Add support for Qiskit from IBM: software stack for quantum computing and algorithms research. Build, optimize, and execute quantum workloads at scale.

Add support for PennyLane: a cross-platform library for quantum computing, quantum machine learning, and quantum chemistry.
Add support for Cirq: library for writing, manipulating, and optimizing quantum circuits, and then running them on quantum computers and quantum simulators.
2026-03-29 07:43:46 +08:00 · 2025-11-30 08:46:41 -05:00 · 2025-11-30 08:27:24 -05:00 · 2025-11-30 08:02:58 -05:00 · 2025-11-30 07:46:17 -05:00 · 2025-11-30 06:28:41 -05:00
913 changed files with 133808 additions and 22029 deletions
--- a/.claude-plugin/marketplace.json
+++ b/.claude-plugin/marketplace.json
@@ -2,134 +2,147 @@
 {
  "name": "claude-scientific-skills",
  "owner": {
-    "name": "Timothy Kassis",
-    "email": "timothy.kassis@k-dense.ai"
+    "name": "K-Dense Inc.",
+    "email": "contact@k-dense.ai"
  },
  "metadata": {
    "description": "Claude scientific skills from K-Dense Inc",
-    "version": "1.50.0"
+    "version": "2.6.0"
  },
  "plugins": [
    {
-      "name": "scientific-packages",
-      "description": "Collection of python scientific packages",
+      "name": "scientific-skills",
+      "description": "Collection of scientific skills",
      "source": "./",
      "strict": false,
      "skills": [
-        "./scientific-packages/anndata",
-        "./scientific-packages/arboreto",
-        "./scientific-packages/astropy",
-        "./scientific-packages/biomni",
-        "./scientific-packages/biopython",
-        "./scientific-packages/bioservices",
-        "./scientific-packages/cellxgene-census",
-        "./scientific-packages/cobrapy",
-        "./scientific-packages/dask",
-        "./scientific-packages/datamol",
-        "./scientific-packages/deepchem",
-        "./scientific-packages/deeptools",
-        "./scientific-packages/diffdock",
-        "./scientific-packages/etetoolkit",
-        "./scientific-packages/flowio",
-        "./scientific-packages/gget",
-        "./scientific-packages/matchms",
-        "./scientific-packages/matplotlib",
-        "./scientific-packages/medchem",
-        "./scientific-packages/molfeat",
-        "./scientific-packages/polars",
-        "./scientific-packages/pydeseq2",
-        "./scientific-packages/pymatgen",
-        "./scientific-packages/pymc",
-        "./scientific-packages/pymoo",
-        "./scientific-packages/pyopenms",
-        "./scientific-packages/pysam",
-        "./scientific-packages/pytdc",
-        "./scientific-packages/pytorch-lightning",
-        "./scientific-packages/rdkit",
-        "./scientific-packages/reportlab",
-        "./scientific-packages/scanpy",
-        "./scientific-packages/scikit-bio",
-        "./scientific-packages/scikit-learn",
-        "./scientific-packages/seaborn",
-        "./scientific-packages/statsmodels",
-        "./scientific-packages/torch_geometric",
-        "./scientific-packages/torchdrug",
-        "./scientific-packages/transformers",
-        "./scientific-packages/umap-learn",
-        "./scientific-packages/zarr-python"
+        "./scientific-skills/adaptyv",
+        "./scientific-skills/aeon",
+        "./scientific-skills/anndata",
+        "./scientific-skills/arboreto",
+        "./scientific-skills/astropy",
+        "./scientific-skills/biomni",
+        "./scientific-skills/biopython",
+        "./scientific-skills/bioservices",
+        "./scientific-skills/cellxgene-census",
+        "./scientific-skills/cirq",
+        "./scientific-skills/cobrapy",
+        "./scientific-skills/dask",
+        "./scientific-skills/datacommons-client",
+        "./scientific-skills/denario",
+        "./scientific-skills/datamol",
+        "./scientific-skills/deepchem",
+        "./scientific-skills/deeptools",
+        "./scientific-skills/diffdock",
+        "./scientific-skills/esm",
+        "./scientific-skills/etetoolkit",
+        "./scientific-skills/flowio",
+        "./scientific-skills/fluidsim",
+        "./scientific-skills/geniml",
+        "./scientific-skills/geopandas",
+        "./scientific-skills/gget",
+        "./scientific-skills/gtars",
+        "./scientific-skills/hypogenic",
+        "./scientific-skills/histolab",
+        "./scientific-skills/lamindb",
+        "./scientific-skills/markitdown",
+        "./scientific-skills/matchms",
+        "./scientific-skills/matplotlib",
+        "./scientific-skills/medchem",
+        "./scientific-skills/modal",
+        "./scientific-skills/molfeat",
+        "./scientific-skills/neurokit2",
+        "./scientific-skills/networkx",
+        "./scientific-skills/paper-2-web",
+        "./scientific-skills/pathml",
+        "./scientific-skills/pennylane",
+        "./scientific-skills/perplexity-search",
+        "./scientific-skills/plotly",
+        "./scientific-skills/polars",
+        "./scientific-skills/pydeseq2",
+        "./scientific-skills/pydicom",
+        "./scientific-skills/pyhealth",
+        "./scientific-skills/pymatgen",
+        "./scientific-skills/pymc",
+        "./scientific-skills/pylabrobot",
+        "./scientific-skills/pymoo",
+        "./scientific-skills/pufferlib",
+        "./scientific-skills/pyopenms",
+        "./scientific-skills/pysam",
+        "./scientific-skills/pytdc",
+        "./scientific-skills/pytorch-lightning",
+        "./scientific-skills/qiskit",
+        "./scientific-skills/qutip",
+        "./scientific-skills/rdkit",
+        "./scientific-skills/reportlab",
+        "./scientific-skills/scanpy",
+        "./scientific-skills/scvi-tools",
+        "./scientific-skills/scikit-bio",
+        "./scientific-skills/scikit-learn",
+        "./scientific-skills/scikit-survival",
+        "./scientific-skills/seaborn",
+        "./scientific-skills/shap",
+        "./scientific-skills/simpy",
+        "./scientific-skills/stable-baselines3",
+        "./scientific-skills/statsmodels",
+        "./scientific-skills/sympy",
+        "./scientific-skills/torch_geometric",
+        "./scientific-skills/torchdrug",
+        "./scientific-skills/tooluniverse",
+        "./scientific-skills/transformers",
+        "./scientific-skills/umap-learn",
+        "./scientific-skills/vaex",
+        "./scientific-skills/zarr-python",
+        "./scientific-skills/alphafold-database",
+        "./scientific-skills/biorxiv-database",
+        "./scientific-skills/chembl-database",
+        "./scientific-skills/clinpgx-database",
+        "./scientific-skills/clinvar-database",
+        "./scientific-skills/clinicaltrials-database",
+        "./scientific-skills/cosmic-database",
+        "./scientific-skills/drugbank-database",
+        "./scientific-skills/ena-database",
+        "./scientific-skills/ensembl-database",
+        "./scientific-skills/fda-database",
+        "./scientific-skills/gene-database",
+        "./scientific-skills/geo-database",
+        "./scientific-skills/gwas-database",
+        "./scientific-skills/hmdb-database",
+        "./scientific-skills/kegg-database",
+        "./scientific-skills/metabolomics-workbench-database",
+        "./scientific-skills/openalex-database",
+        "./scientific-skills/opentargets-database",
+        "./scientific-skills/pdb-database",
+        "./scientific-skills/pubchem-database",
+        "./scientific-skills/pubmed-database",
+        "./scientific-skills/reactome-database",
+        "./scientific-skills/string-database",
+        "./scientific-skills/uniprot-database",
+        "./scientific-skills/uspto-database",
+        "./scientific-skills/zinc-database",
+        "./scientific-skills/exploratory-data-analysis",
+        "./scientific-skills/hypothesis-generation",
+        "./scientific-skills/literature-review",
+        "./scientific-skills/peer-review",
+        "./scientific-skills/scholar-evaluation",
+        "./scientific-skills/scientific-brainstorming",
+        "./scientific-skills/scientific-critical-thinking",
+        "./scientific-skills/scientific-writing",
+        "./scientific-skills/statistical-analysis",
+        "./scientific-skills/scientific-visualization",
+        "./scientific-skills/document-skills/docx",
+        "./scientific-skills/document-skills/pdf",
+        "./scientific-skills/document-skills/pptx",
+        "./scientific-skills/document-skills/xlsx",
+        "./scientific-skills/benchling-integration",
+        "./scientific-skills/dnanexus-integration",
+        "./scientific-skills/labarchive-integration",
+        "./scientific-skills/latchbio-integration",
+        "./scientific-skills/omero-integration",
+        "./scientific-skills/opentrons-integration",
+        "./scientific-skills/protocolsio-integration",
+        "./scientific-skills/get-available-resources"
      ]
-    },
-    {
-      "name": "scientific-databases",
-      "description": "Collection of scientific databases",
-      "source": "./",
-      "strict": false,
-      "skills": [
-        "./scientific-databases/alphafold-database",
-        "./scientific-databases/chembl-database",
-        "./scientific-databases/clinpgx-database",
-        "./scientific-databases/clinvar-database",
-        "./scientific-databases/clinicaltrials-database",
-        "./scientific-databases/cosmic-database",
-        "./scientific-databases/ena-database",
-        "./scientific-databases/ensembl-database",
-        "./scientific-databases/fda-database",
-        "./scientific-databases/gene-database",
-        "./scientific-databases/geo-database",
-        "./scientific-databases/gwas-database",
-        "./scientific-databases/hmdb-database",
-        "./scientific-databases/kegg-database",
-        "./scientific-databases/metabolomics-workbench-database",
-        "./scientific-databases/opentargets-database",
-        "./scientific-databases/pdb-database",
-        "./scientific-databases/pubchem-database",
-        "./scientific-databases/pubmed-database",
-        "./scientific-databases/reactome-database",
-        "./scientific-databases/string-database",
-        "./scientific-databases/uniprot-database",
-        "./scientific-databases/uspto-database",
-        "./scientific-databases/zinc-database"
-      ]
-    },
-    {
-      "name": "scientific-thinking",
-      "description": "Collection of scientific thinking methodologies",
-      "source": "./",
-      "strict": false,
-      "skills": [
-        "./scientific-thinking/exploratory-data-analysis",
-        "./scientific-thinking/hypothesis-generation",
-        "./scientific-thinking/peer-review",
-        "./scientific-thinking/scientific-brainstorming",
-        "./scientific-thinking/scientific-critical-thinking",
-        "./scientific-thinking/scientific-writing",
-        "./scientific-thinking/statistical-analysis",
-        "./scientific-thinking/scientific-visualization",
-        "./scientific-thinking/document-skills/docx",
-        "./scientific-thinking/document-skills/pdf",
-        "./scientific-thinking/document-skills/pptx",
-        "./scientific-thinking/document-skills/xlsx"
-      ]
-    },
-    {
-      "name": "scientific-integrations",
-      "description": "Collection of scientific platform integrations",
-      "source": "./",
-      "strict": false,
-      "skills": [
-        "./scientific-integrations/benchling-integration",
-        "./scientific-integrations/dnanexus-integration",
-        "./scientific-integrations/labarchive-integration",
-        "./scientific-integrations/latchbio-integration",
-        "./scientific-integrations/omero-integration",
-        "./scientific-integrations/opentrons-integration"
-      ]
-    },
-    {
-      "name": "scientific-context-initialization",
-      "description": "Always Auto-invoked skill that creates/updates workspace AGENT.md to instruct the agent to always search for existing skills before attempting any scientific task",
-      "source": "./scientific-helpers/scientific-context-initialization",
-      "strict": false
    }
  ]
-}
+}
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -57,14 +57,11 @@ jobs:
        id: release_notes
        if: steps.check_tag.outputs.exists == 'false'
        run: |
-          VERSION="${{ steps.get_version.outputs.version }}"
          PREVIOUS_TAG="${{ steps.previous_tag.outputs.previous_tag }}"
          
          # Start release notes
          cat > release_notes.md << 'EOF'
-          ## Claude Scientific Skills v$VERSION
-          
-          ### What's Changed
+          ## What's Changed
          
          EOF
          
@@ -82,9 +79,6 @@ jobs:
            git log --pretty=format:"* %s (%h)" --no-merges --max-count=20 >> release_notes.md
          fi
          
-          # Replace $VERSION in the file
-          sed -i "s/\$VERSION/$VERSION/g" release_notes.md
-          
          cat release_notes.md
      
      - name: Create Release
@@ -92,7 +86,7 @@ jobs:
        uses: softprops/action-gh-release@v1
        with:
          tag_name: ${{ steps.get_version.outputs.tag }}
-          name: Claude Scientific Skills v${{ steps.get_version.outputs.version }}
+          name: v${{ steps.get_version.outputs.version }}
          body_path: release_notes.md
          draft: false
          prerelease: false
--- a/.gitignore
+++ b/.gitignore
@@ -8,4 +8,6 @@ uv.lock

 .venv/
 .python-version
-main.py
+main.py
+
+__pycache__/
--- a/LICENSE.md
+++ b/LICENSE.md
@@ -1,56 +1,21 @@
-# PolyForm Noncommercial License 1.0.0  
-<https://polyformproject.org/licenses/noncommercial/1.0.0>
+MIT License

-> **Required Notice:** Copyright © K-Dense Inc. (https://k-dense.ai)
+Copyright (c) 2025 K-Dense Inc.

-## Acceptance  
-In order to get any license under these terms, you must agree to them as both strict obligations and conditions to all your licenses.
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:

-## Copyright License  
-The licensor (**K-Dense Inc.**) grants you a copyright license for the software to do everything you might do with the software that would otherwise infringe the licensor's copyright in it for any permitted purpose. However, you may only distribute the software according to [Distribution License](#distribution-license) and make changes or new works based on the software according to [Changes and New Works License](#changes-and-new-works-license).
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.

-## Distribution License  
-The licensor (**K-Dense Inc.**) grants you an additional copyright license to distribute copies of the software. Your license to distribute covers distributing the software with changes and new works permitted by [Changes and New Works License](#changes-and-new-works-license).
-
-## Notices  
-You must ensure that anyone who gets a copy of any part of the software from you also gets a copy of these terms or the URL for them above, as well as copies of any plain-text lines beginning with `Required Notice:` that the licensor provided with the software. For example:  
-
-> Required Notice: Copyright © K-Dense Inc. (https://k-dense.ai)
-
-## Changes and New Works License  
-The licensor (**K-Dense Inc.**) grants you an additional copyright license to make changes and new works based on the software for any permitted purpose.
-
-## Patent License  
-The licensor (**K-Dense Inc.**) grants you a patent license for the software that covers patent claims the licensor can license, or becomes able to license, that you would infringe by using the software.
-
-## Noncommercial Purposes  
-Any noncommercial purpose is a permitted purpose.
-
-## Personal Uses  
-Personal use for research, experiment, and testing for the benefit of public knowledge, personal study, private entertainment, hobby projects, amateur pursuits, or religious observance, without any anticipated commercial application, is use for a permitted purpose.
-
-## Noncommercial Organizations  
-Use by any charitable organization, educational institution, public research organization, public safety or health organization, environmental protection organization, or government institution is use for a permitted purpose regardless of the source of funding or obligations resulting from the funding.
-
-## Fair Use  
-You may have "fair use" rights for the software under the law. These terms do not limit them.
-
-## No Other Rights  
-These terms do not allow you to sublicense or transfer any of your licenses to anyone else, or prevent the licensor (**K-Dense Inc.**) from granting licenses to anyone else. These terms do not imply any other licenses.
-
-## Patent Defense  
-If you make any written claim that the software infringes or contributes to infringement of any patent, your patent license for the software granted under these terms ends immediately. If your company makes such a claim, your patent license ends immediately for work on behalf of your company.
-
-## Violations  
-The first time you are notified in writing that you have violated any of these terms, or done anything with the software not covered by your licenses, your licenses can nonetheless continue if you come into full compliance with these terms and take practical steps to correct past violations within 32 days of receiving notice. Otherwise, all your licenses end immediately.
-
-## No Liability  
-***As far as the law allows, the software comes as is, without any warranty or condition, and K-Dense Inc. will not be liable to you for any damages arising out of these terms or the use or nature of the software, under any kind of legal claim.***
-
-## Definitions  
-The **licensor** is **K-Dense Inc.**, the individual or entity offering these terms, and the **software** is the software K-Dense Inc. makes available under these terms.  
-
-**You** refers to the individual or entity agreeing to these terms.  
-**Your company** is any legal entity, sole proprietorship, or other kind of organization that you work for, plus all organizations that have control over, are under the control of, or are under common control with that organization. **Control** means ownership of substantially all the assets of an entity, or the power to direct its management and policies by vote, contract, or otherwise. Control can be direct or indirect.  
-**Your licenses** are all the licenses granted to you for the software under these terms.  
-**Use** means anything you do with the software requiring one of your licenses.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
@@ -1,33 +1,61 @@
 # Claude Scientific Skills

-[![License: PolyForm Noncommercial 1.0.0](https://img.shields.io/badge/License-PolyForm%20Noncommercial-blue.svg)](LICENSE.md)
-[![GitHub Stars](https://img.shields.io/github/stars/K-Dense-AI/claude-scientific-skills?style=social)](https://github.com/K-Dense-AI/claude-scientific-skills)
-[![Skills](https://img.shields.io/badge/Skills-72%2B-brightgreen.svg)](#what-s-included)
-[![Workflows](https://img.shields.io/badge/Workflows-122-orange.svg)](#what-s-included)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE.md)
+[![Skills](https://img.shields.io/badge/Skills-127-brightgreen.svg)](#whats-included)

-A comprehensive collection of ready-to-use scientific skills for Claude, curated by the K-Dense team.
+A comprehensive collection of **127+ ready-to-use scientific skills** for Claude, created by the K-Dense team. Transform Claude into your AI research assistant capable of executing complex multi-step scientific workflows across biology, chemistry, medicine, and beyond.

-These skills enable Claude to work with specialized scientific libraries and databases across multiple scientific domains:
- 🧬 Bioinformatics & Genomics
- 🧪 Cheminformatics & Drug Discovery
- 🔬 Proteomics & Mass Spectrometry
- 🤖 Machine Learning & AI
- 🔮 Materials Science & Chemistry
- 📊 Data Analysis & Visualization
+These skills enable Claude to seamlessly work with specialized scientific libraries, databases, and tools across multiple scientific domains:
+- 🧬 Bioinformatics & Genomics - Sequence analysis, single-cell RNA-seq, gene regulatory networks, variant annotation, phylogenetic analysis
+- 🧪 Cheminformatics & Drug Discovery - Molecular property prediction, virtual screening, ADMET analysis, molecular docking, lead optimization
+- 🔬 Proteomics & Mass Spectrometry - LC-MS/MS processing, peptide identification, spectral matching, protein quantification
+- 🏥 Clinical Research & Precision Medicine - Clinical trials, pharmacogenomics, variant interpretation, drug safety, precision therapeutics
+- 🧠 Healthcare AI & Clinical ML - EHR analysis, physiological signal processing, medical imaging, clinical prediction models
+- 🖼️ Medical Imaging & Digital Pathology - DICOM processing, whole slide image analysis, computational pathology, radiology workflows
+- 🤖 Machine Learning & AI - Deep learning, reinforcement learning, time series analysis, model interpretability, Bayesian methods
+- 🔮 Materials Science & Chemistry - Crystal structure analysis, phase diagrams, metabolic modeling, computational chemistry
+- 🌌 Physics & Astronomy - Astronomical data analysis, coordinate transformations, cosmological calculations, symbolic mathematics, physics computations
+- ⚙️ Engineering & Simulation - Discrete-event simulation, multi-objective optimization, metabolic engineering, systems modeling, process optimization
+- 📊 Data Analysis & Visualization - Statistical analysis, network analysis, time series, publication-quality figures, large-scale data processing
+- 🧪 Laboratory Automation - Liquid handling protocols, lab equipment control, workflow automation, LIMS integration
+- 📚 Scientific Communication - Literature review, peer review, scientific writing, document processing, publication workflows
+- 🔬 Multi-omics & Systems Biology - Multi-modal data integration, pathway analysis, network biology, systems-level insights
+- 🧬 Protein Engineering & Design - Protein language models, structure prediction, sequence design, function annotation

 **Transform Claude Code into an 'AI Scientist' on your desktop!**

 > 💼 For substantially more advanced capabilities, compute infrastructure, and enterprise-ready offerings, check out [k-dense.ai](https://k-dense.ai/).

+> ⭐ **If you find this repository useful**, please consider giving it a star! It helps others discover these tools and encourages us to continue maintaining and expanding this collection.
+
+---
+
+## 📦 What's Included
+
+This repository provides **127+ scientific skills** organized into the following categories:
+
+- **26+ Scientific Databases** - Direct API access to OpenAlex, PubMed, ChEMBL, UniProt, COSMIC, ClinicalTrials.gov, and more
+- **54+ Python Packages** - RDKit, Scanpy, PyTorch Lightning, scikit-learn, BioPython, PennyLane, Qiskit, and others
+- **15+ Scientific Integrations** - Benchling, DNAnexus, LatchBio, OMERO, Protocols.io, and more
+- **20+ Analysis & Communication Tools** - Literature review, scientific writing, peer review, document processing
+
+Each skill includes:
+- ✅ Comprehensive documentation (`SKILL.md`)
+- ✅ Practical code examples
+- ✅ Use cases and best practices
+- ✅ Integration guides
+- ✅ Reference materials
+
 ---

 ## 📋 Table of Contents

- [What's Included](#what-s-included)
+- [What's Included](#whats-included)
 - [Why Use This?](#why-use-this)
 - [Getting Started](#getting-started)
-  - [Claude Code](#claude-code)
-  - [Any MCP Client](#any-mcp-client-including-chatgpt-cursor-google-adk-openai-agent-sdk-etc)
+  - [Claude Code](#claude-code-recommended)
+  - [Cursor IDE](#cursor-ide)
+  - [Any MCP Client](#any-mcp-client)
 - [Prerequisites](#prerequisites)
 - [Quick Examples](#quick-examples)
 - [Use Cases](#use-cases)
@@ -36,274 +64,376 @@ These skills enable Claude to work with specialized scientific libraries and dat
 - [Troubleshooting](#troubleshooting)
 - [FAQ](#faq)
 - [Support](#support)
+- [Join Our Community](#join-our-community)
+- [Citation](#citation)
 - [License](#license)

 ---

-## 📦 What's Included
-
-| Category | Count | Description |
-|----------|-------|-------------|
-| 📊 **Scientific Databases** | 24 | PubMed, PubChem, UniProt, ChEMBL, COSMIC, AlphaFold DB, and more |
-| 🔬 **Scientific Packages** | 41 | BioPython, RDKit, PyTorch, Scanpy, and specialized tools |
-| 🔌 **Scientific Integrations** | 6 | Benchling, DNAnexus, Opentrons, LabArchives, LatchBio, OMERO |
-| 🎯 **Context Initialization** | 1 | Auto-invoked skill to ensure Claude uses existing skills effectively |
-| 📚 **Documented Workflows** | 122 | Ready-to-use examples and reference materials |
-
---
-
 ## 🚀 Why Use This?

-✅ **Save Time** - Skip days of API documentation research and integration work  
-✅ **Best Practices** - Curated workflows following scientific computing standards  
-✅ **Production Ready** - Tested and validated code examples  
-✅ **Regular Updates** - Maintained and expanded by K-Dense team  
-✅ **Comprehensive** - Coverage across major scientific domains  
-✅ **Enterprise Support** - Commercial offerings available for advanced needs
+### ⚡ **Accelerate Your Research**
+- **Save Days of Work** - Skip API documentation research and integration setup
+- **Production-Ready Code** - Tested, validated examples following scientific best practices
+- **Multi-Step Workflows** - Execute complex pipelines with a single prompt
+
+### 🎯 **Comprehensive Coverage**
+- **127+ Skills** - Extensive coverage across all major scientific domains
+- **26+ Databases** - Direct access to OpenAlex, PubMed, ChEMBL, UniProt, COSMIC, and more
+- **54+ Python Packages** - RDKit, Scanpy, PyTorch Lightning, scikit-learn, PennyLane, Qiskit, and others
+
+### 🔧 **Easy Integration**
+- **One-Click Setup** - Install via Claude Code or MCP server
+- **Automatic Discovery** - Claude automatically finds and uses relevant skills
+- **Well Documented** - Each skill includes examples, use cases, and best practices
+
+### 🌟 **Maintained & Supported**
+- **Regular Updates** - Continuously maintained and expanded by K-Dense team
+- **Community Driven** - Open source with active community contributions
+- **Enterprise Ready** - Commercial support available for advanced needs

 ---

 ## 🎯 Getting Started

-### Claude Code
-Register this repository as a Claude Code Plugin marketplace by running:
+Choose your preferred platform to get started:
+
+### 🖥️ Claude Code (Recommended)
+
+> 📚 **New to Claude Code?** Check out the [Claude Code Quickstart Guide](https://docs.claude.com/en/docs/claude-code/quickstart) to get started.
+
+**Step 1: Install Claude Code**
+
+**macOS:**
+```bash
+curl -fsSL https://claude.ai/install.sh | bash
+```
+
+**Windows:**
+```powershell
+irm https://claude.ai/install.ps1 | iex
+```
+
+**Step 2: Register the Marketplace**

 ```bash
 /plugin marketplace add K-Dense-AI/claude-scientific-skills
 ```

-Then, to install a specific set of skills:
+**Step 3: Install Skills**

-1. Select **Browse and install plugins**
-2. Select **claude-scientific-skills**
-3. Choose from:
-   - `scientific-databases` - Access to 24 scientific databases
-   - `scientific-packages` - 40 specialized Python packages
-   - `scientific-thinking` - Analysis tools and document processing
-   - `scientific-integrations` - Lab automation and platform integrations
-   - `scientific-context-initialization` - Ensures Claude searches for and uses existing skills
-4. Select **Install now**
+1. Open Claude Code
+2. Select **Browse and install plugins**
+3. Choose **claude-scientific-skills**
+4. Select **scientific-skills**
+5. Click **Install now**

-After installation, simply mention the skill or describe your task - Claude Code will automatically use the appropriate skills!
+**That's it!** Claude will automatically use the appropriate skills when you describe your scientific tasks. Make sure to keep the skill up to date!

-> 💡 **Tip**: If you find that Claude isn't utilizing the installed skills as much as you'd like, install the `scientific-context-initialization` skill. It automatically creates/updates an `AGENT.md` file in your workspace that instructs Claude to always search for and use existing skills before attempting any scientific task. This ensures Claude leverages documented patterns, authentication methods, working examples, and best practices from the repository.
+---

-### Any MCP Client (including ChatGPT, Cursor, Google ADK, OpenAI Agent SDK, etc.)
-Use our newly released MCP server that allows you to use any Claude Skill in any client!
+### ⌨️ Cursor IDE

-🔗 **[claude-skills-mcp](https://github.com/K-Dense-AI/claude-skills-mcp)**
+One-click installation via our hosted MCP server:
+
+<a href="https://cursor.com/en-US/install-mcp?name=claude-scientific-skills&config=eyJ1cmwiOiJodHRwczovL21jcC5rLWRlbnNlLmFpL2NsYXVkZS1zY2llbnRpZmljLXNraWxscy9tY3AifQ%3D%3D">
+  <picture>
+    <source srcset="https://cursor.com/deeplink/mcp-install-light.svg" media="(prefers-color-scheme: dark)">
+    <source srcset="https://cursor.com/deeplink/mcp-install-dark.svg" media="(prefers-color-scheme: light)">
+    <img src="https://cursor.com/deeplink/mcp-install-dark.svg" alt="Install MCP Server" style="height:2.7em;"/>
+  </picture>
+</a>
+
+---
+
+### 🔌 Any MCP Client
+
+Access all skills via our MCP server in any MCP-compatible client (ChatGPT, Google ADK, OpenAI Agent SDK, etc.):
+
+**Option 1: Hosted MCP Server** (Easiest)
+```
+https://mcp.k-dense.ai/claude-scientific-skills/mcp
+```
+
+**Option 2: Self-Hosted** (More Control)
+🔗 **[claude-skills-mcp](https://github.com/K-Dense-AI/claude-skills-mcp)** - Deploy your own MCP server

 ---

 ## ⚙️ Prerequisites

- **Python**: 3.8+ (3.10+ recommended for best compatibility)
- **Claude Code**: Latest version or any MCP-compatible client
+- **Python**: 3.9+ (3.12+ recommended for best compatibility)
+- **uv**: Python package manager (required for installing skill dependencies)
+- **Client**: Claude Code, Cursor, or any MCP-compatible client
 - **System**: macOS, Linux, or Windows with WSL2
 - **Dependencies**: Automatically handled by individual skills (check `SKILL.md` files for specific requirements)

+### Installing uv
+
+The skills use `uv` as the package manager for installing Python dependencies. Install it using the instructions for your operating system:
+
+**macOS and Linux:**
+```bash
+curl -LsSf https://astral.sh/uv/install.sh | sh
+```
+
+**Windows:**
+```powershell
+powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
+```
+
+**Alternative (via pip):**
+```bash
+pip install uv
+```
+
+After installation, verify it works by running:
+```bash
+uv --version
+```
+
+For more installation options and details, visit the [official uv documentation](https://docs.astral.sh/uv/).
+
 ---

 ## 💡 Quick Examples

-Once you've installed the skills, you can ask Claude:
+Once you've installed the skills, you can ask Claude to execute complex multi-step scientific workflows. Here are some example prompts:

-### Cheminformatics
+### 🧪 Drug Discovery Pipeline
+**Goal**: Find novel EGFR inhibitors for lung cancer treatment
+
+**Prompt**:
 ```
-"Use PubChem to find information about aspirin and calculate its molecular properties"
+Use available skills you have access to whenever possible. Query ChEMBL for EGFR inhibitors (IC50 < 50nM), analyze structure-activity relationships 
+with RDKit, generate improved analogs with datamol, perform virtual screening with DiffDock 
+against AlphaFold EGFR structure, search PubMed for resistance mechanisms, check COSMIC for 
+mutations, and create visualizations and a comprehensive report.
 ```

-### Bioinformatics
+**Skills Used**: ChEMBL, RDKit, datamol, DiffDock, AlphaFold DB, PubMed, COSMIC, scientific visualization
+
+---
+
+### 🔬 Single-Cell RNA-seq Analysis
+**Goal**: Comprehensive analysis of 10X Genomics data with public data integration
+
+**Prompt**:
 ```
-"Analyze this protein sequence using BioPython and predict its secondary structure"
+Use available skills you have access to whenever possible. Load 10X dataset with Scanpy, perform QC and doublet removal, integrate with Cellxgene 
+Census data, identify cell types using NCBI Gene markers, run differential expression with 
+PyDESeq2, infer gene regulatory networks with Arboreto, enrich pathways via Reactome/KEGG, 
+and identify therapeutic targets with Open Targets.
 ```

-### Data Analysis
+**Skills Used**: Scanpy, Cellxgene Census, NCBI Gene, PyDESeq2, Arboreto, Reactome, KEGG, Open Targets
+
+---
+
+### 🧬 Multi-Omics Biomarker Discovery
+**Goal**: Integrate RNA-seq, proteomics, and metabolomics to predict patient outcomes
+
+**Prompt**:
 ```
-"Perform exploratory data analysis on this RNA-seq dataset and create publication-quality plots"
+Use available skills you have access to whenever possible. Analyze RNA-seq with PyDESeq2, process mass spec with pyOpenMS, integrate metabolites from 
+HMDB/Metabolomics Workbench, map proteins to pathways (UniProt/KEGG), find interactions via 
+STRING, correlate omics layers with statsmodels, build predictive model with scikit-learn, 
+and search ClinicalTrials.gov for relevant trials.
 ```

-### Drug Discovery
+**Skills Used**: PyDESeq2, pyOpenMS, HMDB, Metabolomics Workbench, UniProt, KEGG, STRING, statsmodels, scikit-learn, ClinicalTrials.gov
+
+---
+
+### 🎯 Virtual Screening Campaign
+**Goal**: Discover allosteric modulators for protein-protein interactions
+
+**Prompt**:
 ```
-"Search ChEMBL for kinase inhibitors with IC50 < 100nM and visualize their structures"
+Use available skills you have access to whenever possible. Retrieve AlphaFold structures, identify interaction interface with BioPython, search ZINC 
+for allosteric candidates (MW 300-500, logP 2-4), filter with RDKit, dock with DiffDock, 
+rank with DeepChem, check PubChem suppliers, search USPTO patents, and optimize leads with 
+MedChem/molfeat.
 ```

-### Literature Review
+**Skills Used**: AlphaFold DB, BioPython, ZINC, RDKit, DiffDock, DeepChem, PubChem, USPTO, MedChem, molfeat
+
+---
+
+### 🏥 Clinical Variant Interpretation
+**Goal**: Analyze VCF file for hereditary cancer risk assessment
+
+**Prompt**:
 ```
-"Search PubMed for recent papers on CRISPR-Cas9 applications in cancer therapy"
+Use available skills you have access to whenever possible. Parse VCF with pysam, annotate variants with Ensembl VEP, query ClinVar for pathogenicity, 
+check COSMIC for cancer mutations, retrieve gene info from NCBI Gene, analyze protein impact 
+with UniProt, search PubMed for case reports, check ClinPGx for pharmacogenomics, generate 
+clinical report with ReportLab, and find matching trials on ClinicalTrials.gov.
 ```

-### Protein Structure
+**Skills Used**: pysam, Ensembl, ClinVar, COSMIC, NCBI Gene, UniProt, PubMed, ClinPGx, ReportLab, ClinicalTrials.gov
+
+---
+
+### 🌐 Systems Biology Network Analysis
+**Goal**: Analyze gene regulatory networks from RNA-seq data
+
+**Prompt**:
 ```
-"Retrieve the AlphaFold structure prediction for human p53 and analyze confidence scores"
+Use available skills you have access to whenever possible. Query NCBI Gene for annotations, retrieve sequences from UniProt, identify interactions via 
+STRING, map to Reactome/KEGG pathways, analyze topology with Torch Geometric, reconstruct 
+GRNs with Arboreto, assess druggability with Open Targets, model with PyMC, visualize 
+networks, and search GEO for similar patterns.
 ```

+**Skills Used**: NCBI Gene, UniProt, STRING, Reactome, KEGG, Torch Geometric, Arboreto, Open Targets, PyMC, GEO
+
+> 📖 **Want more examples?** Check out [docs/examples.md](docs/examples.md) for comprehensive workflow examples and detailed use cases across all scientific domains.
+
 ---

 ## 🔬 Use Cases

-### Drug Discovery Research
- Screen compound libraries from PubChem and ZINC
- Analyze bioactivity data from ChEMBL
- Predict molecular properties with RDKit and DeepChem
- Perform molecular docking with DiffDock
+### 🧪 Drug Discovery & Medicinal Chemistry
+- **Virtual Screening**: Screen millions of compounds from PubChem/ZINC against protein targets
+- **Lead Optimization**: Analyze structure-activity relationships with RDKit, generate analogs with datamol
+- **ADMET Prediction**: Predict absorption, distribution, metabolism, excretion, and toxicity with DeepChem
+- **Molecular Docking**: Predict binding poses and affinities with DiffDock
+- **Bioactivity Mining**: Query ChEMBL for known inhibitors and analyze SAR patterns

-### Bioinformatics Analysis
- Process genomic sequences with BioPython
- Analyze single-cell RNA-seq data with Scanpy
- Query gene information from Ensembl and NCBI Gene
- Identify protein-protein interactions via STRING
+### 🧬 Bioinformatics & Genomics
+- **Sequence Analysis**: Process DNA/RNA/protein sequences with BioPython and pysam
+- **Single-Cell Analysis**: Analyze 10X Genomics data with Scanpy, identify cell types, infer GRNs with Arboreto
+- **Variant Annotation**: Annotate VCF files with Ensembl VEP, query ClinVar for pathogenicity
+- **Gene Discovery**: Query NCBI Gene, UniProt, and Ensembl for comprehensive gene information
+- **Network Analysis**: Identify protein-protein interactions via STRING, map to pathways (KEGG, Reactome)

-### Materials Science
- Analyze crystal structures with Pymatgen
- Predict material properties
- Design novel compounds and materials
+### 🏥 Clinical Research & Precision Medicine
+- **Clinical Trials**: Search ClinicalTrials.gov for relevant studies, analyze eligibility criteria
+- **Variant Interpretation**: Annotate variants with ClinVar, COSMIC, and ClinPGx for pharmacogenomics
+- **Drug Safety**: Query FDA databases for adverse events, drug interactions, and recalls
+- **Precision Therapeutics**: Match patient variants to targeted therapies and clinical trials

-### Clinical Research
- Search clinical trials on ClinicalTrials.gov
- Analyze genetic variants in ClinVar
- Review pharmacogenomic data from ClinPGx
- Access cancer mutations from COSMIC
+### 🔬 Multi-Omics & Systems Biology
+- **Multi-Omics Integration**: Combine RNA-seq, proteomics, and metabolomics data
+- **Pathway Analysis**: Enrich differentially expressed genes in KEGG/Reactome pathways
+- **Network Biology**: Reconstruct gene regulatory networks, identify hub genes
+- **Biomarker Discovery**: Integrate multi-omics layers to predict patient outcomes

-### Academic Research
- Literature searches via PubMed
- Patent landscape analysis using USPTO
- Data visualization for publications
- Statistical analysis and hypothesis testing
+### 📊 Data Analysis & Visualization
+- **Statistical Analysis**: Perform hypothesis testing, power analysis, and experimental design
+- **Publication Figures**: Create publication-quality visualizations with matplotlib and seaborn
+- **Network Visualization**: Visualize biological networks with NetworkX
+- **Report Generation**: Generate comprehensive PDF reports with ReportLab
+
+### 🧪 Laboratory Automation
+- **Protocol Design**: Create Opentrons protocols for automated liquid handling
+- **LIMS Integration**: Integrate with Benchling and LabArchives for data management
+- **Workflow Automation**: Automate multi-step laboratory workflows

 ---

 ## 📚 Available Skills

-### 🗄️ Scientific Databases
-**24 comprehensive databases** including PubMed, PubChem, UniProt, ChEMBL, AlphaFold DB, COSMIC, Ensembl, KEGG, and more.
+This repository contains **122+ scientific skills** organized across multiple domains. Each skill provides comprehensive documentation, code examples, and best practices for working with scientific libraries, databases, and tools.

-📖 **[Full Database Documentation →](docs/scientific-databases.md)**
+### Skill Categories

-<details>
-<summary><strong>View all databases</strong></summary>
+#### 🧬 **Bioinformatics & Genomics** (15+ skills)
+- Sequence analysis: BioPython, pysam, scikit-bio
+- Single-cell analysis: Scanpy, AnnData, scvi-tools, Arboreto, Cellxgene Census
+- Genomic tools: gget, geniml, gtars, deepTools, FlowIO, Zarr
+- Phylogenetics: ETE Toolkit

- **AlphaFold DB** - AI-predicted protein structures (200M+ predictions)
- **ChEMBL** - Bioactive molecules and drug-like properties
- **ClinPGx** - Clinical pharmacogenomics and gene-drug interactions
- **ClinVar** - Genomic variants and clinical significance
- **ClinicalTrials.gov** - Global clinical studies registry
- **COSMIC** - Somatic cancer mutations database
- **ENA** - European Nucleotide Archive
- **Ensembl** - Genome browser and annotations
- **FDA Databases** - Drug approvals, adverse events, recalls
- **GEO** - Gene expression and functional genomics
- **GWAS Catalog** - Genome-wide association studies
- **HMDB** - Human metabolome database
- **KEGG** - Biological pathways and molecular interactions
- **Metabolomics Workbench** - NIH metabolomics data
- **NCBI Gene** - Gene information and annotations
- **Open Targets** - Therapeutic target identification
- **PDB** - Protein structure database
- **PubChem** - Chemical compound data (110M+ compounds)
- **PubMed** - Biomedical literature database
- **Reactome** - Curated biological pathways
- **STRING** - Protein-protein interaction networks
- **UniProt** - Protein sequences and annotations
- **USPTO** - Patent and trademark data
- **ZINC** - Commercially-available compounds for screening
+#### 🧪 **Cheminformatics & Drug Discovery** (10+ skills)
+- Molecular manipulation: RDKit, Datamol, Molfeat
+- Deep learning: DeepChem, TorchDrug
+- Docking & screening: DiffDock
+- Drug-likeness: MedChem
+- Benchmarks: PyTDC

-</details>
+#### 🔬 **Proteomics & Mass Spectrometry** (2 skills)
+- Spectral processing: matchms, pyOpenMS

---
+#### 🏥 **Clinical Research & Precision Medicine** (8+ skills)
+- Clinical databases: ClinicalTrials.gov, ClinVar, ClinPGx, COSMIC, FDA Databases
+- Healthcare AI: PyHealth, NeuroKit2
+- Variant analysis: Ensembl, NCBI Gene

-### 🔬 Scientific Packages
-**41 specialized Python packages** organized by domain.
+#### 🖼️ **Medical Imaging & Digital Pathology** (3 skills)
+- DICOM processing: pydicom
+- Whole slide imaging: histolab, PathML

-📖 **[Full Package Documentation →](docs/scientific-packages.md)**
+#### 🤖 **Machine Learning & AI** (15+ skills)
+- Deep learning: PyTorch Lightning, Transformers, Stable Baselines3, PufferLib
+- Classical ML: scikit-learn, scikit-survival, SHAP
+- Time series: aeon
+- Bayesian methods: PyMC
+- Optimization: PyMOO
+- Graph ML: Torch Geometric
+- Dimensionality reduction: UMAP-learn
+- Statistical modeling: statsmodels

-<details>
-<summary><strong>Bioinformatics & Genomics (11 packages)</strong></summary>
+#### 🔮 **Materials Science, Chemistry & Physics** (7 skills)
+- Materials: Pymatgen
+- Metabolic modeling: COBRApy
+- Astronomy: Astropy
+- Quantum computing: Cirq, PennyLane, Qiskit, QuTiP

- AnnData, Arboreto, BioPython, BioServices, Cellxgene Census
- deepTools, FlowIO, gget, pysam, PyDESeq2, Scanpy
+#### ⚙️ **Engineering & Simulation** (3 skills)
+- Computational fluid dynamics: FluidSim
+- Discrete-event simulation: SimPy
+- Data processing: Dask, Polars, Vaex

-</details>
+#### 📊 **Data Analysis & Visualization** (10+ skills)
+- Visualization: Matplotlib, Seaborn, Plotly
+- Geospatial analysis: GeoPandas
+- Network analysis: NetworkX
+- Symbolic math: SymPy
+- PDF generation: ReportLab
+- Data access: Data Commons

-<details>
-<summary><strong>Cheminformatics & Drug Discovery (8 packages)</strong></summary>
+#### 🧪 **Laboratory Automation** (3 skills)
+- Liquid handling: PyLabRobot
+- Protocol management: Protocols.io
+- LIMS integration: Benchling, LabArchives

- Datamol, DeepChem, DiffDock, MedChem, Molfeat, PyTDC, RDKit, TorchDrug
+#### 🔬 **Multi-omics & Systems Biology** (5+ skills)
+- Pathway analysis: KEGG, Reactome, STRING
+- Multi-omics: BIOMNI, Denario, HypoGeniC
+- Data management: LaminDB

-</details>
+#### 🧬 **Protein Engineering & Design** (2 skills)
+- Protein language models: ESM
+- Cloud laboratory platform: Adaptyv (automated protein testing and validation)

-<details>
-<summary><strong>Proteomics & Mass Spectrometry (2 packages)</strong></summary>
+#### 📚 **Scientific Communication** (9+ skills)
+- Literature: OpenAlex, PubMed, Literature Review
+- Web search: Perplexity Search (AI-powered search with real-time information)
+- Writing: Scientific Writing, Peer Review
+- Document processing: DOCX, PDF, PPTX, XLSX, MarkItDown
+- Publishing: Paper-2-Web

- matchms, pyOpenMS
+#### 🔬 **Scientific Databases** (26+ skills)
+- Protein: UniProt, PDB, AlphaFold DB
+- Chemical: PubChem, ChEMBL, DrugBank, ZINC, HMDB
+- Genomic: Ensembl, NCBI Gene, GEO, ENA, GWAS Catalog
+- Clinical: ClinVar, COSMIC, ClinicalTrials.gov, ClinPGx, FDA Databases
+- Pathways: KEGG, Reactome, STRING
+- Targets: Open Targets
+- Metabolomics: Metabolomics Workbench
+- Patents: USPTO

-</details>
+#### 🔧 **Infrastructure & Platforms** (5+ skills)
+- Cloud compute: Modal
+- Genomics platforms: DNAnexus, LatchBio
+- Microscopy: OMERO
+- Automation: Opentrons
+- Tool discovery: ToolUniverse

-<details>
-<summary><strong>Machine Learning & Deep Learning (8 packages)</strong></summary>
+> 📖 **For complete details on all skills**, see [docs/scientific-skills.md](docs/scientific-skills.md)

- PyMC, PyMOO, PyTorch Lightning, scikit-learn, statsmodels
- Torch Geometric, Transformers, UMAP-learn
-
-</details>
-
-<details>
-<summary><strong>Materials Science & Chemistry (3 packages)</strong></summary>
-
- Astropy, COBRApy, Pymatgen
-
-</details>
-
-<details>
-<summary><strong>Data Analysis & Visualization (5 packages)</strong></summary>
-
- Dask, Matplotlib, Polars, ReportLab, Seaborn
-
-</details>
-
-<details>
-<summary><strong>Additional Packages (4 packages)</strong></summary>
-
- BIOMNI (Multi-omics), ETE Toolkit (Phylogenetics)
- scikit-bio (Sequence analysis), Zarr (Array storage)
-
-</details>
-
---
-
-### 🧠 Scientific Thinking & Analysis
-**Comprehensive analysis tools** and document processing capabilities.
-
-📖 **[Full Thinking & Analysis Documentation →](docs/scientific-thinking.md)**
-
-**Analysis & Methodology:**
- Exploratory Data Analysis (automated statistics and insights)
- Hypothesis Generation (structured frameworks)
- Peer Review (comprehensive evaluation toolkit)
- Scientific Brainstorming (ideation workflows)
- Scientific Critical Thinking (rigorous reasoning)
- Scientific Visualization (publication-quality figures)
- Scientific Writing (IMRAD format, citation styles)
- Statistical Analysis (testing and experimental design)
-
-**Document Processing:**
- DOCX, PDF, PPTX, XLSX manipulation and analysis
- Tracked changes, comments, and formatting preservation
- Text extraction, table parsing, and data analysis
-
---
-
-### 🔌 Scientific Integrations
-**6 platform integrations** for lab automation and workflow management.
-
-📖 **[Full Integration Documentation →](docs/scientific-integrations.md)**
-
- **Benchling** - R&D platform and LIMS integration
- **DNAnexus** - Cloud genomics and biomedical data analysis
- **LabArchives** - Electronic Lab Notebook (ELN) integration
- **LatchBio** - Workflow platform and cloud execution
- **OMERO** - Microscopy and bio-image data management
- **Opentrons** - Laboratory automation protocols
+> 💡 **Looking for practical examples?** Check out [docs/examples.md](docs/examples.md) for comprehensive workflow examples across all scientific domains.

 ---

@@ -357,26 +487,19 @@ Contributors are recognized in our community and may be featured in:

 Your contributions help make scientific computing more accessible and enable researchers to leverage AI tools more effectively!

-📖 **[Contributing Guidelines →](CONTRIBUTING.md)** *(coming soon)*
-
 ---

 ## 🔧 Troubleshooting

 ### Common Issues

-**Problem: Claude not using installed skills**
- Solution: Install the `scientific-context-initialization` skill
- This creates an `AGENT.md` file that instructs Claude to search for and use existing skills before attempting tasks
- After installation, Claude will automatically leverage documented patterns, examples, and best practices
-
 **Problem: Skills not loading in Claude Code**
 - Solution: Ensure you've installed the latest version of Claude Code
 - Try reinstalling the plugin: `/plugin marketplace add K-Dense-AI/claude-scientific-skills`

 **Problem: Missing Python dependencies**
 - Solution: Check the specific `SKILL.md` file for required packages
- Install dependencies: `pip install package-name`
+- Install dependencies: `uv pip install package-name`

 **Problem: API rate limits**
 - Solution: Many databases have rate limits. Review the specific database documentation
@@ -394,29 +517,41 @@ Your contributions help make scientific computing more accessible and enable res

 ## ❓ FAQ

+### General Questions
+
 **Q: Is this free to use?**  
-A: Yes, for noncommercial use. See the [License](#license) section for details.
+A: Yes! This project is MIT licensed, allowing free use for any purpose including commercial projects.

-**Q: Do I need all the Python packages installed?**  
-A: No, only install the packages you need. Each skill specifies its requirements.
-
-**Q: Can I use this with other AI models?**  
-A: The skills are designed for Claude but can be adapted for other models with MCP support.
-
-**Q: How often is this updated?**  
-A: We regularly update skills to reflect the latest versions of packages and APIs.
+**Q: Why are all skills grouped into one plugin instead of separate plugins?**  
+A: We believe good science in the age of AI is inherently interdisciplinary. Bundling all skills into a single plugin makes it trivial for you (and Claude) to bridge across fields—e.g., combining genomics, cheminformatics, clinical data, and machine learning in one workflow—without worrying about which individual skills to install or wire together.

 **Q: Can I use this for commercial projects?**  
-A: For commercial use, please visit [K-Dense](https://k-dense.ai/) for enterprise licensing.
+A: Absolutely! The MIT License allows both commercial and noncommercial use without restrictions.
+
+**Q: How often is this updated?**  
+A: We regularly update skills to reflect the latest versions of packages and APIs. Major updates are announced in release notes.
+
+**Q: Can I use this with other AI models?**  
+A: The skills are optimized for Claude but can be adapted for other models with MCP support. The MCP server works with any MCP-compatible client.
+
+### Installation & Setup
+
+**Q: Do I need all the Python packages installed?**  
+A: No! Only install the packages you need. Each skill specifies its requirements in its `SKILL.md` file.

 **Q: What if a skill doesn't work?**  
-A: First check the troubleshooting section, then file an issue on GitHub with details.
-
-**Q: Can I contribute my own skills?**  
-A: Absolutely! See the [Contributing](#contributing) section for guidelines.
+A: First check the [Troubleshooting](#troubleshooting) section. If the issue persists, file an issue on GitHub with detailed reproduction steps.

 **Q: Do the skills work offline?**  
-A: Database skills require internet access. Package skills work offline once dependencies are installed.
+A: Database skills require internet access to query APIs. Package skills work offline once Python dependencies are installed.
+
+### Contributing
+
+**Q: Can I contribute my own skills?**  
+A: Absolutely! We welcome contributions. See the [Contributing](#contributing) section for guidelines and best practices.
+
+**Q: How do I report bugs or suggest features?**  
+A: Open an issue on GitHub with a clear description. For bugs, include reproduction steps and expected vs actual behavior.

 ---

@@ -428,19 +563,73 @@ Need help? Here's how to get support:
 - 🐛 **Bug Reports**: [Open an issue](https://github.com/K-Dense-AI/claude-scientific-skills/issues)
 - 💡 **Feature Requests**: [Submit a feature request](https://github.com/K-Dense-AI/claude-scientific-skills/issues/new)
 - 💼 **Enterprise Support**: Contact [K-Dense](https://k-dense.ai/) for commercial support
- 🌐 **MCP Support**: Visit the [claude-skills-mcp](https://github.com/K-Dense-AI/claude-skills-mcp) repository
+- 🌐 **MCP Support**: Visit the [claude-skills-mcp](https://github.com/K-Dense-AI/claude-skills-mcp) repository or use our hosted MCP server
+
+---
+
+## 🎉 Join Our Community!
+
+**We'd love to have you join us!** 🚀
+
+Connect with other scientists, researchers, and AI enthusiasts using Claude for scientific computing. Share your discoveries, ask questions, get help with your projects, and collaborate with the community!
+
+🌟 **[Join our Slack Community](https://join.slack.com/t/k-densecommunity/shared_invite/zt-3iajtyls1-EwmkwIZk0g_o74311Tkf5g)** 🌟
+
+Whether you're just getting started or you're a power user, our community is here to support you. We share tips, troubleshoot issues together, showcase cool projects, and discuss the latest developments in AI-powered scientific research.
+
+**See you there!** 💬
+
+---
+
+## 📖 Citation
+
+If you use Claude Scientific Skills in your research or project, please cite it as:
+
+### BibTeX
+```bibtex
+@software{claude_scientific_skills_2025,
+  author = {{K-Dense Inc.}},
+  title = {Claude Scientific Skills: A Comprehensive Collection of Scientific Tools for Claude AI},
+  year = {2025},
+  url = {https://github.com/K-Dense-AI/claude-scientific-skills},
+  note = {skills covering databases, packages, integrations, and analysis tools}
+}
+```
+
+### APA
+```
+K-Dense Inc. (2025). Claude Scientific Skills: A comprehensive collection of scientific tools for Claude AI [Computer software]. https://github.com/K-Dense-AI/claude-scientific-skills
+```
+
+### MLA
+```
+K-Dense Inc. Claude Scientific Skills: A Comprehensive Collection of Scientific Tools for Claude AI. 2025, github.com/K-Dense-AI/claude-scientific-skills.
+```
+
+### Plain Text
+```
+Claude Scientific Skills by K-Dense Inc. (2025)
+Available at: https://github.com/K-Dense-AI/claude-scientific-skills
+```
+
+We appreciate acknowledgment in publications, presentations, or projects that benefit from these skills!

 ---

 ## 📄 License

-This project is licensed under the **PolyForm Noncommercial License 1.0.0**.
+This project is licensed under the **MIT License**.

-**Copyright © K-Dense Inc.** ([k-dense.ai](https://k-dense.ai/))
+**Copyright © 2025 K-Dense Inc.** ([k-dense.ai](https://k-dense.ai/))

 ### Key Points:
- ✅ **Free for noncommercial use** (research, education, personal projects)
- ✅ **Free for noncommercial organizations** (universities, research institutions)
- ❌ **Commercial use requires separate license** (contact K-Dense)
+- ✅ **Free for any use** (commercial and noncommercial)
+- ✅ **Open source** - modify, distribute, and use freely
+- ✅ **Permissive** - minimal restrictions on reuse
+- ⚠️ **No warranty** - provided "as is" without warranty of any kind

 See [LICENSE.md](LICENSE.md) for full terms.
+
+## Star History
+
+[![Star History Chart](https://api.star-history.com/svg?repos=K-Dense-AI/claude-scientific-skills&type=date&legend=top-left)](https://www.star-history.com/#K-Dense-AI/claude-scientific-skills&type=date&legend=top-left)
--- a/docs/examples.md
+++ b/docs/examples.md
--- a/docs/scientific-databases.md
+++ b/docs/scientific-databases.md
@@ -1,28 +0,0 @@
-# Scientific Databases
-
- **AlphaFold DB** - AI-predicted protein structure database with 200M+ predictions, confidence metrics (pLDDT, PAE), and Google Cloud bulk access
- **ChEMBL** - Bioactive molecule database with drug-like properties (2M+ compounds, 19M+ activities, 13K+ targets)
- **ClinPGx** - Clinical pharmacogenomics database (successor to PharmGKB) providing gene-drug interactions, CPIC clinical guidelines, allele functions, drug labels, and pharmacogenomic annotations for precision medicine and personalized pharmacotherapy (consolidates PharmGKB, CPIC, and PharmCAT resources)
- **ClinVar** - NCBI's public archive of genomic variants and their clinical significance with standardized classifications (pathogenic, benign, VUS), E-utilities API access, and bulk FTP downloads for variant interpretation and precision medicine research
- **ClinicalTrials.gov** - Comprehensive registry of clinical studies conducted worldwide (maintained by U.S. National Library of Medicine) with API v2 access for searching trials by condition, intervention, location, sponsor, study status, and phase; retrieve detailed trial information including eligibility criteria, outcomes, contacts, and locations; export to CSV/JSON formats for analysis (public API, no authentication required, ~50 req/min rate limit)
- **COSMIC** - Catalogue of Somatic Mutations in Cancer, the world's largest database of somatic cancer mutations (millions of mutations across thousands of cancer types, Cancer Gene Census, mutational signatures, structural variants, and drug resistance data)
- **ENA (European Nucleotide Archive)** - Comprehensive public repository for nucleotide sequence data and metadata with REST APIs for accessing sequences, assemblies, samples, studies, and reads; supports advanced search, taxonomy lookups, and bulk downloads via FTP/Aspera (rate limit: 50 req/sec)
- **Ensembl** - Genome browser and bioinformatics database providing genomic annotations, sequences, variants, and comparative genomics data for 250+ vertebrate species (Release 115, 2025) with comprehensive REST API for gene lookups, sequence retrieval, variant effect prediction (VEP), ortholog finding, assembly mapping (GRCh37/GRCh38), and region analysis
- **FDA Databases** - Comprehensive access to all FDA (Food and Drug Administration) regulatory databases through openFDA API covering drugs (adverse events, labeling, NDC, recalls, approvals, shortages), medical devices (adverse events, 510k clearances, PMA, UDI, classifications), foods (recalls, adverse events, allergen tracking), animal/veterinary medicines (species-specific adverse events), and substances (UNII/CAS lookup, chemical structures, molecular data) for drug safety research, pharmacovigilance, regulatory compliance, and scientific analysis
- **GEO (Gene Expression Omnibus)** - High-throughput gene expression and functional genomics data repository (264K+ studies, 8M+ samples) with microarray, RNA-seq, and expression profile access
- **GWAS Catalog** - NHGRI-EBI catalog of published genome-wide association studies with curated SNP-trait associations (thousands of studies, genome-wide significant associations p≤5×10⁻⁸), full summary statistics, REST API access for variant/trait/gene queries, and FTP downloads for genetic epidemiology and precision medicine research
- **HMDB (Human Metabolome Database)** - Comprehensive metabolomics resource with 220K+ metabolite entries, detailed chemical/biological data, concentration ranges, disease associations, pathways, and spectral data for metabolite identification and biomarker discovery
- **KEGG** - Kyoto Encyclopedia of Genes and Genomes for biological pathway analysis, gene-to-pathway mapping, compound searches, and molecular interaction networks (pathway enrichment, metabolic pathways, gene annotations, drug-drug interactions, ID conversion)
- **Metabolomics Workbench** - NIH Common Fund metabolomics data repository with 4,200+ processed studies, standardized nomenclature (RefMet), mass spectrometry searches, and comprehensive REST API for accessing metabolite structures, study metadata, experimental results, and gene/protein-metabolite associations
- **Open Targets** - Comprehensive therapeutic target identification and validation platform integrating genetics, omics, and chemical data (200M+ evidence strings, target-disease associations with scoring, tractability assessments, safety liabilities, known drugs from ChEMBL, GraphQL API) for drug target discovery, prioritization, evidence evaluation, drug repurposing, competitive intelligence, and mechanism research
- **NCBI Gene** - Work with NCBI Gene database to search, retrieve, and analyze gene information including nomenclature, sequences, variations, phenotypes, and pathways using E-utilities and Datasets API
- **Protein Data Bank (PDB)** - Access 3D structural data of proteins, nucleic acids, and biological macromolecules (200K+ structures) with search, retrieval, and analysis capabilities
- **PubChem** - Access chemical compound data from the world's largest free chemical database (110M+ compounds, 270M+ bioactivities)
- **PubMed** - Access to PubMed literature database with advanced search capabilities
- **Reactome** - Curated pathway database for biological processes and molecular interactions (2,825+ human pathways, 16K+ reactions, 11K+ proteins) with pathway enrichment analysis, expression data analysis, and species comparison using Content Service and Analysis Service APIs
- **STRING** - Protein-protein interaction network database (5000+ genomes, 59.3M proteins, 20B+ interactions) with functional enrichment analysis, interaction partner discovery, and network visualization from experimental data, computational prediction, and text-mining
- **UniProt** - Universal Protein Resource for protein sequences, annotations, and functional information (UniProtKB/Swiss-Prot reviewed entries, TrEMBL unreviewed entries) with REST API access for search, retrieval, ID mapping, and batch operations across 200+ databases
- **USPTO** - United States Patent and Trademark Office data access including patent searches, trademark lookups, patent examination history (PEDS), office actions, assignments, citations, and litigation records; supports PatentSearch API (ElasticSearch-based patent search), TSDR (Trademark Status & Document Retrieval), Patent/Trademark Assignment APIs, and additional specialized APIs for comprehensive IP analysis
- **ZINC** - Free database of commercially-available compounds for virtual screening and drug discovery (230M+ purchasable compounds in ready-to-dock 3D formats)
-
-
--- a/docs/scientific-integrations.md
+++ b/docs/scientific-integrations.md
@@ -1,21 +0,0 @@
-# Scientific Integrations
-
-## Laboratory Information Management Systems (LIMS) & R&D Platforms
- **Benchling Integration** - Toolkit for integrating with Benchling's R&D platform, providing programmatic access to laboratory data management including registry entities (DNA sequences, proteins), inventory systems (samples, containers, locations), electronic lab notebooks (entries, protocols), workflows (tasks, automation), and data exports using Python SDK and REST API
-
-## Cloud Platforms for Genomics & Biomedical Data
- **DNAnexus Integration** - Comprehensive toolkit for working with the DNAnexus cloud platform for genomics and biomedical data analysis. Covers building and deploying apps/applets (Python/Bash), managing data objects (files, records, databases), running analyses and workflows, using the dxpy Python SDK, and configuring app metadata and dependencies (dxapp.json setup, system packages, Docker, assets). Enables processing of FASTQ/BAM/VCF files, bioinformatics pipelines, job execution, workflow orchestration, and platform operations including project management and permissions
-
-## Laboratory Automation
- **Opentrons Integration** - Toolkit for creating, editing, and debugging Opentrons Python Protocol API v2 protocols for laboratory automation using Flex and OT-2 robots. Enables automated liquid handling, pipetting workflows, hardware module control (thermocycler, temperature, magnetic, heater-shaker, absorbance plate reader), labware management, and complex protocol development for biological and chemical experiments
-
-## Electronic Lab Notebooks (ELN)
- **LabArchives Integration** - Toolkit for interacting with LabArchives Electronic Lab Notebook (ELN) REST API. Provides programmatic access to notebooks (backup, retrieval, management), entries (creation, comments, attachments), user authentication, site reports and analytics, and third-party integrations (Protocols.io, GraphPad Prism, SnapGene, Geneious, Jupyter, REDCap). Includes Python scripts for configuration setup, notebook operations, and entry management. Supports multi-regional API endpoints (US, UK, Australia) and OAuth authentication
-
-## Workflow Platforms & Cloud Execution
- **LatchBio Integration** - Integration with the Latch platform for building, deploying, and executing bioinformatics workflows. Provides comprehensive support for creating serverless bioinformatics pipelines using Python decorators, deploying Nextflow/Snakemake pipelines, managing cloud data (LatchFile, LatchDir) and structured Registry (Projects, Tables, Records), configuring computational resources (CPU, GPU, memory, storage), and using pre-built Latch Verified workflows (RNA-seq, AlphaFold, DESeq2, single-cell analysis, CRISPR editing). Enables automatic containerization, UI generation, workflow versioning, and execution on scalable cloud infrastructure with comprehensive data management
-
-## Microscopy & Bio-image Data
- **OMERO Integration** - Toolkit for interacting with OMERO microscopy data management systems using Python. Provides comprehensive access to microscopy images stored in OMERO servers, including dataset and screening data retrieval, pixel data analysis, annotation and metadata management, regions of interest (ROIs) creation and analysis, batch processing, OMERO.scripts development, and OMERO.tables for structured data storage. Essential for researchers working with high-content screening data, multi-dimensional microscopy datasets, or collaborative image repositories
-
-
--- a/docs/scientific-packages.md
+++ b/docs/scientific-packages.md
@@ -1,62 +0,0 @@
-# Scientific Packages
-
-## Bioinformatics & Genomics
- **AnnData** - Annotated data matrices for single-cell genomics and h5ad files
- **Arboreto** - Gene regulatory network inference using GRNBoost2 and GENIE3
- **BioPython** - Sequence manipulation, NCBI database access, BLAST searches, alignments, and phylogenetics
- **BioServices** - Programmatic access to 40+ biological web services (KEGG, UniProt, ChEBI, ChEMBL)
- **Cellxgene Census** - Query and analyze large-scale single-cell RNA-seq data
- **gget** - Efficient genomic database queries (Ensembl, UniProt, NCBI, PDB, COSMIC)
- **pysam** - Read, write, and manipulate genomic data files (SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences) with pileup analysis, coverage calculations, and bioinformatics workflows
- **PyDESeq2** - Differential gene expression analysis for bulk RNA-seq data
- **Scanpy** - Single-cell RNA-seq analysis with clustering, marker genes, and UMAP/t-SNE visualization
-
-## Cheminformatics & Drug Discovery
- **Datamol** - Molecular manipulation and featurization with enhanced RDKit workflows
- **DeepChem** - Molecular machine learning, graph neural networks, and MoleculeNet benchmarks
- **DiffDock** - Diffusion-based molecular docking for protein-ligand binding prediction
- **MedChem** - Medicinal chemistry analysis, ADMET prediction, and drug-likeness assessment
- **Molfeat** - 100+ molecular featurizers including fingerprints, descriptors, and pretrained models
- **PyTDC** - Therapeutics Data Commons for drug discovery datasets and benchmarks
- **RDKit** - Cheminformatics toolkit for molecular I/O, descriptors, fingerprints, and SMARTS
- **TorchDrug** - PyTorch-based machine learning platform for drug discovery with 40+ datasets, 20+ GNN models for molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, and retrosynthesis planning
-
-## Proteomics & Mass Spectrometry
- **matchms** - Processing and similarity matching of mass spectrometry data with 40+ filters, spectral library matching (Cosine, Modified Cosine, Neutral Losses), metadata harmonization, molecular fingerprint comparison, and support for multiple file formats (MGF, MSP, mzML, JSON)
- **pyOpenMS** - Comprehensive mass spectrometry data analysis for proteomics and metabolomics (LC-MS/MS processing, peptide identification, feature detection, quantification, chemical calculations, and integration with search engines like Comet, Mascot, MSGF+)
-
-## Machine Learning & Deep Learning
- **PyMC** - Bayesian statistical modeling and probabilistic programming
- **PyMOO** - Multi-objective optimization with evolutionary algorithms
- **PyTorch Lightning** - Deep learning framework that organizes PyTorch code to eliminate boilerplate while maintaining full flexibility. Automates training workflows (40+ tasks including epoch/batch iteration, optimizer steps, gradient management, checkpointing), supports multi-GPU/TPU training with DDP/FSDP/DeepSpeed strategies, includes LightningModule for model organization, Trainer for automation, LightningDataModule for data pipelines, callbacks for extensibility, and integrations with TensorBoard, Wandb, MLflow for experiment tracking
- **scikit-learn** - Machine learning algorithms, preprocessing, and model selection
- **statsmodels** - Statistical modeling and econometrics (OLS, GLM, logit/probit, ARIMA, time series forecasting, hypothesis testing, diagnostics)
- **Torch Geometric** - Graph Neural Networks for molecular and geometric data
- **Transformers** - State-of-the-art machine learning models for NLP, computer vision, audio, and multimodal tasks. Provides 1M+ pre-trained models accessible via pipelines (text-classification, NER, QA, summarization, translation, text-generation, image-classification, object-detection, ASR, VQA), comprehensive training via Trainer API with distributed training and mixed precision, flexible text generation with multiple decoding strategies (greedy, beam search, sampling), and Auto classes for automatic architecture selection (BERT, GPT, T5, ViT, BART, etc.)
- **UMAP-learn** - Dimensionality reduction and manifold learning
-
-## Materials Science & Chemistry
- **Astropy** - Astronomy and astrophysics (coordinates, cosmology, FITS files)
- **COBRApy** - Constraint-based metabolic modeling and flux balance analysis
- **Pymatgen** - Materials structure analysis, phase diagrams, and electronic structure
-
-## Data Analysis & Visualization
- **Dask** - Parallel computing for larger-than-memory datasets with distributed DataFrames, Arrays, Bags, and Futures
- **Matplotlib** - Publication-quality plotting and visualization
- **Polars** - High-performance DataFrame operations with lazy evaluation
- **Seaborn** - Statistical data visualization with dataset-oriented interface, automatic confidence intervals, publication-quality themes, colorblind-safe palettes, and comprehensive support for exploratory analysis, distribution comparisons, correlation matrices, regression plots, and multi-panel figures
- **ReportLab** - Programmatic PDF generation for reports and documents
-
-## Phylogenetics & Trees
- **ETE Toolkit** - Phylogenetic tree manipulation, visualization, and analysis
-
-## Genomics Tools
- **deepTools** - NGS data analysis (ChIP-seq, RNA-seq, ATAC-seq) with BAM/bigWig files
- **FlowIO** - Flow Cytometry Standard (FCS) file reading and manipulation
- **scikit-bio** - Bioinformatics sequence analysis and diversity metrics
- **Zarr** - Chunked, compressed N-dimensional array storage
-
-## Multi-omics & Integration
- **BIOMNI** - Multi-omics data integration with LLM-powered analysis
-
-
--- a/docs/scientific-skills.md
+++ b/docs/scientific-skills.md
@@ -0,0 +1,187 @@
+# Scientific Skills
+
+## Scientific Databases
+
+- **AlphaFold DB** - Comprehensive AI-predicted protein structure database from DeepMind providing 200M+ high-confidence protein structure predictions covering UniProt reference proteomes and beyond. Includes confidence metrics (pLDDT for per-residue confidence, PAE for pairwise accuracy estimates), structure quality assessment, predicted aligned error matrices, and multiple structure formats (PDB, mmCIF, AlphaFold DB format). Supports programmatic access via REST API, bulk downloads through Google Cloud Storage, and integration with structural analysis tools. Enables structure-based drug discovery, protein function prediction, structural genomics, comparative modeling, and structural bioinformatics research without experimental structure determination
+- **ChEMBL** - Comprehensive manually curated database of bioactive molecules with drug-like properties maintained by EMBL-EBI. Contains 2M+ unique compounds, 19M+ bioactivity measurements, 13K+ protein targets, and 1.1M+ assays from 90K+ publications. Provides detailed compound information including chemical structures (SMILES, InChI), bioactivity data (IC50, EC50, Ki, Kd values), target information (protein families, pathways), ADMET properties, drug indications, clinical trial data, and patent information. Features REST API access, web interface, downloadable data files, and integration with other databases (UniProt, PubChem, DrugBank). Use cases: drug discovery, target identification, lead optimization, bioactivity prediction, chemical biology research, and drug repurposing
+- **ClinPGx** - Clinical pharmacogenomics database (successor to PharmGKB) providing gene-drug interactions, CPIC clinical guidelines, allele functions, drug labels, and pharmacogenomic annotations for precision medicine and personalized pharmacotherapy (consolidates PharmGKB, CPIC, and PharmCAT resources)
+- **ClinVar** - NCBI's public archive of genomic variants and their clinical significance with standardized classifications (pathogenic, benign, VUS), E-utilities API access, and bulk FTP downloads for variant interpretation and precision medicine research
+- **ClinicalTrials.gov** - Comprehensive registry of clinical studies conducted worldwide (maintained by U.S. National Library of Medicine) with API v2 access for searching trials by condition, intervention, location, sponsor, study status, and phase; retrieve detailed trial information including eligibility criteria, outcomes, contacts, and locations; export to CSV/JSON formats for analysis (public API, no authentication required, ~50 req/min rate limit)
+- **COSMIC** - Catalogue of Somatic Mutations in Cancer, the world's largest database of somatic cancer mutations (millions of mutations across thousands of cancer types, Cancer Gene Census, mutational signatures, structural variants, and drug resistance data)
+- **DrugBank** - Comprehensive bioinformatics and cheminformatics database containing detailed drug and drug target information (9,591+ drug entries including 2,037 FDA-approved small molecules, 241 biotech drugs, 96 nutraceuticals, 6,000+ experimental compounds) with 200+ data fields per entry covering chemical structures (SMILES, InChI), pharmacology (mechanism of action, pharmacodynamics, ADME), drug-drug interactions, protein targets (enzymes, transporters, carriers), biological pathways, external identifiers (PubChem, ChEMBL, UniProt), and physicochemical properties for drug discovery, pharmacology research, interaction analysis, target identification, chemical similarity searches, and ADMET predictions
+- **ENA (European Nucleotide Archive)** - Comprehensive public repository for nucleotide sequence data and metadata with REST APIs for accessing sequences, assemblies, samples, studies, and reads; supports advanced search, taxonomy lookups, and bulk downloads via FTP/Aspera (rate limit: 50 req/sec)
+- **Ensembl** - Genome browser and bioinformatics database providing genomic annotations, sequences, variants, and comparative genomics data for 250+ vertebrate species (Release 115, 2025) with comprehensive REST API for gene lookups, sequence retrieval, variant effect prediction (VEP), ortholog finding, assembly mapping (GRCh37/GRCh38), and region analysis
+- **FDA Databases** - Comprehensive access to all FDA (Food and Drug Administration) regulatory databases through openFDA API covering drugs (adverse events, labeling, NDC, recalls, approvals, shortages), medical devices (adverse events, 510k clearances, PMA, UDI, classifications), foods (recalls, adverse events, allergen tracking), animal/veterinary medicines (species-specific adverse events), and substances (UNII/CAS lookup, chemical structures, molecular data) for drug safety research, pharmacovigilance, regulatory compliance, and scientific analysis
+- **GEO (Gene Expression Omnibus)** - NCBI's comprehensive public repository for high-throughput gene expression and functional genomics data. Contains 264K+ studies, 8M+ samples, and petabytes of data from microarray, RNA-seq, ChIP-seq, ATAC-seq, and other high-throughput experiments. Provides standardized data submission formats (MINIML, SOFT), programmatic access via Entrez Programming Utilities (E-utilities) and GEOquery R package, bulk FTP downloads, and web-based search and retrieval. Supports data mining, meta-analysis, differential expression analysis, and cross-study comparisons. Includes curated datasets, series records with experimental design, platform annotations, and sample metadata. Use cases: gene expression analysis, biomarker discovery, disease mechanism research, drug response studies, and functional genomics research
+- **GWAS Catalog** - NHGRI-EBI catalog of published genome-wide association studies with curated SNP-trait associations (thousands of studies, genome-wide significant associations p≤5×10⁻⁸), full summary statistics, REST API access for variant/trait/gene queries, and FTP downloads for genetic epidemiology and precision medicine research
+- **HMDB (Human Metabolome Database)** - Comprehensive metabolomics resource with 220K+ metabolite entries, detailed chemical/biological data, concentration ranges, disease associations, pathways, and spectral data for metabolite identification and biomarker discovery
+- **KEGG** - Kyoto Encyclopedia of Genes and Genomes, comprehensive database resource integrating genomic, chemical, and systemic functional information. Provides pathway databases (KEGG PATHWAY with 500+ reference pathways, metabolic pathways, signaling pathways, disease pathways), genome databases (KEGG GENES with gene catalogs from 5,000+ organisms), chemical databases (KEGG COMPOUND, KEGG DRUG, KEGG GLYCAN), and disease/drug databases (KEGG DISEASE, KEGG DRUG). Features pathway enrichment analysis, gene-to-pathway mapping, compound searches, molecular interaction networks, ortholog identification (KO - KEGG Orthology), ID conversion across databases, and visualization tools. Supports REST API access, KEGG Mapper for pathway mapping, and integration with bioinformatics tools. Use cases: pathway enrichment analysis, metabolic pathway reconstruction, drug target identification, comparative genomics, systems biology, and functional annotation of genes
+- **Metabolomics Workbench** - NIH Common Fund metabolomics data repository with 4,200+ processed studies, standardized nomenclature (RefMet), mass spectrometry searches, and comprehensive REST API for accessing metabolite structures, study metadata, experimental results, and gene/protein-metabolite associations
+- **OpenAlex** - Comprehensive open catalog of 240M+ scholarly works, authors, institutions, topics, sources, publishers, and funders. Provides complete bibliometric database for academic literature search, citation analysis, research trend tracking, author publication discovery, institution research output analysis, and open access paper identification. Features REST API with no authentication required (100k requests/day, 10 req/sec with email), advanced filtering (publication year, citations, open access status, topics, authors, institutions), aggregation/grouping capabilities, random sampling for research studies, batch ID lookups (DOI, ORCID, ROR, ISSN), and comprehensive metadata (titles, abstracts, citations, authorships, topics, funding). Supports literature reviews, bibliometric analysis, research output evaluation, citation network analysis, and academic database queries across all scientific domains
+- **Open Targets** - Comprehensive therapeutic target identification and validation platform integrating genetics, omics, and chemical data (200M+ evidence strings, target-disease associations with scoring, tractability assessments, safety liabilities, known drugs from ChEMBL, GraphQL API) for drug target discovery, prioritization, evidence evaluation, drug repurposing, competitive intelligence, and mechanism research
+- **NCBI Gene** - Comprehensive gene-specific database from NCBI providing curated information about genes from 500+ organisms. Contains gene nomenclature (official symbols, aliases, full names), genomic locations (chromosomal positions, exons, introns), sequences (genomic, mRNA, protein), gene function and phenotypes, pathways and interactions, orthologs and paralogs, variation data (SNPs, mutations), expression data, and cross-references to 200+ external databases (UniProt, Ensembl, HGNC, OMIM, Reactome). Supports programmatic access via E-utilities API (Entrez Programming Utilities) and NCBI Datasets API, bulk downloads, and web interface. Enables gene annotation, comparative genomics, variant interpretation, pathway analysis, and integration with other NCBI resources (PubMed, dbSNP, ClinVar). Use cases: gene information retrieval, variant annotation, functional genomics, disease gene discovery, and bioinformatics workflows
+- **Protein Data Bank (PDB)** - Worldwide repository for 3D structural data of proteins, nucleic acids, and biological macromolecules. Contains 200K+ experimentally determined structures from X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy. Provides comprehensive structure information including atomic coordinates, experimental data, structure quality metrics, ligand binding sites, protein-protein interfaces, and metadata (authors, methods, citations). Features advanced search capabilities (by sequence, structure similarity, ligand, organism, resolution), REST API and FTP access, structure visualization tools, and integration with analysis software. Supports structure comparison, homology modeling, drug design, structural biology research, and educational use. Maintained by wwPDB consortium (RCSB PDB, PDBe, PDBj, BMRB). Use cases: structural biology research, drug discovery, protein engineering, molecular modeling, and structural bioinformatics
+- **PubChem** - World's largest free chemical information database maintained by NCBI. Contains 110M+ unique chemical compounds, 270M+ bioactivity test results, 300M+ chemical structures, and 1M+ patents. Provides comprehensive compound information including chemical structures (2D/3D structures, SMILES, InChI), physicochemical properties (molecular weight, logP, H-bond donors/acceptors), bioactivity data (assays, targets, pathways), safety and toxicity data, literature references, and vendor information. Features REST API (PUG REST, PUG SOAP, PUG View), web interface with advanced search, bulk downloads, and integration with other NCBI resources. Supports chemical similarity searches, substructure searches, property-based filtering, and cheminformatics analysis. Use cases: drug discovery, chemical biology, lead identification, ADMET prediction, chemical database mining, and molecular property analysis
+- **PubMed** - NCBI's comprehensive biomedical literature database containing 35M+ citations from MEDLINE, life science journals, and online books. Provides access to abstracts, full-text articles (when available), MeSH (Medical Subject Headings) terms, author information, publication dates, and citation networks. Features advanced search capabilities with Boolean operators, field tags (author, title, journal, MeSH terms, publication date), filters (article type, species, language, publication date range), and saved searches with email alerts. Supports programmatic access via E-utilities API (Entrez Programming Utilities), bulk downloads, citation export in multiple formats (RIS, BibTeX, MEDLINE), and integration with reference management software. Includes PubMed Central (PMC) for open-access full-text articles. Use cases: literature searches, systematic reviews, citation analysis, research discovery, and staying current with scientific publications
+- **Reactome** - Curated pathway database for biological processes and molecular interactions (2,825+ human pathways, 16K+ reactions, 11K+ proteins) with pathway enrichment analysis, expression data analysis, and species comparison using Content Service and Analysis Service APIs
+- **STRING** - Protein-protein interaction network database (5000+ genomes, 59.3M proteins, 20B+ interactions) with functional enrichment analysis, interaction partner discovery, and network visualization from experimental data, computational prediction, and text-mining
+- **UniProt** - Universal Protein Resource for protein sequences, annotations, and functional information (UniProtKB/Swiss-Prot reviewed entries, TrEMBL unreviewed entries) with REST API access for search, retrieval, ID mapping, and batch operations across 200+ databases
+- **USPTO** - United States Patent and Trademark Office data access including patent searches, trademark lookups, patent examination history (PEDS), office actions, assignments, citations, and litigation records; supports PatentSearch API (ElasticSearch-based patent search), TSDR (Trademark Status & Document Retrieval), Patent/Trademark Assignment APIs, and additional specialized APIs for comprehensive IP analysis
+- **ZINC** - Free database of commercially-available compounds for virtual screening and drug discovery maintained by UCSF. Contains 230M+ purchasable compounds from 100+ vendors in ready-to-dock 3D formats (SDF, MOL2) with pre-computed conformers. Provides compound information including chemical structures, vendor information and pricing, physicochemical properties (molecular weight, logP, H-bond donors/acceptors, rotatable bonds), drug-likeness filters (Lipinski's Rule of Five, Veber rules), and substructure search capabilities. Features multiple compound subsets (drug-like, lead-like, fragment-like, natural products), downloadable subsets for specific screening campaigns, and integration with molecular docking software (AutoDock, DOCK, Glide). Supports structure-based and ligand-based virtual screening workflows. Use cases: virtual screening campaigns, lead identification, compound library design, high-throughput docking, and drug discovery research
+
+## Scientific Integrations
+
+### Laboratory Information Management Systems (LIMS) & R&D Platforms
+- **Benchling Integration** - Toolkit for integrating with Benchling's R&D platform, providing programmatic access to laboratory data management including registry entities (DNA sequences, proteins), inventory systems (samples, containers, locations), electronic lab notebooks (entries, protocols), workflows (tasks, automation), and data exports using Python SDK and REST API
+
+### Cloud Platforms for Genomics & Biomedical Data
+- **DNAnexus Integration** - Comprehensive toolkit for working with the DNAnexus cloud platform for genomics and biomedical data analysis. Covers building and deploying apps/applets (Python/Bash), managing data objects (files, records, databases), running analyses and workflows, using the dxpy Python SDK, and configuring app metadata and dependencies (dxapp.json setup, system packages, Docker, assets). Enables processing of FASTQ/BAM/VCF files, bioinformatics pipelines, job execution, workflow orchestration, and platform operations including project management and permissions
+
+### Laboratory Automation
+- **Opentrons Integration** - Toolkit for creating, editing, and debugging Opentrons Python Protocol API v2 protocols for laboratory automation using Flex and OT-2 robots. Enables automated liquid handling, pipetting workflows, hardware module control (thermocycler, temperature, magnetic, heater-shaker, absorbance plate reader), labware management, and complex protocol development for biological and chemical experiments
+
+### Electronic Lab Notebooks (ELN)
+- **LabArchives Integration** - Toolkit for interacting with LabArchives Electronic Lab Notebook (ELN) REST API. Provides programmatic access to notebooks (backup, retrieval, management), entries (creation, comments, attachments), user authentication, site reports and analytics, and third-party integrations (Protocols.io, GraphPad Prism, SnapGene, Geneious, Jupyter, REDCap). Includes Python scripts for configuration setup, notebook operations, and entry management. Supports multi-regional API endpoints (US, UK, Australia) and OAuth authentication
+
+### Workflow Platforms & Cloud Execution
+- **LatchBio Integration** - Integration with the Latch platform for building, deploying, and executing bioinformatics workflows. Provides comprehensive support for creating serverless bioinformatics pipelines using Python decorators, deploying Nextflow/Snakemake pipelines, managing cloud data (LatchFile, LatchDir) and structured Registry (Projects, Tables, Records), configuring computational resources (CPU, GPU, memory, storage), and using pre-built Latch Verified workflows (RNA-seq, AlphaFold, DESeq2, single-cell analysis, CRISPR editing). Enables automatic containerization, UI generation, workflow versioning, and execution on scalable cloud infrastructure with comprehensive data management
+
+### Microscopy & Bio-image Data
+- **OMERO Integration** - Toolkit for interacting with OMERO microscopy data management systems using Python. Provides comprehensive access to microscopy images stored in OMERO servers, including dataset and screening data retrieval, pixel data analysis, annotation and metadata management, regions of interest (ROIs) creation and analysis, batch processing, OMERO.scripts development, and OMERO.tables for structured data storage. Essential for researchers working with high-content screening data, multi-dimensional microscopy datasets, or collaborative image repositories
+
+### Protocol Management & Sharing
+- **Protocols.io Integration** - Integration with protocols.io API for managing scientific protocols. Enables programmatic access to protocol discovery (search by keywords, DOI, category), protocol lifecycle management (create, update, publish with DOI), step-by-step procedure documentation, collaborative development with workspaces and discussions, file management (upload data, images, documents), experiment tracking and documentation, and data export. Supports OAuth authentication, protocol PDF generation, materials management, threaded comments, workspace permissions, and institutional protocol repositories. Essential for protocol standardization, reproducibility, lab knowledge management, and scientific collaboration
+
+## Scientific Packages
+
+### Bioinformatics & Genomics
+- **AnnData** - Python package for handling annotated data matrices, specifically designed for single-cell genomics data. Provides efficient storage and manipulation of high-dimensional data with associated annotations (observations/cells and variables/genes). Key features include: HDF5-based h5ad file format for efficient I/O and compression, integration with pandas DataFrames for metadata, support for sparse matrices (scipy.sparse) for memory efficiency, layered data organization (X for main data matrix, obs for observation annotations, var for variable annotations, obsm/varm for multi-dimensional annotations, obsp/varp for pairwise matrices), and seamless integration with Scanpy, scvi-tools, and other single-cell analysis packages. Supports lazy loading, chunked operations, and conversion to/from other formats (CSV, HDF5, Zarr). Use cases: single-cell RNA-seq data management, multi-modal single-cell data (RNA+ATAC, CITE-seq), spatial transcriptomics, and any high-dimensional annotated data requiring efficient storage and manipulation
+- **Arboreto** - Python package for efficient gene regulatory network (GRN) inference from single-cell RNA-seq data using ensemble tree-based methods. Implements GRNBoost2 (gradient boosting-based network inference) and GENIE3 (random forest-based inference) algorithms optimized for large-scale single-cell datasets. Key features include: parallel processing for scalability, support for sparse matrices and large datasets (millions of cells), integration with Scanpy/AnnData workflows, customizable hyperparameters, and output formats compatible with network analysis tools. Provides ranked lists of potential regulatory interactions (transcription factor-target gene pairs) with confidence scores. Use cases: identifying transcription factor-target relationships, reconstructing gene regulatory networks from single-cell data, understanding cell-type-specific regulatory programs, and inferring causal relationships in gene expression
+- **BioPython** - Comprehensive Python library for computational biology and bioinformatics providing tools for sequence manipulation, database access, and biological data analysis. Key features include: sequence objects (Seq, SeqRecord, SeqIO) for DNA/RNA/protein sequences with biological alphabet validation, file format parsers (FASTA, FASTQ, GenBank, EMBL, Swiss-Prot, PDB, SAM/BAM, VCF, GFF), NCBI database access (Entrez Programming Utilities for PubMed, GenBank, BLAST, taxonomy), BLAST integration (running searches, parsing results), sequence alignment (pairwise and multiple sequence alignment with Bio.Align), phylogenetics (tree construction and manipulation with Bio.Phylo), population genetics (Hardy-Weinberg, F-statistics), protein structure analysis (PDB parsing, structure calculations), and statistical analysis tools. Supports integration with NumPy, pandas, and other scientific Python libraries. Use cases: sequence analysis, database queries, phylogenetic analysis, sequence alignment, file format conversion, and general bioinformatics workflows
+- **BioServices** - Python library providing unified programmatic access to 40+ biological web services and databases. Supports major bioinformatics resources including KEGG (pathway and compound data), UniProt (protein sequences and annotations), ChEBI (chemical entities), ChEMBL (bioactive molecules), Reactome (pathways), IntAct (protein interactions), BioModels (biological models), and many others. Features consistent API across different services, automatic result caching, error handling and retry logic, support for both REST and SOAP web services, and conversion of results to Python objects (dictionaries, lists, BioPython objects). Handles authentication, rate limiting, and API versioning. Use cases: automated data retrieval from multiple biological databases, building bioinformatics pipelines, database integration workflows, and programmatic access to biological web resources without manual web browsing
+- **Cellxgene Census** - Python package for querying and analyzing large-scale single-cell RNA-seq data from the CZ CELLxGENE Discover census. Provides access to 50M+ cells across 1,000+ datasets with standardized annotations and metadata. Key features include: efficient data access using TileDB-SOMA format for scalable queries, integration with AnnData and Scanpy for downstream analysis, cell metadata filtering and querying, gene expression retrieval, and support for both human and mouse data. Enables subsetting datasets by cell type, tissue, disease, or other metadata before downloading, reducing data transfer and memory requirements. Supports local caching and batch operations. Use cases: large-scale single-cell analysis, cell-type discovery, cross-dataset comparisons, reference dataset construction, and exploratory analysis of public single-cell data
+- **gget** - Command-line tool and Python package for efficient querying of genomic databases with a simple, unified interface. Provides fast access to Ensembl (gene information, sequences, orthologs, variants), UniProt (protein sequences and annotations), NCBI (BLAST searches, gene information), PDB (protein structures), COSMIC (cancer mutations), and other databases. Features include: single-command queries without complex API setup, automatic result formatting, batch query support, integration with pandas DataFrames, and support for both command-line and Python API usage. Optimized for speed and ease of use, making database queries accessible to users without extensive bioinformatics experience. Use cases: quick gene lookups, sequence retrieval, variant annotation, protein structure access, and rapid database queries in bioinformatics workflows
+- **geniml** - Genomic interval machine learning toolkit providing unsupervised methods for building ML models on BED files. Key capabilities include Region2Vec (word2vec-style embeddings of genomic regions and region sets using tokenization and neural language modeling), BEDspace (joint embeddings of regions and metadata labels using StarSpace for cross-modal queries), scEmbed (Region2Vec applied to single-cell ATAC-seq data generating cell-level embeddings for clustering and annotation with scanpy integration), consensus peak building (four statistical methods CC/CCF/ML/HMM for creating reference universes from BED collections), and comprehensive utilities (BBClient for BED caching, BEDshift for genomic randomization preserving context, evaluation metrics for embedding quality, Text2BedNN for neural search backends). Part of BEDbase ecosystem. Supports Python API and CLI workflows, pre-trained models on Hugging Face, and integration with gtars for tokenization. Use cases: region similarity searches, dimension reduction of chromatin accessibility data, scATAC-seq clustering and cell-type annotation, metadata-aware genomic queries, universe construction for standardized references, and any ML task requiring genomic region feature vectors
+- **gtars** - High-performance Rust toolkit for genomic interval analysis providing specialized tools for overlap detection using IGD (Integrated Genome Database) indexing, coverage track generation (uniwig module for WIG/BigWig formats), genomic tokenization for machine learning applications (TreeTokenizer for deep learning models), reference sequence management (refget protocol compliance), fragment processing for single-cell genomics (barcode-based splitting and cluster analysis), and fragment scoring against reference datasets. Offers Python bindings with NumPy integration, command-line tools (gtars-cli), and Rust library. Key modules include: tokenizers (convert genomic regions to ML tokens), overlaprs (efficient overlap computation), uniwig (ATAC-seq/ChIP-seq/RNA-seq coverage profiles), refget (GA4GH-compliant sequence digests), bbcache (BEDbase.org integration), scoring (fragment enrichment metrics), and fragsplit (single-cell fragment manipulation). Supports parallel processing, memory-mapped files, streaming for large datasets, and serves as foundation for geniml genomic ML package. Ideal for genomic ML preprocessing, regulatory element analysis, variant annotation, chromatin accessibility profiling, and computational genomics workflows
+- **pysam** - Read, write, and manipulate genomic data files (SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences) with pileup analysis, coverage calculations, and bioinformatics workflows
+- **PyDESeq2** - Python implementation of the DESeq2 differential gene expression analysis method for bulk RNA-seq data. Provides statistical methods for determining differential expression between experimental conditions using negative binomial generalized linear models. Key features include: size factor estimation for library size normalization, dispersion estimation and shrinkage, hypothesis testing with Wald test or likelihood ratio test, multiple testing correction (Benjamini-Hochberg FDR), results filtering and ranking, and integration with pandas DataFrames. Handles complex experimental designs, batch effects, and replicates. Produces fold-change estimates, p-values, and adjusted p-values for each gene. Use cases: identifying differentially expressed genes between conditions, RNA-seq experiment analysis, biomarker discovery, and gene expression studies requiring rigorous statistical analysis
+- **Scanpy** - Comprehensive Python toolkit for single-cell RNA-seq data analysis built on AnnData. Provides end-to-end workflows for preprocessing (quality control, normalization, log transformation), dimensionality reduction (PCA, UMAP, t-SNE, ForceAtlas2), clustering (Leiden, Louvain, hierarchical clustering), marker gene identification, trajectory inference (PAGA, diffusion maps), and visualization. Key features include: efficient handling of large datasets (millions of cells) using sparse matrices, integration with scvi-tools for advanced analysis, support for multi-modal data (RNA+ATAC, CITE-seq), batch correction methods, and publication-quality plotting functions. Includes extensive documentation, tutorials, and integration with other single-cell tools. Supports GPU acceleration for certain operations. Use cases: single-cell RNA-seq analysis, cell-type identification, trajectory analysis, batch correction, and comprehensive single-cell genomics workflows
+- **scvi-tools** - Probabilistic deep learning models for single-cell omics analysis. PyTorch-based framework providing variational autoencoders (VAEs) for dimensionality reduction, batch correction, differential expression, and data integration across modalities. Includes 25+ models: scVI/scANVI (RNA-seq integration and cell type annotation), totalVI (CITE-seq protein+RNA), MultiVI (multiome RNA+ATAC integration), PeakVI (ATAC-seq analysis), DestVI/Stereoscope/Tangram (spatial transcriptomics deconvolution), MethylVI (methylation), CytoVI (flow/mass cytometry), VeloVI (RNA velocity), contrastiveVI (perturbation studies), and Solo (doublet detection). Supports seamless integration with Scanpy/AnnData ecosystem, GPU acceleration, reference mapping (scArches), and probabilistic differential expression with uncertainty quantification
+
+### Data Management & Infrastructure
+- **LaminDB** - Open-source data framework for biology that makes data queryable, traceable, reproducible, and FAIR (Findable, Accessible, Interoperable, Reusable). Provides unified platform combining lakehouse architecture, lineage tracking, feature stores, biological ontologies (via Bionty plugin with 20+ ontologies: genes, proteins, cell types, tissues, diseases, pathways), LIMS, and ELN capabilities through a single Python API. Key features include: automatic data lineage tracking (code, inputs, outputs, environment), versioned artifacts (DataFrame, AnnData, SpatialData, Parquet, Zarr), schema validation and data curation with standardization/synonym mapping, queryable metadata with feature-based filtering, cross-registry traversal, and streaming for large datasets. Supports integrations with workflow managers (Nextflow, Snakemake, Redun), MLOps platforms (Weights & Biases, MLflow, HuggingFace, scVI-tools), cloud storage (S3, GCS, S3-compatible), array stores (TileDB-SOMA, DuckDB), and visualization (Vitessce). Deployment options: local SQLite, cloud storage with SQLite, or cloud storage with PostgreSQL for production. Use cases: scRNA-seq standardization and analysis, flow cytometry/spatial data management, multi-modal dataset integration, computational workflow tracking with reproducibility, biological ontology-based annotation, data lakehouse construction for unified queries, ML pipeline integration with experiment tracking, and FAIR-compliant dataset publishing
+- **Modal** - Serverless cloud platform for running Python code with minimal configuration, specialized for AI/ML workloads and scientific computing. Execute functions on powerful GPUs (T4, L4, A10, A100, L40S, H100, H200, B200), scale automatically from zero to thousands of containers, and pay only for compute used. Key features include: declarative container image building with uv/pip/apt package management, automatic autoscaling with configurable limits and buffer containers, GPU acceleration with multi-GPU support (up to 8 GPUs per container), persistent storage via Volumes for model weights and datasets, secret management for API keys and credentials, scheduled jobs with cron expressions, web endpoints for deploying serverless APIs, parallel execution with `.map()` for batch processing, input concurrency for I/O-bound workloads, and resource configuration (CPU cores, memory, disk). Supports custom Docker images, integration with Hugging Face/Weights & Biases, FastAPI for web endpoints, and distributed training. Free tier includes $30/month credits. Use cases: ML model deployment and inference (LLMs, image generation, embeddings), GPU-accelerated training, batch processing large datasets in parallel, scheduled compute-intensive jobs, serverless API deployment with autoscaling, scientific computing requiring distributed compute or specialized hardware, and data pipeline automation
+
+### Cheminformatics & Drug Discovery
+- **Datamol** - Python library for molecular manipulation and featurization built on RDKit with enhanced workflows and performance optimizations. Provides utilities for molecular I/O (reading/writing SMILES, SDF, MOL files), molecular standardization and sanitization, molecular transformations (tautomer enumeration, stereoisomer generation), molecular featurization (descriptors, fingerprints, graph representations), parallel processing for large datasets, and integration with machine learning pipelines. Features include: optimized RDKit operations, caching for repeated computations, molecular filtering and preprocessing, and seamless integration with pandas DataFrames. Designed for drug discovery and cheminformatics workflows requiring efficient processing of large compound libraries. Use cases: molecular preprocessing for ML models, compound library management, molecular similarity searches, and cheminformatics data pipelines
+- **DeepChem** - Deep learning framework for molecular machine learning and drug discovery built on TensorFlow and PyTorch. Provides implementations of graph neural networks (GCN, GAT, MPNN, AttentiveFP) for molecular property prediction, molecular featurization (molecular graphs, fingerprints, descriptors), pre-trained models, and MoleculeNet benchmark suite (50+ datasets for molecular property prediction, toxicity, ADMET). Key features include: support for both TensorFlow and PyTorch backends, distributed training, hyperparameter optimization, model interpretation tools, and integration with RDKit. Includes datasets for quantum chemistry, toxicity prediction, ADMET properties, and binding affinity prediction. Use cases: molecular property prediction, drug discovery, ADMET prediction, toxicity screening, and molecular machine learning research
+- **DiffDock** - State-of-the-art diffusion-based molecular docking method for predicting protein-ligand binding poses and binding affinities. Uses diffusion models to generate diverse, high-quality binding poses without requiring exhaustive search. Key features include: fast inference compared to traditional docking methods, generation of multiple diverse poses, confidence scoring for predictions, and support for flexible ligand docking. Provides pre-trained models and Python API for integration into drug discovery pipelines. Achieves superior performance on standard benchmarks (PDBbind, CASF) compared to traditional docking methods. Use cases: virtual screening, lead optimization, binding pose prediction, structure-based drug design, and initial pose generation for refinement with more expensive methods
+- **MedChem** - Python library for medicinal chemistry analysis and drug-likeness assessment. Provides tools for calculating molecular descriptors, ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) property prediction, drug-likeness filters (Lipinski's Rule of Five, Veber rules, Egan rules, Muegge rules), molecular complexity metrics, and synthetic accessibility scoring. Features include: integration with RDKit, parallel processing for large datasets, and comprehensive property calculators. Supports filtering compound libraries based on drug-like properties, identifying potential ADMET issues early in drug discovery, and prioritizing compounds for further development. Use cases: lead optimization, compound library filtering, ADMET prediction, drug-likeness assessment, and medicinal chemistry analysis in drug discovery workflows
+- **Molfeat** - Comprehensive Python library providing 100+ molecular featurizers for converting molecules into numerical representations suitable for machine learning. Includes molecular fingerprints (ECFP, MACCS, RDKit, Pharmacophore), molecular descriptors (2D/3D descriptors, constitutional, topological, electronic), graph-based representations (molecular graphs, line graphs), and pre-trained models (MolBERT, ChemBERTa, Uni-Mol embeddings). Features unified API across different featurizer types, caching for performance, parallel processing, and integration with popular ML frameworks (scikit-learn, PyTorch, TensorFlow). Supports both traditional cheminformatics descriptors and modern learned representations. Use cases: molecular property prediction, virtual screening, molecular similarity searches, and preparing molecular data for machine learning models
+- **PyTDC** - Python library providing access to Therapeutics Data Commons (TDC), a collection of curated datasets and benchmarks for drug discovery and development. Includes datasets for ADMET prediction (absorption, distribution, metabolism, excretion, toxicity), drug-target interactions, drug-drug interactions, drug response prediction, molecular generation, and retrosynthesis. Features standardized data formats, data loaders with automatic preprocessing, benchmark tasks with evaluation metrics, leaderboards for model comparison, and integration with popular ML frameworks. Provides both single-molecule and drug-pair datasets, covering various stages of drug discovery from target identification to clinical outcomes. Use cases: benchmarking ML models for drug discovery, ADMET prediction model development, drug-target interaction prediction, and drug discovery research
+- **RDKit** - Open-source cheminformatics toolkit for molecular informatics and drug discovery. Provides comprehensive functionality for molecular I/O (reading/writing SMILES, SDF, MOL, PDB files), molecular descriptors (200+ 2D and 3D descriptors), molecular fingerprints (Morgan, RDKit, MACCS, topological torsions), SMARTS pattern matching for substructure searches, molecular alignment and 3D coordinate generation, pharmacophore perception, reaction handling, and molecular drawing. Features high-performance C++ core with Python bindings, support for large molecule sets, and extensive documentation. Widely used in pharmaceutical industry and academic research. Use cases: molecular property calculation, virtual screening, molecular similarity searches, substructure matching, molecular visualization, and general cheminformatics workflows
+- **TorchDrug** - PyTorch-based machine learning platform for drug discovery with 40+ datasets, 20+ GNN models for molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, and retrosynthesis planning
+
+### Proteomics & Mass Spectrometry
+- **matchms** - Processing and similarity matching of mass spectrometry data with 40+ filters, spectral library matching (Cosine, Modified Cosine, Neutral Losses), metadata harmonization, molecular fingerprint comparison, and support for multiple file formats (MGF, MSP, mzML, JSON)
+- **pyOpenMS** - Comprehensive mass spectrometry data analysis for proteomics and metabolomics (LC-MS/MS processing, peptide identification, feature detection, quantification, chemical calculations, and integration with search engines like Comet, Mascot, MSGF+)
+
+### Medical Imaging & Digital Pathology
+- **histolab** - Digital pathology toolkit for whole slide image (WSI) processing and analysis. Provides automated tissue detection, tile extraction for deep learning pipelines, and preprocessing for gigapixel histopathology images. Key features include: multi-format WSI support (SVS, TIFF, NDPI), three tile extraction strategies (RandomTiler for sampling, GridTiler for complete coverage, ScoreTiler for quality-driven selection), automated tissue masks with customizable filters, built-in scorers (NucleiScorer, CellularityScorer), pyramidal image handling, visualization tools (thumbnails, mask overlays, tile previews), and H&E stain decomposition. Supports multiple tissue sections, artifact removal, pen annotation exclusion, and reproducible extraction with seeding. Use cases: creating training datasets for computational pathology, extracting informative tiles for tumor classification, whole-slide tissue characterization, quality assessment of histology samples, automated nuclei density analysis, and preprocessing for digital pathology deep learning workflows
+- **PathML** - Comprehensive computational pathology toolkit for whole slide image analysis, tissue segmentation, and machine learning on pathology data. Provides end-to-end workflows for digital pathology research including data loading, preprocessing, feature extraction, and model deployment
+- **pydicom** - Pure Python package for working with DICOM (Digital Imaging and Communications in Medicine) files. Provides comprehensive support for reading, writing, and manipulating medical imaging data from CT, MRI, X-ray, ultrasound, PET scans and other modalities. Key features include: pixel data extraction and manipulation with automatic decompression (JPEG/JPEG 2000/RLE), metadata access and modification with 1000+ standardized DICOM tags, image format conversion (PNG/JPEG/TIFF), anonymization tools for removing Protected Health Information (PHI), windowing and display transformations (VOI LUT application), multi-frame and 3D volume processing, DICOM sequence handling, and support for multiple transfer syntaxes. Use cases: medical image analysis, PACS system integration, radiology workflows, research data processing, DICOM anonymization, format conversion, image preprocessing for machine learning, multi-slice volume reconstruction, and clinical imaging pipelines
+
+### Healthcare AI & Clinical Machine Learning
+- **NeuroKit2** - Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Key features include: automated signal processing pipelines (cleaning, peak detection, delineation, quality assessment), heart rate variability analysis across time/frequency/nonlinear domains (SDNN, RMSSD, LF/HF, DFA, entropy measures), EEG analysis (frequency band power, microstates, source localization), autonomic nervous system assessment (sympathetic indices, respiratory sinus arrhythmia), comprehensive complexity measures (25+ entropy types, 15+ fractal dimensions, Lyapunov exponents), event-related and interval-related analysis modes, epoch creation and averaging for stimulus-locked responses, multi-signal integration with unified workflows, and extensive signal processing utilities (filtering, decomposition, peak correction, spectral analysis). Includes modular reference documentation across 12 specialized domains. Use cases: heart rate variability for cardiovascular health assessment, EEG microstates for consciousness studies, electrodermal activity for emotion research, respiratory variability analysis, psychophysiology experiments, affective computing, stress monitoring, sleep staging, autonomic dysfunction assessment, biofeedback applications, and multi-modal physiological signal integration for comprehensive human state monitoring
+- **PyHealth** - Comprehensive healthcare AI toolkit for developing, testing, and deploying machine learning models with clinical data. Provides specialized tools for electronic health records (EHR), physiological signals, medical imaging, and clinical text analysis. Key features include: 10+ healthcare datasets (MIMIC-III/IV, eICU, OMOP, sleep EEG, COVID-19 CXR), 20+ predefined clinical prediction tasks (mortality, hospital readmission, length of stay, drug recommendation, sleep staging, EEG analysis), 33+ models (Logistic Regression, MLP, CNN, RNN, Transformer, GNN, plus healthcare-specific models like RETAIN, SafeDrug, GAMENet, StageNet), comprehensive data processing (sequence processors, signal processors, medical code translation between ICD-9/10, NDC, RxNorm, ATC systems), training/evaluation utilities (Trainer class, fairness metrics, calibration, uncertainty quantification), and interpretability tools (attention visualization, SHAP, ChEFER). 3x faster than pandas for healthcare data processing. Use cases: ICU mortality prediction, hospital readmission risk assessment, safe medication recommendation with drug-drug interaction constraints, sleep disorder diagnosis from EEG signals, medical code standardization and translation, clinical text to ICD coding, length of stay estimation, and any clinical ML application requiring interpretability, fairness assessment, and calibrated predictions for healthcare deployment
+
+### Protein Engineering & Design
+- **Adaptyv** - Cloud laboratory platform for automated protein testing and validation. Submit protein sequences via API or web interface and receive experimental results in approximately 21 days. Supports multiple assay types including binding assays (biolayer interferometry for protein-target interactions, KD/kon/koff measurements), expression testing (quantify protein expression levels in E. coli, mammalian, yeast, or insect cells), thermostability measurements (DSF and CD for Tm determination and thermal stability profiling), and enzyme activity assays (kinetic parameters, substrate specificity, inhibitor testing). Includes computational optimization tools for pre-screening sequences: NetSolP/SoluProt for solubility prediction, SolubleMPNN for sequence redesign to improve expression, ESM for sequence likelihood scoring, ipTM (AlphaFold-Multimer) for interface stability assessment, and pSAE for aggregation risk quantification. Platform features automated workflows from expression through purification to assay execution with quality control, webhook notifications for experiment completion, batch submission support for high-throughput screening, and comprehensive results with kinetic parameters, confidence metrics, and raw data access. Use cases: antibody affinity maturation, therapeutic protein developability assessment, enzyme engineering and optimization, protein stability improvement, AI-driven protein design validation, library screening for expression and function, lead optimization with experimental feedback, and integration of computational design with wet-lab validation in iterative design-build-test-learn cycles
+- **ESM (Evolutionary Scale Modeling)** - State-of-the-art protein language models from EvolutionaryScale for protein design, structure prediction, and representation learning. Includes ESM3 (1.4B-98B parameter multimodal generative models for simultaneous reasoning across sequence, structure, and function with chain-of-thought generation, inverse folding, and function-conditioned design) and ESM C (300M-6B parameter efficient embedding models 3x faster than ESM2 for similarity analysis, classification, and feature extraction). Supports local inference with open weights and cloud-based Forge API for scalable batch processing. Use cases: novel protein design, structure prediction from sequence, sequence design from structure, protein embeddings, function annotation, variant generation, and directed evolution workflows
+
+### Machine Learning & Deep Learning
+- **aeon** - Comprehensive scikit-learn compatible Python toolkit for time series machine learning providing state-of-the-art algorithms across 7 domains: classification (13 algorithm categories including ROCKET variants, deep learning with InceptionTime/ResNet/FCN, distance-based with DTW/ERP/LCSS, shapelet-based, dictionary methods like BOSS/WEASEL, and hybrid ensembles HIVECOTE), regression (9 categories mirroring classification approaches), clustering (k-means/k-medoids with temporal distances, deep learning autoencoders, spectral methods), forecasting (ARIMA, ETS, Theta, Threshold Autoregressive, TCN, DeepAR), anomaly detection (STOMP/MERLIN matrix profile, clustering-based CBLOF/KMeans, isolation methods, copula-based COPOD), segmentation (ClaSP, FLUSS, HMM, binary segmentation), and similarity search (MASS algorithm, STOMP motif discovery, approximate nearest neighbors). Includes 40+ distance metrics (elastic: DTW/DDTW/WDTW/Shape-DTW, edit-based: ERP/EDR/LCSS/TWE/MSM, lock-step: Euclidean/Manhattan), extensive transformations (ROCKET/MiniRocket/MultiRocket for features, Catch22/TSFresh for statistics, SAX/PAA for symbolic representation, shapelet transforms, wavelets, matrix profile), 20+ deep learning architectures (FCN, ResNet, InceptionTime, TCN, autoencoders with attention mechanisms), comprehensive benchmarking tools (UCR/UEA archives with 100+ datasets, published results repository, statistical testing), and performance-optimized implementations using numba. Features progressive model complexity from fast baselines (MiniRocket: <1 second training, 0.95+ accuracy on many benchmarks) to state-of-the-art ensembles (HIVECOTE V2), GPU acceleration support, and extensive visualization utilities. Use cases: physiological signal classification (ECG, EEG), industrial sensor monitoring, financial forecasting, change point detection, pattern discovery, activity recognition from wearables, predictive maintenance, climate time series analysis, and any sequential data requiring specialized temporal modeling beyond standard ML
+- **PufferLib** - High-performance reinforcement learning library achieving 1M-4M steps/second through optimized vectorization, native multi-agent support, and efficient PPO training (PuffeRL). Use this skill for RL training on any environment (Gymnasium, PettingZoo, Atari, Procgen), creating custom PufferEnv environments, developing policies (CNN, LSTM, multi-input architectures), optimizing parallel simulation performance, or scaling multi-agent systems. Includes Ocean suite (20+ environments), seamless framework integration with automatic space flattening, zero-copy vectorization with shared memory buffers, distributed training support, and comprehensive reference guides for training workflows, environment development, vectorization optimization, policy architectures, and third-party integrations
+- **PyMC** - Comprehensive Python library for Bayesian statistical modeling and probabilistic programming. Provides intuitive syntax for building probabilistic models, advanced MCMC sampling algorithms (NUTS, Metropolis-Hastings, Slice sampling), variational inference methods (ADVI, SVGD), Gaussian processes, time series models (ARIMA, state space models), and model comparison tools (WAIC, LOO). Features include: automatic differentiation via Aesara (formerly Theano), GPU acceleration support, parallel sampling, model diagnostics and convergence checking, and integration with ArviZ for visualization and analysis. Supports hierarchical models, mixture models, survival analysis, and custom distributions. Use cases: Bayesian data analysis, uncertainty quantification, A/B testing, time series forecasting, hierarchical modeling, and probabilistic machine learning
+- **PyMOO** - Python framework for multi-objective optimization using evolutionary algorithms. Provides implementations of state-of-the-art algorithms including NSGA-II, NSGA-III, MOEA/D, SPEA2, and reference-point based methods. Features include: support for constrained and unconstrained optimization, multiple problem types (continuous, discrete, mixed-variable), performance indicators (hypervolume, IGD, GD), visualization tools (Pareto front plots, convergence plots), and parallel evaluation support. Supports custom problem definitions, algorithm configuration, and result analysis. Designed for engineering design, parameter optimization, and any problem requiring optimization of multiple conflicting objectives simultaneously. Use cases: multi-objective optimization problems, Pareto-optimal solution finding, engineering design optimization, and research in evolutionary computation
+- **PyTorch Lightning** - Deep learning framework that organizes PyTorch code to eliminate boilerplate while maintaining full flexibility. Automates training workflows (40+ tasks including epoch/batch iteration, optimizer steps, gradient management, checkpointing), supports multi-GPU/TPU training with DDP/FSDP/DeepSpeed strategies, includes LightningModule for model organization, Trainer for automation, LightningDataModule for data pipelines, callbacks for extensibility, and integrations with TensorBoard, Wandb, MLflow for experiment tracking
+- **PennyLane** - Cross-platform Python library for quantum computing, quantum machine learning, and quantum chemistry. Enables building and training quantum circuits with automatic differentiation, seamless integration with PyTorch/JAX/NumPy, and device-independent execution across simulators and quantum hardware (IBM, Amazon Braket, Google, Rigetti, IonQ). Key features include: quantum circuit construction with QNodes (quantum functions with automatic differentiation), 100+ quantum gates and operations (Pauli, Hadamard, rotation, controlled gates), circuit templates and layers for common ansatze (StronglyEntanglingLayers, BasicEntanglerLayers, UCCSD for chemistry), gradient computation methods (parameter-shift rule for hardware, backpropagation for simulators, adjoint differentiation), quantum chemistry module (molecular Hamiltonian construction, VQE for ground state energy, differentiable Hartree-Fock solver), ML framework integration (TorchLayer for PyTorch models, JAX transformations, TensorFlow deprecated), built-in optimizers (Adam, GradientDescent, QNG, Rotosolve), measurement types (expectation values, probabilities, samples, state vectors), device ecosystem (default.qubit simulator, lightning.qubit for performance, hardware plugins for IBM/Braket/Cirq/Rigetti/IonQ), and Catalyst for just-in-time compilation with adaptive circuits. Supports variational quantum algorithms (VQE, QAOA), quantum neural networks, hybrid quantum-classical models, data encoding strategies (angle, amplitude, IQP embeddings), and pulse-level programming. Use cases: variational quantum eigensolver for molecular simulations, quantum circuit machine learning with gradient-based optimization, hybrid quantum-classical neural networks, quantum chemistry calculations with differentiable workflows, quantum algorithm prototyping with hardware-agnostic code, quantum machine learning research with automatic differentiation, and deploying quantum circuits across multiple quantum computing platforms
+- **Qiskit** - World's most popular open-source quantum computing framework for building, optimizing, and executing quantum circuits with 13M+ downloads and 74% developer preference. Provides comprehensive tools for quantum algorithm development including circuit construction with 100+ quantum gates (Pauli, Hadamard, CNOT, rotation gates, controlled gates), circuit transpilation with 83x faster optimization than competitors producing circuits with 29% fewer two-qubit gates, primitives for execution (Sampler for bitstring measurements and probability distributions, Estimator for expectation values and observables), visualization tools (circuit diagrams in matplotlib/LaTeX, result histograms, Bloch sphere, state visualizations), backend-agnostic execution (local simulators including StatevectorSampler and Aer, IBM Quantum hardware with 100+ qubit systems, IonQ trapped ion, Amazon Braket multi-provider), session and batch modes for iterative and parallel workloads, error mitigation with configurable resilience levels (readout error correction, ZNE, PEC reducing sampling overhead by 100x), four-step patterns workflow (Map classical problems to quantum circuits, Optimize through transpilation, Execute with primitives, Post-process results), algorithm libraries including Qiskit Nature for quantum chemistry (molecular Hamiltonians, VQE for ground states, UCCSD ansatz, multiple fermion-to-qubit mappings), Qiskit Optimization for combinatorial problems (QAOA, portfolio optimization, MaxCut), and Qiskit Machine Learning (quantum kernels, VQC, QNN), support for Python/C/Rust with modular architecture, parameterized circuits for variational algorithms, quantum Fourier transform, Grover search, Shor's algorithm, pulse-level control, IBM Quantum Runtime for cloud execution with job management and queuing, and comprehensive documentation with textbook and tutorials. Use cases: variational quantum eigensolver for molecular ground state energy, QAOA for combinatorial optimization problems, quantum chemistry simulations with multiple ansatze and mappings, quantum machine learning with kernel methods and neural networks, hybrid quantum-classical algorithms, quantum algorithm research and prototyping across multiple hardware platforms, quantum circuit optimization and benchmarking, quantum error mitigation and characterization, quantum information science experiments, and production quantum computing workflows on real quantum hardware
+- **QuTiP** - Quantum Toolbox in Python for simulating and analyzing quantum mechanical systems. Provides comprehensive tools for both closed (unitary) and open (dissipative) quantum systems including quantum states (kets, bras, density matrices, Fock states, coherent states), quantum operators (creation/annihilation operators, Pauli matrices, angular momentum operators, quantum gates), time evolution solvers (Schrödinger equation with sesolve, Lindblad master equation with mesolve, quantum trajectories with Monte Carlo mcsolve, Bloch-Redfield brmesolve, Floquet methods for periodic Hamiltonians), analysis tools (expectation values, entropy measures, fidelity, concurrence, correlation functions, steady state calculations), visualization (Bloch sphere with animations, Wigner functions, Q-functions, Fock distributions, matrix histograms), and advanced methods (Hierarchical Equations of Motion for non-Markovian dynamics, permutational invariance for identical particles, stochastic solvers, superoperators). Supports tensor products for composite systems, partial traces, time-dependent Hamiltonians, multiple dissipation channels, and parallel processing. Includes extensive documentation, tutorials, and examples. Use cases: quantum optics simulations (cavity QED, photon statistics), quantum computing (gate operations, circuit dynamics), open quantum systems (decoherence, dissipation), quantum information theory (entanglement dynamics, quantum channels), condensed matter physics (spin chains, many-body systems), and general quantum mechanics research and education
+- **scikit-learn** - Industry-standard Python library for classical machine learning providing comprehensive supervised learning (classification: Logistic Regression, SVM, Decision Trees, Random Forests with 17+ variants, Gradient Boosting with XGBoost-compatible HistGradientBoosting, Naive Bayes, KNN, Neural Networks/MLP; regression: Linear, Ridge, Lasso, ElasticNet, SVR, ensemble methods), unsupervised learning (clustering: K-Means, DBSCAN, HDBSCAN, OPTICS, Agglomerative/Hierarchical, Spectral, Gaussian Mixture Models, BIRCH, MeanShift; dimensionality reduction: PCA, Kernel PCA, t-SNE, Isomap, LLE, NMF, TruncatedSVD, FastICA, LDA; outlier detection: IsolationForest, LocalOutlierFactor, OneClassSVM), data preprocessing (scaling: StandardScaler, MinMaxScaler, RobustScaler; encoding: OneHotEncoder, OrdinalEncoder, LabelEncoder; imputation: SimpleImputer, KNNImputer, IterativeImputer; feature engineering: PolynomialFeatures, KBinsDiscretizer, text vectorization with CountVectorizer/TfidfVectorizer), model evaluation (cross-validation: KFold, StratifiedKFold, TimeSeriesSplit, GroupKFold; hyperparameter tuning: GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV; metrics: 30+ evaluation metrics for classification/regression/clustering including accuracy, precision, recall, F1, ROC-AUC, MSE, R², silhouette score), and Pipeline/ColumnTransformer for production-ready workflows. Features consistent API (fit/predict/transform), extensive documentation, integration with NumPy/pandas/SciPy, joblib persistence, and scikit-learn-compatible ecosystem (XGBoost, LightGBM, CatBoost, imbalanced-learn). Optimized implementations using Cython/OpenMP for performance. Use cases: predictive modeling, customer segmentation, anomaly detection, feature engineering, model selection/validation, text classification, image classification (with feature extraction), time series forecasting (with preprocessing), medical diagnosis, fraud detection, recommendation systems, and any tabular data ML task requiring interpretable models or established algorithms
+- **scikit-survival** - Survival analysis and time-to-event modeling with censored data. Built on scikit-learn, provides Cox proportional hazards models (CoxPHSurvivalAnalysis, CoxnetSurvivalAnalysis with elastic net regularization), ensemble methods (Random Survival Forests, Gradient Boosting), Survival Support Vector Machines (linear and kernel), non-parametric estimators (Kaplan-Meier, Nelson-Aalen), competing risks analysis, and specialized evaluation metrics (concordance index, time-dependent AUC, Brier score). Handles right-censored data, integrates with scikit-learn pipelines, and supports feature selection and hyperparameter tuning via cross-validation
+- **SHAP** - Model interpretability and explainability using Shapley values from game theory. Provides unified approach to explain any ML model with TreeExplainer (fast exact explanations for XGBoost/LightGBM/Random Forest), DeepExplainer (TensorFlow/PyTorch neural networks), KernelExplainer (model-agnostic), and LinearExplainer. Includes comprehensive visualizations (waterfall plots for individual predictions, beeswarm plots for global importance, scatter plots for feature relationships, bar/force/heatmap plots), supports model debugging, fairness analysis, feature engineering guidance, and production deployment
+- **Stable Baselines3** - PyTorch-based reinforcement learning library providing reliable implementations of RL algorithms (PPO, SAC, DQN, TD3, DDPG, A2C, HER, RecurrentPPO). Use this skill for training RL agents on standard or custom Gymnasium environments, implementing callbacks for monitoring and control, using vectorized environments for parallel training, creating custom environments with proper Gymnasium API implementation, and integrating with deep RL workflows. Includes comprehensive training templates, evaluation utilities, algorithm selection guidance (on-policy vs off-policy, continuous vs discrete actions), support for multi-input policies (dict observations), goal-conditioned learning with HER, and integration with TensorBoard for experiment tracking
+- **statsmodels** - Statistical modeling and econometrics (OLS, GLM, logit/probit, ARIMA, time series forecasting, hypothesis testing, diagnostics)
+- **Torch Geometric** - Graph Neural Networks for molecular and geometric data
+- **Transformers** - State-of-the-art machine learning models for NLP, computer vision, audio, and multimodal tasks. Provides 1M+ pre-trained models accessible via pipelines (text-classification, NER, QA, summarization, translation, text-generation, image-classification, object-detection, ASR, VQA), comprehensive training via Trainer API with distributed training and mixed precision, flexible text generation with multiple decoding strategies (greedy, beam search, sampling), and Auto classes for automatic architecture selection (BERT, GPT, T5, ViT, BART, etc.)
+- **UMAP-learn** - Python implementation of Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction and manifold learning. Provides fast, scalable nonlinear dimensionality reduction that preserves both local and global structure of high-dimensional data. Key features include: support for both supervised and unsupervised dimensionality reduction, ability to handle mixed data types, integration with scikit-learn API, and efficient implementation using numba for performance. Produces low-dimensional embeddings (typically 2D or 3D) suitable for visualization and downstream analysis. Often outperforms t-SNE in preserving global structure while maintaining local neighborhoods. Use cases: data visualization, feature extraction, preprocessing for machine learning, single-cell data analysis, and exploratory data analysis of high-dimensional datasets
+
+### Materials Science & Chemistry
+- **Astropy** - Comprehensive Python library for astronomy and astrophysics providing core functionality for astronomical research and data analysis. Includes coordinate system transformations (ICRS, Galactic, FK5, AltAz), physical units and quantities with automatic dimensional consistency, FITS file operations (reading, writing, manipulating headers and data), cosmological calculations (luminosity distance, lookback time, Hubble parameter, Planck/WMAP models), precise time handling across multiple time scales (UTC, TAI, TT, TDB) and formats (JD, MJD, ISO), table operations with unit support (FITS, CSV, HDF5, VOTable), WCS transformations between pixel and world coordinates, astronomical constants, modeling framework, visualization tools, and statistical functions. Use for celestial coordinate transformations, unit conversions, FITS image/table processing, cosmological distance calculations, barycentric time corrections, catalog cross-matching, and astronomical data analysis
+- **COBRApy** - Python package for constraint-based reconstruction and analysis (COBRA) of metabolic networks. Provides tools for building, manipulating, and analyzing genome-scale metabolic models (GEMs). Key features include: flux balance analysis (FBA) for predicting optimal metabolic fluxes, flux variability analysis (FVA), gene knockout simulations, pathway analysis, model validation, and integration with other COBRA Toolbox formats (SBML, JSON). Supports various optimization objectives (biomass production, ATP production, metabolite production), constraint handling (reaction bounds, gene-protein-reaction associations), and model comparison. Includes utilities for model construction, gap filling, and model refinement. Use cases: metabolic engineering, systems biology, biotechnology applications, understanding cellular metabolism, and predicting metabolic phenotypes
+- **Pymatgen** - Python Materials Genomics (pymatgen) library for materials science computation and analysis. Provides comprehensive tools for crystal structure manipulation, phase diagram construction, electronic structure analysis, and materials property calculations. Key features include: structure objects with symmetry analysis, space group determination, structure matching and comparison, phase diagram generation from formation energies, band structure and density of states analysis, defect calculations, surface and interface analysis, and integration with DFT codes (VASP, Quantum ESPRESSO, ABINIT). Supports Materials Project database integration, structure file I/O (CIF, POSCAR, VASP), and high-throughput materials screening workflows. Use cases: materials discovery, crystal structure analysis, phase stability prediction, electronic structure calculations, and computational materials science research
+
+### Engineering & Simulation
+- **FluidSim** - Object-oriented Python framework for high-performance computational fluid dynamics (CFD) simulations using pseudospectral methods with FFT. Provides solvers for periodic-domain equations including 2D/3D incompressible Navier-Stokes equations (with/without stratification), shallow water equations, and Föppl-von Kármán elastic plate equations. Key features include: Pythran/Transonic compilation for performance comparable to Fortran/C++, MPI parallelization for large-scale simulations, hierarchical parameter configuration with type safety, comprehensive output management (physical fields in HDF5, spatial means, energy/enstrophy spectra, spectral energy budgets), custom forcing mechanisms (time-correlated random forcing, proportional forcing, script-defined forcing), flexible initial conditions (noise, vortex, dipole, Taylor-Green, from file, in-script), online and offline visualization, and integration with ParaView/VisIt for 3D visualization. Supports workflow features including simulation restart/continuation, parametric studies with batch execution, cluster submission integration, and adaptive CFL-based time stepping. Use cases: 2D/3D turbulence studies with energy cascade analysis, stratified oceanic and atmospheric flows with buoyancy effects, geophysical flows with rotation (Coriolis effects), vortex dynamics and fundamental fluid mechanics research, high-resolution direct numerical simulation (DNS), parametric studies exploring parameter spaces, validation studies (Taylor-Green vortex), and any periodic-domain fluid dynamics research requiring HPC-grade performance with Python flexibility
+
+### Data Analysis & Visualization
+- **Dask** - Parallel computing for larger-than-memory datasets with distributed DataFrames, Arrays, Bags, and Futures
+- **Data Commons** - Programmatic access to public statistical data from global sources including census bureaus, health organizations, and environmental agencies. Provides unified Python API for querying demographic data, economic indicators, health statistics, and environmental datasets through a knowledge graph interface. Features three main endpoints: Observation (statistical time-series queries for population, GDP, unemployment rates, disease prevalence), Node (knowledge graph exploration for entity relationships and hierarchies), and Resolve (entity identification from names, coordinates, or Wikidata IDs). Seamless Pandas integration for DataFrames, relation expressions for hierarchical queries, data source filtering for consistency, and support for custom Data Commons instances
+- **GeoPandas** - Python library extending pandas for working with geospatial vector data including shapefiles, GeoJSON, and GeoPackage files. Provides GeoDataFrame and GeoSeries data structures combining geometric data with tabular attributes for spatial analysis. Key features include: reading/writing spatial file formats (Shapefile, GeoJSON, GeoPackage, PostGIS, Parquet) with Arrow acceleration for 2-4x faster I/O, geometric operations (buffer, simplify, centroid, convex hull, affine transformations) through Shapely integration, spatial analysis (spatial joins with predicates like intersects/contains/within, nearest neighbor joins, overlay operations for union/intersection/difference, dissolve for aggregation, clipping), coordinate reference system (CRS) management (setting CRS, reprojecting between coordinate systems, UTM estimation), and visualization (static choropleth maps with matplotlib, interactive maps with folium, multi-layer mapping, classification schemes with mapclassify). Supports spatial indexing for performance, filtering during read operations (bbox, mask, SQL WHERE), and integration with cartopy for cartographic projections. Use cases: spatial data manipulation, buffer analysis, spatial joins between datasets, dissolving boundaries, calculating areas/distances in projected CRS, reprojecting coordinate systems, creating choropleth maps, converting between spatial file formats, PostGIS database integration, and geospatial data analysis workflows
+- **Matplotlib** - Comprehensive Python plotting library for creating publication-quality static, animated, and interactive visualizations. Provides extensive customization options for creating figures, subplots, axes, and annotations. Key features include: support for multiple plot types (line, scatter, bar, histogram, contour, 3D, and many more), extensive customization (colors, fonts, styles, layouts), multiple backends (PNG, PDF, SVG, interactive backends), LaTeX integration for mathematical notation, and integration with NumPy and pandas. Includes specialized modules (pyplot for MATLAB-like interface, artist layer for fine-grained control, backend layer for rendering). Supports complex multi-panel figures, color maps, legends, and annotations. Use cases: scientific figure creation, data visualization, exploratory data analysis, publication graphics, and any application requiring high-quality plots
+- **NetworkX** - Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs. Supports four graph types (Graph, DiGraph, MultiGraph, MultiDiGraph) with nodes as any hashable objects and rich edge attributes. Provides 100+ algorithms including shortest paths (Dijkstra, Bellman-Ford, A*), centrality measures (degree, betweenness, closeness, eigenvector, PageRank), clustering (coefficients, triangles, transitivity), community detection (modularity-based, label propagation, Girvan-Newman), connectivity analysis (components, cuts, flows), tree algorithms (MST, spanning trees), matching, graph coloring, isomorphism, and traversal (DFS, BFS). Includes 50+ graph generators for classic (complete, cycle, wheel), random (Erdős-Rényi, Barabási-Albert, Watts-Strogatz, stochastic block model), lattice (grid, hexagonal, hypercube), and specialized networks. Supports I/O across formats (edge lists, GraphML, GML, JSON, Pajek, GEXF, DOT) with Pandas/NumPy/SciPy integration. Visualization capabilities include 8+ layout algorithms (spring/force-directed, circular, spectral, Kamada-Kawai), customizable node/edge appearance, interactive visualizations with Plotly/PyVis, and publication-quality figure generation. Use cases: social network analysis, biological networks (protein-protein interactions, gene regulatory networks, metabolic pathways), transportation systems, citation networks, knowledge graphs, web structure analysis, infrastructure networks, and any domain involving pairwise relationships requiring structural analysis or graph-based modeling
+- **Polars** - High-performance DataFrame library written in Rust with Python bindings, designed for fast data manipulation and analysis. Provides lazy evaluation for query optimization, efficient memory usage, and parallel processing. Key features include: DataFrame operations (filtering, grouping, joining, aggregations), support for large datasets (larger than RAM), integration with pandas and NumPy, expression API for complex transformations, and support for multiple data formats (CSV, Parquet, JSON, Excel, Arrow). Features query optimization through lazy evaluation, automatic parallelization, and efficient memory management. Often 5-30x faster than pandas for many operations. Use cases: large-scale data processing, ETL pipelines, data analysis workflows, and high-performance data manipulation tasks
+- **Plotly** - Interactive scientific and statistical data visualization library for Python with 40+ chart types. Provides both high-level API (Plotly Express) for quick visualizations and low-level API (graph objects) for fine-grained control. Key features include: comprehensive chart types (scatter, line, bar, histogram, box, violin, heatmap, contour, 3D plots, geographic maps, financial charts, statistical distributions, hierarchical charts), interactive features (hover tooltips, pan/zoom, legend toggling, animations, rangesliders, buttons/dropdowns), publication-quality output (static images in PNG/PDF/SVG via Kaleido, interactive HTML with embeddable figures), extensive customization (templates, themes, color scales, fonts, layouts, annotations, shapes), subplot support (multi-plot figures with shared axes), and Dash integration for building analytical web applications. Plotly Express offers one-line creation of complex visualizations with automatic color encoding, faceting, and trendlines. Graph objects provide precise control for specialized visualizations (candlestick charts, 3D surfaces, sankey diagrams, gauge charts). Supports pandas DataFrames, NumPy arrays, and various data formats. Use cases: scientific data visualization, statistical analysis, financial charting, interactive dashboards, publication figures, exploratory data analysis, and any application requiring interactive or publication-quality visualizations
+- **Seaborn** - Statistical data visualization with dataset-oriented interface, automatic confidence intervals, publication-quality themes, colorblind-safe palettes, and comprehensive support for exploratory analysis, distribution comparisons, correlation matrices, regression plots, and multi-panel figures
+- **SimPy** - Process-based discrete-event simulation framework for modeling systems with processes, queues, and resource contention (manufacturing, service operations, network traffic, logistics). Supports generator-based process definition, multiple resource types (Resource, PriorityResource, PreemptiveResource, Container, Store), event-driven scheduling, process interaction mechanisms (signaling, interruption, parallel/sequential execution), real-time simulation synchronized with wall-clock time, and comprehensive monitoring capabilities for utilization, wait times, and queue statistics
+- **SymPy** - Symbolic mathematics in Python for exact computation using mathematical symbols rather than numerical approximations. Provides comprehensive support for symbolic algebra (simplification, expansion, factorization), calculus (derivatives, integrals, limits, series), equation solving (algebraic, differential, systems of equations), matrices and linear algebra (eigenvalues, decompositions, solving linear systems), physics (classical mechanics with Lagrangian/Hamiltonian formulations, quantum mechanics, vector analysis, units), number theory (primes, factorization, modular arithmetic, Diophantine equations), geometry (2D/3D analytic geometry), combinatorics (permutations, combinations, partitions, group theory), logic and sets, statistics (probability distributions, random variables), special functions (gamma, Bessel, orthogonal polynomials), and code generation (lambdify to NumPy/SciPy functions, C/Fortran code generation, LaTeX output for documentation). Emphasizes exact arithmetic using rational numbers and symbolic representations, supports assumptions for improved simplification (positive, real, integer), integrates seamlessly with NumPy/SciPy through lambdify for fast numerical evaluation, and enables symbolic-to-numeric pipelines for scientific computing workflows
+- **Vaex** - High-performance Python library for lazy, out-of-core DataFrames to process and visualize tabular datasets larger than available RAM. Processes over a billion rows per second through memory-mapped files (HDF5, Apache Arrow), lazy evaluation, and virtual columns (zero memory overhead). Provides instant file opening, efficient aggregations across billions of rows, interactive visualizations without sampling, machine learning pipelines with transformers (scalers, encoders, PCA), and seamless integration with pandas/NumPy/Arrow. Includes comprehensive ML framework (vaex.ml) with feature scaling, categorical encoding, dimensionality reduction, and integration with scikit-learn/XGBoost/LightGBM/CatBoost. Supports distributed computing via Dask, asynchronous operations, and state management for production deployment. Use cases: processing gigabyte to terabyte datasets, fast statistical aggregations on massive data, visualizing billion-row datasets, ML pipelines on big data, converting between data formats, and working with astronomical, financial, or scientific large-scale datasets
+- **ReportLab** - Python library for programmatic PDF generation and document creation. Provides comprehensive tools for creating PDFs from scratch including text formatting, tables, graphics, images, charts, and complex layouts. Key features include: high-level Platypus framework for document layout, low-level canvas API for precise control, support for fonts (TrueType, Type 1), vector graphics, image embedding, page templates, headers/footers, and multi-page documents. Supports barcodes, forms, encryption, and digital signatures. Can generate reports, invoices, certificates, and complex documents programmatically. Use cases: automated report generation, document creation, invoice generation, certificate printing, and any application requiring programmatic PDF creation
+
+### Phylogenetics & Trees
+- **ETE Toolkit** - Python library for phylogenetic tree manipulation, visualization, and analysis. Provides comprehensive tools for working with phylogenetic trees including tree construction, manipulation (pruning, collapsing, rooting), tree comparison (Robinson-Foulds distance, tree reconciliation), annotation (node colors, labels, branch styles), and publication-quality visualization. Key features include: support for multiple tree formats (Newick, Nexus, PhyloXML), integration with phylogenetic software (PhyML, RAxML, FastTree), tree annotation with metadata, interactive tree visualization, and export to various image formats (PNG, PDF, SVG). Supports species trees, gene trees, and reconciliation analysis. Use cases: phylogenetic analysis, tree visualization, evolutionary biology research, comparative genomics, and teaching phylogenetics
+
+### Genomics Tools
+- **deepTools** - Comprehensive suite of Python tools for exploring and visualizing next-generation sequencing (NGS) data, particularly ChIP-seq, RNA-seq, and ATAC-seq experiments. Provides command-line tools and Python API for processing BAM and bigWig files. Key features include: quality control metrics (plotFingerprint, plotCorrelation), coverage track generation (bamCoverage for creating bigWig files), matrix generation for heatmaps (computeMatrix, plotHeatmap, plotProfile), comparative analysis (multiBigwigSummary, plotPCA), and efficient handling of large files. Supports normalization methods, binning options, and various visualization outputs. Designed for high-throughput analysis workflows and publication-quality figure generation. Use cases: ChIP-seq peak visualization, RNA-seq coverage analysis, ATAC-seq signal tracks, comparative genomics, and NGS data exploration
+- **FlowIO** - Python library for reading and manipulating Flow Cytometry Standard (FCS) files, the standard format for flow cytometry data. Provides efficient parsing of FCS files (versions 2.0, 3.0, 3.1), access to event data (fluorescence intensities, scatter parameters), metadata extraction (keywords, parameters, acquisition settings), and conversion to pandas DataFrames or NumPy arrays. Features include: support for large FCS files, handling of multiple data segments, access to text segments and analysis segments, and integration with flow cytometry analysis workflows. Enables programmatic access to flow cytometry data for downstream analysis, visualization, and machine learning applications. Use cases: flow cytometry data analysis, high-throughput screening, immune cell profiling, and automated processing of FCS files
+- **scikit-bio** - Python library for bioinformatics providing data structures, algorithms, and parsers for biological sequence analysis. Built on NumPy, SciPy, and pandas. Key features include: sequence objects (DNA, RNA, protein sequences) with biological alphabet validation, sequence alignment algorithms (local, global, semiglobal), phylogenetic tree manipulation, diversity metrics (alpha diversity, beta diversity, phylogenetic diversity), distance metrics for sequences and communities, file format parsers (FASTA, FASTQ, QIIME formats, Newick), and statistical analysis tools. Provides scikit-learn compatible transformers for machine learning workflows. Supports efficient processing of large sequence datasets. Use cases: sequence analysis, microbial ecology (16S rRNA analysis), metagenomics, phylogenetic analysis, and bioinformatics research requiring sequence manipulation and diversity calculations
+- **Zarr** - Python library implementing the Zarr chunked, compressed N-dimensional array storage format. Provides efficient storage and access to large multi-dimensional arrays with chunking and compression. Key features include: support for NumPy-like arrays with chunked storage, multiple compression codecs (zlib, blosc, lz4, zstd), support for various data types, efficient partial array reading (only load needed chunks), support for both local filesystem and cloud storage (S3, GCS, Azure), and integration with NumPy, Dask, and Xarray. Enables working with arrays larger than available RAM through lazy loading and efficient chunk access. Supports parallel read/write operations and is optimized for cloud storage backends. Use cases: large-scale scientific data storage, cloud-based array storage, out-of-core array operations, and efficient storage of multi-dimensional datasets (genomics, imaging, climate data)
+
+### Multi-omics & AI Agent Frameworks
+- **BIOMNI** - Autonomous biomedical AI agent framework from Stanford SNAP lab for executing complex research tasks across genomics, drug discovery, molecular biology, and clinical analysis. Combines LLM reasoning with code execution and ~11GB of integrated biomedical databases (Ensembl, NCBI Gene, UniProt, PDB, AlphaFold, ClinVar, OMIM, HPO, PubMed, KEGG, Reactome, GO). Supports multiple LLM providers (Claude, GPT-4, Gemini, Groq, Bedrock). Includes A1 agent class for autonomous task decomposition, BiomniEval1 benchmark framework, and MCP server integration. Use cases: CRISPR screening design, single-cell RNA-seq analysis, ADMET prediction, GWAS interpretation, rare disease diagnosis, protein structure analysis, literature synthesis, and multi-omics integration
+- **Denario** - Multiagent AI system for scientific research assistance that automates complete research workflows from data analysis through publication. Built on AG2 and LangGraph frameworks, orchestrates specialized agents for hypothesis generation, methodology development, computational analysis, and LaTeX paper writing. Supports multiple LLM providers (Google Vertex AI, OpenAI) with flexible pipeline stages allowing manual or automated inputs. Key features include: end-to-end research automation (data description → idea generation → methodology → results → paper), journal-specific formatting (APS and others), GUI interface via Streamlit, Docker deployment with LaTeX environment, reproducible research with version-controlled outputs, literature search integration, and integration with scientific Python stack (pandas, sklearn, scipy). Provides both programmatic Python API and web-based interface. Use cases: automated hypothesis generation from datasets, research methodology development, computational experiment execution with visualization, publication-ready manuscript generation, time-series analysis research, machine learning experiment automation, and accelerating the complete scientific research lifecycle from ideation to publication
+- **HypoGeniC** - Automated hypothesis generation and testing using large language models to accelerate scientific discovery. Provides three frameworks: HypoGeniC (data-driven hypothesis generation from observational data), HypoRefine (synergistic approach combining literature insights with empirical patterns through an agentic system), and Union methods (mechanistic combination of literature and data-driven hypotheses). Features iterative refinement that improves hypotheses by learning from challenging examples, Redis caching for API cost reduction, and customizable YAML-based prompt templates. Includes command-line tools for generation (hypogenic_generation) and testing (hypogenic_inference). Research applications have demonstrated 14.19% accuracy improvement in AI-content detection and 7.44% in deception detection. Use cases: deception detection in reviews, AI-generated content identification, mental stress detection, exploratory research without existing literature, hypothesis-driven analysis in novel domains, and systematic exploration of competing explanations
+
+### Scientific Communication & Publishing
+- **Paper-2-Web** - Autonomous pipeline for transforming academic papers into multiple promotional formats using the Paper2All system. Converts LaTeX or PDF papers into: (1) Paper2Web - interactive, layout-aware academic homepages with responsive design, interactive figures, and mobile support; (2) Paper2Video - professional presentation videos with slides, narration, cursor movements, and optional talking-head generation using Hallo2; (3) Paper2Poster - print-ready conference posters with custom dimensions, professional layouts, and institution branding. Supports GPT-4/GPT-4.1 models, batch processing, QR code generation, multi-language content, and quality assessment metrics. Use cases: conference materials, video abstracts, preprint enhancement, research promotion, poster sessions, and academic website creation
+- **Perplexity Search** - AI-powered web search using Perplexity models via LiteLLM and OpenRouter for real-time, web-grounded answers with source citations. Provides access to multiple Perplexity models: Sonar Pro (general-purpose, best cost-quality balance), Sonar Pro Search (most advanced agentic search with multi-step reasoning), Sonar (cost-effective for simple queries), Sonar Reasoning Pro (advanced step-by-step analysis), and Sonar Reasoning (basic reasoning). Key features include: single OpenRouter API key setup (no separate Perplexity account), real-time access to current information beyond training data cutoff, comprehensive query design guidance (domain-specific patterns, time constraints, source preferences), cost optimization strategies with usage monitoring, programmatic and CLI interfaces, batch processing support, and integration with other scientific skills. Installation uses uv pip for LiteLLM, with detailed setup, troubleshooting, and security documentation. Use cases: finding recent scientific publications and research, conducting literature searches across domains, verifying facts with source citations, accessing current developments in any field, comparing technologies and approaches, performing domain-specific research (biomedical, clinical, technical), supplementing PubMed searches with real-time web results, and discovering latest developments post-database indexing
+
+### Document Processing & Conversion
+- **MarkItDown** - Python utility for converting 20+ file formats to Markdown optimized for LLM processing. Converts Office documents (PDF, DOCX, PPTX, XLSX), images with OCR, audio with transcription, web content (HTML, YouTube transcripts, EPUB), and structured data (CSV, JSON, XML) while preserving document structure (headings, lists, tables, hyperlinks). Key features include: Azure Document Intelligence integration for enhanced PDF table extraction, LLM-powered image descriptions using GPT-4o, batch processing with ZIP archive support, modular installation for specific formats, streaming approach without temporary files, and plugin system for custom converters. Supports Python 3.10+. Use cases: preparing documents for RAG systems, extracting text from PDFs and Office files, transcribing audio to text, performing OCR on images and scanned documents, converting YouTube videos to searchable text, processing HTML and EPUB books, converting structured data to readable format, document analysis pipelines, and LLM training data preparation
+
+### Laboratory Automation & Equipment Control
+- **PyLabRobot** - Hardware-agnostic, pure Python SDK for automated and autonomous laboratories. Provides unified interface for controlling liquid handling robots (Hamilton STAR/STARlet, Opentrons OT-2, Tecan EVO), plate readers (BMG CLARIOstar), heater shakers, incubators, centrifuges, pumps, and scales. Key features include: modular resource management system for plates, tips, and containers with hierarchical deck layouts and JSON serialization; comprehensive liquid handling operations (aspirate, dispense, transfer, serial dilutions, plate replication) with automatic tip and volume tracking; backend abstraction enabling hardware-agnostic protocols that work across different robots; ChatterboxBackend for protocol simulation and testing without hardware; browser-based visualizer for real-time 3D deck state visualization; cross-platform support (Windows, macOS, Linux, Raspberry Pi); and integration capabilities for multi-device workflows combining liquid handlers, analytical equipment, and material handling devices. Use cases: automated sample preparation, high-throughput screening, serial dilution protocols, plate reading workflows, laboratory protocol development and validation, robotic liquid handling automation, and reproducible laboratory automation with state tracking and persistence
+
+### Tool Discovery & Research Platforms
+- **ToolUniverse** - Unified ecosystem providing standardized access to 600+ scientific tools, models, datasets, and APIs across bioinformatics, cheminformatics, genomics, structural biology, and proteomics. Enables AI agents to function as research scientists through: (1) Tool Discovery - natural language, semantic, and keyword-based search for finding relevant scientific tools (Tool_Finder, Tool_Finder_LLM, Tool_Finder_Keyword); (2) Tool Execution - standardized AI-Tool Interaction Protocol for running tools with consistent interfaces; (3) Tool Composition - sequential and parallel workflow chaining for multi-step research pipelines; (4) Model Context Protocol (MCP) integration for Claude Desktop/Code. Supports drug discovery workflows (disease→targets→structures→screening→candidates), genomics analysis (expression→differential analysis→pathways), clinical genomics (variants→annotation→pathogenicity→disease associations), and cross-domain research. Use cases: accessing scientific databases (OpenTargets, PubChem, UniProt, PDB, ChEMBL, KEGG), protein structure prediction (AlphaFold), molecular docking, pathway enrichment, variant annotation, literature searches, and automated scientific workflows
+
+## Scientific Thinking & Analysis
+
+### Analysis & Methodology
+- **Exploratory Data Analysis** - Comprehensive EDA toolkit with automated statistics, visualizations, and insights for any tabular dataset
+- **Hypothesis Generation** - Structured frameworks for generating and evaluating scientific hypotheses
+- **Literature Review** - Systematic literature search and review toolkit with support for multiple scientific databases (PubMed, bioRxiv, Google Scholar), citation management with multiple citation styles (APA, AMA, Vancouver, Chicago, IEEE, Nature, Science), citation verification and deduplication, search strategies (Boolean operators, MeSH terms, field tags), PDF report generation with formatted references, and comprehensive templates for conducting systematic reviews following PRISMA guidelines
+- **Peer Review** - Comprehensive toolkit for conducting high-quality scientific peer review with structured evaluation of methodology, statistics, reproducibility, ethics, and presentation across all scientific disciplines
+- **Scientific Brainstorming** - Conversational brainstorming partner for generating novel research ideas, exploring connections, challenging assumptions, and developing creative approaches through structured ideation workflows
+- **Scientific Critical Thinking** - Tools and approaches for rigorous scientific reasoning and evaluation
+- **Scientific Visualization** - Best practices and templates for creating publication-quality scientific figures with matplotlib and seaborn, including statistical plots with automatic confidence intervals, colorblind-safe palettes, multi-panel figures, heatmaps, and journal-specific formatting
+- **Scientific Writing** - Comprehensive toolkit for writing, structuring, and formatting scientific research papers using IMRAD format, multiple citation styles (APA, AMA, Vancouver, Chicago, IEEE), reporting guidelines (CONSORT, STROBE, PRISMA), effective figures and tables, field-specific terminology, venue-specific structure expectations, and core writing principles for clarity, conciseness, and accuracy across all scientific disciplines
+- **Statistical Analysis** - Comprehensive statistical testing, power analysis, and experimental design
+
+### Document Processing
+- **DOCX** - Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction
+- **PDF** - PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms
+- **PPTX** - Presentation creation, editing, and analysis with support for layouts, comments, and speaker notes
+- **XLSX** - Spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization
+
--- a/docs/scientific-thinking.md
+++ b/docs/scientific-thinking.md
@@ -1,19 +0,0 @@
-# Scientific Thinking & Analysis
-
-## Analysis & Methodology
- **Exploratory Data Analysis** - Comprehensive EDA toolkit with automated statistics, visualizations, and insights for any tabular dataset
- **Hypothesis Generation** - Structured frameworks for generating and evaluating scientific hypotheses
- **Peer Review** - Comprehensive toolkit for conducting high-quality scientific peer review with structured evaluation of methodology, statistics, reproducibility, ethics, and presentation across all scientific disciplines
- **Scientific Brainstorming** - Conversational brainstorming partner for generating novel research ideas, exploring connections, challenging assumptions, and developing creative approaches through structured ideation workflows
- **Scientific Critical Thinking** - Tools and approaches for rigorous scientific reasoning and evaluation
- **Scientific Visualization** - Best practices and templates for creating publication-quality scientific figures with matplotlib and seaborn, including statistical plots with automatic confidence intervals, colorblind-safe palettes, multi-panel figures, heatmaps, and journal-specific formatting
- **Scientific Writing** - Comprehensive toolkit for writing, structuring, and formatting scientific research papers using IMRAD format, multiple citation styles (APA, AMA, Vancouver, Chicago, IEEE), reporting guidelines (CONSORT, STROBE, PRISMA), effective figures and tables, field-specific terminology, venue-specific structure expectations, and core writing principles for clarity, conciseness, and accuracy across all scientific disciplines
- **Statistical Analysis** - Comprehensive statistical testing, power analysis, and experimental design
-
-## Document Processing
- **DOCX** - Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction
- **PDF** - PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms
- **PPTX** - Presentation creation, editing, and analysis with support for layouts, comments, and speaker notes
- **XLSX** - Spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization
-
-
--- a/scientific-helpers/scientific-context-initialization/SKILL.md
+++ b/scientific-helpers/scientific-context-initialization/SKILL.md
@@ -1,80 +0,0 @@
---
-name: context-initialization
-description: "Always Auto-invoked skill that creates/updates workspace AGENT.md to instruct the agent to always search for existing skills before attempting any scientific task."
---
-
-# Context Initialization
-
-## Overview
-
-This skill automatically creates or updates an `AGENT.md` file in the workspace root that instructs the agent to use existing skills before attempting to solve scientific tasks. This ensures the agent uses documented patterns, authentication methods, working examples, and best practices from the repository's skills rather than inventing solutions from scratch.
-
-## When to Use This Skill
-
-This skill is automatically invoked when:
- The agent initializes in this workspace
- User begins any scientific task (database access, package usage, platform integration, or methodology)
- User mentions specific databases, packages, platforms, or research methods
- Any scientific data retrieval, analysis, or research task is started
-
-**No manual invocation required** - this skill runs automatically.
-
-## What This Skill Does
-
-Creates or updates `AGENT.md` in the workspace root with instructions for the agent to:
-
-1. **Search first**: Look for relevant skills across all skill categories before writing code
-2. **Use existing patterns**: Apply documented API access patterns, workflows, and examples
-3. **Follow best practices**: Use rate limits, authentication, configurations, and established methodologies
-4. **Adapt examples**: Leverage working code examples from `scripts/` folders and reference documentation
-
-**Important**: If `AGENT.md` already exists in the workspace, this skill will update it intelligently rather than overwriting it. This preserves any custom instructions or modifications while ensuring the essential skill-search directives are present.
-
-## Skill Categories
-
-This unified context initialization covers four major skill categories:
-
-### Database Access Tasks
- Search `scientific-databases/` for 24+ database skills
- Use documented API endpoints and authentication patterns
- Apply working code examples and best practices
- Follow rate limits and error handling patterns
-
-### Scientific Package Usage
- Search `scientific-packages/` for 40+ Python package skills
- Use installation instructions and API usage examples
- Apply best practices and common patterns
- Leverage working scripts and reference documentation
-
-### Laboratory Platform Integration
- Search `scientific-integrations/` for 6+ platform integration skills
- Use authentication and setup instructions
- Apply API access patterns and platform-specific best practices
- Leverage working integration examples
-
-### Scientific Analysis & Research Methods
- Search `scientific-thinking/` for methodology skills
- Use established data analysis frameworks (EDA, statistical analysis)
- Apply research methodologies (hypothesis generation, brainstorming, critical thinking)
- Leverage communication skills (scientific writing, visualization, peer review)
- Use document processing skills (DOCX, PDF, PPTX, XLSX)
-
-## Implementation
-
-When invoked, this skill manages the workspace `AGENT.md` file as follows:
-
- **If `AGENT.md` does not exist**: Creates a new file using the complete template from `references/AGENT.md`
- **If `AGENT.md` already exists**: Updates the file to ensure the essential skill-search directives are present, while preserving any existing custom content or modifications
-
-The file includes sections instructing the agent to search for and use existing skills across all scientific task categories.
-
-The complete reference template is available in `references/AGENT.md`.
-
-## Benefits
-
-By centralizing context initialization, this skill ensures:
- **Consistency**: The agent always uses the same approach across all skill types
- **Efficiency**: One initialization covers all scientific tasks
- **Maintainability**: Updates to the initialization strategy occur in one place
- **Completeness**: The agent is reminded to search across all available skill categories
-
--- a/scientific-helpers/scientific-context-initialization/references/AGENT.md
+++ b/scientific-helpers/scientific-context-initialization/references/AGENT.md
@@ -1,72 +0,0 @@
-# Reference: Complete Context Initialization Template
-
-This is the complete reference template for what gets added to the workspace root `AGENT.md` file.
-
---
-
-# Agent Scientific Skills - Working Instructions
-
-## IMPORTANT: Use Available Skills First
-
-Before attempting any scientific task, use available skills.
-
---
-
-## Database Access Tasks
-
-**Before writing any database access code, use available skills in this repository.**
-
-This repository contains skills for 24+ scientific databases. Each skill includes:
- API endpoints and authentication patterns
- Working code examples
- Best practices and rate limits
- Example scripts
-
-Always use available database skills before writing custom database access code.
-
---
-
-## Scientific Package Usage
-
-**Before writing analysis code with scientific packages, use available skills in this repository.**
-
-This repository contains skills for 40+ scientific Python packages. Each skill includes:
- Installation instructions
- Complete API usage examples
- Best practices and common patterns
- Working scripts and reference documentation
-
-Always use available package skills before writing custom analysis code.
-
---
-
-## Laboratory Platform Integration
-
-**Before writing any platform integration code, use available skills in this repository.**
-
-This repository contains skills for 6+ laboratory platforms and cloud services. Each skill includes:
- Authentication and setup instructions
- API access patterns
- Working integration examples
- Platform-specific best practices
-
-Always use available integration skills before writing custom platform code.
-
---
-
-## Scientific Analysis & Research Methods
-
-**Before attempting any analysis, writing, or research task, use available methodology skills in this repository.**
-
-This repository contains skills for scientific methodologies including:
- Data analysis frameworks (EDA, statistical analysis)
- Research methodologies (hypothesis generation, brainstorming, critical thinking)
- Communication skills (scientific writing, visualization, peer review)
- Document processing (DOCX, PDF, PPTX, XLSX)
-
-Always use available methodology skills before attempting scientific analysis or writing tasks.
-
---
-
-*This file is auto-generated by context-initialization skills. It ensures the agent uses available skills before attempting to solve scientific tasks from scratch.*
-
--- a/scientific-packages/anndata/SKILL.md
+++ b/scientific-packages/anndata/SKILL.md
@@ -1,527 +0,0 @@
---
-name: anndata
-description: "Manipulate AnnData objects for single-cell genomics. Load/save .h5ad files, manage obs/var metadata, layers, embeddings (PCA/UMAP), concatenate datasets, for scRNA-seq workflows."
---
-
-# AnnData
-
-## Overview
-
-AnnData (Annotated Data) is Python's standard for storing and manipulating annotated data matrices, particularly in single-cell genomics. Work with AnnData objects for data creation, manipulation, file I/O, concatenation, and memory-efficient workflows.
-
-## Core Capabilities
-
-### 1. Creating and Structuring AnnData Objects
-
-Create AnnData objects from various data sources and organize multi-dimensional annotations.
-
-**Basic creation:**
-```python
-import anndata as ad
-import numpy as np
-from scipy.sparse import csr_matrix
-
-# From dense or sparse arrays
-counts = np.random.poisson(1, size=(100, 2000))
-adata = ad.AnnData(counts)
-
-# With sparse matrix (memory-efficient)
-counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
-adata = ad.AnnData(counts)
-```
-
-**With metadata:**
-```python
-import pandas as pd
-
-obs_meta = pd.DataFrame({
-    'cell_type': pd.Categorical(['B', 'T', 'Monocyte'] * 33 + ['B']),
-    'batch': ['batch1'] * 50 + ['batch2'] * 50
-})
-var_meta = pd.DataFrame({
-    'gene_name': [f'Gene_{i}' for i in range(2000)],
-    'highly_variable': np.random.choice([True, False], 2000)
-})
-
-adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
-```
-
-**Understanding the structure:**
- **X**: Primary data matrix (observations × variables)
- **obs**: Row (observation) annotations as DataFrame
- **var**: Column (variable) annotations as DataFrame
- **obsm**: Multi-dimensional observation annotations (e.g., PCA, UMAP coordinates)
- **varm**: Multi-dimensional variable annotations (e.g., gene loadings)
- **layers**: Alternative data matrices with same dimensions as X
- **uns**: Unstructured metadata dictionary
- **obsp/varp**: Pairwise relationship matrices (graphs)
-
-### 2. Adding Annotations and Layers
-
-Organize different data representations and metadata within a single object.
-
-**Cell-level metadata (obs):**
-```python
-adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
-adata.obs['total_counts'] = adata.X.sum(axis=1)
-adata.obs['condition'] = pd.Categorical(['control', 'treated'] * 50)
-```
-
-**Gene-level metadata (var):**
-```python
-adata.var['highly_variable'] = gene_variance > threshold
-adata.var['chromosome'] = pd.Categorical(['chr1', 'chr2', ...])
-```
-
-**Embeddings (obsm/varm):**
-```python
-# Dimensionality reduction results
-adata.obsm['X_pca'] = pca_coordinates  # Shape: (n_obs, n_components)
-adata.obsm['X_umap'] = umap_coordinates  # Shape: (n_obs, 2)
-adata.obsm['X_tsne'] = tsne_coordinates
-
-# Gene loadings
-adata.varm['PCs'] = principal_components  # Shape: (n_vars, n_components)
-```
-
-**Alternative data representations (layers):**
-```python
-# Store multiple versions
-adata.layers['counts'] = raw_counts
-adata.layers['log1p'] = np.log1p(adata.X)
-adata.layers['scaled'] = (adata.X - mean) / std
-```
-
-**Unstructured metadata (uns):**
-```python
-# Analysis parameters
-adata.uns['preprocessing'] = {
-    'normalization': 'TPM',
-    'min_genes': 200,
-    'date': '2024-01-15'
-}
-
-# Results
-adata.uns['pca'] = {'variance_ratio': variance_explained}
-```
-
-### 3. Subsetting and Views
-
-Efficiently subset data while managing memory through views and copies.
-
-**Subsetting operations:**
-```python
-# By observation/variable names
-subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
-
-# By boolean masks
-b_cells = adata[adata.obs.cell_type == 'B']
-high_quality = adata[adata.obs.n_genes > 200]
-
-# By position
-first_cells = adata[:100, :]
-top_genes = adata[:, :500]
-
-# Combined conditions
-filtered = adata[
-    (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
-    adata.var.highly_variable
-]
-```
-
-**Understanding views:**
- Subsetting returns **views** by default (memory-efficient, shares data with original)
- Modifying a view affects the original object
- Check with `adata.is_view`
- Convert to independent copy with `.copy()`
-
-```python
-# View (memory-efficient)
-subset = adata[adata.obs.condition == 'treated']
-print(subset.is_view)  # True
-
-# Independent copy
-subset_copy = adata[adata.obs.condition == 'treated'].copy()
-print(subset_copy.is_view)  # False
-```
-
-### 4. File I/O and Backed Mode
-
-Read and write data efficiently, with options for memory-limited environments.
-
-**Writing data:**
-```python
-# Standard format with compression
-adata.write('results.h5ad', compression='gzip')
-
-# Alternative formats
-adata.write_zarr('results.zarr')  # For cloud storage
-adata.write_loom('results.loom')  # For compatibility
-adata.write_csvs('results/')      # As CSV files
-```
-
-**Reading data:**
-```python
-# Load into memory
-adata = ad.read_h5ad('results.h5ad')
-
-# Backed mode (disk-backed, memory-efficient)
-adata = ad.read_h5ad('large_file.h5ad', backed='r')
-print(adata.isbacked)  # True
-print(adata.filename)  # Path to file
-
-# Close file connection when done
-adata.file.close()
-```
-
-**Reading from other formats:**
-```python
-# 10X format
-adata = ad.read_mtx('matrix.mtx')
-
-# CSV
-adata = ad.read_csv('data.csv')
-
-# Loom
-adata = ad.read_loom('data.loom')
-```
-
-**Working with backed mode:**
-```python
-# Read in backed mode for large files
-adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
-
-# Process in chunks
-for chunk in adata.chunk_X(chunk_size=1000):
-    result = process_chunk(chunk)
-
-# Load to memory if needed
-adata_memory = adata.to_memory()
-```
-
-### 5. Concatenating Multiple Datasets
-
-Combine multiple AnnData objects with control over how data is merged.
-
-**Basic concatenation:**
-```python
-# Concatenate observations (most common)
-combined = ad.concat([adata1, adata2, adata3], axis=0)
-
-# Concatenate variables (rare)
-combined = ad.concat([adata1, adata2], axis=1)
-```
-
-**Join strategies:**
-```python
-# Inner join: only shared variables (no missing data)
-combined = ad.concat([adata1, adata2], join='inner')
-
-# Outer join: all variables (fills missing with 0)
-combined = ad.concat([adata1, adata2], join='outer')
-```
-
-**Tracking data sources:**
-```python
-# Add source labels
-combined = ad.concat(
-    [adata1, adata2, adata3],
-    label='dataset',
-    keys=['exp1', 'exp2', 'exp3']
-)
-# Creates combined.obs['dataset'] with values 'exp1', 'exp2', 'exp3'
-
-# Make duplicate indices unique
-combined = ad.concat(
-    [adata1, adata2],
-    keys=['batch1', 'batch2'],
-    index_unique='-'
-)
-# Cell names become: Cell_0-batch1, Cell_0-batch2, etc.
-```
-
-**Merge strategies for metadata:**
-```python
-# merge=None: exclude variable annotations (default)
-combined = ad.concat([adata1, adata2], merge=None)
-
-# merge='same': keep only identical annotations
-combined = ad.concat([adata1, adata2], merge='same')
-
-# merge='first': use first occurrence
-combined = ad.concat([adata1, adata2], merge='first')
-
-# merge='unique': keep annotations with single value
-combined = ad.concat([adata1, adata2], merge='unique')
-```
-
-**Complete example:**
-```python
-# Load batches
-batch1 = ad.read_h5ad('batch1.h5ad')
-batch2 = ad.read_h5ad('batch2.h5ad')
-batch3 = ad.read_h5ad('batch3.h5ad')
-
-# Concatenate with full tracking
-combined = ad.concat(
-    [batch1, batch2, batch3],
-    axis=0,
-    join='outer',              # Keep all genes
-    merge='first',             # Use first batch's annotations
-    label='batch_id',          # Track source
-    keys=['b1', 'b2', 'b3'],  # Custom labels
-    index_unique='-'           # Make cell names unique
-)
-```
-
-### 6. Data Conversion and Extraction
-
-Convert between AnnData and other formats for interoperability.
-
-**To DataFrame:**
-```python
-# Convert X to DataFrame
-df = adata.to_df()
-
-# Convert specific layer
-df = adata.to_df(layer='log1p')
-```
-
-**Extract vectors:**
-```python
-# Get 1D arrays from data or annotations
-gene_expression = adata.obs_vector('Gene_100')
-cell_metadata = adata.obs_vector('n_genes')
-```
-
-**Transpose:**
-```python
-# Swap observations and variables
-transposed = adata.T
-```
-
-### 7. Memory Optimization
-
-Strategies for working with large datasets efficiently.
-
-**Use sparse matrices:**
-```python
-from scipy.sparse import csr_matrix
-
-# Check sparsity
-density = (adata.X != 0).sum() / adata.X.size
-if density < 0.3:  # Less than 30% non-zero
-    adata.X = csr_matrix(adata.X)
-```
-
-**Convert strings to categoricals:**
-```python
-# Automatic conversion
-adata.strings_to_categoricals()
-
-# Manual conversion (more control)
-adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
-```
-
-**Use backed mode:**
-```python
-# Read without loading into memory
-adata = ad.read_h5ad('large_file.h5ad', backed='r')
-
-# Work with subsets
-subset = adata[:1000, :500].copy()  # Only this subset in memory
-```
-
-**Chunked processing:**
-```python
-# Process data in chunks
-results = []
-for chunk in adata.chunk_X(chunk_size=1000):
-    result = expensive_computation(chunk)
-    results.append(result)
-```
-
-## Common Workflows
-
-### Single-Cell RNA-seq Analysis
-
-Complete workflow from loading to analysis:
-
-```python
-import anndata as ad
-import numpy as np
-import pandas as pd
-
-# 1. Load data
-adata = ad.read_mtx('matrix.mtx')
-adata.obs_names = pd.read_csv('barcodes.tsv', header=None)[0]
-adata.var_names = pd.read_csv('genes.tsv', header=None)[0]
-
-# 2. Quality control
-adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
-adata.obs['total_counts'] = adata.X.sum(axis=1)
-adata = adata[adata.obs.n_genes > 200]
-adata = adata[adata.obs.total_counts < 10000]
-
-# 3. Filter genes
-min_cells = 3
-adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
-
-# 4. Store raw counts
-adata.layers['counts'] = adata.X.copy()
-
-# 5. Normalize
-adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
-adata.X = np.log1p(adata.X)
-
-# 6. Feature selection
-gene_var = adata.X.var(axis=0)
-adata.var['highly_variable'] = gene_var > np.percentile(gene_var, 90)
-
-# 7. Dimensionality reduction (example with external tools)
-# adata.obsm['X_pca'] = compute_pca(adata.X)
-# adata.obsm['X_umap'] = compute_umap(adata.obsm['X_pca'])
-
-# 8. Save results
-adata.write('analyzed.h5ad', compression='gzip')
-```
-
-### Batch Integration
-
-Combining multiple experimental batches:
-
-```python
-# Load batches
-batches = [ad.read_h5ad(f'batch_{i}.h5ad') for i in range(3)]
-
-# Concatenate with tracking
-combined = ad.concat(
-    batches,
-    axis=0,
-    join='outer',
-    label='batch',
-    keys=['batch_0', 'batch_1', 'batch_2'],
-    index_unique='-'
-)
-
-# Add batch as numeric for correction algorithms
-combined.obs['batch_numeric'] = combined.obs['batch'].cat.codes
-
-# Perform batch correction (with external tools)
-# corrected_pca = run_harmony(combined.obsm['X_pca'], combined.obs['batch'])
-# combined.obsm['X_pca_corrected'] = corrected_pca
-
-# Save integrated data
-combined.write('integrated.h5ad', compression='gzip')
-```
-
-### Memory-Efficient Large Dataset Processing
-
-Working with datasets too large for memory:
-
-```python
-# Read in backed mode
-adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')
-
-# Compute statistics in chunks
-total = 0
-for chunk in adata.chunk_X(chunk_size=1000):
-    total += chunk.sum()
-
-mean_expression = total / (adata.n_obs * adata.n_vars)
-
-# Work with subset
-high_quality_cells = adata.obs.n_genes > 1000
-subset = adata[high_quality_cells, :].copy()
-
-# Close file
-adata.file.close()
-```
-
-## Best Practices
-
-### Data Organization
-
-1. **Use layers for different representations**: Store raw counts, normalized, log-transformed, and scaled data in separate layers
-2. **Use obsm/varm for multi-dimensional data**: Embeddings, loadings, and other matrix-like annotations
-3. **Use uns for metadata**: Analysis parameters, dates, version information
-4. **Use categoricals for efficiency**: Convert repeated strings to categorical types
-
-### Subsetting
-
-1. **Understand views vs copies**: Subsetting returns views by default; use `.copy()` when you need independence
-2. **Chain conditions efficiently**: Combine boolean masks in a single subsetting operation
-3. **Validate after subsetting**: Check dimensions and data integrity
-
-### File I/O
-
-1. **Use compression**: Always use `compression='gzip'` when writing h5ad files
-2. **Choose the right format**: H5AD for general use, Zarr for cloud storage, Loom for compatibility
-3. **Close backed files**: Always close file connections when done
-4. **Use backed mode for large files**: Don't load everything into memory if not needed
-
-### Concatenation
-
-1. **Choose appropriate join**: Inner join for complete cases, outer join to preserve all features
-2. **Track sources**: Use `label` and `keys` to track data origin
-3. **Handle duplicates**: Use `index_unique` to make observation names unique
-4. **Select merge strategy**: Choose appropriate merge strategy for variable annotations
-
-### Memory Management
-
-1. **Use sparse matrices**: For data with <30% non-zero values
-2. **Convert to categoricals**: For repeated string values
-3. **Process in chunks**: For operations on very large matrices
-4. **Use backed mode**: Read large files with `backed='r'`
-
-### Naming Conventions
-
-Follow these conventions for consistency:
-
- **Embeddings**: `X_pca`, `X_umap`, `X_tsne`
- **Layers**: Descriptive names like `counts`, `log1p`, `scaled`
- **Observations**: Use snake_case like `cell_type`, `n_genes`, `total_counts`
- **Variables**: Use snake_case like `highly_variable`, `gene_name`
-
-## Reference Documentation
-
-For detailed API information, usage patterns, and troubleshooting, refer to the comprehensive reference files in the `references/` directory:
-
-1. **api_reference.md**: Complete API documentation including all classes, methods, and functions with usage examples. Use `grep -r "pattern" references/api_reference.md` to search for specific functions or parameters.
-
-2. **workflows_best_practices.md**: Detailed workflows for common tasks (single-cell analysis, batch integration, large datasets), best practices for memory management, data organization, and common pitfalls to avoid. Use `grep -r "pattern" references/workflows_best_practices.md` to search for specific workflows.
-
-3. **concatenation_guide.md**: Comprehensive guide to concatenation strategies, join types, merge strategies, source tracking, and troubleshooting concatenation issues. Use `grep -r "pattern" references/concatenation_guide.md` to search for concatenation patterns.
-
-## When to Load References
-
-Load reference files into context when:
- Implementing complex concatenation with specific merge strategies
- Troubleshooting errors or unexpected behavior
- Optimizing memory usage for large datasets
- Implementing complete analysis workflows
- Understanding nuances of specific API methods
-
-To search within references without loading them:
-```python
-# Example: Search for information about backed mode
-grep -r "backed mode" references/
-```
-
-## Common Error Patterns
-
-### Memory Errors
-**Problem**: "MemoryError: Unable to allocate array"
-**Solution**: Use backed mode, sparse matrices, or process in chunks
-
-### Dimension Mismatch
-**Problem**: "ValueError: operands could not be broadcast together"
-**Solution**: Use outer join in concatenation or align indices before operations
-
-### View Modification
-**Problem**: "ValueError: assignment destination is read-only"
-**Solution**: Convert view to copy with `.copy()` before modification
-
-### File Already Open
-**Problem**: "OSError: Unable to open file (file is already open)"
-**Solution**: Close previous file connection with `adata.file.close()`
--- a/scientific-packages/anndata/references/api_reference.md
+++ b/scientific-packages/anndata/references/api_reference.md
@@ -1,218 +0,0 @@
-# AnnData API Reference
-
-## Core AnnData Class
-
-The `AnnData` class is the central data structure for storing and manipulating annotated datasets in single-cell genomics and other domains.
-
-### Core Attributes
-
-| Attribute | Type | Description |
-|-----------|------|-------------|
-| **X** | array-like | Primary data matrix (#observations × #variables). Supports NumPy arrays, sparse matrices (CSR/CSC), HDF5 datasets, Zarr arrays, and Dask arrays |
-| **obs** | DataFrame | One-dimensional annotation of observations (rows). Length equals observation count |
-| **var** | DataFrame | One-dimensional annotation of variables/features (columns). Length equals variable count |
-| **uns** | OrderedDict | Unstructured annotation for miscellaneous metadata |
-| **obsm** | dict-like | Multi-dimensional observation annotations (structured arrays aligned to observation axis) |
-| **varm** | dict-like | Multi-dimensional variable annotations (structured arrays aligned to variable axis) |
-| **obsp** | dict-like | Pairwise observation annotations (square matrices representing graphs) |
-| **varp** | dict-like | Pairwise variable annotations (graphs between features) |
-| **layers** | dict-like | Additional data matrices matching X's dimensions |
-| **raw** | AnnData | Stores original versions of X and var before transformations |
-
-### Dimensional Properties
-
- **n_obs**: Number of observations (sample count)
- **n_vars**: Number of variables/features
- **shape**: Tuple returning (n_obs, n_vars)
- **T**: Transposed view of the entire object
-
-### State Properties
-
- **isbacked**: Boolean indicating disk-backed storage status
- **is_view**: Boolean identifying whether object is a view of another AnnData
- **filename**: Path to backing .h5ad file; setting this enables disk-backed mode
-
-### Key Methods
-
-#### Construction and Copying
- **`AnnData(X=None, obs=None, var=None, ...)`**: Create new AnnData object
- **`copy(filename=None)`**: Create full copy, optionally stored on disk
-
-#### Subsetting and Views
- **`adata[obs_subset, var_subset]`**: Subset observations and variables (returns view by default)
- **`.copy()`**: Convert view to independent object
-
-#### Data Access
- **`to_df(layer=None)`**: Generate pandas DataFrame representation
- **`obs_vector(k, layer=None)`**: Extract 1D array from X, layers, or annotations
- **`var_vector(k, layer=None)`**: Extract 1D array for a variable
- **`chunk_X(chunk_size)`**: Iterate over data matrix in chunks
- **`chunked_X(chunk_size)`**: Context manager for chunked iteration
-
-#### Transformation
- **`transpose()`**: Return transposed object
- **`concatenate(*adatas, ...)`**: Combine multiple AnnData objects along observation axis
- **`to_memory(copy=False)`**: Load all backed arrays into RAM
-
-#### File I/O
- **`write_h5ad(filename, compression='gzip')`**: Save as .h5ad HDF5 format
- **`write_zarr(store, ...)`**: Export hierarchical Zarr store
- **`write_loom(filename, ...)`**: Output .loom format file
- **`write_csvs(dirname, ...)`**: Write annotations as separate CSV files
-
-#### Data Management
- **`strings_to_categoricals()`**: Convert string annotations to categorical types
- **`rename_categories(key, categories)`**: Update category labels in annotations
- **`obs_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate observation names
- **`var_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate variable names
-
-## Module-Level Functions
-
-### Reading Functions
-
-#### Native Formats
- **`read_h5ad(filename, backed=None, as_sparse=None)`**: Load HDF5-based .h5ad files
- **`read_zarr(store)`**: Access hierarchical Zarr array stores
-
-#### Alternative Formats
- **`read_csv(filename, ...)`**: Import from CSV files
- **`read_excel(filename, ...)`**: Import from Excel files
- **`read_hdf(filename, key)`**: Read from HDF5 files
- **`read_loom(filename, ...)`**: Import from .loom files
- **`read_mtx(filename, ...)`**: Import from Matrix Market format
- **`read_text(filename, ...)`**: Import from text files
- **`read_umi_tools(filename, ...)`**: Import from UMI-tools format
-
-#### Element-Level Access
- **`read_elem(elem)`**: Retrieve specific components from storage
- **`sparse_dataset(group)`**: Generate backed sparse matrix classes
-
-### Combining Operations
- **`concat(adatas, axis=0, join='inner', merge=None, ...)`**: Merge multiple AnnData objects
-  - **axis**: 0 (observations) or 1 (variables)
-  - **join**: 'inner' (intersection) or 'outer' (union)
-  - **merge**: Strategy for non-concatenation axis ('same', 'unique', 'first', 'only', or None)
-  - **label**: Column name for source tracking
-  - **keys**: Dataset identifiers for source annotation
-  - **index_unique**: Separator for making duplicate indices unique
-
-### Writing Functions
- **`write_h5ad(filename, adata, compression='gzip')`**: Export to HDF5 format
- **`write_zarr(store, adata, ...)`**: Save as Zarr hierarchical arrays
- **`write_elem(elem, ...)`**: Write individual components
-
-### Experimental Features
- **`AnnCollection`**: Batch processing for large collections
- **`AnnLoader`**: PyTorch DataLoader integration
- **`concat_on_disk(*adatas, filename, ...)`**: Memory-efficient out-of-core concatenation
- **`read_lazy(filename)`**: Lazy loading with deferred computation
- **`read_dispatched(filename, ...)`**: Custom I/O with callbacks
- **`write_dispatched(filename, ...)`**: Custom writing with callbacks
-
-### Configuration
- **`settings`**: Package-wide configuration object
- **`settings.override(**kwargs)`**: Context manager for temporary settings changes
-
-## Common Usage Patterns
-
-### Creating AnnData Objects
-
-```python
-import anndata as ad
-import numpy as np
-from scipy.sparse import csr_matrix
-
-# From dense array
-counts = np.random.poisson(1, size=(100, 2000))
-adata = ad.AnnData(counts)
-
-# From sparse matrix
-counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
-adata = ad.AnnData(counts)
-
-# With metadata
-import pandas as pd
-obs_meta = pd.DataFrame({'cell_type': ['B', 'T', 'Monocyte'] * 33 + ['B']})
-var_meta = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]})
-adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
-```
-
-### Subsetting
-
-```python
-# By names
-subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
-
-# By boolean mask
-b_cells = adata[adata.obs.cell_type == 'B']
-
-# By position
-first_five = adata[:5, :100]
-
-# Convert view to copy
-adata_copy = adata[:5].copy()
-```
-
-### Adding Annotations
-
-```python
-# Cell-level metadata
-adata.obs['batch'] = pd.Categorical(['batch1', 'batch2'] * 50)
-
-# Gene-level metadata
-adata.var['highly_variable'] = np.random.choice([True, False], size=adata.n_vars)
-
-# Embeddings
-adata.obsm['X_pca'] = np.random.normal(size=(adata.n_obs, 50))
-adata.obsm['X_umap'] = np.random.normal(size=(adata.n_obs, 2))
-
-# Alternative data representations
-adata.layers['log_transformed'] = np.log1p(adata.X)
-adata.layers['scaled'] = (adata.X - adata.X.mean(axis=0)) / adata.X.std(axis=0)
-
-# Unstructured metadata
-adata.uns['experiment_date'] = '2024-01-15'
-adata.uns['parameters'] = {'min_genes': 200, 'min_cells': 3}
-```
-
-### File I/O
-
-```python
-# Write to disk
-adata.write('my_results.h5ad', compression='gzip')
-
-# Read into memory
-adata = ad.read_h5ad('my_results.h5ad')
-
-# Read in backed mode (memory-efficient)
-adata = ad.read_h5ad('my_results.h5ad', backed='r')
-
-# Close file connection
-adata.file.close()
-```
-
-### Concatenation
-
-```python
-# Combine multiple datasets
-adata1 = ad.AnnData(np.random.poisson(1, size=(100, 2000)))
-adata2 = ad.AnnData(np.random.poisson(1, size=(150, 2000)))
-adata3 = ad.AnnData(np.random.poisson(1, size=(80, 2000)))
-
-# Simple concatenation
-combined = ad.concat([adata1, adata2, adata3], axis=0)
-
-# With source labels
-combined = ad.concat(
-    [adata1, adata2, adata3],
-    axis=0,
-    label='dataset',
-    keys=['exp1', 'exp2', 'exp3']
-)
-
-# Inner join (only shared variables)
-combined = ad.concat([adata1, adata2, adata3], axis=0, join='inner')
-
-# Outer join (all variables, pad with zeros)
-combined = ad.concat([adata1, adata2, adata3], axis=0, join='outer')
-```
--- a/scientific-packages/anndata/references/concatenation_guide.md
+++ b/scientific-packages/anndata/references/concatenation_guide.md
@@ -1,478 +0,0 @@
-# AnnData Concatenation Guide
-
-## Overview
-
-The `concat()` function combines multiple AnnData objects through two fundamental operations:
-1. **Concatenation**: Stacking sub-elements in order
-2. **Merging**: Combining collections into one result
-
-## Basic Concatenation
-
-### Syntax
-```python
-import anndata as ad
-
-combined = ad.concat(
-    adatas,              # List of AnnData objects
-    axis=0,              # 0=observations, 1=variables
-    join='inner',        # 'inner' or 'outer'
-    merge=None,          # Merge strategy for non-concat axis
-    label=None,          # Column name for source tracking
-    keys=None,           # Dataset identifiers
-    index_unique=None,   # Separator for unique indices
-    fill_value=None,     # Fill value for missing data
-    pairwise=False       # Include pairwise matrices
-)
-```
-
-### Concatenating Observations (Cells)
-```python
-# Most common: combining multiple samples/batches
-adata1 = ad.AnnData(np.random.rand(100, 2000))
-adata2 = ad.AnnData(np.random.rand(150, 2000))
-adata3 = ad.AnnData(np.random.rand(80, 2000))
-
-combined = ad.concat([adata1, adata2, adata3], axis=0)
-# Result: (330 observations, 2000 variables)
-```
-
-### Concatenating Variables (Genes)
-```python
-# Less common: combining different feature sets
-adata1 = ad.AnnData(np.random.rand(100, 1000))
-adata2 = ad.AnnData(np.random.rand(100, 500))
-
-combined = ad.concat([adata1, adata2], axis=1)
-# Result: (100 observations, 1500 variables)
-```
-
-## Join Strategies
-
-### Inner Join (Intersection)
-
-Keeps only shared features across all objects.
-
-```python
-# Datasets with different genes
-adata1 = ad.AnnData(
-    np.random.rand(100, 2000),
-    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(2000)])
-)
-adata2 = ad.AnnData(
-    np.random.rand(150, 1800),
-    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(200, 2000)])
-)
-
-# Inner join: only genes present in both
-combined = ad.concat([adata1, adata2], join='inner')
-# Result: (250 observations, 1800 variables)
-# Only Gene_200 through Gene_1999
-```
-
-**Use when:**
- You want to analyze only features measured in all datasets
- Missing features would compromise analysis
- You need a complete case analysis
-
-**Trade-offs:**
- May lose many features
- Ensures no missing data
- Smaller result size
-
-### Outer Join (Union)
-
-Keeps all features from all objects, padding with fill values (default 0).
-
-```python
-# Outer join: all genes from both datasets
-combined = ad.concat([adata1, adata2], join='outer')
-# Result: (250 observations, 2000 variables)
-# Missing values filled with 0
-
-# Custom fill value
-combined = ad.concat([adata1, adata2], join='outer', fill_value=np.nan)
-```
-
-**Use when:**
- You want to preserve all features
- Sparse data is acceptable
- Features are independent
-
-**Trade-offs:**
- Introduces zeros/missing values
- Larger result size
- May need imputation
-
-## Merge Strategies
-
-Merge strategies control how elements on the non-concatenation axis are combined.
-
-### merge=None (Default)
-
-Excludes all non-concatenation axis elements.
-
-```python
-# Both datasets have var annotations
-adata1.var['gene_type'] = ['protein_coding'] * 2000
-adata2.var['gene_type'] = ['protein_coding'] * 1800
-
-# merge=None: var annotations excluded
-combined = ad.concat([adata1, adata2], merge=None)
-assert 'gene_type' not in combined.var.columns
-```
-
-**Use when:**
- Annotations are dataset-specific
- You'll add new annotations after merging
-
-### merge='same'
-
-Keeps only annotations with identical values across datasets.
-
-```python
-# Same annotation values
-adata1.var['chromosome'] = ['chr1'] * 1000 + ['chr2'] * 1000
-adata2.var['chromosome'] = ['chr1'] * 900 + ['chr2'] * 900
-
-# merge='same': keeps chromosome annotation
-combined = ad.concat([adata1, adata2], merge='same')
-assert 'chromosome' in combined.var.columns
-```
-
-**Use when:**
- Annotations should be consistent
- You want to validate consistency
- Shared metadata is important
-
-**Note:** Comparison occurs after index alignment - only shared indices need to match.
-
-### merge='unique'
-
-Includes annotations with a single possible value.
-
-```python
-# Unique values per gene
-adata1.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
-adata2.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
-
-# merge='unique': keeps ensembl_id
-combined = ad.concat([adata1, adata2], merge='unique')
-```
-
-**Use when:**
- Each feature has a unique identifier
- Annotations are feature-specific
-
-### merge='first'
-
-Takes the first occurrence of each annotation.
-
-```python
-# Different annotation versions
-adata1.var['description'] = ['desc1'] * 2000
-adata2.var['description'] = ['desc2'] * 2000
-
-# merge='first': uses adata1's descriptions
-combined = ad.concat([adata1, adata2], merge='first')
-# Uses descriptions from adata1
-```
-
-**Use when:**
- One dataset has authoritative annotations
- Order matters
- You need a simple resolution strategy
-
-### merge='only'
-
-Retains annotations appearing in exactly one object.
-
-```python
-# Dataset-specific annotations
-adata1.var['dataset1_specific'] = ['value'] * 2000
-adata2.var['dataset2_specific'] = ['value'] * 2000
-
-# merge='only': keeps both (no conflicts)
-combined = ad.concat([adata1, adata2], merge='only')
-```
-
-**Use when:**
- Datasets have non-overlapping annotations
- You want to preserve all unique metadata
-
-## Source Tracking
-
-### Using label
-
-Add a categorical column to track data origin.
-
-```python
-combined = ad.concat(
-    [adata1, adata2, adata3],
-    label='batch'
-)
-
-# Creates obs['batch'] with values 0, 1, 2
-print(combined.obs['batch'].cat.categories)  # ['0', '1', '2']
-```
-
-### Using keys
-
-Provide custom names for source tracking.
-
-```python
-combined = ad.concat(
-    [adata1, adata2, adata3],
-    label='study',
-    keys=['control', 'treatment_a', 'treatment_b']
-)
-
-# Creates obs['study'] with custom names
-print(combined.obs['study'].unique())  # ['control', 'treatment_a', 'treatment_b']
-```
-
-### Making Indices Unique
-
-Append source identifiers to duplicate observation names.
-
-```python
-# Both datasets have cells named "Cell_0", "Cell_1", etc.
-adata1.obs_names = [f'Cell_{i}' for i in range(100)]
-adata2.obs_names = [f'Cell_{i}' for i in range(150)]
-
-# index_unique adds suffix
-combined = ad.concat(
-    [adata1, adata2],
-    keys=['batch1', 'batch2'],
-    index_unique='-'
-)
-
-# Results in: Cell_0-batch1, Cell_0-batch2, etc.
-print(combined.obs_names[:5])
-```
-
-## Handling Different Attributes
-
-### X Matrix and Layers
-
-Follows join strategy. Missing values filled according to `fill_value`.
-
-```python
-# Both have layers
-adata1.layers['counts'] = adata1.X.copy()
-adata2.layers['counts'] = adata2.X.copy()
-
-# Concatenates both X and layers
-combined = ad.concat([adata1, adata2])
-assert 'counts' in combined.layers
-```
-
-### obs and var DataFrames
-
- **obs**: Concatenated along concatenation axis
- **var**: Handled by merge strategy
-
-```python
-adata1.obs['cell_type'] = ['B cell'] * 100
-adata2.obs['cell_type'] = ['T cell'] * 150
-
-combined = ad.concat([adata1, adata2])
-# obs['cell_type'] preserved for all cells
-```
-
-### obsm and varm
-
-Multi-dimensional annotations follow same rules as layers.
-
-```python
-adata1.obsm['X_pca'] = np.random.rand(100, 50)
-adata2.obsm['X_pca'] = np.random.rand(150, 50)
-
-combined = ad.concat([adata1, adata2])
-# obsm['X_pca'] concatenated: shape (250, 50)
-```
-
-### obsp and varp
-
-Pairwise matrices excluded by default. Enable with `pairwise=True`.
-
-```python
-# Distance matrices
-adata1.obsp['distances'] = np.random.rand(100, 100)
-adata2.obsp['distances'] = np.random.rand(150, 150)
-
-# Excluded by default
-combined = ad.concat([adata1, adata2])
-assert 'distances' not in combined.obsp
-
-# Include if needed
-combined = ad.concat([adata1, adata2], pairwise=True)
-# Results in padded block diagonal matrix
-```
-
-### uns Dictionary
-
-Merged recursively, applying merge strategy at any nesting depth.
-
-```python
-adata1.uns['experiment'] = {'date': '2024-01', 'lab': 'A'}
-adata2.uns['experiment'] = {'date': '2024-02', 'lab': 'A'}
-
-# merge='same' keeps 'lab', excludes 'date'
-combined = ad.concat([adata1, adata2], merge='same')
-# combined.uns['experiment'] = {'lab': 'A'}
-```
-
-## Advanced Patterns
-
-### Batch Integration Pipeline
-
-```python
-import anndata as ad
-
-# Load batches
-batches = [
-    ad.read_h5ad(f'batch_{i}.h5ad')
-    for i in range(5)
-]
-
-# Concatenate with tracking
-combined = ad.concat(
-    batches,
-    axis=0,
-    join='outer',
-    merge='first',
-    label='batch_id',
-    keys=[f'batch_{i}' for i in range(5)],
-    index_unique='-'
-)
-
-# Add batch effects
-combined.obs['batch_numeric'] = combined.obs['batch_id'].cat.codes
-```
-
-### Multi-Study Meta-Analysis
-
-```python
-# Different studies with varying gene coverage
-studies = {
-    'study_a': ad.read_h5ad('study_a.h5ad'),
-    'study_b': ad.read_h5ad('study_b.h5ad'),
-    'study_c': ad.read_h5ad('study_c.h5ad')
-}
-
-# Outer join to keep all genes
-combined = ad.concat(
-    list(studies.values()),
-    axis=0,
-    join='outer',
-    label='study',
-    keys=list(studies.keys()),
-    merge='unique',
-    fill_value=0
-)
-
-# Track coverage
-for study in studies:
-    n_genes = studies[study].n_vars
-    combined.uns[f'{study}_n_genes'] = n_genes
-```
-
-### Incremental Concatenation
-
-```python
-# For many datasets, concatenate in batches
-chunk_size = 10
-all_files = [f'dataset_{i}.h5ad' for i in range(100)]
-
-# Process in chunks
-result = None
-for i in range(0, len(all_files), chunk_size):
-    chunk_files = all_files[i:i+chunk_size]
-    chunk_adatas = [ad.read_h5ad(f) for f in chunk_files]
-    chunk_combined = ad.concat(chunk_adatas)
-
-    if result is None:
-        result = chunk_combined
-    else:
-        result = ad.concat([result, chunk_combined])
-```
-
-### Memory-Efficient On-Disk Concatenation
-
-```python
-# Experimental feature for large datasets
-from anndata.experimental import concat_on_disk
-
-files = ['dataset1.h5ad', 'dataset2.h5ad', 'dataset3.h5ad']
-concat_on_disk(
-    files,
-    'combined.h5ad',
-    join='outer'
-)
-
-# Read result in backed mode
-combined = ad.read_h5ad('combined.h5ad', backed='r')
-```
-
-## Troubleshooting
-
-### Issue: Dimension Mismatch
-
-```python
-# Error: shapes don't match
-adata1 = ad.AnnData(np.random.rand(100, 2000))
-adata2 = ad.AnnData(np.random.rand(150, 1500))
-
-# Solution: use outer join
-combined = ad.concat([adata1, adata2], join='outer')
-```
-
-### Issue: Memory Error
-
-```python
-# Problem: too many large objects in memory
-large_adatas = [ad.read_h5ad(f) for f in many_files]
-
-# Solution: read and concatenate incrementally
-result = None
-for file in many_files:
-    adata = ad.read_h5ad(file)
-    if result is None:
-        result = adata
-    else:
-        result = ad.concat([result, adata])
-        del adata  # Free memory
-```
-
-### Issue: Duplicate Indices
-
-```python
-# Problem: same cell names in different batches
-# Solution: use index_unique
-combined = ad.concat(
-    [adata1, adata2],
-    keys=['batch1', 'batch2'],
-    index_unique='-'
-)
-```
-
-### Issue: Lost Annotations
-
-```python
-# Problem: annotations disappear
-adata1.var['important'] = values1
-adata2.var['important'] = values2
-
-combined = ad.concat([adata1, adata2])  # merge=None by default
-# Solution: use appropriate merge strategy
-combined = ad.concat([adata1, adata2], merge='first')
-```
-
-## Performance Tips
-
-1. **Pre-align indices**: Ensure consistent naming before concatenation
-2. **Use sparse matrices**: Convert to sparse before concatenating
-3. **Batch operations**: Concatenate in groups for many datasets
-4. **Choose inner join**: When possible, to reduce result size
-5. **Use categoricals**: Convert string annotations before concatenating
-6. **Consider on-disk**: For very large datasets, use `concat_on_disk`
--- a/scientific-packages/anndata/references/workflows_best_practices.md
+++ b/scientific-packages/anndata/references/workflows_best_practices.md
@@ -1,438 +0,0 @@
-# AnnData Workflows and Best Practices
-
-## Common Workflows
-
-### 1. Single-Cell RNA-seq Analysis Workflow
-
-#### Loading Data
-```python
-import anndata as ad
-import numpy as np
-import pandas as pd
-
-# Load from 10X format
-adata = ad.read_mtx('matrix.mtx')
-adata.var_names = pd.read_csv('genes.tsv', sep='\t', header=None)[0]
-adata.obs_names = pd.read_csv('barcodes.tsv', sep='\t', header=None)[0]
-
-# Or load from pre-processed h5ad
-adata = ad.read_h5ad('preprocessed_data.h5ad')
-```
-
-#### Quality Control
-```python
-# Calculate QC metrics
-adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
-adata.obs['total_counts'] = adata.X.sum(axis=1)
-
-# Filter cells
-adata = adata[adata.obs.n_genes > 200]
-adata = adata[adata.obs.total_counts < 10000]
-
-# Filter genes
-min_cells = 3
-adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
-```
-
-#### Normalization and Preprocessing
-```python
-# Store raw counts
-adata.layers['counts'] = adata.X.copy()
-
-# Normalize
-adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
-
-# Log transform
-adata.layers['log1p'] = np.log1p(adata.X)
-adata.X = adata.layers['log1p']
-
-# Identify highly variable genes
-gene_variance = adata.X.var(axis=0)
-adata.var['highly_variable'] = gene_variance > np.percentile(gene_variance, 90)
-```
-
-#### Dimensionality Reduction
-```python
-# PCA
-from sklearn.decomposition import PCA
-pca = PCA(n_components=50)
-adata.obsm['X_pca'] = pca.fit_transform(adata.X)
-
-# Store PCA variance
-adata.uns['pca'] = {'variance_ratio': pca.explained_variance_ratio_}
-
-# UMAP
-from umap import UMAP
-umap = UMAP(n_components=2)
-adata.obsm['X_umap'] = umap.fit_transform(adata.obsm['X_pca'])
-```
-
-#### Clustering
-```python
-# Store cluster assignments
-adata.obs['clusters'] = pd.Categorical(['cluster_0', 'cluster_1', ...])
-
-# Store cluster centroids
-centroids = np.array([...])
-adata.varm['cluster_centroids'] = centroids
-```
-
-#### Save Results
-```python
-# Save complete analysis
-adata.write('analyzed_data.h5ad', compression='gzip')
-```
-
-### 2. Batch Integration Workflow
-
-```python
-import anndata as ad
-
-# Load multiple batches
-batch1 = ad.read_h5ad('batch1.h5ad')
-batch2 = ad.read_h5ad('batch2.h5ad')
-batch3 = ad.read_h5ad('batch3.h5ad')
-
-# Concatenate with batch labels
-adata = ad.concat(
-    [batch1, batch2, batch3],
-    axis=0,
-    label='batch',
-    keys=['batch1', 'batch2', 'batch3'],
-    index_unique='-'
-)
-
-# Batch effect correction would go here
-# (using external tools like Harmony, Scanorama, etc.)
-
-# Store corrected embeddings
-adata.obsm['X_pca_corrected'] = corrected_pca
-adata.obsm['X_umap_corrected'] = corrected_umap
-```
-
-### 3. Memory-Efficient Large Dataset Workflow
-
-```python
-import anndata as ad
-
-# Read in backed mode
-adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
-
-# Check backing status
-print(f"Is backed: {adata.isbacked}")
-print(f"File: {adata.filename}")
-
-# Work with chunks
-for chunk in adata.chunk_X(chunk_size=1000):
-    # Process chunk
-    result = process_chunk(chunk)
-
-# Close file when done
-adata.file.close()
-```
-
-### 4. Multi-Dataset Comparison Workflow
-
-```python
-import anndata as ad
-
-# Load datasets
-datasets = {
-    'study1': ad.read_h5ad('study1.h5ad'),
-    'study2': ad.read_h5ad('study2.h5ad'),
-    'study3': ad.read_h5ad('study3.h5ad')
-}
-
-# Outer join to keep all genes
-combined = ad.concat(
-    list(datasets.values()),
-    axis=0,
-    join='outer',
-    label='study',
-    keys=list(datasets.keys()),
-    merge='first'
-)
-
-# Handle missing data
-combined.X[np.isnan(combined.X)] = 0
-
-# Add dataset-specific metadata
-combined.uns['datasets'] = {
-    'study1': {'date': '2023-01', 'n_samples': datasets['study1'].n_obs},
-    'study2': {'date': '2023-06', 'n_samples': datasets['study2'].n_obs},
-    'study3': {'date': '2024-01', 'n_samples': datasets['study3'].n_obs}
-}
-```
-
-## Best Practices
-
-### Memory Management
-
-#### Use Sparse Matrices
-```python
-from scipy.sparse import csr_matrix
-
-# Convert to sparse if data is sparse
-if density < 0.3:  # Less than 30% non-zero
-    adata.X = csr_matrix(adata.X)
-```
-
-#### Use Backed Mode for Large Files
-```python
-# Read with backing
-adata = ad.read_h5ad('large_file.h5ad', backed='r')
-
-# Only load what you need
-subset = adata[:1000, :500].copy()  # Now in memory
-```
-
-#### Convert Strings to Categoricals
-```python
-# Efficient storage for repeated strings
-adata.strings_to_categoricals()
-
-# Or manually
-adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
-```
-
-### Data Organization
-
-#### Use Layers for Different Representations
-```python
-# Store multiple versions of the data
-adata.layers['counts'] = raw_counts
-adata.layers['normalized'] = normalized_data
-adata.layers['log1p'] = log_transformed_data
-adata.layers['scaled'] = scaled_data
-```
-
-#### Use obsm/varm for Multi-Dimensional Annotations
-```python
-# Embeddings
-adata.obsm['X_pca'] = pca_coordinates
-adata.obsm['X_umap'] = umap_coordinates
-adata.obsm['X_tsne'] = tsne_coordinates
-
-# Gene loadings
-adata.varm['PCs'] = principal_components
-```
-
-#### Use uns for Analysis Metadata
-```python
-# Store parameters
-adata.uns['preprocessing'] = {
-    'normalization': 'TPM',
-    'min_genes': 200,
-    'min_cells': 3,
-    'date': '2024-01-15'
-}
-
-# Store analysis results
-adata.uns['differential_expression'] = {
-    'method': 't-test',
-    'p_value_threshold': 0.05
-}
-```
-
-### Subsetting and Views
-
-#### Understand View vs Copy
-```python
-# Subsetting returns a view
-subset = adata[adata.obs.cell_type == 'B cell']  # View
-print(subset.is_view)  # True
-
-# Views are memory efficient but modifications affect original
-subset.obs['new_column'] = value  # Modifies original adata
-
-# Create independent copy when needed
-subset_copy = adata[adata.obs.cell_type == 'B cell'].copy()
-```
-
-#### Chain Operations Efficiently
-```python
-# Bad - creates multiple intermediate views
-temp1 = adata[adata.obs.batch == 'batch1']
-temp2 = temp1[temp1.obs.n_genes > 200]
-result = temp2[:, temp2.var.highly_variable].copy()
-
-# Good - chain operations
-result = adata[
-    (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
-    adata.var.highly_variable
-].copy()
-```
-
-### File I/O
-
-#### Use Compression
-```python
-# Save with compression
-adata.write('data.h5ad', compression='gzip')
-```
-
-#### Choose the Right Format
-```python
-# H5AD for general use (good compression, fast)
-adata.write_h5ad('data.h5ad')
-
-# Zarr for cloud storage and parallel access
-adata.write_zarr('data.zarr')
-
-# Loom for compatibility with other tools
-adata.write_loom('data.loom')
-```
-
-#### Close File Connections
-```python
-# Use context manager pattern
-adata = ad.read_h5ad('file.h5ad', backed='r')
-try:
-    # Work with data
-    process(adata)
-finally:
-    adata.file.close()
-```
-
-### Concatenation
-
-#### Choose Appropriate Join Strategy
-```python
-# Inner join - only common features (safe, may lose data)
-combined = ad.concat([adata1, adata2], join='inner')
-
-# Outer join - all features (keeps all data, may introduce zeros)
-combined = ad.concat([adata1, adata2], join='outer')
-```
-
-#### Track Data Sources
-```python
-# Add source labels
-combined = ad.concat(
-    [adata1, adata2, adata3],
-    label='dataset',
-    keys=['exp1', 'exp2', 'exp3']
-)
-
-# Make indices unique
-combined = ad.concat(
-    [adata1, adata2, adata3],
-    index_unique='-'
-)
-```
-
-#### Handle Variable-Specific Metadata
-```python
-# Use merge strategy for var annotations
-combined = ad.concat(
-    [adata1, adata2],
-    merge='same',  # Keep only identical annotations
-    join='outer'
-)
-```
-
-### Naming Conventions
-
-#### Use Consistent Naming
-```python
-# Embeddings: X_<method>
-adata.obsm['X_pca']
-adata.obsm['X_umap']
-adata.obsm['X_tsne']
-
-# Layers: descriptive names
-adata.layers['counts']
-adata.layers['log1p']
-adata.layers['scaled']
-
-# Observations: snake_case
-adata.obs['cell_type']
-adata.obs['n_genes']
-adata.obs['total_counts']
-```
-
-#### Make Indices Unique
-```python
-# Ensure unique names
-adata.obs_names_make_unique()
-adata.var_names_make_unique()
-```
-
-### Error Handling
-
-#### Validate Data Structure
-```python
-# Check dimensions
-assert adata.n_obs > 0, "No observations in data"
-assert adata.n_vars > 0, "No variables in data"
-
-# Check for NaN values
-if np.isnan(adata.X).any():
-    print("Warning: NaN values detected")
-
-# Check for negative values in count data
-if (adata.X < 0).any():
-    print("Warning: Negative values in count data")
-```
-
-#### Handle Missing Data
-```python
-# Check for missing annotations
-if adata.obs['cell_type'].isna().any():
-    print("Warning: Missing cell type annotations")
-    # Fill or remove
-    adata = adata[~adata.obs['cell_type'].isna()]
-```
-
-## Common Pitfalls
-
-### 1. Forgetting to Copy Views
-```python
-# BAD - modifies original
-subset = adata[adata.obs.condition == 'treated']
-subset.X = transformed_data  # Changes original adata!
-
-# GOOD
-subset = adata[adata.obs.condition == 'treated'].copy()
-subset.X = transformed_data  # Only changes subset
-```
-
-### 2. Mixing Backed and In-Memory Operations
-```python
-# BAD - trying to modify backed data
-adata = ad.read_h5ad('file.h5ad', backed='r')
-adata.X[0, 0] = 100  # Error: can't modify backed data
-
-# GOOD - load to memory first
-adata = ad.read_h5ad('file.h5ad', backed='r')
-adata = adata.to_memory()
-adata.X[0, 0] = 100  # Works
-```
-
-### 3. Not Using Categoricals for Metadata
-```python
-# BAD - stores as strings (memory inefficient)
-adata.obs['cell_type'] = ['B cell', 'T cell', ...] * 1000
-
-# GOOD - use categorical
-adata.obs['cell_type'] = pd.Categorical(['B cell', 'T cell', ...] * 1000)
-```
-
-### 4. Incorrect Concatenation Axis
-```python
-# Concatenating observations (cells)
-combined = ad.concat([adata1, adata2], axis=0)  # Correct
-
-# Concatenating variables (genes) - rare
-combined = ad.concat([adata1, adata2], axis=1)  # Less common
-```
-
-### 5. Not Preserving Raw Data
-```python
-# BAD - loses original data
-adata.X = normalized_data
-
-# GOOD - preserve original
-adata.layers['counts'] = adata.X.copy()
-adata.X = normalized_data
-```
--- a/scientific-packages/arboreto/SKILL.md
+++ b/scientific-packages/arboreto/SKILL.md
@@ -1,415 +0,0 @@
---
-name: arboreto
-description: "Gene regulatory network inference with GRNBoost2/GENIE3 algorithms. Infer TF-target relationships from expression data, scalable with Dask, for scRNA-seq and GRN analysis."
---
-
-# Arboreto - Gene Regulatory Network Inference
-
-## Overview
-
-Arboreto is a Python library for inferring gene regulatory networks (GRNs) from gene expression data using machine learning algorithms. It enables scalable GRN inference from single machines to multi-node clusters using Dask for distributed computing. The skill provides comprehensive support for both GRNBoost2 (fast gradient boosting) and GENIE3 (Random Forest) algorithms.
-
-## When to Use This Skill
-
-This skill should be used when:
- Inferring regulatory relationships between genes from expression data
- Analyzing single-cell or bulk RNA-seq data to identify transcription factor targets
- Building the GRN inference component of a pySCENIC pipeline
- Comparing GRNBoost2 and GENIE3 algorithm performance
- Setting up distributed computing for large-scale genomic analyses
- Troubleshooting arboreto installation or runtime issues
-
-## Core Capabilities
-
-### 1. Basic GRN Inference
-
-For standard gene regulatory network inference tasks:
-
-**Key considerations:**
- Expression data format: Rows = observations (cells/samples), Columns = genes
- If data has genes as rows, transpose it first: `expression_df.T`
- Always include `seed` parameter for reproducible results
- Transcription factor list is optional but recommended for focused analysis
-
-**Typical workflow:**
-```python
-import pandas as pd
-from arboreto.algo import grnboost2
-from arboreto.utils import load_tf_names
-
-# Load expression data (ensure correct orientation)
-expression_data = pd.read_csv('expression_data.tsv', sep='\t', index_col=0)
-
-# Optional: Load TF names
-tf_names = load_tf_names('transcription_factors.txt')
-
-# Run inference
-network = grnboost2(
-    expression_data=expression_data,
-    tf_names=tf_names,
-    seed=42  # For reproducibility
-)
-
-# Save results
-network.to_csv('network_output.tsv', sep='\t', index=False)
-```
-
-**Output format:**
- DataFrame with columns: `['TF', 'target', 'importance']`
- Higher importance scores indicate stronger predicted regulatory relationships
- Typically sorted by importance (descending)
-
-**Multiprocessing requirement:**
-All arboreto code must include `if __name__ == '__main__':` protection due to Dask's multiprocessing requirements:
-
-```python
-if __name__ == '__main__':
-    # Arboreto code goes here
-    network = grnboost2(expression_data=expr_data, seed=42)
-```
-
-### 2. Algorithm Selection
-
-**GRNBoost2 (Recommended for most cases):**
- ~10-100x faster than GENIE3
- Uses stochastic gradient boosting with early-stopping
- Best for: Large datasets (>10k observations), time-sensitive analyses
- Function: `arboreto.algo.grnboost2()`
-
-**GENIE3:**
- Uses Random Forest regression
- More established, classical approach
- Best for: Small datasets, methodological comparisons, reproducing published results
- Function: `arboreto.algo.genie3()`
-
-**When to compare both algorithms:**
-Use the provided `compare_algorithms.py` script when:
- Validating results for critical analyses
- Benchmarking performance on new datasets
- Publishing research requiring methodological comparisons
-
-### 3. Distributed Computing
-
-**Local execution (default):**
-Arboreto automatically creates a local Dask client. No configuration needed:
-```python
-network = grnboost2(expression_data=expr_data)
-```
-
-**Custom local cluster (recommended for better control):**
-```python
-from dask.distributed import Client, LocalCluster
-
-# Configure cluster
-cluster = LocalCluster(
-    n_workers=4,
-    threads_per_worker=2,
-    memory_limit='4GB',
-    diagnostics_port=8787  # Dashboard at http://localhost:8787
-)
-client = Client(cluster)
-
-# Run inference
-network = grnboost2(
-    expression_data=expr_data,
-    client_or_address=client
-)
-
-# Clean up
-client.close()
-cluster.close()
-```
-
-**Distributed cluster (multi-node):**
-On scheduler node:
-```bash
-dask-scheduler --no-bokeh
-```
-
-On worker nodes:
-```bash
-dask-worker scheduler-address:8786 --local-dir /tmp
-```
-
-In Python:
-```python
-from dask.distributed import Client
-
-client = Client('scheduler-address:8786')
-network = grnboost2(expression_data=expr_data, client_or_address=client)
-```
-
-### 4. Data Preparation
-
-**Common data format issues:**
-
-1. **Transposed data** (genes as rows instead of columns):
-```python
-# If genes are rows, transpose
-expression_data = pd.read_csv('data.tsv', sep='\t', index_col=0).T
-```
-
-2. **Missing gene names:**
-```python
-# Provide gene names if using numpy array
-network = grnboost2(
-    expression_data=expr_array,
-    gene_names=['Gene1', 'Gene2', 'Gene3', ...],
-    seed=42
-)
-```
-
-3. **Transcription factor specification:**
-```python
-# Option 1: Python list
-tf_names = ['Sox2', 'Oct4', 'Nanog', 'Klf4']
-
-# Option 2: Load from file (one TF per line)
-from arboreto.utils import load_tf_names
-tf_names = load_tf_names('tf_names.txt')
-```
-
-### 5. Reproducibility
-
-Always specify a seed for consistent results:
-```python
-network = grnboost2(expression_data=expr_data, seed=42)
-```
-
-Without a seed, results will vary between runs due to algorithm randomness.
-
-### 6. Result Interpretation
-
-**Understanding the output:**
- `TF`: Transcription factor (regulator) gene
- `target`: Target gene being regulated
- `importance`: Strength of predicted regulatory relationship
-
-**Typical post-processing:**
-```python
-# Filter by importance threshold
-high_confidence = network[network['importance'] > 10]
-
-# Get top N predictions
-top_predictions = network.head(1000)
-
-# Find all targets of a specific TF
-sox2_targets = network[network['TF'] == 'Sox2']
-
-# Count regulations per TF
-tf_counts = network['TF'].value_counts()
-```
-
-## Installation
-
-**Recommended (via conda):**
-```bash
-conda install -c bioconda arboreto
-```
-
-**Via pip:**
-```bash
-pip install arboreto
-```
-
-**From source:**
-```bash
-git clone https://github.com/tmoerman/arboreto.git
-cd arboreto
-pip install .
-```
-
-**Dependencies:**
- pandas
- numpy
- scikit-learn
- scipy
- dask
- distributed
-
-## Troubleshooting
-
-### Issue: Bokeh error when launching Dask scheduler
-
-**Error:** `TypeError: got an unexpected keyword argument 'host'`
-
-**Solutions:**
- Use `dask-scheduler --no-bokeh` to disable Bokeh
- Upgrade to Dask distributed >= 0.20.0
-
-### Issue: Workers not connecting to scheduler
-
-**Symptoms:** Worker processes start but fail to establish connections
-
-**Solutions:**
- Remove `dask-worker-space` directory before restarting workers
- Specify adequate `local_dir` when creating cluster:
-```python
-cluster = LocalCluster(
-    worker_kwargs={'local_dir': '/tmp'}
-)
-```
-
-### Issue: Memory errors with large datasets
-
-**Solutions:**
- Increase worker memory limits: `memory_limit='8GB'`
- Distribute across more nodes
- Reduce dataset size through preprocessing (e.g., feature selection)
- Ensure expression matrix fits in available RAM
-
-### Issue: Inconsistent results across runs
-
-**Solution:** Always specify a `seed` parameter:
-```python
-network = grnboost2(expression_data=expr_data, seed=42)
-```
-
-### Issue: Import errors or missing dependencies
-
-**Solution:** Use conda installation to handle numerical library dependencies:
-```bash
-conda create --name arboreto-env
-conda activate arboreto-env
-conda install -c bioconda arboreto
-```
-
-## Provided Scripts
-
-This skill includes ready-to-use scripts for common workflows:
-
-### scripts/basic_grn_inference.py
-
-Command-line tool for standard GRN inference workflow.
-
-**Usage:**
-```bash
-python scripts/basic_grn_inference.py expression_data.tsv \
-    -t tf_names.txt \
-    -o network.tsv \
-    -s 42 \
-    --transpose  # if genes are rows
-```
-
-**Features:**
- Automatic data loading and validation
- Optional TF list specification
- Configurable output format
- Data transposition support
- Summary statistics
-
-### scripts/distributed_inference.py
-
-GRN inference with custom Dask cluster configuration.
-
-**Usage:**
-```bash
-python scripts/distributed_inference.py expression_data.tsv \
-    -t tf_names.txt \
-    -w 8 \
-    -m 4GB \
-    --threads 2 \
-    --dashboard-port 8787
-```
-
-**Features:**
- Configurable worker count and memory limits
- Dask dashboard integration
- Thread configuration
- Resource monitoring
-
-### scripts/compare_algorithms.py
-
-Compare GRNBoost2 and GENIE3 side-by-side.
-
-**Usage:**
-```bash
-python scripts/compare_algorithms.py expression_data.tsv \
-    -t tf_names.txt \
-    --top-n 100
-```
-
-**Features:**
- Runtime comparison
- Network statistics
- Prediction overlap analysis
- Top prediction comparison
-
-## Reference Documentation
-
-Detailed API documentation is available in [references/api_reference.md](references/api_reference.md), including:
- Complete parameter descriptions for all functions
- Data format specifications
- Distributed computing configuration
- Performance optimization tips
- Integration with pySCENIC
- Comprehensive examples
-
-Load this reference when:
- Working with advanced Dask configurations
- Troubleshooting complex deployment scenarios
- Understanding algorithm internals
- Optimizing performance for specific use cases
-
-## Integration with pySCENIC
-
-Arboreto is the first step in the pySCENIC single-cell analysis pipeline:
-
-1. **GRN Inference (arboreto)** ← This skill
-   - Input: Expression matrix
-   - Output: Regulatory network
-
-2. **Regulon Prediction (pySCENIC)**
-   - Input: Network from arboreto
-   - Output: Refined regulons
-
-3. **Cell Type Identification (pySCENIC)**
-   - Input: Regulons
-   - Output: Cell type scores
-
-When working with pySCENIC, use arboreto to generate the initial network, then pass results to the pySCENIC pipeline.
-
-## Best Practices
-
-1. **Always use seed parameter** for reproducible research
-2. **Validate data orientation** (rows = observations, columns = genes)
-3. **Specify TF list** when known to focus inference and improve speed
-4. **Monitor with Dask dashboard** for distributed computing
-5. **Save intermediate results** to avoid re-running long computations
-6. **Filter results** by importance threshold for downstream analysis
-7. **Use GRNBoost2 by default** unless specifically requiring GENIE3
-8. **Include multiprocessing guard** (`if __name__ == '__main__':`) in all scripts
-
-## Quick Reference
-
-**Basic inference:**
-```python
-from arboreto.algo import grnboost2
-network = grnboost2(expression_data=expr_df, seed=42)
-```
-
-**With TF specification:**
-```python
-network = grnboost2(expression_data=expr_df, tf_names=tf_list, seed=42)
-```
-
-**With custom Dask client:**
-```python
-from dask.distributed import Client, LocalCluster
-cluster = LocalCluster(n_workers=4)
-client = Client(cluster)
-network = grnboost2(expression_data=expr_df, client_or_address=client, seed=42)
-client.close()
-cluster.close()
-```
-
-**Load TF names:**
-```python
-from arboreto.utils import load_tf_names
-tf_names = load_tf_names('transcription_factors.txt')
-```
-
-**Transpose data:**
-```python
-expression_df = pd.read_csv('data.tsv', sep='\t', index_col=0).T
-```
--- a/scientific-packages/arboreto/references/api_reference.md
+++ b/scientific-packages/arboreto/references/api_reference.md
@@ -1,271 +0,0 @@
-# Arboreto API Reference
-
-This document provides comprehensive API documentation for the arboreto package, a Python library for gene regulatory network (GRN) inference.
-
-## Overview
-
-Arboreto enables inference of gene regulatory networks from expression data using machine learning algorithms. It supports distributed computing via Dask for scalability from single machines to multi-node clusters.
-
-**Current Version:** 0.1.5
-**GitHub:** https://github.com/tmoerman/arboreto
-**License:** BSD 3-Clause
-
-## Core Algorithms
-
-### GRNBoost2
-
-The flagship algorithm for fast gene regulatory network inference using stochastic gradient boosting.
-
-**Function:** `arboreto.algo.grnboost2()`
-
-**Parameters:**
- `expression_data` (pandas.DataFrame or numpy.ndarray): Expression matrix where rows are observations (cells/samples) and columns are genes. Required.
- `gene_names` (list, optional): List of gene names matching column order. If None, uses DataFrame column names.
- `tf_names` (list, optional): List of transcription factor names to consider as regulators. If None, all genes are considered potential regulators.
- `seed` (int, optional): Random seed for reproducibility. Recommended when consistent results are needed across runs.
- `client_or_address` (dask.distributed.Client or str, optional): Custom Dask client or scheduler address for distributed computing. If None, creates a default local client.
- `verbose` (bool, optional): Enable verbose output for debugging.
-
-**Returns:**
- pandas.DataFrame with columns `['TF', 'target', 'importance']` representing inferred regulatory links. Each row represents a regulatory relationship with an importance score.
-
-**Algorithm Details:**
- Uses stochastic gradient boosting with early-stopping regularization
- Much faster than GENIE3, especially for large datasets (tens of thousands of observations)
- Extracts important features from trained regression models to identify regulatory relationships
- Recommended as the default choice for most use cases
-
-**Example:**
-```python
-from arboreto.algo import grnboost2
-import pandas as pd
-
-# Load expression data
-expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
-tf_list = ['TF1', 'TF2', 'TF3']  # Optional: specify TFs
-
-# Run inference
-network = grnboost2(
-    expression_data=expression_matrix,
-    tf_names=tf_list,
-    seed=42  # For reproducibility
-)
-
-# Save results
-network.to_csv('output_network.tsv', sep='\t', index=False)
-```
-
-### GENIE3
-
-Classical gene regulatory network inference using Random Forest regression.
-
-**Function:** `arboreto.algo.genie3()`
-
-**Parameters:**
-Same as GRNBoost2 (see above).
-
-**Returns:**
-Same format as GRNBoost2 (see above).
-
-**Algorithm Details:**
- Uses Random Forest or ExtraTrees regression models
- Blueprint for multiple regression GRN inference strategy
- More computationally expensive than GRNBoost2
- Better suited for smaller datasets or when maximum accuracy is needed
-
-**When to Use GENIE3 vs GRNBoost2:**
- **Use GRNBoost2:** For large datasets, faster results, or when computational resources are limited
- **Use GENIE3:** For smaller datasets, when following established protocols, or for comparison with published results
-
-## Module Structure
-
-### arboreto.algo
-
-Primary module for typical users. Contains high-level inference functions.
-
-**Main Functions:**
- `grnboost2()` - Fast GRN inference using gradient boosting
- `genie3()` - Classical GRN inference using Random Forest
-
-### arboreto.core
-
-Advanced module for power users. Contains low-level framework components for custom implementations.
-
-**Use cases:**
- Custom inference pipelines
- Algorithm modifications
- Performance tuning
-
-### arboreto.utils
-
-Utility functions for common data processing tasks.
-
-**Key Functions:**
- `load_tf_names(filename)` - Load transcription factor names from file
-  - Reads a text file with one TF name per line
-  - Returns a list of TF names
-  - Example: `tf_names = load_tf_names('transcription_factors.txt')`
-
-## Data Format Requirements
-
-### Input Format
-
-**Expression Matrix:**
- **Format:** pandas DataFrame or numpy ndarray
- **Orientation:** Rows = observations (cells/samples), Columns = genes
- **Convention:** Follows scikit-learn format
- **Gene Names:** Column names (DataFrame) or separate `gene_names` parameter
- **Data Type:** Numeric (float or int)
-
-**Common Mistake:** If data is transposed (genes as rows), use pandas to transpose:
-```python
-expression_df = pd.read_csv('data.tsv', sep='\t', index_col=0).T
-```
-
-**Transcription Factor List:**
- **Format:** Python list of strings or text file (one TF per line)
- **Optional:** If not provided, all genes are considered potential regulators
- **Example:** `['Sox2', 'Oct4', 'Nanog']`
-
-### Output Format
-
-**Network DataFrame:**
- **Columns:**
-  - `TF` (str): Transcription factor (regulator) gene name
-  - `target` (str): Target gene name
-  - `importance` (float): Importance score of the regulatory relationship
- **Interpretation:** Higher importance scores indicate stronger predicted regulatory relationships
- **Sorting:** Typically sorted by importance (descending) for prioritization
-
-**Example Output:**
-```
-TF      target    importance
-Sox2    Gene1     15.234
-Oct4    Gene1     12.456
-Sox2    Gene2     8.901
-```
-
-## Distributed Computing with Dask
-
-### Local Execution (Default)
-
-Arboreto automatically creates a local Dask client if none is provided:
-
-```python
-network = grnboost2(expression_data=expr_matrix, tf_names=tf_list)
-```
-
-### Custom Local Cluster
-
-For better control over resources or multiple inferences:
-
-```python
-from dask.distributed import Client, LocalCluster
-
-# Configure cluster
-cluster = LocalCluster(
-    n_workers=4,
-    threads_per_worker=2,
-    memory_limit='4GB'
-)
-client = Client(cluster)
-
-# Run inference
-network = grnboost2(
-    expression_data=expr_matrix,
-    tf_names=tf_list,
-    client_or_address=client
-)
-
-# Clean up
-client.close()
-cluster.close()
-```
-
-### Distributed Cluster
-
-For multi-node computation:
-
-**On scheduler node:**
-```bash
-dask-scheduler --no-bokeh  # Use --no-bokeh to avoid Bokeh errors
-```
-
-**On worker nodes:**
-```bash
-dask-worker scheduler-address:8786 --local-dir /tmp
-```
-
-**In Python script:**
-```python
-from dask.distributed import Client
-
-client = Client('scheduler-address:8786')
-network = grnboost2(
-    expression_data=expr_matrix,
-    tf_names=tf_list,
-    client_or_address=client
-)
-```
-
-### Dask Dashboard
-
-Monitor computation progress via the Dask dashboard:
-
-```python
-from dask.distributed import Client, LocalCluster
-
-cluster = LocalCluster(diagnostics_port=8787)
-client = Client(cluster)
-
-# Dashboard available at: http://localhost:8787
-```
-
-## Reproducibility
-
-To ensure reproducible results across runs:
-
-```python
-network = grnboost2(
-    expression_data=expr_matrix,
-    tf_names=tf_list,
-    seed=42  # Fixed seed ensures identical results
-)
-```
-
-**Note:** Without a seed parameter, results may vary slightly between runs due to randomness in the algorithms.
-
-## Performance Considerations
-
-### Memory Management
-
- Expression matrices should fit in memory (RAM)
- For very large datasets, consider:
-  - Using a machine with more RAM
-  - Distributing across multiple nodes
-  - Preprocessing to reduce dimensionality
-
-### Worker Configuration
-
- **Local execution:** Number of workers = number of CPU cores (default)
- **Custom cluster:** Balance workers and threads based on available resources
- **Distributed execution:** Ensure adequate `local_dir` space on worker nodes
-
-### Algorithm Choice
-
- **GRNBoost2:** ~10-100x faster than GENIE3 for large datasets
- **GENIE3:** More established but slower, better for small datasets (<10k observations)
-
-## Integration with pySCENIC
-
-Arboreto is a core component of the pySCENIC pipeline for single-cell RNA sequencing analysis:
-
-1. **GRN Inference (Arboreto):** Infer regulatory networks using GRNBoost2
-2. **Regulon Prediction:** Prune network and identify regulons
-3. **Cell Type Identification:** Score regulons across cells
-
-For pySCENIC workflows, arboreto is typically used in the first step to generate the initial regulatory network.
-
-## Common Issues and Solutions
-
-See the main SKILL.md for troubleshooting guidance.
--- a/scientific-packages/arboreto/scripts/basic_grn_inference.py
+++ b/scientific-packages/arboreto/scripts/basic_grn_inference.py
@@ -1,110 +0,0 @@
-#!/usr/bin/env python3
-"""
-Basic GRN inference script using arboreto GRNBoost2.
-
-This script demonstrates the standard workflow for gene regulatory network inference:
-1. Load expression data
-2. Optionally load transcription factor names
-3. Run GRNBoost2 inference
-4. Save results
-
-Usage:
-    python basic_grn_inference.py <expression_file> [options]
-
-Example:
-    python basic_grn_inference.py expression_data.tsv -t tf_names.txt -o network.tsv
-"""
-
-import argparse
-import pandas as pd
-from arboreto.algo import grnboost2
-from arboreto.utils import load_tf_names
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        description='Infer gene regulatory network using GRNBoost2'
-    )
-    parser.add_argument(
-        'expression_file',
-        help='Path to expression data file (TSV/CSV format)'
-    )
-    parser.add_argument(
-        '-t', '--tf-file',
-        help='Path to file containing transcription factor names (one per line)',
-        default=None
-    )
-    parser.add_argument(
-        '-o', '--output',
-        help='Output file path for network results',
-        default='network_output.tsv'
-    )
-    parser.add_argument(
-        '-s', '--seed',
-        type=int,
-        help='Random seed for reproducibility',
-        default=42
-    )
-    parser.add_argument(
-        '--sep',
-        help='Separator for input file (default: tab)',
-        default='\t'
-    )
-    parser.add_argument(
-        '--transpose',
-        action='store_true',
-        help='Transpose the expression matrix (use if genes are rows)'
-    )
-
-    args = parser.parse_args()
-
-    # Load expression data
-    print(f"Loading expression data from {args.expression_file}...")
-    expression_data = pd.read_csv(args.expression_file, sep=args.sep, index_col=0)
-
-    # Transpose if needed
-    if args.transpose:
-        print("Transposing expression matrix...")
-        expression_data = expression_data.T
-
-    print(f"Expression data shape: {expression_data.shape}")
-    print(f"  Observations (rows): {expression_data.shape[0]}")
-    print(f"  Genes (columns): {expression_data.shape[1]}")
-
-    # Load TF names if provided
-    tf_names = None
-    if args.tf_file:
-        print(f"Loading transcription factor names from {args.tf_file}...")
-        tf_names = load_tf_names(args.tf_file)
-        print(f"  Found {len(tf_names)} transcription factors")
-    else:
-        print("No TF file provided. Using all genes as potential regulators.")
-
-    # Run GRNBoost2
-    print("\nRunning GRNBoost2 inference...")
-    print("  (This may take a while depending on dataset size)")
-
-    network = grnboost2(
-        expression_data=expression_data,
-        tf_names=tf_names,
-        seed=args.seed
-    )
-
-    print(f"\nInference complete!")
-    print(f"  Total regulatory links inferred: {len(network)}")
-    print(f"  Unique TFs: {network['TF'].nunique()}")
-    print(f"  Unique targets: {network['target'].nunique()}")
-
-    # Save results
-    print(f"\nSaving results to {args.output}...")
-    network.to_csv(args.output, sep='\t', index=False)
-
-    # Display top 10 predictions
-    print("\nTop 10 predicted regulatory relationships:")
-    print(network.head(10).to_string(index=False))
-
-    print("\nDone!")
-
-
-if __name__ == '__main__':
-    main()
--- a/scientific-packages/arboreto/scripts/compare_algorithms.py
+++ b/scientific-packages/arboreto/scripts/compare_algorithms.py
@@ -1,205 +0,0 @@
-#!/usr/bin/env python3
-"""
-Compare GRNBoost2 and GENIE3 algorithms on the same dataset.
-
-This script runs both algorithms on the same expression data and compares:
- Runtime
- Number of predicted links
- Top predicted relationships
- Overlap between predictions
-
-Usage:
-    python compare_algorithms.py <expression_file> [options]
-
-Example:
-    python compare_algorithms.py expression_data.tsv -t tf_names.txt
-"""
-
-import argparse
-import time
-import pandas as pd
-from arboreto.algo import grnboost2, genie3
-from arboreto.utils import load_tf_names
-
-
-def compare_networks(network1, network2, name1, name2, top_n=100):
-    """Compare two inferred networks."""
-    print(f"\n{'='*60}")
-    print("Network Comparison")
-    print(f"{'='*60}")
-
-    # Basic statistics
-    print(f"\n{name1} Statistics:")
-    print(f"  Total links: {len(network1)}")
-    print(f"  Unique TFs: {network1['TF'].nunique()}")
-    print(f"  Unique targets: {network1['target'].nunique()}")
-    print(f"  Importance range: [{network1['importance'].min():.3f}, {network1['importance'].max():.3f}]")
-
-    print(f"\n{name2} Statistics:")
-    print(f"  Total links: {len(network2)}")
-    print(f"  Unique TFs: {network2['TF'].nunique()}")
-    print(f"  Unique targets: {network2['target'].nunique()}")
-    print(f"  Importance range: [{network2['importance'].min():.3f}, {network2['importance'].max():.3f}]")
-
-    # Compare top predictions
-    print(f"\nTop {top_n} Predictions Overlap:")
-
-    # Create edge sets for top N predictions
-    top_edges1 = set(
-        zip(network1.head(top_n)['TF'], network1.head(top_n)['target'])
-    )
-    top_edges2 = set(
-        zip(network2.head(top_n)['TF'], network2.head(top_n)['target'])
-    )
-
-    # Calculate overlap
-    overlap = top_edges1 & top_edges2
-    only_net1 = top_edges1 - top_edges2
-    only_net2 = top_edges2 - top_edges1
-
-    overlap_pct = (len(overlap) / top_n) * 100
-
-    print(f"  Shared edges: {len(overlap)} ({overlap_pct:.1f}%)")
-    print(f"  Only in {name1}: {len(only_net1)}")
-    print(f"  Only in {name2}: {len(only_net2)}")
-
-    # Show some example overlapping edges
-    if overlap:
-        print(f"\nExample overlapping predictions:")
-        for i, (tf, target) in enumerate(list(overlap)[:5], 1):
-            print(f"  {i}. {tf} -> {target}")
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        description='Compare GRNBoost2 and GENIE3 algorithms'
-    )
-    parser.add_argument(
-        'expression_file',
-        help='Path to expression data file (TSV/CSV format)'
-    )
-    parser.add_argument(
-        '-t', '--tf-file',
-        help='Path to file containing transcription factor names (one per line)',
-        default=None
-    )
-    parser.add_argument(
-        '--grnboost2-output',
-        help='Output file path for GRNBoost2 results',
-        default='grnboost2_network.tsv'
-    )
-    parser.add_argument(
-        '--genie3-output',
-        help='Output file path for GENIE3 results',
-        default='genie3_network.tsv'
-    )
-    parser.add_argument(
-        '-s', '--seed',
-        type=int,
-        help='Random seed for reproducibility',
-        default=42
-    )
-    parser.add_argument(
-        '--sep',
-        help='Separator for input file (default: tab)',
-        default='\t'
-    )
-    parser.add_argument(
-        '--transpose',
-        action='store_true',
-        help='Transpose the expression matrix (use if genes are rows)'
-    )
-    parser.add_argument(
-        '--top-n',
-        type=int,
-        help='Number of top predictions to compare (default: 100)',
-        default=100
-    )
-
-    args = parser.parse_args()
-
-    # Load expression data
-    print(f"Loading expression data from {args.expression_file}...")
-    expression_data = pd.read_csv(args.expression_file, sep=args.sep, index_col=0)
-
-    # Transpose if needed
-    if args.transpose:
-        print("Transposing expression matrix...")
-        expression_data = expression_data.T
-
-    print(f"Expression data shape: {expression_data.shape}")
-    print(f"  Observations (rows): {expression_data.shape[0]}")
-    print(f"  Genes (columns): {expression_data.shape[1]}")
-
-    # Load TF names if provided
-    tf_names = None
-    if args.tf_file:
-        print(f"Loading transcription factor names from {args.tf_file}...")
-        tf_names = load_tf_names(args.tf_file)
-        print(f"  Found {len(tf_names)} transcription factors")
-    else:
-        print("No TF file provided. Using all genes as potential regulators.")
-
-    # Run GRNBoost2
-    print("\n" + "="*60)
-    print("Running GRNBoost2...")
-    print("="*60)
-    start_time = time.time()
-
-    grnboost2_network = grnboost2(
-        expression_data=expression_data,
-        tf_names=tf_names,
-        seed=args.seed
-    )
-
-    grnboost2_time = time.time() - start_time
-    print(f"GRNBoost2 completed in {grnboost2_time:.2f} seconds")
-
-    # Save GRNBoost2 results
-    grnboost2_network.to_csv(args.grnboost2_output, sep='\t', index=False)
-    print(f"Results saved to {args.grnboost2_output}")
-
-    # Run GENIE3
-    print("\n" + "="*60)
-    print("Running GENIE3...")
-    print("="*60)
-    start_time = time.time()
-
-    genie3_network = genie3(
-        expression_data=expression_data,
-        tf_names=tf_names,
-        seed=args.seed
-    )
-
-    genie3_time = time.time() - start_time
-    print(f"GENIE3 completed in {genie3_time:.2f} seconds")
-
-    # Save GENIE3 results
-    genie3_network.to_csv(args.genie3_output, sep='\t', index=False)
-    print(f"Results saved to {args.genie3_output}")
-
-    # Compare runtimes
-    print("\n" + "="*60)
-    print("Runtime Comparison")
-    print("="*60)
-    print(f"GRNBoost2: {grnboost2_time:.2f} seconds")
-    print(f"GENIE3: {genie3_time:.2f} seconds")
-    speedup = genie3_time / grnboost2_time
-    print(f"Speedup: {speedup:.2f}x (GRNBoost2 is {speedup:.2f}x faster)")
-
-    # Compare networks
-    compare_networks(
-        grnboost2_network,
-        genie3_network,
-        "GRNBoost2",
-        "GENIE3",
-        top_n=args.top_n
-    )
-
-    print("\n" + "="*60)
-    print("Comparison complete!")
-    print("="*60)
-
-
-if __name__ == '__main__':
-    main()
--- a/scientific-packages/arboreto/scripts/distributed_inference.py
+++ b/scientific-packages/arboreto/scripts/distributed_inference.py
@@ -1,157 +0,0 @@
-#!/usr/bin/env python3
-"""
-Distributed GRN inference script using arboreto with custom Dask configuration.
-
-This script demonstrates how to use arboreto with a custom Dask LocalCluster
-for better control over computational resources.
-
-Usage:
-    python distributed_inference.py <expression_file> [options]
-
-Example:
-    python distributed_inference.py expression_data.tsv -t tf_names.txt -w 8 -m 4GB
-"""
-
-import argparse
-import pandas as pd
-from dask.distributed import Client, LocalCluster
-from arboreto.algo import grnboost2
-from arboreto.utils import load_tf_names
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        description='Distributed GRN inference using GRNBoost2 with custom Dask cluster'
-    )
-    parser.add_argument(
-        'expression_file',
-        help='Path to expression data file (TSV/CSV format)'
-    )
-    parser.add_argument(
-        '-t', '--tf-file',
-        help='Path to file containing transcription factor names (one per line)',
-        default=None
-    )
-    parser.add_argument(
-        '-o', '--output',
-        help='Output file path for network results',
-        default='network_output.tsv'
-    )
-    parser.add_argument(
-        '-s', '--seed',
-        type=int,
-        help='Random seed for reproducibility',
-        default=42
-    )
-    parser.add_argument(
-        '-w', '--workers',
-        type=int,
-        help='Number of Dask workers',
-        default=4
-    )
-    parser.add_argument(
-        '-m', '--memory-limit',
-        help='Memory limit per worker (e.g., "4GB", "2000MB")',
-        default='4GB'
-    )
-    parser.add_argument(
-        '--threads',
-        type=int,
-        help='Threads per worker',
-        default=2
-    )
-    parser.add_argument(
-        '--dashboard-port',
-        type=int,
-        help='Port for Dask dashboard (default: 8787)',
-        default=8787
-    )
-    parser.add_argument(
-        '--sep',
-        help='Separator for input file (default: tab)',
-        default='\t'
-    )
-    parser.add_argument(
-        '--transpose',
-        action='store_true',
-        help='Transpose the expression matrix (use if genes are rows)'
-    )
-
-    args = parser.parse_args()
-
-    # Load expression data
-    print(f"Loading expression data from {args.expression_file}...")
-    expression_data = pd.read_csv(args.expression_file, sep=args.sep, index_col=0)
-
-    # Transpose if needed
-    if args.transpose:
-        print("Transposing expression matrix...")
-        expression_data = expression_data.T
-
-    print(f"Expression data shape: {expression_data.shape}")
-    print(f"  Observations (rows): {expression_data.shape[0]}")
-    print(f"  Genes (columns): {expression_data.shape[1]}")
-
-    # Load TF names if provided
-    tf_names = None
-    if args.tf_file:
-        print(f"Loading transcription factor names from {args.tf_file}...")
-        tf_names = load_tf_names(args.tf_file)
-        print(f"  Found {len(tf_names)} transcription factors")
-    else:
-        print("No TF file provided. Using all genes as potential regulators.")
-
-    # Set up Dask cluster
-    print(f"\nSetting up Dask LocalCluster...")
-    print(f"  Workers: {args.workers}")
-    print(f"  Threads per worker: {args.threads}")
-    print(f"  Memory limit per worker: {args.memory_limit}")
-    print(f"  Dashboard: http://localhost:{args.dashboard_port}")
-
-    cluster = LocalCluster(
-        n_workers=args.workers,
-        threads_per_worker=args.threads,
-        memory_limit=args.memory_limit,
-        diagnostics_port=args.dashboard_port
-    )
-    client = Client(cluster)
-
-    print(f"\nDask cluster ready!")
-    print(f"  Dashboard available at: {client.dashboard_link}")
-
-    # Run GRNBoost2
-    print("\nRunning GRNBoost2 inference with distributed computation...")
-    print("  (Monitor progress via the Dask dashboard)")
-
-    try:
-        network = grnboost2(
-            expression_data=expression_data,
-            tf_names=tf_names,
-            seed=args.seed,
-            client_or_address=client
-        )
-
-        print(f"\nInference complete!")
-        print(f"  Total regulatory links inferred: {len(network)}")
-        print(f"  Unique TFs: {network['TF'].nunique()}")
-        print(f"  Unique targets: {network['target'].nunique()}")
-
-        # Save results
-        print(f"\nSaving results to {args.output}...")
-        network.to_csv(args.output, sep='\t', index=False)
-
-        # Display top 10 predictions
-        print("\nTop 10 predicted regulatory relationships:")
-        print(network.head(10).to_string(index=False))
-
-        print("\nDone!")
-
-    finally:
-        # Clean up Dask resources
-        print("\nClosing Dask cluster...")
-        client.close()
-        cluster.close()
-
-
-if __name__ == '__main__':
-    main()
--- a/scientific-packages/astropy/SKILL.md
+++ b/scientific-packages/astropy/SKILL.md
@@ -1,790 +0,0 @@
---
-name: astropy
-description: "Astronomy toolkit. FITS I/O, celestial coordinate transforms, cosmology calculations, time systems, WCS, units, astronomical tables, for astronomical data analysis and imaging."
---
-
-# Astropy
-
-## Overview
-
-Astropy is the community standard Python library for astronomy, providing core functionality for astronomical data analysis and computation. This skill provides comprehensive guidance and tools for working with astropy's extensive capabilities across coordinate systems, file I/O, units and quantities, time systems, cosmology, modeling, and more.
-
-## When to Use This Skill
-
-This skill should be used when:
- Working with FITS files (reading, writing, inspecting, modifying)
- Performing coordinate transformations between astronomical reference frames
- Calculating cosmological distances, ages, or other quantities
- Handling astronomical time systems and conversions
- Working with physical units and dimensional analysis
- Processing astronomical data tables with specialized column types
- Fitting models to astronomical data
- Converting between pixel and world coordinates (WCS)
- Performing robust statistical analysis on astronomical data
- Visualizing astronomical images with proper scaling and stretching
-
-## Core Capabilities
-
-### 1. FITS File Operations
-
-FITS (Flexible Image Transport System) is the standard file format in astronomy. Astropy provides comprehensive FITS support.
-
-**Quick FITS Inspection**:
-Use the included `scripts/fits_info.py` script for rapid file inspection:
-```bash
-python scripts/fits_info.py observation.fits
-python scripts/fits_info.py observation.fits --detailed
-python scripts/fits_info.py observation.fits --ext 1
-```
-
-**Common FITS workflows**:
-```python
-from astropy.io import fits
-
-# Read FITS file
-with fits.open('image.fits') as hdul:
-    hdul.info()  # Display structure
-    data = hdul[0].data
-    header = hdul[0].header
-
-# Write FITS file
-fits.writeto('output.fits', data, header, overwrite=True)
-
-# Quick access (less efficient for multiple operations)
-data = fits.getdata('image.fits', ext=0)
-header = fits.getheader('image.fits', ext=0)
-
-# Update specific header keyword
-fits.setval('image.fits', 'OBJECT', value='M31')
-```
-
-**Multi-extension FITS**:
-```python
-from astropy.io import fits
-
-# Create multi-extension FITS
-primary = fits.PrimaryHDU(primary_data)
-image_ext = fits.ImageHDU(science_data, name='SCI')
-error_ext = fits.ImageHDU(error_data, name='ERR')
-
-hdul = fits.HDUList([primary, image_ext, error_ext])
-hdul.writeto('multi_ext.fits', overwrite=True)
-```
-
-**Binary tables**:
-```python
-from astropy.io import fits
-
-# Read binary table
-with fits.open('catalog.fits') as hdul:
-    table_data = hdul[1].data
-    ra = table_data['RA']
-    dec = table_data['DEC']
-
-# Better: use astropy.table for table operations (see section 5)
-```
-
-### 2. Coordinate Systems and Transformations
-
-Astropy supports ~25 coordinate frames with seamless transformations.
-
-**Quick Coordinate Conversion**:
-Use the included `scripts/coord_convert.py` script:
-```bash
-python scripts/coord_convert.py 10.68 41.27 --from icrs --to galactic
-python scripts/coord_convert.py --file coords.txt --from icrs --to galactic --output sexagesimal
-```
-
-**Basic coordinate operations**:
-```python
-from astropy.coordinates import SkyCoord
-import astropy.units as u
-
-# Create coordinate (multiple input formats supported)
-c = SkyCoord(ra=10.68*u.degree, dec=41.27*u.degree, frame='icrs')
-c = SkyCoord('00:42:44.3 +41:16:09', unit=(u.hourangle, u.deg))
-c = SkyCoord('00h42m44.3s +41d16m09s')
-
-# Transform between frames
-c_galactic = c.galactic
-c_fk5 = c.fk5
-
-print(f"Galactic: l={c_galactic.l.deg:.3f}, b={c_galactic.b.deg:.3f}")
-```
-
-**Working with coordinate arrays**:
-```python
-import numpy as np
-from astropy.coordinates import SkyCoord
-import astropy.units as u
-
-# Arrays of coordinates
-ra = np.array([10.1, 10.2, 10.3]) * u.degree
-dec = np.array([40.1, 40.2, 40.3]) * u.degree
-coords = SkyCoord(ra=ra, dec=dec, frame='icrs')
-
-# Calculate separations
-sep = coords[0].separation(coords[1])
-print(f"Separation: {sep.to(u.arcmin)}")
-
-# Position angle
-pa = coords[0].position_angle(coords[1])
-```
-
-**Catalog matching**:
-```python
-from astropy.coordinates import SkyCoord
-import astropy.units as u
-
-catalog1 = SkyCoord(ra=[10, 11, 12]*u.degree, dec=[40, 41, 42]*u.degree)
-catalog2 = SkyCoord(ra=[10.01, 11.02, 13]*u.degree, dec=[40.01, 41.01, 43]*u.degree)
-
-# Find nearest neighbors
-idx, sep2d, dist3d = catalog1.match_to_catalog_sky(catalog2)
-
-# Filter by separation threshold
-max_sep = 1 * u.arcsec
-matched = sep2d < max_sep
-```
-
-**Horizontal coordinates (Alt/Az)**:
-```python
-from astropy.coordinates import SkyCoord, EarthLocation, AltAz
-from astropy.time import Time
-import astropy.units as u
-
-location = EarthLocation(lat=40*u.deg, lon=-70*u.deg, height=300*u.m)
-obstime = Time('2023-01-01 03:00:00')
-target = SkyCoord(ra=10*u.degree, dec=40*u.degree, frame='icrs')
-
-altaz_frame = AltAz(obstime=obstime, location=location)
-target_altaz = target.transform_to(altaz_frame)
-
-print(f"Alt: {target_altaz.alt.deg:.2f}°, Az: {target_altaz.az.deg:.2f}°")
-```
-
-**Available coordinate frames**:
- `icrs` - International Celestial Reference System (default, preferred)
- `fk5`, `fk4` - Fifth/Fourth Fundamental Katalog
- `galactic` - Galactic coordinates
- `supergalactic` - Supergalactic coordinates
- `altaz` - Horizontal (altitude-azimuth) coordinates
- `gcrs`, `cirs`, `itrs` - Earth-based systems
- Ecliptic frames: `BarycentricMeanEcliptic`, `HeliocentricMeanEcliptic`, `GeocentricMeanEcliptic`
-
-### 3. Units and Quantities
-
-Physical units are fundamental to astronomical calculations. Astropy's units system provides dimensional analysis and automatic conversions.
-
-**Basic unit operations**:
-```python
-import astropy.units as u
-
-# Create quantities
-distance = 5.2 * u.parsec
-velocity = 300 * u.km / u.s
-time = 10 * u.year
-
-# Convert units
-distance_ly = distance.to(u.lightyear)
-velocity_mps = velocity.to(u.m / u.s)
-
-# Arithmetic with units
-wavelength = 500 * u.nm
-frequency = wavelength.to(u.Hz, equivalencies=u.spectral())
-```
-
-**Working with arrays**:
-```python
-import numpy as np
-import astropy.units as u
-
-wavelengths = np.array([400, 500, 600]) * u.nm
-frequencies = wavelengths.to(u.THz, equivalencies=u.spectral())
-
-fluxes = np.array([1.2, 2.3, 1.8]) * u.Jy
-luminosities = 4 * np.pi * (10*u.pc)**2 * fluxes
-```
-
-**Important equivalencies**:
- `u.spectral()` - Convert wavelength ↔ frequency ↔ energy
- `u.doppler_optical(rest)` - Optical Doppler velocity
- `u.doppler_radio(rest)` - Radio Doppler velocity
- `u.doppler_relativistic(rest)` - Relativistic Doppler
- `u.temperature()` - Temperature unit conversions
- `u.brightness_temperature(freq)` - Brightness temperature
-
-**Physical constants**:
-```python
-from astropy import constants as const
-
-print(const.c)      # Speed of light
-print(const.G)      # Gravitational constant
-print(const.M_sun)  # Solar mass
-print(const.R_sun)  # Solar radius
-print(const.L_sun)  # Solar luminosity
-```
-
-**Performance tip**: Use the `<<` operator for fast unit assignment to arrays:
-```python
-# Fast
-result = large_array << u.m
-
-# Slower
-result = large_array * u.m
-```
-
-### 4. Time Systems
-
-Astronomical time systems require high precision and multiple time scales.
-
-**Creating time objects**:
-```python
-from astropy.time import Time
-import astropy.units as u
-
-# Various input formats
-t1 = Time('2023-01-01T00:00:00', format='isot', scale='utc')
-t2 = Time(2459945.5, format='jd', scale='utc')
-t3 = Time(['2023-01-01', '2023-06-01'], format='iso')
-
-# Convert formats
-print(t1.jd)    # Julian Date
-print(t1.mjd)   # Modified Julian Date
-print(t1.unix)  # Unix timestamp
-print(t1.iso)   # ISO format
-
-# Convert time scales
-print(t1.tai)   # International Atomic Time
-print(t1.tt)    # Terrestrial Time
-print(t1.tdb)   # Barycentric Dynamical Time
-```
-
-**Time arithmetic**:
-```python
-from astropy.time import Time, TimeDelta
-import astropy.units as u
-
-t1 = Time('2023-01-01T00:00:00')
-dt = TimeDelta(1*u.day)
-
-t2 = t1 + dt
-diff = t2 - t1
-print(diff.to(u.hour))
-
-# Array of times
-times = t1 + np.arange(10) * u.day
-```
-
-**Astronomical time calculations**:
-```python
-from astropy.time import Time
-from astropy.coordinates import SkyCoord, EarthLocation
-import astropy.units as u
-
-location = EarthLocation(lat=40*u.deg, lon=-70*u.deg)
-t = Time('2023-01-01T00:00:00')
-
-# Local sidereal time
-lst = t.sidereal_time('apparent', longitude=location.lon)
-
-# Barycentric correction
-target = SkyCoord(ra=10*u.deg, dec=40*u.deg)
-ltt = t.light_travel_time(target, location=location)
-t_bary = t.tdb + ltt
-```
-
-**Available time scales**:
- `utc` - Coordinated Universal Time
- `tai` - International Atomic Time
- `tt` - Terrestrial Time
- `tcb`, `tcg` - Barycentric/Geocentric Coordinate Time
- `tdb` - Barycentric Dynamical Time
- `ut1` - Universal Time
-
-### 5. Data Tables
-
-Astropy tables provide astronomy-specific enhancements over pandas.
-
-**Creating and manipulating tables**:
-```python
-from astropy.table import Table
-import astropy.units as u
-
-# Create table
-t = Table()
-t['name'] = ['Star1', 'Star2', 'Star3']
-t['ra'] = [10.5, 11.2, 12.3] * u.degree
-t['dec'] = [41.2, 42.1, 43.5] * u.degree
-t['flux'] = [1.2, 2.3, 0.8] * u.Jy
-
-# Column metadata
-t['flux'].description = 'Flux at 1.4 GHz'
-t['flux'].format = '.2f'
-
-# Add calculated column
-t['flux_mJy'] = t['flux'].to(u.mJy)
-
-# Filter and sort
-bright = t[t['flux'] > 1.0 * u.Jy]
-t.sort('flux')
-```
-
-**Table I/O**:
-```python
-from astropy.table import Table
-
-# Read (format auto-detected from extension)
-t = Table.read('data.fits')
-t = Table.read('data.csv', format='ascii.csv')
-t = Table.read('data.ecsv', format='ascii.ecsv')  # Preserves units!
-t = Table.read('data.votable', format='votable')
-
-# Write
-t.write('output.fits', overwrite=True)
-t.write('output.ecsv', format='ascii.ecsv', overwrite=True)
-```
-
-**Advanced operations**:
-```python
-from astropy.table import Table, join, vstack, hstack
-
-# Join tables (like SQL)
-joined = join(table1, table2, keys='id')
-
-# Stack tables
-combined_rows = vstack([t1, t2])
-combined_cols = hstack([t1, t2])
-
-# Grouping and aggregation
-t.group_by('category').groups.aggregate(np.mean)
-```
-
-**Tables with astronomical objects**:
-```python
-from astropy.table import Table
-from astropy.coordinates import SkyCoord
-from astropy.time import Time
-import astropy.units as u
-
-coords = SkyCoord(ra=[10, 11, 12]*u.deg, dec=[40, 41, 42]*u.deg)
-times = Time(['2023-01-01', '2023-01-02', '2023-01-03'])
-
-t = Table([coords, times], names=['coords', 'obstime'])
-print(t['coords'][0].ra)  # Access coordinate properties
-```
-
-### 6. Cosmological Calculations
-
-Quick cosmology calculations using standard models.
-
-**Using the cosmology calculator**:
-```bash
-python scripts/cosmo_calc.py 0.5 1.0 1.5
-python scripts/cosmo_calc.py --range 0 3 0.5 --cosmology Planck18
-python scripts/cosmo_calc.py 0.5 --verbose
-python scripts/cosmo_calc.py --convert 1000 --from luminosity_distance
-```
-
-**Programmatic usage**:
-```python
-from astropy.cosmology import Planck18
-import astropy.units as u
-import numpy as np
-
-cosmo = Planck18
-
-# Calculate distances
-z = 1.5
-d_L = cosmo.luminosity_distance(z)
-d_A = cosmo.angular_diameter_distance(z)
-d_C = cosmo.comoving_distance(z)
-
-# Time calculations
-age = cosmo.age(z)
-lookback = cosmo.lookback_time(z)
-
-# Hubble parameter
-H_z = cosmo.H(z)
-
-print(f"At z={z}:")
-print(f"  Luminosity distance: {d_L:.2f}")
-print(f"  Age of universe: {age:.2f}")
-```
-
-**Convert observables**:
-```python
-from astropy.cosmology import Planck18
-import astropy.units as u
-
-cosmo = Planck18
-z = 1.5
-
-# Angular size to physical size
-d_A = cosmo.angular_diameter_distance(z)
-angular_size = 1 * u.arcsec
-physical_size = (angular_size.to(u.radian) * d_A).to(u.kpc)
-
-# Flux to luminosity
-flux = 1e-17 * u.erg / u.s / u.cm**2
-d_L = cosmo.luminosity_distance(z)
-luminosity = flux * 4 * np.pi * d_L**2
-
-# Find redshift for given distance
-from astropy.cosmology import z_at_value
-z = z_at_value(cosmo.luminosity_distance, 1000*u.Mpc)
-```
-
-**Available cosmologies**:
- `Planck18`, `Planck15`, `Planck13` - Planck satellite parameters
- `WMAP9`, `WMAP7`, `WMAP5` - WMAP satellite parameters
- Custom: `FlatLambdaCDM(H0=70*u.km/u.s/u.Mpc, Om0=0.3)`
-
-### 7. Model Fitting
-
-Fit mathematical models to astronomical data.
-
-**1D fitting example**:
-```python
-from astropy.modeling import models, fitting
-import numpy as np
-
-# Generate data
-x = np.linspace(0, 10, 100)
-y_data = 10 * np.exp(-0.5 * ((x - 5) / 1)**2) + np.random.normal(0, 0.5, x.shape)
-
-# Create and fit model
-g_init = models.Gaussian1D(amplitude=8, mean=4.5, stddev=0.8)
-fitter = fitting.LevMarLSQFitter()
-g_fit = fitter(g_init, x, y_data)
-
-# Results
-print(f"Amplitude: {g_fit.amplitude.value:.3f}")
-print(f"Mean: {g_fit.mean.value:.3f}")
-print(f"Stddev: {g_fit.stddev.value:.3f}")
-
-# Evaluate fitted model
-y_fit = g_fit(x)
-```
-
-**Common 1D models**:
- `Gaussian1D` - Gaussian profile
- `Lorentz1D` - Lorentzian profile
- `Voigt1D` - Voigt profile
- `Moffat1D` - Moffat profile (PSF modeling)
- `Polynomial1D` - Polynomial
- `PowerLaw1D` - Power law
- `BlackBody` - Blackbody spectrum
-
-**Common 2D models**:
- `Gaussian2D` - 2D Gaussian
- `Moffat2D` - 2D Moffat (stellar PSF)
- `AiryDisk2D` - Airy disk (diffraction pattern)
- `Disk2D` - Circular disk
-
-**Fitting with constraints**:
-```python
-from astropy.modeling import models, fitting
-
-g = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
-
-# Set bounds
-g.amplitude.bounds = (0, None)  # Positive only
-g.mean.bounds = (4, 6)          # Constrain center
-
-# Fix parameters
-g.stddev.fixed = True
-
-# Compound models
-model = models.Gaussian1D() + models.Polynomial1D(degree=1)
-```
-
-**Available fitters**:
- `LinearLSQFitter` - Linear least squares (fast, for linear models)
- `LevMarLSQFitter` - Levenberg-Marquardt (most common)
- `SimplexLSQFitter` - Downhill simplex
- `SLSQPLSQFitter` - Sequential Least Squares with constraints
-
-### 8. World Coordinate System (WCS)
-
-Transform between pixel and world coordinates in images.
-
-**Basic WCS usage**:
-```python
-from astropy.io import fits
-from astropy.wcs import WCS
-
-# Read FITS with WCS
-hdu = fits.open('image.fits')[0]
-wcs = WCS(hdu.header)
-
-# Pixel to world
-ra, dec = wcs.pixel_to_world_values(100, 200)
-
-# World to pixel
-x, y = wcs.world_to_pixel_values(ra, dec)
-
-# Using SkyCoord (more powerful)
-from astropy.coordinates import SkyCoord
-import astropy.units as u
-
-coord = SkyCoord(ra=150*u.deg, dec=-30*u.deg)
-x, y = wcs.world_to_pixel(coord)
-```
-
-**Plotting with WCS**:
-```python
-from astropy.io import fits
-from astropy.wcs import WCS
-from astropy.visualization import ImageNormalize, LogStretch, PercentileInterval
-import matplotlib.pyplot as plt
-
-hdu = fits.open('image.fits')[0]
-wcs = WCS(hdu.header)
-data = hdu.data
-
-# Create figure with WCS projection
-fig = plt.figure()
-ax = fig.add_subplot(111, projection=wcs)
-
-# Plot with coordinate grid
-norm = ImageNormalize(data, interval=PercentileInterval(99.5),
-                     stretch=LogStretch())
-ax.imshow(data, norm=norm, origin='lower', cmap='viridis')
-
-# Coordinate labels and grid
-ax.set_xlabel('RA')
-ax.set_ylabel('Dec')
-ax.coords.grid(color='white', alpha=0.5)
-```
-
-### 9. Statistics and Data Processing
-
-Robust statistical tools for astronomical data.
-
-**Sigma clipping** (remove outliers):
-```python
-from astropy.stats import sigma_clip, sigma_clipped_stats
-
-# Remove outliers
-clipped = sigma_clip(data, sigma=3, maxiters=5)
-
-# Get statistics on cleaned data
-mean, median, std = sigma_clipped_stats(data, sigma=3)
-
-# Use clipped data
-background = median
-signal = data - background
-snr = signal / std
-```
-
-**Other statistical functions**:
-```python
-from astropy.stats import mad_std, biweight_location, biweight_scale
-
-# Robust standard deviation
-std_robust = mad_std(data)
-
-# Robust central location
-center = biweight_location(data)
-
-# Robust scale
-scale = biweight_scale(data)
-```
-
-### 10. Visualization
-
-Display astronomical images with proper scaling.
-
-**Image normalization and stretching**:
-```python
-from astropy.visualization import (ImageNormalize, MinMaxInterval,
-                                   PercentileInterval, ZScaleInterval,
-                                   SqrtStretch, LogStretch, PowerStretch,
-                                   AsinhStretch)
-import matplotlib.pyplot as plt
-
-# Common combination: percentile interval + sqrt stretch
-norm = ImageNormalize(data,
-                     interval=PercentileInterval(99),
-                     stretch=SqrtStretch())
-
-plt.imshow(data, norm=norm, origin='lower', cmap='gray')
-plt.colorbar()
-```
-
-**Available intervals** (determine min/max):
- `MinMaxInterval()` - Use actual min/max
- `PercentileInterval(percentile)` - Clip to percentile (e.g., 99%)
- `ZScaleInterval()` - IRAF's zscale algorithm
- `ManualInterval(vmin, vmax)` - Specify manually
-
-**Available stretches** (nonlinear scaling):
- `LinearStretch()` - Linear (default)
- `SqrtStretch()` - Square root (common for images)
- `LogStretch()` - Logarithmic (for high dynamic range)
- `PowerStretch(power)` - Power law
- `AsinhStretch()` - Arcsinh (good for wide range)
-
-## Bundled Resources
-
-### scripts/
-
-**`fits_info.py`** - Comprehensive FITS file inspection tool
-```bash
-python scripts/fits_info.py observation.fits
-python scripts/fits_info.py observation.fits --detailed
-python scripts/fits_info.py observation.fits --ext 1
-```
-
-**`coord_convert.py`** - Batch coordinate transformation utility
-```bash
-python scripts/coord_convert.py 10.68 41.27 --from icrs --to galactic
-python scripts/coord_convert.py --file coords.txt --from icrs --to galactic
-```
-
-**`cosmo_calc.py`** - Cosmological calculator
-```bash
-python scripts/cosmo_calc.py 0.5 1.0 1.5
-python scripts/cosmo_calc.py --range 0 3 0.5 --cosmology Planck18
-```
-
-### references/
-
-**`module_overview.md`** - Comprehensive reference of all astropy subpackages, classes, and methods. Consult this for detailed API information, available functions, and module capabilities.
-
-**`common_workflows.md`** - Complete working examples for common astronomical data analysis tasks. Contains full code examples for FITS operations, coordinate transformations, cosmology, modeling, and complete analysis pipelines.
-
-## Best Practices
-
-1. **Use context managers for FITS files**:
-   ```python
-   with fits.open('file.fits') as hdul:
-       # Work with file
-   ```
-
-2. **Prefer astropy.table over raw FITS tables** for better unit/metadata support
-
-3. **Use SkyCoord for coordinates** (high-level interface) rather than low-level frame classes
-
-4. **Always attach units** to quantities when possible for dimensional safety
-
-5. **Use ECSV format** for saving tables when you want to preserve units and metadata
-
-6. **Vectorize coordinate operations** rather than looping for performance
-
-7. **Use memmap=True** when opening large FITS files to save memory
-
-8. **Install Bottleneck** package for faster statistics operations
-
-9. **Pre-compute composite units** for repeated operations to improve performance
-
-10. **Consult `references/module_overview.md`** for detailed module information and `references/common_workflows.md`** for complete working examples
-
-## Common Patterns
-
-### Pattern: FITS → Process → FITS
-```python
-from astropy.io import fits
-from astropy.stats import sigma_clipped_stats
-
-# Read
-with fits.open('input.fits') as hdul:
-    data = hdul[0].data
-    header = hdul[0].header
-
-    # Process
-    mean, median, std = sigma_clipped_stats(data, sigma=3)
-    processed = (data - median) / std
-
-    # Write
-    fits.writeto('output.fits', processed, header, overwrite=True)
-```
-
-### Pattern: Catalog Matching
-```python
-from astropy.coordinates import SkyCoord
-from astropy.table import Table
-import astropy.units as u
-
-# Load catalogs
-cat1 = Table.read('catalog1.fits')
-cat2 = Table.read('catalog2.fits')
-
-# Create coordinate objects
-coords1 = SkyCoord(ra=cat1['RA'], dec=cat1['DEC'], unit=u.degree)
-coords2 = SkyCoord(ra=cat2['RA'], dec=cat2['DEC'], unit=u.degree)
-
-# Match
-idx, sep2d, dist3d = coords1.match_to_catalog_sky(coords2)
-
-# Filter by separation
-max_sep = 1 * u.arcsec
-matched_mask = sep2d < max_sep
-
-# Create matched catalog
-matched_cat1 = cat1[matched_mask]
-matched_cat2 = cat2[idx[matched_mask]]
-```
-
-### Pattern: Time Series Analysis
-```python
-from astropy.time import Time
-from astropy.timeseries import TimeSeries
-import astropy.units as u
-
-# Create time series
-times = Time(['2023-01-01', '2023-01-02', '2023-01-03'])
-flux = [1.2, 2.3, 1.8] * u.Jy
-
-ts = TimeSeries(time=times)
-ts['flux'] = flux
-
-# Fold on period
-from astropy.timeseries import aggregate_downsample
-period = 1.5 * u.day
-folded = ts.fold(period=period)
-```
-
-### Pattern: Image Display with WCS
-```python
-from astropy.io import fits
-from astropy.wcs import WCS
-from astropy.visualization import ImageNormalize, SqrtStretch, PercentileInterval
-import matplotlib.pyplot as plt
-
-hdu = fits.open('image.fits')[0]
-wcs = WCS(hdu.header)
-data = hdu.data
-
-fig = plt.figure(figsize=(10, 10))
-ax = fig.add_subplot(111, projection=wcs)
-
-norm = ImageNormalize(data, interval=PercentileInterval(99),
-                     stretch=SqrtStretch())
-im = ax.imshow(data, norm=norm, origin='lower', cmap='viridis')
-
-ax.set_xlabel('RA')
-ax.set_ylabel('Dec')
-ax.coords.grid(color='white', alpha=0.5, linestyle='solid')
-plt.colorbar(im, ax=ax)
-```
-
-## Installation Note
-
-Ensure astropy is installed in the Python environment:
-```bash
-pip install astropy
-```
-
-For additional performance and features:
-```bash
-pip install astropy[all]  # Includes optional dependencies
-```
-
-## Additional Resources
-
- Official documentation: https://docs.astropy.org
- Tutorials: https://learn.astropy.org
- API reference: Consult `references/module_overview.md` in this skill
- Working examples: Consult `references/common_workflows.md` in this skill
--- a/scientific-packages/astropy/references/common_workflows.md
+++ b/scientific-packages/astropy/references/common_workflows.md
@@ -1,618 +0,0 @@
-# Common Astropy Workflows
-
-This document describes frequently used workflows when working with astronomical data using astropy.
-
-## 1. Working with FITS Files
-
-### Basic FITS Reading
-```python
-from astropy.io import fits
-import numpy as np
-
-# Open and examine structure
-with fits.open('observation.fits') as hdul:
-    hdul.info()
-
-    # Access primary HDU
-    primary_hdr = hdul[0].header
-    primary_data = hdul[0].data
-
-    # Access extension
-    ext_data = hdul[1].data
-    ext_hdr = hdul[1].header
-
-    # Read specific header keywords
-    object_name = primary_hdr['OBJECT']
-    exposure = primary_hdr['EXPTIME']
-```
-
-### Writing FITS Files
-```python
-# Create new FITS file
-from astropy.io import fits
-import numpy as np
-
-# Create data
-data = np.random.random((100, 100))
-
-# Create primary HDU
-hdu = fits.PrimaryHDU(data)
-hdu.header['OBJECT'] = 'M31'
-hdu.header['EXPTIME'] = 300.0
-
-# Write to file
-hdu.writeto('output.fits', overwrite=True)
-
-# Multi-extension FITS
-hdul = fits.HDUList([
-    fits.PrimaryHDU(data1),
-    fits.ImageHDU(data2, name='SCI'),
-    fits.ImageHDU(data3, name='ERR')
-])
-hdul.writeto('multi_ext.fits', overwrite=True)
-```
-
-### FITS Table Operations
-```python
-from astropy.io import fits
-
-# Read binary table
-with fits.open('catalog.fits') as hdul:
-    table_data = hdul[1].data
-
-    # Access columns
-    ra = table_data['RA']
-    dec = table_data['DEC']
-    mag = table_data['MAG']
-
-    # Filter data
-    bright = table_data[table_data['MAG'] < 15]
-
-# Write binary table
-from astropy.table import Table
-import astropy.units as u
-
-t = Table([ra, dec, mag], names=['RA', 'DEC', 'MAG'])
-t['RA'].unit = u.degree
-t['DEC'].unit = u.degree
-t.write('output_catalog.fits', format='fits', overwrite=True)
-```
-
-## 2. Coordinate Transformations
-
-### Basic Coordinate Creation and Transformation
-```python
-from astropy.coordinates import SkyCoord
-import astropy.units as u
-
-# Create from RA/Dec
-c = SkyCoord(ra=10.68458*u.degree, dec=41.26917*u.degree, frame='icrs')
-
-# Alternative creation methods
-c = SkyCoord('00:42:44.3 +41:16:09', unit=(u.hourangle, u.deg))
-c = SkyCoord('00h42m44.3s +41d16m09s')
-
-# Transform to different frames
-c_gal = c.galactic
-c_fk5 = c.fk5
-print(f"Galactic: l={c_gal.l.deg}, b={c_gal.b.deg}")
-```
-
-### Coordinate Arrays and Separations
-```python
-import numpy as np
-from astropy.coordinates import SkyCoord
-import astropy.units as u
-
-# Create array of coordinates
-ra_array = np.array([10.1, 10.2, 10.3]) * u.degree
-dec_array = np.array([40.1, 40.2, 40.3]) * u.degree
-coords = SkyCoord(ra=ra_array, dec=dec_array, frame='icrs')
-
-# Calculate separations
-c1 = SkyCoord(ra=10*u.degree, dec=40*u.degree)
-c2 = SkyCoord(ra=11*u.degree, dec=41*u.degree)
-sep = c1.separation(c2)
-print(f"Separation: {sep.to(u.arcmin)}")
-
-# Position angle
-pa = c1.position_angle(c2)
-```
-
-### Catalog Matching
-```python
-from astropy.coordinates import SkyCoord, match_coordinates_sky
-import astropy.units as u
-
-# Two catalogs of coordinates
-catalog1 = SkyCoord(ra=[10, 11, 12]*u.degree, dec=[40, 41, 42]*u.degree)
-catalog2 = SkyCoord(ra=[10.01, 11.02, 13]*u.degree, dec=[40.01, 41.01, 43]*u.degree)
-
-# Find nearest neighbors
-idx, sep2d, dist3d = catalog1.match_to_catalog_sky(catalog2)
-
-# Filter by separation threshold
-max_sep = 1 * u.arcsec
-matched = sep2d < max_sep
-matching_indices = idx[matched]
-```
-
-### Horizontal Coordinates (Alt/Az)
-```python
-from astropy.coordinates import SkyCoord, EarthLocation, AltAz
-from astropy.time import Time
-import astropy.units as u
-
-# Observer location
-location = EarthLocation(lat=40*u.deg, lon=-70*u.deg, height=300*u.m)
-
-# Observation time
-obstime = Time('2023-01-01 03:00:00')
-
-# Target coordinate
-target = SkyCoord(ra=10*u.degree, dec=40*u.degree, frame='icrs')
-
-# Transform to Alt/Az
-altaz_frame = AltAz(obstime=obstime, location=location)
-target_altaz = target.transform_to(altaz_frame)
-
-print(f"Altitude: {target_altaz.alt.deg}")
-print(f"Azimuth: {target_altaz.az.deg}")
-```
-
-## 3. Units and Quantities
-
-### Basic Unit Operations
-```python
-import astropy.units as u
-
-# Create quantities
-distance = 5.2 * u.parsec
-time = 10 * u.year
-velocity = 300 * u.km / u.s
-
-# Unit conversion
-distance_ly = distance.to(u.lightyear)
-velocity_mps = velocity.to(u.m / u.s)
-
-# Arithmetic with units
-wavelength = 500 * u.nm
-frequency = wavelength.to(u.Hz, equivalencies=u.spectral())
-
-# Compose/decompose units
-composite = (1 * u.kg * u.m**2 / u.s**2)
-print(composite.decompose())  # Base SI units
-print(composite.compose())     # Known compound units (Joule)
-```
-
-### Working with Arrays
-```python
-import numpy as np
-import astropy.units as u
-
-# Quantity arrays
-wavelengths = np.array([400, 500, 600]) * u.nm
-frequencies = wavelengths.to(u.THz, equivalencies=u.spectral())
-
-# Mathematical operations preserve units
-fluxes = np.array([1.2, 2.3, 1.8]) * u.Jy
-luminosities = 4 * np.pi * (10*u.pc)**2 * fluxes
-```
-
-### Custom Units and Equivalencies
-```python
-import astropy.units as u
-
-# Define custom unit
-beam = u.def_unit('beam', 1.5e-10 * u.steradian)
-
-# Register for session
-u.add_enabled_units([beam])
-
-# Use in calculations
-flux_per_beam = 1.5 * u.Jy / beam
-
-# Doppler equivalencies
-rest_wavelength = 656.3 * u.nm  # H-alpha
-observed = 656.5 * u.nm
-velocity = observed.to(u.km/u.s,
-                       equivalencies=u.doppler_optical(rest_wavelength))
-```
-
-## 4. Time Handling
-
-### Time Creation and Conversion
-```python
-from astropy.time import Time
-import astropy.units as u
-
-# Create time objects
-t1 = Time('2023-01-01T00:00:00', format='isot', scale='utc')
-t2 = Time(2459945.5, format='jd', scale='utc')
-t3 = Time(['2023-01-01', '2023-06-01'], format='iso')
-
-# Convert formats
-print(t1.jd)    # Julian Date
-print(t1.mjd)   # Modified Julian Date
-print(t1.unix)  # Unix timestamp
-print(t1.iso)   # ISO format
-
-# Convert time scales
-print(t1.tai)   # Convert to TAI
-print(t1.tt)    # Convert to TT
-print(t1.tdb)   # Convert to TDB
-```
-
-### Time Arithmetic
-```python
-from astropy.time import Time, TimeDelta
-import astropy.units as u
-
-t1 = Time('2023-01-01T00:00:00')
-dt = TimeDelta(1*u.day)
-
-# Add time delta
-t2 = t1 + dt
-
-# Difference between times
-diff = t2 - t1
-print(diff.to(u.hour))
-
-# Array of times
-times = t1 + np.arange(10) * u.day
-```
-
-### Sidereal Time and Astronomical Calculations
-```python
-from astropy.time import Time
-from astropy.coordinates import EarthLocation
-import astropy.units as u
-
-location = EarthLocation(lat=40*u.deg, lon=-70*u.deg)
-t = Time('2023-01-01T00:00:00')
-
-# Local sidereal time
-lst = t.sidereal_time('apparent', longitude=location.lon)
-
-# Light travel time correction
-from astropy.coordinates import SkyCoord
-target = SkyCoord(ra=10*u.deg, dec=40*u.deg)
-ltt_bary = t.light_travel_time(target, location=location)
-t_bary = t + ltt_bary
-```
-
-## 5. Tables and Data Management
-
-### Creating and Manipulating Tables
-```python
-from astropy.table import Table, Column
-import astropy.units as u
-import numpy as np
-
-# Create table
-t = Table()
-t['name'] = ['Star1', 'Star2', 'Star3']
-t['ra'] = [10.5, 11.2, 12.3] * u.degree
-t['dec'] = [41.2, 42.1, 43.5] * u.degree
-t['flux'] = [1.2, 2.3, 0.8] * u.Jy
-
-# Add column metadata
-t['flux'].description = 'Flux at 1.4 GHz'
-t['flux'].format = '.2f'
-
-# Add new column
-t['flux_mJy'] = t['flux'].to(u.mJy)
-
-# Filter rows
-bright = t[t['flux'] > 1.0 * u.Jy]
-
-# Sort
-t.sort('flux')
-```
-
-### Table I/O
-```python
-from astropy.table import Table
-
-# Read various formats
-t = Table.read('data.fits')
-t = Table.read('data.csv', format='ascii.csv')
-t = Table.read('data.ecsv', format='ascii.ecsv')  # Preserves units
-t = Table.read('data.votable', format='votable')
-
-# Write various formats
-t.write('output.fits', overwrite=True)
-t.write('output.csv', format='ascii.csv', overwrite=True)
-t.write('output.ecsv', format='ascii.ecsv', overwrite=True)
-t.write('output.votable', format='votable', overwrite=True)
-```
-
-### Advanced Table Operations
-```python
-from astropy.table import Table, join, vstack, hstack
-
-# Join tables
-t1 = Table([[1, 2], ['a', 'b']], names=['id', 'val1'])
-t2 = Table([[1, 2], ['c', 'd']], names=['id', 'val2'])
-joined = join(t1, t2, keys='id')
-
-# Stack tables vertically
-combined = vstack([t1, t2])
-
-# Stack horizontally
-combined = hstack([t1, t2])
-
-# Grouping
-t.group_by('category').groups.aggregate(np.mean)
-```
-
-### Tables with Astronomical Objects
-```python
-from astropy.table import Table
-from astropy.coordinates import SkyCoord
-from astropy.time import Time
-import astropy.units as u
-
-# Table with SkyCoord column
-coords = SkyCoord(ra=[10, 11, 12]*u.deg, dec=[40, 41, 42]*u.deg)
-times = Time(['2023-01-01', '2023-01-02', '2023-01-03'])
-
-t = Table([coords, times], names=['coords', 'obstime'])
-
-# Access individual coordinates
-print(t['coords'][0].ra)
-print(t['coords'][0].dec)
-```
-
-## 6. Cosmological Calculations
-
-### Distance Calculations
-```python
-from astropy.cosmology import Planck18, FlatLambdaCDM
-import astropy.units as u
-import numpy as np
-
-# Use built-in cosmology
-cosmo = Planck18
-
-# Redshifts
-z = np.linspace(0, 5, 50)
-
-# Calculate distances
-comoving_dist = cosmo.comoving_distance(z)
-angular_diam_dist = cosmo.angular_diameter_distance(z)
-luminosity_dist = cosmo.luminosity_distance(z)
-
-# Age of universe
-age_at_z = cosmo.age(z)
-lookback_time = cosmo.lookback_time(z)
-
-# Hubble parameter
-H_z = cosmo.H(z)
-```
-
-### Converting Observables
-```python
-from astropy.cosmology import Planck18
-import astropy.units as u
-
-cosmo = Planck18
-z = 1.5
-
-# Angular diameter distance
-d_A = cosmo.angular_diameter_distance(z)
-
-# Convert angular size to physical size
-angular_size = 1 * u.arcsec
-physical_size = (angular_size.to(u.radian) * d_A).to(u.kpc)
-
-# Convert flux to luminosity
-flux = 1e-17 * u.erg / u.s / u.cm**2
-d_L = cosmo.luminosity_distance(z)
-luminosity = flux * 4 * np.pi * d_L**2
-
-# Find redshift for given distance
-from astropy.cosmology import z_at_value
-z_result = z_at_value(cosmo.luminosity_distance, 1000*u.Mpc)
-```
-
-### Custom Cosmology
-```python
-from astropy.cosmology import FlatLambdaCDM
-import astropy.units as u
-
-# Define custom cosmology
-my_cosmo = FlatLambdaCDM(H0=70 * u.km/u.s/u.Mpc,
-                         Om0=0.3,
-                         Tcmb0=2.725 * u.K)
-
-# Use it for calculations
-print(my_cosmo.age(0))
-print(my_cosmo.luminosity_distance(1.5))
-```
-
-## 7. Model Fitting
-
-### Fitting 1D Models
-```python
-from astropy.modeling import models, fitting
-import numpy as np
-import matplotlib.pyplot as plt
-
-# Generate data with noise
-x = np.linspace(0, 10, 100)
-true_model = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
-y = true_model(x) + np.random.normal(0, 0.5, x.shape)
-
-# Create and fit model
-g_init = models.Gaussian1D(amplitude=8, mean=4.5, stddev=0.8)
-fitter = fitting.LevMarLSQFitter()
-g_fit = fitter(g_init, x, y)
-
-# Plot results
-plt.plot(x, y, 'o', label='Data')
-plt.plot(x, g_fit(x), label='Fit')
-plt.legend()
-
-# Get fitted parameters
-print(f"Amplitude: {g_fit.amplitude.value}")
-print(f"Mean: {g_fit.mean.value}")
-print(f"Stddev: {g_fit.stddev.value}")
-```
-
-### Fitting with Constraints
-```python
-from astropy.modeling import models, fitting
-
-# Set parameter bounds
-g = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
-g.amplitude.bounds = (0, None)  # Positive only
-g.mean.bounds = (4, 6)          # Constrain center
-g.stddev.fixed = True           # Fix width
-
-# Tie parameters (for multi-component models)
-g1 = models.Gaussian1D(amplitude=10, mean=5, stddev=1, name='g1')
-g2 = models.Gaussian1D(amplitude=5, mean=6, stddev=1, name='g2')
-g2.stddev.tied = lambda model: model.g1.stddev
-
-# Compound model
-model = g1 + g2
-```
-
-### 2D Image Fitting
-```python
-from astropy.modeling import models, fitting
-import numpy as np
-
-# Create 2D data
-y, x = np.mgrid[0:100, 0:100]
-z = models.Gaussian2D(amplitude=100, x_mean=50, y_mean=50,
-                     x_stddev=5, y_stddev=5)(x, y)
-z += np.random.normal(0, 5, z.shape)
-
-# Fit 2D Gaussian
-g_init = models.Gaussian2D(amplitude=90, x_mean=48, y_mean=48,
-                          x_stddev=4, y_stddev=4)
-fitter = fitting.LevMarLSQFitter()
-g_fit = fitter(g_init, x, y, z)
-
-# Get parameters
-print(f"Center: ({g_fit.x_mean.value}, {g_fit.y_mean.value})")
-print(f"Width: ({g_fit.x_stddev.value}, {g_fit.y_stddev.value})")
-```
-
-## 8. Image Processing and Visualization
-
-### Image Display with Proper Scaling
-```python
-from astropy.io import fits
-from astropy.visualization import ImageNormalize, SqrtStretch, PercentileInterval
-import matplotlib.pyplot as plt
-
-# Read FITS image
-data = fits.getdata('image.fits')
-
-# Apply normalization
-norm = ImageNormalize(data,
-                     interval=PercentileInterval(99),
-                     stretch=SqrtStretch())
-
-# Display
-plt.imshow(data, norm=norm, origin='lower', cmap='gray')
-plt.colorbar()
-```
-
-### WCS Plotting
-```python
-from astropy.io import fits
-from astropy.wcs import WCS
-from astropy.visualization import ImageNormalize, LogStretch, PercentileInterval
-import matplotlib.pyplot as plt
-
-# Read FITS with WCS
-hdu = fits.open('image.fits')[0]
-wcs = WCS(hdu.header)
-data = hdu.data
-
-# Create figure with WCS projection
-fig = plt.figure()
-ax = fig.add_subplot(111, projection=wcs)
-
-# Plot with coordinate grid
-norm = ImageNormalize(data, interval=PercentileInterval(99.5),
-                     stretch=LogStretch())
-im = ax.imshow(data, norm=norm, origin='lower', cmap='viridis')
-
-# Add coordinate labels
-ax.set_xlabel('RA')
-ax.set_ylabel('Dec')
-ax.coords.grid(color='white', alpha=0.5)
-plt.colorbar(im)
-```
-
-### Sigma Clipping and Statistics
-```python
-from astropy.stats import sigma_clip, sigma_clipped_stats
-import numpy as np
-
-# Data with outliers
-data = np.random.normal(100, 15, 1000)
-data[0:50] = np.random.normal(200, 10, 50)  # Add outliers
-
-# Sigma clipping
-clipped = sigma_clip(data, sigma=3, maxiters=5)
-
-# Get statistics on clipped data
-mean, median, std = sigma_clipped_stats(data, sigma=3)
-
-print(f"Mean: {mean:.2f}")
-print(f"Median: {median:.2f}")
-print(f"Std: {std:.2f}")
-print(f"Clipped {clipped.mask.sum()} values")
-```
-
-## 9. Complete Analysis Example
-
-### Photometry Pipeline
-```python
-from astropy.io import fits
-from astropy.wcs import WCS
-from astropy.coordinates import SkyCoord
-from astropy.stats import sigma_clipped_stats
-from astropy.visualization import ImageNormalize, LogStretch
-import astropy.units as u
-import numpy as np
-
-# Read FITS file
-hdu = fits.open('observation.fits')[0]
-data = hdu.data
-header = hdu.header
-wcs = WCS(header)
-
-# Calculate background statistics
-mean, median, std = sigma_clipped_stats(data, sigma=3.0)
-print(f"Background: {median:.2f} +/- {std:.2f}")
-
-# Subtract background
-data_sub = data - median
-
-# Known source coordinates
-source_coord = SkyCoord(ra='10:42:30', dec='+41:16:09', unit=(u.hourangle, u.deg))
-
-# Convert to pixel coordinates
-x_pix, y_pix = wcs.world_to_pixel(source_coord)
-
-# Simple aperture photometry
-aperture_radius = 10  # pixels
-y, x = np.ogrid[:data.shape[0], :data.shape[1]]
-mask = (x - x_pix)**2 + (y - y_pix)**2 <= aperture_radius**2
-
-aperture_sum = np.sum(data_sub[mask])
-npix = np.sum(mask)
-
-print(f"Source position: ({x_pix:.1f}, {y_pix:.1f})")
-print(f"Aperture sum: {aperture_sum:.2f}")
-print(f"S/N: {aperture_sum / (std * np.sqrt(npix)):.2f}")
-```
-
-This workflow document provides practical examples for common astronomical data analysis tasks using astropy.
--- a/scientific-packages/astropy/references/module_overview.md
+++ b/scientific-packages/astropy/references/module_overview.md
@@ -1,340 +0,0 @@
-# Astropy Module Overview
-
-This document provides a comprehensive reference of all major astropy subpackages and their capabilities.
-
-## Core Data Structures
-
-### astropy.units
-**Purpose**: Handle physical units and dimensional analysis in computations.
-
-**Key Classes**:
- `Quantity` - Combines numerical values with units
- `Unit` - Represents physical units
-
-**Common Operations**:
-```python
-import astropy.units as u
-distance = 5 * u.meter
-time = 2 * u.second
-velocity = distance / time  # Returns Quantity in m/s
-wavelength = 500 * u.nm
-frequency = wavelength.to(u.Hz, equivalencies=u.spectral())
-```
-
-**Equivalencies**:
- `u.spectral()` - Convert wavelength ↔ frequency
- `u.doppler_optical()`, `u.doppler_radio()` - Velocity conversions
- `u.temperature()` - Temperature unit conversions
- `u.pixel_scale()` - Pixel to physical units
-
-### astropy.constants
-**Purpose**: Provide physical and astronomical constants.
-
-**Common Constants**:
- `c` - Speed of light
- `G` - Gravitational constant
- `h` - Planck constant
- `M_sun`, `R_sun`, `L_sun` - Solar mass, radius, luminosity
- `M_earth`, `R_earth` - Earth mass, radius
- `pc`, `au` - Parsec, astronomical unit
-
-### astropy.time
-**Purpose**: Represent and manipulate times and dates with astronomical precision.
-
-**Time Scales**:
- `UTC` - Coordinated Universal Time
- `TAI` - International Atomic Time
- `TT` - Terrestrial Time
- `TCB`, `TCG` - Barycentric/Geocentric Coordinate Time
- `TDB` - Barycentric Dynamical Time
- `UT1` - Universal Time
-
-**Common Formats**:
- `iso`, `isot` - ISO 8601 strings
- `jd`, `mjd` - Julian/Modified Julian Date
- `unix`, `gps` - Unix/GPS timestamps
- `datetime` - Python datetime objects
-
-**Example**:
-```python
-from astropy.time import Time
-t = Time('2023-01-01T00:00:00', format='isot', scale='utc')
-print(t.mjd)  # Modified Julian Date
-print(t.jd)   # Julian Date
-print(t.tt)   # Convert to TT scale
-```
-
-### astropy.table
-**Purpose**: Work with tabular data optimized for astronomical applications.
-
-**Key Features**:
- Native support for astropy Quantity, Time, and SkyCoord columns
- Multi-dimensional columns
- Metadata preservation (units, descriptions, formats)
- Advanced operations: joins, grouping, binning
- File I/O via unified interface
-
-**Example**:
-```python
-from astropy.table import Table
-import astropy.units as u
-
-t = Table()
-t['name'] = ['Star1', 'Star2', 'Star3']
-t['ra'] = [10.5, 11.2, 12.3] * u.degree
-t['dec'] = [41.2, 42.1, 43.5] * u.degree
-t['flux'] = [1.2, 2.3, 0.8] * u.Jy
-```
-
-## Coordinates and World Coordinate Systems
-
-### astropy.coordinates
-**Purpose**: Represent and transform celestial coordinates.
-
-**Primary Interface**: `SkyCoord` - High-level class for sky positions
-
-**Coordinate Frames**:
- `ICRS` - International Celestial Reference System (default)
- `FK5`, `FK4` - Fifth/Fourth Fundamental Katalog
- `Galactic`, `Supergalactic` - Galactic coordinates
- `AltAz` - Horizontal (altitude-azimuth) coordinates
- `GCRS`, `CIRS`, `ITRS` - Earth-based systems
- `BarycentricMeanEcliptic`, `HeliocentricMeanEcliptic`, `GeocentricMeanEcliptic` - Ecliptic coordinates
-
-**Common Operations**:
-```python
-from astropy.coordinates import SkyCoord
-import astropy.units as u
-
-# Create coordinate
-c = SkyCoord(ra=10.625*u.degree, dec=41.2*u.degree, frame='icrs')
-
-# Transform to galactic
-c_gal = c.galactic
-
-# Calculate separation
-c2 = SkyCoord(ra=11*u.degree, dec=42*u.degree, frame='icrs')
-sep = c.separation(c2)
-
-# Match catalogs
-idx, sep2d, dist3d = c.match_to_catalog_sky(catalog_coords)
-```
-
-### astropy.wcs
-**Purpose**: Handle World Coordinate System transformations for astronomical images.
-
-**Key Class**: `WCS` - Maps between pixel and world coordinates
-
-**Common Use Cases**:
- Convert pixel coordinates to RA/Dec
- Convert RA/Dec to pixel coordinates
- Handle distortion corrections (SIP, lookup tables)
-
-**Example**:
-```python
-from astropy.wcs import WCS
-from astropy.io import fits
-
-hdu = fits.open('image.fits')[0]
-wcs = WCS(hdu.header)
-
-# Pixel to world
-ra, dec = wcs.pixel_to_world_values(100, 200)
-
-# World to pixel
-x, y = wcs.world_to_pixel_values(ra, dec)
-```
-
-## File I/O
-
-### astropy.io.fits
-**Purpose**: Read and write FITS (Flexible Image Transport System) files.
-
-**Key Classes**:
- `HDUList` - Container for all HDUs in a file
- `PrimaryHDU` - Primary header data unit
- `ImageHDU` - Image extension
- `BinTableHDU` - Binary table extension
- `Header` - FITS header keywords
-
-**Common Operations**:
-```python
-from astropy.io import fits
-
-# Read FITS file
-with fits.open('file.fits') as hdul:
-    hdul.info()  # Display structure
-    header = hdul[0].header
-    data = hdul[0].data
-
-# Write FITS file
-fits.writeto('output.fits', data, header)
-
-# Update header keyword
-fits.setval('file.fits', 'OBJECT', value='M31')
-```
-
-### astropy.io.ascii
-**Purpose**: Read and write ASCII tables in various formats.
-
-**Supported Formats**:
- Basic, CSV, tab-delimited
- CDS/MRT (Machine Readable Tables)
- IPAC, Daophot, SExtractor
- LaTeX tables
- HTML tables
-
-### astropy.io.votable
-**Purpose**: Handle Virtual Observatory (VO) table format.
-
-### astropy.io.misc
-**Purpose**: Additional formats including HDF5, Parquet, and YAML.
-
-## Scientific Calculations
-
-### astropy.cosmology
-**Purpose**: Perform cosmological calculations.
-
-**Common Models**:
- `FlatLambdaCDM` - Flat universe with cosmological constant (most common)
- `LambdaCDM` - Universe with cosmological constant
- `Planck18`, `Planck15`, `Planck13` - Pre-defined Planck parameters
- `WMAP9`, `WMAP7`, `WMAP5` - Pre-defined WMAP parameters
-
-**Common Methods**:
-```python
-from astropy.cosmology import FlatLambdaCDM, Planck18
-import astropy.units as u
-
-cosmo = FlatLambdaCDM(H0=70, Om0=0.3)
-# Or use built-in: cosmo = Planck18
-
-z = 1.5
-print(cosmo.age(z))  # Age of universe at z
-print(cosmo.luminosity_distance(z))  # Luminosity distance
-print(cosmo.angular_diameter_distance(z))  # Angular diameter distance
-print(cosmo.comoving_distance(z))  # Comoving distance
-print(cosmo.H(z))  # Hubble parameter at z
-```
-
-### astropy.modeling
-**Purpose**: Framework for model evaluation and fitting.
-
-**Model Categories**:
- 1D models: Gaussian1D, Lorentz1D, Voigt1D, Polynomial1D
- 2D models: Gaussian2D, Disk2D, Moffat2D
- Physical models: BlackBody, Drude1D, NFW
- Polynomial models: Chebyshev, Legendre
-
-**Common Fitters**:
- `LinearLSQFitter` - Linear least squares
- `LevMarLSQFitter` - Levenberg-Marquardt
- `SimplexLSQFitter` - Downhill simplex
-
-**Example**:
-```python
-from astropy.modeling import models, fitting
-
-# Create model
-g = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
-
-# Fit to data
-fitter = fitting.LevMarLSQFitter()
-fitted_model = fitter(g, x_data, y_data)
-```
-
-### astropy.convolution
-**Purpose**: Convolve and filter astronomical data.
-
-**Common Kernels**:
- `Gaussian2DKernel` - 2D Gaussian smoothing
- `Box2DKernel` - 2D boxcar smoothing
- `Tophat2DKernel` - 2D tophat filter
- Custom kernels via arrays
-
-### astropy.stats
-**Purpose**: Statistical tools for astronomical data analysis.
-
-**Key Functions**:
- `sigma_clip()` - Remove outliers via sigma clipping
- `sigma_clipped_stats()` - Compute mean, median, std with clipping
- `mad_std()` - Median Absolute Deviation
- `biweight_location()`, `biweight_scale()` - Robust statistics
- `circmean()`, `circstd()` - Circular statistics
-
-**Example**:
-```python
-from astropy.stats import sigma_clip, sigma_clipped_stats
-
-# Remove outliers
-filtered_data = sigma_clip(data, sigma=3, maxiters=5)
-
-# Get statistics
-mean, median, std = sigma_clipped_stats(data, sigma=3)
-```
-
-## Data Processing
-
-### astropy.nddata
-**Purpose**: Handle N-dimensional datasets with metadata.
-
-**Key Class**: `NDData` - Container for array data with units, uncertainty, mask, and WCS
-
-### astropy.timeseries
-**Purpose**: Work with time series data.
-
-**Key Classes**:
- `TimeSeries` - Time-indexed data table
- `BinnedTimeSeries` - Time-binned data
-
-**Common Operations**:
- Period finding (Lomb-Scargle)
- Folding time series
- Binning data
-
-### astropy.visualization
-**Purpose**: Display astronomical data effectively.
-
-**Key Features**:
- Image normalization (LogStretch, PowerStretch, SqrtStretch, etc.)
- Interval scaling (MinMaxInterval, PercentileInterval, ZScaleInterval)
- WCSAxes for plotting with coordinate overlays
- RGB image creation with stretching
- Astronomical colormaps
-
-**Example**:
-```python
-from astropy.visualization import ImageNormalize, SqrtStretch, PercentileInterval
-import matplotlib.pyplot as plt
-
-norm = ImageNormalize(data, interval=PercentileInterval(99),
-                     stretch=SqrtStretch())
-plt.imshow(data, norm=norm, origin='lower')
-```
-
-## Utilities
-
-### astropy.samp
-**Purpose**: Simple Application Messaging Protocol for inter-application communication.
-
-**Use Case**: Connect Python scripts with other astronomical tools (e.g., DS9, TOPCAT).
-
-## Module Import Patterns
-
-**Standard imports**:
-```python
-import astropy.units as u
-from astropy.coordinates import SkyCoord
-from astropy.time import Time
-from astropy.io import fits
-from astropy.table import Table
-from astropy import constants as const
-```
-
-## Performance Tips
-
-1. **Pre-compute composite units** for repeated operations
-2. **Use `<<` operator** for fast unit assignments: `array << u.m` instead of `array * u.m`
-3. **Vectorize operations** rather than looping over coordinates/times
-4. **Use memmap=True** when opening large FITS files
-5. **Install Bottleneck** for faster stats operations
--- a/scientific-packages/astropy/scripts/coord_convert.py
+++ b/scientific-packages/astropy/scripts/coord_convert.py
@@ -1,226 +0,0 @@
-#!/usr/bin/env python3
-"""
-Coordinate conversion utility for astronomical coordinates.
-
-This script provides batch coordinate transformations between different
-astronomical coordinate systems using astropy.
-"""
-
-import sys
-import argparse
-from astropy.coordinates import SkyCoord
-import astropy.units as u
-
-
-def convert_coordinates(coords_input, input_frame='icrs', output_frame='galactic',
-                       input_format='decimal', output_format='decimal'):
-    """
-    Convert astronomical coordinates between different frames.
-
-    Parameters
-    ----------
-    coords_input : list of tuples or str
-        Input coordinates as (lon, lat) pairs or strings
-    input_frame : str
-        Input coordinate frame (icrs, fk5, galactic, etc.)
-    output_frame : str
-        Output coordinate frame
-    input_format : str
-        Format of input coordinates ('decimal', 'sexagesimal', 'hourangle')
-    output_format : str
-        Format for output display ('decimal', 'sexagesimal', 'hourangle')
-
-    Returns
-    -------
-    list
-        Converted coordinates
-    """
-    results = []
-
-    for coord in coords_input:
-        try:
-            # Parse input coordinate
-            if input_format == 'decimal':
-                if isinstance(coord, str):
-                    parts = coord.split()
-                    lon, lat = float(parts[0]), float(parts[1])
-                else:
-                    lon, lat = coord
-                c = SkyCoord(lon*u.degree, lat*u.degree, frame=input_frame)
-
-            elif input_format == 'sexagesimal':
-                c = SkyCoord(coord, frame=input_frame, unit=(u.hourangle, u.deg))
-
-            elif input_format == 'hourangle':
-                if isinstance(coord, str):
-                    parts = coord.split()
-                    lon, lat = parts[0], parts[1]
-                else:
-                    lon, lat = coord
-                c = SkyCoord(lon, lat, frame=input_frame, unit=(u.hourangle, u.deg))
-
-            # Transform to output frame
-            if output_frame == 'icrs':
-                c_out = c.icrs
-            elif output_frame == 'fk5':
-                c_out = c.fk5
-            elif output_frame == 'fk4':
-                c_out = c.fk4
-            elif output_frame == 'galactic':
-                c_out = c.galactic
-            elif output_frame == 'supergalactic':
-                c_out = c.supergalactic
-            else:
-                c_out = c.transform_to(output_frame)
-
-            results.append(c_out)
-
-        except Exception as e:
-            print(f"Error converting coordinate {coord}: {e}", file=sys.stderr)
-            results.append(None)
-
-    return results
-
-
-def format_output(coords, frame, output_format='decimal'):
-    """Format coordinates for display."""
-    output = []
-
-    for c in coords:
-        if c is None:
-            output.append("ERROR")
-            continue
-
-        if frame in ['icrs', 'fk5', 'fk4']:
-            lon_name, lat_name = 'RA', 'Dec'
-            lon = c.ra
-            lat = c.dec
-        elif frame == 'galactic':
-            lon_name, lat_name = 'l', 'b'
-            lon = c.l
-            lat = c.b
-        elif frame == 'supergalactic':
-            lon_name, lat_name = 'sgl', 'sgb'
-            lon = c.sgl
-            lat = c.sgb
-        else:
-            lon_name, lat_name = 'lon', 'lat'
-            lon = c.spherical.lon
-            lat = c.spherical.lat
-
-        if output_format == 'decimal':
-            out_str = f"{lon.degree:12.8f} {lat.degree:+12.8f}"
-        elif output_format == 'sexagesimal':
-            if frame in ['icrs', 'fk5', 'fk4']:
-                out_str = f"{lon.to_string(unit=u.hourangle, sep=':', pad=True)} "
-                out_str += f"{lat.to_string(unit=u.degree, sep=':', pad=True)}"
-            else:
-                out_str = f"{lon.to_string(unit=u.degree, sep=':', pad=True)} "
-                out_str += f"{lat.to_string(unit=u.degree, sep=':', pad=True)}"
-        elif output_format == 'hourangle':
-            out_str = f"{lon.to_string(unit=u.hourangle, sep=' ', pad=True)} "
-            out_str += f"{lat.to_string(unit=u.degree, sep=' ', pad=True)}"
-
-        output.append(out_str)
-
-    return output
-
-
-def main():
-    """Main function for command-line usage."""
-    parser = argparse.ArgumentParser(
-        description='Convert astronomical coordinates between different frames',
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog="""
-Supported frames: icrs, fk5, fk4, galactic, supergalactic
-
-Input formats:
-  decimal     : Degrees (e.g., "10.68 41.27")
-  sexagesimal : HMS/DMS (e.g., "00:42:44.3 +41:16:09")
-  hourangle   : Hours and degrees (e.g., "10.5h 41.5d")
-
-Examples:
-  %(prog)s --from icrs --to galactic "10.68 41.27"
-  %(prog)s --from icrs --to galactic --input decimal --output sexagesimal "150.5 -30.2"
-  %(prog)s --from galactic --to icrs "120.5 45.3"
-  %(prog)s --file coords.txt --from icrs --to galactic
-        """
-    )
-
-    parser.add_argument('coordinates', nargs='*',
-                       help='Coordinates to convert (lon lat pairs)')
-    parser.add_argument('-f', '--from', dest='input_frame', default='icrs',
-                       help='Input coordinate frame (default: icrs)')
-    parser.add_argument('-t', '--to', dest='output_frame', default='galactic',
-                       help='Output coordinate frame (default: galactic)')
-    parser.add_argument('-i', '--input', dest='input_format', default='decimal',
-                       choices=['decimal', 'sexagesimal', 'hourangle'],
-                       help='Input format (default: decimal)')
-    parser.add_argument('-o', '--output', dest='output_format', default='decimal',
-                       choices=['decimal', 'sexagesimal', 'hourangle'],
-                       help='Output format (default: decimal)')
-    parser.add_argument('--file', dest='input_file',
-                       help='Read coordinates from file (one per line)')
-    parser.add_argument('--header', action='store_true',
-                       help='Print header line with coordinate names')
-
-    args = parser.parse_args()
-
-    # Get coordinates from file or command line
-    if args.input_file:
-        try:
-            with open(args.input_file, 'r') as f:
-                coords = [line.strip() for line in f if line.strip()]
-        except FileNotFoundError:
-            print(f"Error: File '{args.input_file}' not found.", file=sys.stderr)
-            sys.exit(1)
-    else:
-        if not args.coordinates:
-            print("Error: No coordinates provided.", file=sys.stderr)
-            parser.print_help()
-            sys.exit(1)
-
-        # Combine pairs of arguments
-        if args.input_format == 'decimal':
-            coords = []
-            i = 0
-            while i < len(args.coordinates):
-                if i + 1 < len(args.coordinates):
-                    coords.append(f"{args.coordinates[i]} {args.coordinates[i+1]}")
-                    i += 2
-                else:
-                    print(f"Warning: Odd number of coordinates, skipping last value",
-                          file=sys.stderr)
-                    break
-        else:
-            coords = args.coordinates
-
-    # Convert coordinates
-    converted = convert_coordinates(coords,
-                                   input_frame=args.input_frame,
-                                   output_frame=args.output_frame,
-                                   input_format=args.input_format,
-                                   output_format=args.output_format)
-
-    # Format and print output
-    formatted = format_output(converted, args.output_frame, args.output_format)
-
-    # Print header if requested
-    if args.header:
-        if args.output_frame in ['icrs', 'fk5', 'fk4']:
-            if args.output_format == 'decimal':
-                print(f"{'RA (deg)':>12s} {'Dec (deg)':>13s}")
-            else:
-                print(f"{'RA':>25s} {'Dec':>26s}")
-        elif args.output_frame == 'galactic':
-            if args.output_format == 'decimal':
-                print(f"{'l (deg)':>12s} {'b (deg)':>13s}")
-            else:
-                print(f"{'l':>25s} {'b':>26s}")
-
-    for line in formatted:
-        print(line)
-
-
-if __name__ == '__main__':
-    main()
--- a/scientific-packages/astropy/scripts/cosmo_calc.py
+++ b/scientific-packages/astropy/scripts/cosmo_calc.py
@@ -1,250 +0,0 @@
-#!/usr/bin/env python3
-"""
-Cosmological calculator using astropy.cosmology.
-
-This script provides quick calculations of cosmological distances,
-ages, and other quantities for given redshifts.
-"""
-
-import sys
-import argparse
-import numpy as np
-from astropy.cosmology import FlatLambdaCDM, Planck18, Planck15, WMAP9
-import astropy.units as u
-
-
-def calculate_cosmology(redshifts, cosmology='Planck18', H0=None, Om0=None):
-    """
-    Calculate cosmological quantities for given redshifts.
-
-    Parameters
-    ----------
-    redshifts : array-like
-        Redshift values
-    cosmology : str
-        Cosmology to use ('Planck18', 'Planck15', 'WMAP9', 'custom')
-    H0 : float, optional
-        Hubble constant for custom cosmology (km/s/Mpc)
-    Om0 : float, optional
-        Matter density parameter for custom cosmology
-
-    Returns
-    -------
-    dict
-        Dictionary containing calculated quantities
-    """
-    # Select cosmology
-    if cosmology == 'Planck18':
-        cosmo = Planck18
-    elif cosmology == 'Planck15':
-        cosmo = Planck15
-    elif cosmology == 'WMAP9':
-        cosmo = WMAP9
-    elif cosmology == 'custom':
-        if H0 is None or Om0 is None:
-            raise ValueError("Must provide H0 and Om0 for custom cosmology")
-        cosmo = FlatLambdaCDM(H0=H0 * u.km/u.s/u.Mpc, Om0=Om0)
-    else:
-        raise ValueError(f"Unknown cosmology: {cosmology}")
-
-    z = np.atleast_1d(redshifts)
-
-    results = {
-        'redshift': z,
-        'cosmology': str(cosmo),
-        'luminosity_distance': cosmo.luminosity_distance(z),
-        'angular_diameter_distance': cosmo.angular_diameter_distance(z),
-        'comoving_distance': cosmo.comoving_distance(z),
-        'comoving_volume': cosmo.comoving_volume(z),
-        'age': cosmo.age(z),
-        'lookback_time': cosmo.lookback_time(z),
-        'H': cosmo.H(z),
-        'scale_factor': 1.0 / (1.0 + z)
-    }
-
-    return results, cosmo
-
-
-def print_results(results, verbose=False, csv=False):
-    """Print calculation results."""
-
-    z = results['redshift']
-
-    if csv:
-        # CSV output
-        print("z,D_L(Mpc),D_A(Mpc),D_C(Mpc),Age(Gyr),t_lookback(Gyr),H(km/s/Mpc)")
-        for i in range(len(z)):
-            print(f"{z[i]:.6f},"
-                  f"{results['luminosity_distance'][i].value:.6f},"
-                  f"{results['angular_diameter_distance'][i].value:.6f},"
-                  f"{results['comoving_distance'][i].value:.6f},"
-                  f"{results['age'][i].value:.6f},"
-                  f"{results['lookback_time'][i].value:.6f},"
-                  f"{results['H'][i].value:.6f}")
-    else:
-        # Formatted table output
-        if verbose:
-            print(f"\nCosmology: {results['cosmology']}")
-            print("-" * 80)
-
-        print(f"\n{'z':>8s} {'D_L':>12s} {'D_A':>12s} {'D_C':>12s} "
-              f"{'Age':>10s} {'t_lb':>10s} {'H(z)':>10s}")
-        print(f"{'':>8s} {'(Mpc)':>12s} {'(Mpc)':>12s} {'(Mpc)':>12s} "
-              f"{'(Gyr)':>10s} {'(Gyr)':>10s} {'(km/s/Mpc)':>10s}")
-        print("-" * 80)
-
-        for i in range(len(z)):
-            print(f"{z[i]:8.4f} "
-                  f"{results['luminosity_distance'][i].value:12.3f} "
-                  f"{results['angular_diameter_distance'][i].value:12.3f} "
-                  f"{results['comoving_distance'][i].value:12.3f} "
-                  f"{results['age'][i].value:10.4f} "
-                  f"{results['lookback_time'][i].value:10.4f} "
-                  f"{results['H'][i].value:10.4f}")
-
-        if verbose:
-            print("\nLegend:")
-            print("  z    : Redshift")
-            print("  D_L  : Luminosity distance")
-            print("  D_A  : Angular diameter distance")
-            print("  D_C  : Comoving distance")
-            print("  Age  : Age of universe at z")
-            print("  t_lb : Lookback time to z")
-            print("  H(z) : Hubble parameter at z")
-
-
-def convert_quantity(value, quantity_type, cosmo, to_redshift=False):
-    """
-    Convert between redshift and cosmological quantity.
-
-    Parameters
-    ----------
-    value : float
-        Value to convert
-    quantity_type : str
-        Type of quantity ('luminosity_distance', 'age', etc.)
-    cosmo : Cosmology
-        Cosmology object
-    to_redshift : bool
-        If True, convert quantity to redshift; else convert z to quantity
-    """
-    from astropy.cosmology import z_at_value
-
-    if to_redshift:
-        # Convert quantity to redshift
-        if quantity_type == 'luminosity_distance':
-            z = z_at_value(cosmo.luminosity_distance, value * u.Mpc)
-        elif quantity_type == 'age':
-            z = z_at_value(cosmo.age, value * u.Gyr)
-        elif quantity_type == 'lookback_time':
-            z = z_at_value(cosmo.lookback_time, value * u.Gyr)
-        elif quantity_type == 'comoving_distance':
-            z = z_at_value(cosmo.comoving_distance, value * u.Mpc)
-        else:
-            raise ValueError(f"Unknown quantity type: {quantity_type}")
-        return z
-    else:
-        # Convert redshift to quantity
-        if quantity_type == 'luminosity_distance':
-            return cosmo.luminosity_distance(value)
-        elif quantity_type == 'age':
-            return cosmo.age(value)
-        elif quantity_type == 'lookback_time':
-            return cosmo.lookback_time(value)
-        elif quantity_type == 'comoving_distance':
-            return cosmo.comoving_distance(value)
-        else:
-            raise ValueError(f"Unknown quantity type: {quantity_type}")
-
-
-def main():
-    """Main function for command-line usage."""
-    parser = argparse.ArgumentParser(
-        description='Calculate cosmological quantities for given redshifts',
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog="""
-Available cosmologies: Planck18, Planck15, WMAP9, custom
-
-Examples:
-  %(prog)s 0.5 1.0 1.5
-  %(prog)s 0.5 --cosmology Planck15
-  %(prog)s 0.5 --cosmology custom --H0 70 --Om0 0.3
-  %(prog)s --range 0 3 0.5
-  %(prog)s 0.5 --verbose
-  %(prog)s 0.5 1.0 --csv
-  %(prog)s --convert 1000 --from luminosity_distance --cosmology Planck18
-        """
-    )
-
-    parser.add_argument('redshifts', nargs='*', type=float,
-                       help='Redshift values to calculate')
-    parser.add_argument('-c', '--cosmology', default='Planck18',
-                       choices=['Planck18', 'Planck15', 'WMAP9', 'custom'],
-                       help='Cosmology to use (default: Planck18)')
-    parser.add_argument('--H0', type=float,
-                       help='Hubble constant for custom cosmology (km/s/Mpc)')
-    parser.add_argument('--Om0', type=float,
-                       help='Matter density parameter for custom cosmology')
-    parser.add_argument('-r', '--range', nargs=3, type=float, metavar=('START', 'STOP', 'STEP'),
-                       help='Generate redshift range (start stop step)')
-    parser.add_argument('-v', '--verbose', action='store_true',
-                       help='Print verbose output with cosmology details')
-    parser.add_argument('--csv', action='store_true',
-                       help='Output in CSV format')
-    parser.add_argument('--convert', type=float,
-                       help='Convert a quantity to redshift')
-    parser.add_argument('--from', dest='from_quantity',
-                       choices=['luminosity_distance', 'age', 'lookback_time', 'comoving_distance'],
-                       help='Type of quantity to convert from')
-
-    args = parser.parse_args()
-
-    # Handle conversion mode
-    if args.convert is not None:
-        if args.from_quantity is None:
-            print("Error: Must specify --from when using --convert", file=sys.stderr)
-            sys.exit(1)
-
-        # Get cosmology
-        if args.cosmology == 'Planck18':
-            cosmo = Planck18
-        elif args.cosmology == 'Planck15':
-            cosmo = Planck15
-        elif args.cosmology == 'WMAP9':
-            cosmo = WMAP9
-        elif args.cosmology == 'custom':
-            if args.H0 is None or args.Om0 is None:
-                print("Error: Must provide --H0 and --Om0 for custom cosmology",
-                      file=sys.stderr)
-                sys.exit(1)
-            cosmo = FlatLambdaCDM(H0=args.H0 * u.km/u.s/u.Mpc, Om0=args.Om0)
-
-        z = convert_quantity(args.convert, args.from_quantity, cosmo, to_redshift=True)
-        print(f"\n{args.from_quantity.replace('_', ' ').title()} = {args.convert}")
-        print(f"Redshift z = {z:.6f}")
-        print(f"(using {args.cosmology} cosmology)")
-        return
-
-    # Get redshifts
-    if args.range:
-        start, stop, step = args.range
-        redshifts = np.arange(start, stop + step/2, step)
-    elif args.redshifts:
-        redshifts = np.array(args.redshifts)
-    else:
-        print("Error: No redshifts provided.", file=sys.stderr)
-        parser.print_help()
-        sys.exit(1)
-
-    # Calculate
-    try:
-        results, cosmo = calculate_cosmology(redshifts, args.cosmology,
-                                            H0=args.H0, Om0=args.Om0)
-        print_results(results, verbose=args.verbose, csv=args.csv)
-    except Exception as e:
-        print(f"Error: {e}", file=sys.stderr)
-        sys.exit(1)
-
-
-if __name__ == '__main__':
-    main()
--- a/scientific-packages/astropy/scripts/fits_info.py
+++ b/scientific-packages/astropy/scripts/fits_info.py
@@ -1,189 +0,0 @@
-#!/usr/bin/env python3
-"""
-Quick FITS file inspection tool.
-
-This script provides a convenient way to inspect FITS file structure,
-headers, and basic statistics without writing custom code each time.
-"""
-
-import sys
-from pathlib import Path
-from astropy.io import fits
-import numpy as np
-
-
-def print_fits_info(filename, detailed=False, ext=None):
-    """
-    Print comprehensive information about a FITS file.
-
-    Parameters
-    ----------
-    filename : str
-        Path to FITS file
-    detailed : bool
-        If True, print detailed statistics for each HDU
-    ext : int or str, optional
-        Specific extension to examine in detail
-    """
-    print(f"\n{'='*70}")
-    print(f"FITS File: {filename}")
-    print(f"{'='*70}\n")
-
-    try:
-        with fits.open(filename) as hdul:
-            # Print file structure
-            print("File Structure:")
-            print("-" * 70)
-            hdul.info()
-            print()
-
-            # If specific extension requested
-            if ext is not None:
-                print(f"\nDetailed view of extension: {ext}")
-                print("-" * 70)
-                hdu = hdul[ext]
-                print_hdu_details(hdu, detailed=True)
-                return
-
-            # Print header and data info for each HDU
-            for i, hdu in enumerate(hdul):
-                print(f"\n{'='*70}")
-                print(f"HDU {i}: {hdu.name}")
-                print(f"{'='*70}")
-                print_hdu_details(hdu, detailed=detailed)
-
-    except FileNotFoundError:
-        print(f"Error: File '{filename}' not found.")
-        sys.exit(1)
-    except Exception as e:
-        print(f"Error reading FITS file: {e}")
-        sys.exit(1)
-
-
-def print_hdu_details(hdu, detailed=False):
-    """Print details for a single HDU."""
-
-    # Header information
-    print("\nHeader Information:")
-    print("-" * 70)
-
-    # Key header keywords
-    important_keywords = ['SIMPLE', 'BITPIX', 'NAXIS', 'EXTEND',
-                         'OBJECT', 'TELESCOP', 'INSTRUME', 'OBSERVER',
-                         'DATE-OBS', 'EXPTIME', 'FILTER', 'AIRMASS',
-                         'RA', 'DEC', 'EQUINOX', 'CTYPE1', 'CTYPE2']
-
-    header = hdu.header
-    for key in important_keywords:
-        if key in header:
-            value = header[key]
-            comment = header.comments[key]
-            print(f"  {key:12s} = {str(value):20s} / {comment}")
-
-    # NAXIS keywords
-    if 'NAXIS' in header:
-        naxis = header['NAXIS']
-        for i in range(1, naxis + 1):
-            key = f'NAXIS{i}'
-            if key in header:
-                print(f"  {key:12s} = {str(header[key]):20s} / {header.comments[key]}")
-
-    # Data information
-    if hdu.data is not None:
-        print("\nData Information:")
-        print("-" * 70)
-
-        data = hdu.data
-        print(f"  Data type: {data.dtype}")
-        print(f"  Shape: {data.shape}")
-
-        # For image data
-        if hasattr(data, 'ndim') and data.ndim >= 1:
-            try:
-                # Calculate statistics
-                finite_data = data[np.isfinite(data)]
-                if len(finite_data) > 0:
-                    print(f"  Min: {np.min(finite_data):.6g}")
-                    print(f"  Max: {np.max(finite_data):.6g}")
-                    print(f"  Mean: {np.mean(finite_data):.6g}")
-                    print(f"  Median: {np.median(finite_data):.6g}")
-                    print(f"  Std: {np.std(finite_data):.6g}")
-
-                    # Count special values
-                    n_nan = np.sum(np.isnan(data))
-                    n_inf = np.sum(np.isinf(data))
-                    if n_nan > 0:
-                        print(f"  NaN values: {n_nan}")
-                    if n_inf > 0:
-                        print(f"  Inf values: {n_inf}")
-            except Exception as e:
-                print(f"  Could not calculate statistics: {e}")
-
-        # For table data
-        if hasattr(data, 'columns'):
-            print(f"\n  Table Columns ({len(data.columns)}):")
-            for col in data.columns:
-                print(f"    {col.name:20s} {col.format:10s} {col.unit or ''}")
-
-            if detailed:
-                print(f"\n  First few rows:")
-                print(data[:min(5, len(data))])
-    else:
-        print("\n  No data in this HDU")
-
-    # WCS information if present
-    try:
-        from astropy.wcs import WCS
-        wcs = WCS(hdu.header)
-        if wcs.has_celestial:
-            print("\nWCS Information:")
-            print("-" * 70)
-            print(f"  Has celestial WCS: Yes")
-            print(f"  CTYPE: {wcs.wcs.ctype}")
-            if wcs.wcs.crval is not None:
-                print(f"  CRVAL: {wcs.wcs.crval}")
-            if wcs.wcs.crpix is not None:
-                print(f"  CRPIX: {wcs.wcs.crpix}")
-            if wcs.wcs.cdelt is not None:
-                print(f"  CDELT: {wcs.wcs.cdelt}")
-    except Exception:
-        pass
-
-
-def main():
-    """Main function for command-line usage."""
-    import argparse
-
-    parser = argparse.ArgumentParser(
-        description='Inspect FITS file structure and contents',
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog="""
-Examples:
-  %(prog)s image.fits
-  %(prog)s image.fits --detailed
-  %(prog)s image.fits --ext 1
-  %(prog)s image.fits --ext SCI
-        """
-    )
-
-    parser.add_argument('filename', help='FITS file to inspect')
-    parser.add_argument('-d', '--detailed', action='store_true',
-                       help='Show detailed statistics for each HDU')
-    parser.add_argument('-e', '--ext', type=str, default=None,
-                       help='Show details for specific extension only (number or name)')
-
-    args = parser.parse_args()
-
-    # Convert extension to int if numeric
-    ext = args.ext
-    if ext is not None:
-        try:
-            ext = int(ext)
-        except ValueError:
-            pass  # Keep as string for extension name
-
-    print_fits_info(args.filename, detailed=args.detailed, ext=ext)
-
-
-if __name__ == '__main__':
-    main()
--- a/scientific-packages/biomni/SKILL.md
+++ b/scientific-packages/biomni/SKILL.md
@@ -1,375 +0,0 @@
---
-name: biomni
-description: "AI agent for autonomous biomedical task execution. CRISPR design, scRNA-seq, ADMET, GWAS, structure prediction, multi-omics, with automated planning/code generation, for complex workflows."
---
-
-# Biomni
-
-## Overview
-
-Biomni is a general-purpose biomedical AI agent that autonomously executes research tasks across diverse biomedical subfields. Use Biomni to combine large language model reasoning with retrieval-augmented planning and code-based execution for scientific productivity and hypothesis generation. The system operates with an ~11GB biomedical knowledge base covering molecular, genomic, and clinical domains.
-
-## Quick Start
-
-Initialize and use the Biomni agent with these basic steps:
-
-```python
-from biomni.agent import A1
-
-# Initialize agent with data path and LLM model
-agent = A1(path='./data', llm='claude-sonnet-4-20250514')
-
-# Execute a biomedical research task
-agent.go("Your biomedical task description")
-```
-
-The agent will autonomously decompose the task, retrieve relevant biomedical knowledge, generate and execute code, and provide results.
-
-## Installation and Setup
-
-### Environment Preparation
-
-1. **Set up the conda environment:**
-   - Follow instructions in `biomni_env/README.md` from the repository
-   - Activate the environment: `conda activate biomni_e1`
-
-2. **Install the package:**
-   ```bash
-   pip install biomni --upgrade
-   ```
-
-   Or install from source:
-   ```bash
-   git clone https://github.com/snap-stanford/biomni.git
-   cd biomni
-   pip install -e .
-   ```
-
-3. **Configure API keys:**
-
-   Set up credentials via environment variables or `.env` file:
-   ```bash
-   export ANTHROPIC_API_KEY="your-key-here"
-   export OPENAI_API_KEY="your-key-here"  # Optional
-   ```
-
-4. **Data initialization:**
-
-   On first use, the agent will automatically download the ~11GB biomedical knowledge base.
-
-### LLM Provider Configuration
-
-Biomni supports multiple LLM providers. Configure the default provider using:
-
-```python
-from biomni.config import default_config
-
-# Set the default LLM model
-default_config.llm = "claude-sonnet-4-20250514"  # Anthropic
-# default_config.llm = "gpt-4"  # OpenAI
-# default_config.llm = "azure/gpt-4"  # Azure OpenAI
-# default_config.llm = "gemini/gemini-pro"  # Google Gemini
-
-# Set timeout (optional)
-default_config.timeout_seconds = 1200
-
-# Set data path (optional)
-default_config.data_path = "./custom/data/path"
-```
-
-Refer to `references/llm_providers.md` for detailed configuration options for each provider.
-
-## Core Biomedical Research Tasks
-
-### 1. CRISPR Screening and Design
-
-Execute CRISPR screening tasks including guide RNA design, off-target analysis, and screening experiment planning:
-
-```python
-agent.go("Design a CRISPR screening experiment to identify genes involved in cancer cell resistance to drug X")
-```
-
-The agent will:
- Retrieve relevant gene databases
- Design guide RNAs with specificity analysis
- Plan experimental controls and readout strategies
- Generate analysis code for screening results
-
-### 2. Single-Cell RNA-seq Analysis
-
-Perform comprehensive scRNA-seq analysis workflows:
-
-```python
-agent.go("Analyze this 10X Genomics scRNA-seq dataset, identify cell types, and find differentially expressed genes between clusters")
-```
-
-Capabilities include:
- Quality control and preprocessing
- Dimensionality reduction and clustering
- Cell type annotation using marker databases
- Differential expression analysis
- Pathway enrichment analysis
-
-### 3. Molecular Property Prediction (ADMET)
-
-Predict absorption, distribution, metabolism, excretion, and toxicity properties:
-
-```python
-agent.go("Predict ADMET properties for these drug candidates: [SMILES strings]")
-```
-
-The agent handles:
- Molecular descriptor calculation
- Property prediction using integrated models
- Toxicity screening
- Drug-likeness assessment
-
-### 4. Genomic Analysis
-
-Execute genomic data analysis tasks:
-
-```python
-agent.go("Perform GWAS analysis to identify SNPs associated with disease phenotype in this cohort")
-```
-
-Supports:
- Genome-wide association studies (GWAS)
- Variant calling and annotation
- Population genetics analysis
- Functional genomics integration
-
-### 5. Protein Structure and Function
-
-Analyze protein sequences and structures:
-
-```python
-agent.go("Predict the structure of this protein sequence and identify potential binding sites")
-```
-
-Capabilities:
- Sequence analysis and domain identification
- Structure prediction integration
- Binding site prediction
- Protein-protein interaction analysis
-
-### 6. Disease Diagnosis and Classification
-
-Perform disease classification from multi-omics data:
-
-```python
-agent.go("Build a classifier to diagnose disease X from patient RNA-seq and clinical data")
-```
-
-### 7. Systems Biology and Pathway Analysis
-
-Analyze biological pathways and networks:
-
-```python
-agent.go("Identify dysregulated pathways in this differential expression dataset")
-```
-
-### 8. Drug Discovery and Repurposing
-
-Support drug discovery workflows:
-
-```python
-agent.go("Identify FDA-approved drugs that could be repurposed for treating disease Y based on mechanism of action")
-```
-
-## Advanced Features
-
-### Custom Configuration per Agent
-
-Override global configuration for specific agent instances:
-
-```python
-agent = A1(
-    path='./project_data',
-    llm='gpt-4o',
-    timeout=1800
-)
-```
-
-### Conversation History and Reporting
-
-Save execution traces as formatted PDF reports:
-
-```python
-# After executing tasks
-agent.save_conversation_history(
-    output_path='./reports/experiment_log.pdf',
-    format='pdf'
-)
-```
-
-Requires one of: WeasyPrint, markdown2pdf, or Pandoc.
-
-### Model Context Protocol (MCP) Integration
-
-Extend agent capabilities with external tools:
-
-```python
-# Add MCP-compatible tools
-agent.add_mcp(config_path='./mcp_config.json')
-```
-
-MCP enables integration with:
- Laboratory information management systems (LIMS)
- Specialized bioinformatics databases
- Custom analysis pipelines
- External computational resources
-
-### Using Biomni-R0 (Specialized Reasoning Model)
-
-Deploy the 32B parameter Biomni-R0 model for enhanced biological reasoning:
-
-```bash
-# Install SGLang
-pip install "sglang[all]"
-
-# Deploy Biomni-R0
-python -m sglang.launch_server \
-    --model-path snap-stanford/biomni-r0 \
-    --port 30000 \
-    --trust-remote-code
-```
-
-Then configure the agent:
-
-```python
-from biomni.config import default_config
-
-default_config.llm = "openai/biomni-r0"
-default_config.api_base = "http://localhost:30000/v1"
-```
-
-Biomni-R0 provides specialized reasoning for:
- Complex multi-step biological workflows
- Hypothesis generation and evaluation
- Experimental design optimization
- Literature-informed analysis
-
-## Best Practices
-
-### Task Specification
-
-Provide clear, specific task descriptions:
-
-✅ **Good:** "Analyze this scRNA-seq dataset (file: data.h5ad) to identify T cell subtypes, then perform differential expression analysis comparing activated vs. resting T cells"
-
-❌ **Vague:** "Analyze my RNA-seq data"
-
-### Data Organization
-
-Structure data directories for efficient retrieval:
-
-```
-project/
-├── data/              # Biomni knowledge base
-├── raw_data/          # Your experimental data
-├── results/           # Analysis outputs
-└── reports/           # Generated reports
-```
-
-### Iterative Refinement
-
-Use iterative task execution for complex analyses:
-
-```python
-# Step 1: Exploratory analysis
-agent.go("Load and perform initial QC on the proteomics dataset")
-
-# Step 2: Based on results, refine analysis
-agent.go("Based on the QC results, remove low-quality samples and normalize using method X")
-
-# Step 3: Downstream analysis
-agent.go("Perform differential abundance analysis with adjusted parameters")
-```
-
-### Security Considerations
-
-**CRITICAL:** Biomni executes LLM-generated code with full system privileges. For production use:
-
-1. **Use sandboxed environments:** Deploy in Docker containers or VMs with restricted permissions
-2. **Validate sensitive operations:** Review code before execution for file access, network calls, or credential usage
-3. **Limit data access:** Restrict agent access to only necessary data directories
-4. **Monitor execution:** Log all executed code for audit trails
-
-Never run Biomni with:
- Unrestricted file system access
- Direct access to sensitive credentials
- Network access to production systems
- Elevated system privileges
-
-### Model Selection Guidelines
-
-Choose models based on task complexity:
-
- **Claude Sonnet 4:** Recommended for most biomedical tasks, excellent biological reasoning
- **GPT-4/GPT-4o:** Strong general capabilities, good for diverse tasks
- **Biomni-R0:** Specialized for complex biological reasoning, multi-step workflows
- **Smaller models:** Use for simple, well-defined tasks to reduce cost
-
-## Evaluation and Benchmarking
-
-Biomni-Eval1 benchmark contains 433 evaluation instances across 10 biological tasks:
-
- GWAS analysis
- Disease diagnosis
- Gene detection and classification
- Molecular property prediction
- Pathway analysis
- Protein function prediction
- Drug response prediction
- Variant interpretation
- Cell type annotation
- Biomarker discovery
-
-Use the benchmark to:
- Evaluate custom agent configurations
- Compare LLM providers for specific tasks
- Validate analysis pipelines
-
-## Troubleshooting
-
-### Common Issues
-
-**Issue:** Data download fails or times out
-**Solution:** Manually download the knowledge base or increase timeout settings
-
-**Issue:** Package dependency conflicts
-**Solution:** Some optional dependencies cannot be installed by default due to conflicts. Install specific packages manually and uncomment relevant code sections as documented in the repository
-
-**Issue:** LLM API errors
-**Solution:** Verify API key configuration, check rate limits, ensure sufficient credits
-
-**Issue:** Memory errors with large datasets
-**Solution:** Process data in chunks, use data subsampling, or deploy on higher-memory instances
-
-### Getting Help
-
-For detailed troubleshooting:
- Review the Biomni GitHub repository issues
- Check `references/api_reference.md` for detailed API documentation
- Consult `references/task_examples.md` for comprehensive task patterns
-
-## Resources
-
-### references/
-Detailed reference documentation for advanced usage:
-
- **api_reference.md:** Complete API documentation for A1 agent, configuration objects, and utility functions
- **llm_providers.md:** Comprehensive guide for configuring all supported LLM providers (Anthropic, OpenAI, Azure, Gemini, Groq, Ollama, AWS Bedrock)
- **task_examples.md:** Extensive collection of biomedical task examples with code patterns
-
-### scripts/
-Helper scripts for common operations:
-
- **setup_environment.py:** Automated environment setup and validation
- **generate_report.py:** Enhanced PDF report generation with custom formatting
-
-Load reference documentation as needed:
-```python
-# Claude can read reference files when needed for detailed information
-# Example: "Check references/llm_providers.md for Azure OpenAI configuration"
-```
--- a/scientific-packages/biomni/references/api_reference.md
+++ b/scientific-packages/biomni/references/api_reference.md
@@ -1,635 +0,0 @@
-# Biomni API Reference
-
-This document provides comprehensive API documentation for the Biomni biomedical AI agent system.
-
-## Core Classes
-
-### A1 Agent
-
-The primary agent class for executing biomedical research tasks.
-
-#### Initialization
-
-```python
-from biomni.agent import A1
-
-agent = A1(
-    path='./data',              # Path to biomedical knowledge base
-    llm='claude-sonnet-4-20250514',  # LLM model identifier
-    timeout=None,               # Optional timeout in seconds
-    verbose=True               # Enable detailed logging
-)
-```
-
-**Parameters:**
-
- `path` (str, required): Directory path where the biomedical knowledge base is stored or will be downloaded. First-time initialization will download ~11GB of data.
- `llm` (str, optional): LLM model identifier. Defaults to the value in `default_config.llm`. Supports multiple providers (see LLM Providers section).
- `timeout` (int, optional): Maximum execution time in seconds for agent operations. Overrides `default_config.timeout_seconds`.
- `verbose` (bool, optional): Enable verbose logging for debugging. Default: True.
-
-**Returns:** A1 agent instance ready for task execution.
-
-#### Methods
-
-##### `go(task_description: str) -> None`
-
-Execute a biomedical research task autonomously.
-
-```python
-agent.go("Analyze this scRNA-seq dataset and identify cell types")
-```
-
-**Parameters:**
- `task_description` (str, required): Natural language description of the biomedical task to execute. Be specific about:
-  - Data location and format
-  - Desired analysis or output
-  - Any specific methods or parameters
-  - Expected results format
-
-**Behavior:**
-1. Decomposes the task into executable steps
-2. Retrieves relevant biomedical knowledge from the data lake
-3. Generates and executes Python/R code
-4. Provides results and visualizations
-5. Handles errors and retries with refinement
-
-**Notes:**
- Executes code with system privileges - use in sandboxed environments
- Long-running tasks may require timeout adjustments
- Intermediate results are displayed during execution
-
-##### `save_conversation_history(output_path: str, format: str = 'pdf') -> None`
-
-Export conversation history and execution trace as a formatted report.
-
-```python
-agent.save_conversation_history(
-    output_path='./reports/analysis_log.pdf',
-    format='pdf'
-)
-```
-
-**Parameters:**
- `output_path` (str, required): File path for the output report
- `format` (str, optional): Output format. Options: 'pdf', 'markdown'. Default: 'pdf'
-
-**Requirements:**
- For PDF: Install one of: WeasyPrint, markdown2pdf, or Pandoc
-  ```bash
-  pip install weasyprint  # Recommended
-  # or
-  pip install markdown2pdf
-  # or install Pandoc system-wide
-  ```
-
-**Report Contents:**
- Task description and parameters
- Retrieved biomedical knowledge
- Generated code with execution traces
- Results, visualizations, and outputs
- Timestamps and execution metadata
-
-##### `add_mcp(config_path: str) -> None`
-
-Add Model Context Protocol (MCP) tools to extend agent capabilities.
-
-```python
-agent.add_mcp(config_path='./mcp_tools_config.json')
-```
-
-**Parameters:**
- `config_path` (str, required): Path to MCP configuration JSON file
-
-**MCP Configuration Format:**
-```json
-{
-  "tools": [
-    {
-      "name": "tool_name",
-      "endpoint": "http://localhost:8000/tool",
-      "description": "Tool description for LLM",
-      "parameters": {
-        "param1": "string",
-        "param2": "integer"
-      }
-    }
-  ]
-}
-```
-
-**Use Cases:**
- Connect to laboratory information systems
- Integrate proprietary databases
- Access specialized computational resources
- Link to institutional data repositories
-
-## Configuration
-
-### default_config
-
-Global configuration object for Biomni settings.
-
-```python
-from biomni.config import default_config
-```
-
-#### Attributes
-
-##### `llm: str`
-
-Default LLM model identifier for all agent instances.
-
-```python
-default_config.llm = "claude-sonnet-4-20250514"
-```
-
-**Supported Models:**
-
-**Anthropic:**
- `claude-sonnet-4-20250514` (Recommended)
- `claude-opus-4-20250514`
- `claude-3-5-sonnet-20241022`
- `claude-3-opus-20240229`
-
-**OpenAI:**
- `gpt-4o`
- `gpt-4`
- `gpt-4-turbo`
- `gpt-3.5-turbo`
-
-**Azure OpenAI:**
- `azure/gpt-4`
- `azure/<deployment-name>`
-
-**Google Gemini:**
- `gemini/gemini-pro`
- `gemini/gemini-1.5-pro`
-
-**Groq:**
- `groq/llama-3.1-70b-versatile`
- `groq/mixtral-8x7b-32768`
-
-**Ollama (Local):**
- `ollama/llama3`
- `ollama/mistral`
- `ollama/<model-name>`
-
-**AWS Bedrock:**
- `bedrock/anthropic.claude-v2`
- `bedrock/anthropic.claude-3-sonnet`
-
-**Custom/Biomni-R0:**
- `openai/biomni-r0` (requires local SGLang deployment)
-
-##### `timeout_seconds: int`
-
-Default timeout for agent operations in seconds.
-
-```python
-default_config.timeout_seconds = 1200  # 20 minutes
-```
-
-**Recommended Values:**
- Simple tasks (QC, basic analysis): 300-600 seconds
- Medium tasks (differential expression, clustering): 600-1200 seconds
- Complex tasks (full pipelines, ML models): 1200-3600 seconds
- Very complex tasks: 3600+ seconds
-
-##### `data_path: str`
-
-Default path to biomedical knowledge base.
-
-```python
-default_config.data_path = "/path/to/biomni/data"
-```
-
-**Storage Requirements:**
- Initial download: ~11GB
- Extracted size: ~15GB
- Additional working space: ~5-10GB recommended
-
-##### `api_base: str`
-
-Custom API endpoint for LLM providers (advanced usage).
-
-```python
-# For local Biomni-R0 deployment
-default_config.api_base = "http://localhost:30000/v1"
-
-# For custom OpenAI-compatible endpoints
-default_config.api_base = "https://your-endpoint.com/v1"
-```
-
-##### `max_retries: int`
-
-Number of retry attempts for failed operations.
-
-```python
-default_config.max_retries = 3
-```
-
-#### Methods
-
-##### `reset() -> None`
-
-Reset all configuration values to system defaults.
-
-```python
-default_config.reset()
-```
-
-## Database Query System
-
-Biomni includes a retrieval-augmented generation (RAG) system for querying the biomedical knowledge base.
-
-### Query Functions
-
-#### `query_genes(query: str, top_k: int = 10) -> List[Dict]`
-
-Query gene information from integrated databases.
-
-```python
-from biomni.database import query_genes
-
-results = query_genes(
-    query="genes involved in p53 pathway",
-    top_k=20
-)
-```
-
-**Parameters:**
- `query` (str): Natural language or gene identifier query
- `top_k` (int): Number of results to return
-
-**Returns:** List of dictionaries containing:
- `gene_symbol`: Official gene symbol
- `gene_name`: Full gene name
- `description`: Functional description
- `pathways`: Associated biological pathways
- `go_terms`: Gene Ontology annotations
- `diseases`: Associated diseases
- `similarity_score`: Relevance score (0-1)
-
-#### `query_proteins(query: str, top_k: int = 10) -> List[Dict]`
-
-Query protein information from UniProt and other sources.
-
-```python
-from biomni.database import query_proteins
-
-results = query_proteins(
-    query="kinase proteins in cell cycle",
-    top_k=15
-)
-```
-
-**Returns:** List of dictionaries with protein metadata:
- `uniprot_id`: UniProt accession
- `protein_name`: Protein name
- `function`: Functional annotation
- `domains`: Protein domains
- `subcellular_location`: Cellular localization
- `similarity_score`: Relevance score
-
-#### `query_drugs(query: str, top_k: int = 10) -> List[Dict]`
-
-Query drug and compound information.
-
-```python
-from biomni.database import query_drugs
-
-results = query_drugs(
-    query="FDA approved cancer drugs targeting EGFR",
-    top_k=10
-)
-```
-
-**Returns:** Drug information including:
- `drug_name`: Common name
- `drugbank_id`: DrugBank identifier
- `indication`: Therapeutic indication
- `mechanism`: Mechanism of action
- `targets`: Molecular targets
- `approval_status`: Regulatory status
- `smiles`: Chemical structure (SMILES notation)
-
-#### `query_diseases(query: str, top_k: int = 10) -> List[Dict]`
-
-Query disease information from clinical databases.
-
-```python
-from biomni.database import query_diseases
-
-results = query_diseases(
-    query="autoimmune diseases affecting joints",
-    top_k=10
-)
-```
-
-**Returns:** Disease data:
- `disease_name`: Standard disease name
- `disease_id`: Ontology identifier
- `symptoms`: Clinical manifestations
- `associated_genes`: Genetic associations
- `prevalence`: Epidemiological data
-
-#### `query_pathways(query: str, top_k: int = 10) -> List[Dict]`
-
-Query biological pathways from KEGG, Reactome, and other sources.
-
-```python
-from biomni.database import query_pathways
-
-results = query_pathways(
-    query="immune response signaling pathways",
-    top_k=15
-)
-```
-
-**Returns:** Pathway information:
- `pathway_name`: Pathway name
- `pathway_id`: Database identifier
- `genes`: Genes in pathway
- `description`: Functional description
- `source`: Database source (KEGG, Reactome, etc.)
-
-## Data Structures
-
-### TaskResult
-
-Result object returned by complex agent operations.
-
-```python
-class TaskResult:
-    success: bool           # Whether task completed successfully
-    output: Any            # Task output (varies by task)
-    code: str             # Generated code
-    execution_time: float # Execution time in seconds
-    error: Optional[str]  # Error message if failed
-    metadata: Dict        # Additional metadata
-```
-
-### BiomedicalEntity
-
-Base class for biomedical entities in the knowledge base.
-
-```python
-class BiomedicalEntity:
-    entity_id: str        # Unique identifier
-    entity_type: str      # Type (gene, protein, drug, etc.)
-    name: str            # Entity name
-    description: str     # Description
-    attributes: Dict     # Additional attributes
-    references: List[str] # Literature references
-```
-
-## Utility Functions
-
-### `download_data(path: str, force: bool = False) -> None`
-
-Manually download or update the biomedical knowledge base.
-
-```python
-from biomni.utils import download_data
-
-download_data(
-    path='./data',
-    force=True  # Force re-download
-)
-```
-
-### `validate_environment() -> Dict[str, bool]`
-
-Check if the environment is properly configured.
-
-```python
-from biomni.utils import validate_environment
-
-status = validate_environment()
-# Returns: {
-#   'conda_env': True,
-#   'api_keys': True,
-#   'data_available': True,
-#   'dependencies': True
-# }
-```
-
-### `list_available_models() -> List[str]`
-
-Get a list of available LLM models based on configured API keys.
-
-```python
-from biomni.utils import list_available_models
-
-models = list_available_models()
-# Returns: ['claude-sonnet-4-20250514', 'gpt-4o', ...]
-```
-
-## Error Handling
-
-### Common Exceptions
-
-#### `BiomniConfigError`
-
-Raised when configuration is invalid or incomplete.
-
-```python
-from biomni.exceptions import BiomniConfigError
-
-try:
-    agent = A1(path='./data')
-except BiomniConfigError as e:
-    print(f"Configuration error: {e}")
-```
-
-#### `BiomniExecutionError`
-
-Raised when code generation or execution fails.
-
-```python
-from biomni.exceptions import BiomniExecutionError
-
-try:
-    agent.go("invalid task")
-except BiomniExecutionError as e:
-    print(f"Execution failed: {e}")
-    # Access failed code: e.code
-    # Access error details: e.details
-```
-
-#### `BiomniDataError`
-
-Raised when knowledge base or data access fails.
-
-```python
-from biomni.exceptions import BiomniDataError
-
-try:
-    results = query_genes("unknown query format")
-except BiomniDataError as e:
-    print(f"Data access error: {e}")
-```
-
-#### `BiomniTimeoutError`
-
-Raised when operations exceed timeout limit.
-
-```python
-from biomni.exceptions import BiomniTimeoutError
-
-try:
-    agent.go("very complex long-running task")
-except BiomniTimeoutError as e:
-    print(f"Task timed out after {e.duration} seconds")
-    # Partial results may be available: e.partial_results
-```
-
-## Best Practices
-
-### Efficient Knowledge Retrieval
-
-Pre-query databases for relevant context before complex tasks:
-
-```python
-from biomni.database import query_genes, query_pathways
-
-# Gather relevant biological context first
-genes = query_genes("cell cycle genes", top_k=50)
-pathways = query_pathways("cell cycle regulation", top_k=20)
-
-# Then execute task with enriched context
-agent.go(f"""
-Analyze the cell cycle progression in this dataset.
-Focus on these genes: {[g['gene_symbol'] for g in genes]}
-Consider these pathways: {[p['pathway_name'] for p in pathways]}
-""")
-```
-
-### Error Recovery
-
-Implement robust error handling for production workflows:
-
-```python
-from biomni.exceptions import BiomniExecutionError, BiomniTimeoutError
-
-max_attempts = 3
-for attempt in range(max_attempts):
-    try:
-        agent.go("complex biomedical task")
-        break
-    except BiomniTimeoutError:
-        # Increase timeout and retry
-        default_config.timeout_seconds *= 2
-        print(f"Timeout, retrying with {default_config.timeout_seconds}s timeout")
-    except BiomniExecutionError as e:
-        # Refine task based on error
-        print(f"Execution failed: {e}, refining task...")
-        # Optionally modify task description
-    else:
-        print("Task failed after max attempts")
-```
-
-### Memory Management
-
-For large-scale analyses, manage memory explicitly:
-
-```python
-import gc
-
-# Process datasets in chunks
-for chunk_id in range(num_chunks):
-    agent.go(f"Process data chunk {chunk_id} located at data/chunk_{chunk_id}.h5ad")
-
-    # Force garbage collection between chunks
-    gc.collect()
-
-    # Save intermediate results
-    agent.save_conversation_history(f"./reports/chunk_{chunk_id}.pdf")
-```
-
-### Reproducibility
-
-Ensure reproducible analyses by:
-
-1. **Fixing random seeds:**
-```python
-agent.go("Set random seed to 42 for all analyses, then perform clustering...")
-```
-
-2. **Logging configuration:**
-```python
-import json
-config_log = {
-    'llm': default_config.llm,
-    'timeout': default_config.timeout_seconds,
-    'data_path': default_config.data_path,
-    'timestamp': datetime.now().isoformat()
-}
-with open('config_log.json', 'w') as f:
-    json.dump(config_log, f, indent=2)
-```
-
-3. **Saving execution traces:**
-```python
-# Always save detailed reports
-agent.save_conversation_history('./reports/full_analysis.pdf')
-```
-
-## Performance Optimization
-
-### Model Selection Strategy
-
-Choose models based on task characteristics:
-
-```python
-# For exploratory, simple tasks
-default_config.llm = "gpt-3.5-turbo"  # Fast, cost-effective
-
-# For standard biomedical analyses
-default_config.llm = "claude-sonnet-4-20250514"  # Recommended
-
-# For complex reasoning and hypothesis generation
-default_config.llm = "claude-opus-4-20250514"  # Highest quality
-
-# For specialized biological reasoning
-default_config.llm = "openai/biomni-r0"  # Requires local deployment
-```
-
-### Timeout Tuning
-
-Set appropriate timeouts based on task complexity:
-
-```python
-# Quick queries and simple analyses
-agent = A1(path='./data', timeout=300)
-
-# Standard workflows
-agent = A1(path='./data', timeout=1200)
-
-# Full pipelines with ML training
-agent = A1(path='./data', timeout=3600)
-```
-
-### Caching and Reuse
-
-Reuse agent instances for multiple related tasks:
-
-```python
-# Create agent once
-agent = A1(path='./data', llm='claude-sonnet-4-20250514')
-
-# Execute multiple related tasks
-tasks = [
-    "Load and QC the scRNA-seq dataset",
-    "Perform clustering with resolution 0.5",
-    "Identify marker genes for each cluster",
-    "Annotate cell types based on markers"
-]
-
-for task in tasks:
-    agent.go(task)
-
-# Save complete workflow
-agent.save_conversation_history('./reports/full_workflow.pdf')
-```
--- a/scientific-packages/biomni/references/llm_providers.md
+++ b/scientific-packages/biomni/references/llm_providers.md
@@ -1,649 +0,0 @@
-# LLM Provider Configuration Guide
-
-This document provides comprehensive configuration instructions for all LLM providers supported by Biomni.
-
-## Overview
-
-Biomni supports multiple LLM providers through a unified interface. Configure providers using:
- Environment variables
- `.env` files
- Runtime configuration via `default_config`
-
-## Quick Reference Table
-
-| Provider | Recommended For | API Key Required | Cost | Setup Complexity |
-|----------|----------------|------------------|------|------------------|
-| Anthropic Claude | Most biomedical tasks | Yes | Medium | Easy |
-| OpenAI | General tasks | Yes | Medium-High | Easy |
-| Azure OpenAI | Enterprise deployment | Yes | Varies | Medium |
-| Google Gemini | Multimodal tasks | Yes | Medium | Easy |
-| Groq | Fast inference | Yes | Low | Easy |
-| Ollama | Local/offline use | No | Free | Medium |
-| AWS Bedrock | AWS ecosystem | Yes | Varies | Hard |
-| Biomni-R0 | Complex biological reasoning | No | Free | Hard |
-
-## Anthropic Claude (Recommended)
-
-### Overview
-
-Claude models from Anthropic provide excellent biological reasoning capabilities and are the recommended choice for most Biomni tasks.
-
-### Setup
-
-1. **Obtain API Key:**
-   - Sign up at https://console.anthropic.com/
-   - Navigate to API Keys section
-   - Generate a new key
-
-2. **Configure Environment:**
-
-   **Option A: Environment Variable**
-   ```bash
-   export ANTHROPIC_API_KEY="sk-ant-api03-..."
-   ```
-
-   **Option B: .env File**
-   ```bash
-   # .env file in project root
-   ANTHROPIC_API_KEY=sk-ant-api03-...
-   ```
-
-3. **Set Model in Code:**
-   ```python
-   from biomni.config import default_config
-
-   # Claude Sonnet 4 (Recommended)
-   default_config.llm = "claude-sonnet-4-20250514"
-
-   # Claude Opus 4 (Most capable)
-   default_config.llm = "claude-opus-4-20250514"
-
-   # Claude 3.5 Sonnet (Previous version)
-   default_config.llm = "claude-3-5-sonnet-20241022"
-   ```
-
-### Available Models
-
-| Model | Context Window | Strengths | Best For |
-|-------|---------------|-----------|----------|
-| `claude-sonnet-4-20250514` | 200K tokens | Balanced performance, cost-effective | Most biomedical tasks |
-| `claude-opus-4-20250514` | 200K tokens | Highest capability, complex reasoning | Difficult multi-step analyses |
-| `claude-3-5-sonnet-20241022` | 200K tokens | Fast, reliable | Standard workflows |
-| `claude-3-opus-20240229` | 200K tokens | Strong reasoning | Legacy support |
-
-### Advanced Configuration
-
-```python
-from biomni.config import default_config
-
-# Use Claude with custom parameters
-default_config.llm = "claude-sonnet-4-20250514"
-default_config.timeout_seconds = 1800
-
-# Optional: Custom API endpoint (for proxy/enterprise)
-default_config.api_base = "https://your-proxy.com/v1"
-```
-
-### Cost Estimation
-
-Approximate costs per 1M tokens (as of January 2025):
- Input: $3-15 depending on model
- Output: $15-75 depending on model
-
-For a typical biomedical analysis (~50K tokens total): $0.50-$2.00
-
-## OpenAI
-
-### Overview
-
-OpenAI's GPT models provide strong general capabilities suitable for diverse biomedical tasks.
-
-### Setup
-
-1. **Obtain API Key:**
-   - Sign up at https://platform.openai.com/
-   - Navigate to API Keys
-   - Create new secret key
-
-2. **Configure Environment:**
-
-   ```bash
-   export OPENAI_API_KEY="sk-proj-..."
-   ```
-
-   Or in `.env`:
-   ```
-   OPENAI_API_KEY=sk-proj-...
-   ```
-
-3. **Set Model:**
-   ```python
-   from biomni.config import default_config
-
-   default_config.llm = "gpt-4o"          # Recommended
-   # default_config.llm = "gpt-4"         # Previous flagship
-   # default_config.llm = "gpt-4-turbo"   # Fast variant
-   # default_config.llm = "gpt-3.5-turbo" # Budget option
-   ```
-
-### Available Models
-
-| Model | Context Window | Strengths | Cost |
-|-------|---------------|-----------|------|
-| `gpt-4o` | 128K tokens | Fast, multimodal | Medium |
-| `gpt-4-turbo` | 128K tokens | Fast inference | Medium |
-| `gpt-4` | 8K tokens | Reliable | High |
-| `gpt-3.5-turbo` | 16K tokens | Fast, cheap | Low |
-
-### Cost Optimization
-
-```python
-# For exploratory analysis (budget-conscious)
-default_config.llm = "gpt-3.5-turbo"
-
-# For production analysis (quality-focused)
-default_config.llm = "gpt-4o"
-```
-
-## Azure OpenAI
-
-### Overview
-
-Azure-hosted OpenAI models for enterprise users requiring data residency and compliance.
-
-### Setup
-
-1. **Azure Prerequisites:**
-   - Active Azure subscription
-   - Azure OpenAI resource created
-   - Model deployment configured
-
-2. **Environment Variables:**
-   ```bash
-   export AZURE_OPENAI_API_KEY="your-key"
-   export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
-   export AZURE_OPENAI_API_VERSION="2024-02-15-preview"
-   ```
-
-3. **Configuration:**
-   ```python
-   from biomni.config import default_config
-
-   # Option 1: Use deployment name
-   default_config.llm = "azure/your-deployment-name"
-
-   # Option 2: Specify endpoint explicitly
-   default_config.llm = "azure/gpt-4"
-   default_config.api_base = "https://your-resource.openai.azure.com/"
-   ```
-
-### Deployment Setup
-
-Azure OpenAI requires explicit model deployments:
-
-1. Navigate to Azure OpenAI Studio
-2. Create deployment for desired model (e.g., GPT-4)
-3. Note the deployment name
-4. Use deployment name in Biomni configuration
-
-### Example Configuration
-
-```python
-from biomni.config import default_config
-import os
-
-# Set Azure credentials
-os.environ['AZURE_OPENAI_API_KEY'] = 'your-key'
-os.environ['AZURE_OPENAI_ENDPOINT'] = 'https://your-resource.openai.azure.com/'
-
-# Configure Biomni to use Azure deployment
-default_config.llm = "azure/gpt-4-biomni"  # Your deployment name
-default_config.api_base = os.environ['AZURE_OPENAI_ENDPOINT']
-```
-
-## Google Gemini
-
-### Overview
-
-Google's Gemini models offer multimodal capabilities and competitive performance.
-
-### Setup
-
-1. **Obtain API Key:**
-   - Visit https://makersuite.google.com/app/apikey
-   - Create new API key
-
-2. **Environment Configuration:**
-   ```bash
-   export GEMINI_API_KEY="your-key"
-   ```
-
-3. **Set Model:**
-   ```python
-   from biomni.config import default_config
-
-   default_config.llm = "gemini/gemini-1.5-pro"
-   # Or: default_config.llm = "gemini/gemini-pro"
-   ```
-
-### Available Models
-
-| Model | Context Window | Strengths |
-|-------|---------------|-----------|
-| `gemini/gemini-1.5-pro` | 1M tokens | Very large context, multimodal |
-| `gemini/gemini-pro` | 32K tokens | Balanced performance |
-
-### Use Cases
-
-Gemini excels at:
- Tasks requiring very large context windows
- Multimodal analysis (when incorporating images)
- Cost-effective alternative to GPT-4
-
-```python
-# For tasks with large context requirements
-default_config.llm = "gemini/gemini-1.5-pro"
-default_config.timeout_seconds = 2400  # May need longer timeout
-```
-
-## Groq
-
-### Overview
-
-Groq provides ultra-fast inference with open-source models, ideal for rapid iteration.
-
-### Setup
-
-1. **Get API Key:**
-   - Sign up at https://console.groq.com/
-   - Generate API key
-
-2. **Configure:**
-   ```bash
-   export GROQ_API_KEY="gsk_..."
-   ```
-
-3. **Set Model:**
-   ```python
-   from biomni.config import default_config
-
-   default_config.llm = "groq/llama-3.1-70b-versatile"
-   # Or: default_config.llm = "groq/mixtral-8x7b-32768"
-   ```
-
-### Available Models
-
-| Model | Context Window | Speed | Quality |
-|-------|---------------|-------|---------|
-| `groq/llama-3.1-70b-versatile` | 32K tokens | Very Fast | Good |
-| `groq/mixtral-8x7b-32768` | 32K tokens | Very Fast | Good |
-| `groq/llama-3-70b-8192` | 8K tokens | Ultra Fast | Moderate |
-
-### Best Practices
-
-```python
-# For rapid prototyping and testing
-default_config.llm = "groq/llama-3.1-70b-versatile"
-default_config.timeout_seconds = 600  # Groq is fast
-
-# Note: Quality may be lower than GPT-4/Claude for complex tasks
-# Recommended for: QC, simple analyses, testing workflows
-```
-
-## Ollama (Local Deployment)
-
-### Overview
-
-Run LLMs entirely locally for offline use, data privacy, or cost savings.
-
-### Setup
-
-1. **Install Ollama:**
-   ```bash
-   # macOS/Linux
-   curl -fsSL https://ollama.com/install.sh | sh
-
-   # Or download from https://ollama.com/download
-   ```
-
-2. **Pull Models:**
-   ```bash
-   ollama pull llama3       # Meta Llama 3 (8B)
-   ollama pull mixtral      # Mixtral (47B)
-   ollama pull codellama    # Code-specialized
-   ollama pull medllama     # Medical domain (if available)
-   ```
-
-3. **Start Ollama Server:**
-   ```bash
-   ollama serve  # Runs on http://localhost:11434
-   ```
-
-4. **Configure Biomni:**
-   ```python
-   from biomni.config import default_config
-
-   default_config.llm = "ollama/llama3"
-   default_config.api_base = "http://localhost:11434"
-   ```
-
-### Hardware Requirements
-
-Minimum recommendations:
- **8B models:** 16GB RAM, CPU inference acceptable
- **70B models:** 64GB RAM, GPU highly recommended
- **Storage:** 5-50GB per model
-
-### Model Selection
-
-```python
-# Fast, local, good for testing
-default_config.llm = "ollama/llama3"
-
-# Better quality (requires more resources)
-default_config.llm = "ollama/mixtral"
-
-# Code generation tasks
-default_config.llm = "ollama/codellama"
-```
-
-### Advantages & Limitations
-
-**Advantages:**
- Complete data privacy
- No API costs
- Offline operation
- Unlimited usage
-
-**Limitations:**
- Lower quality than GPT-4/Claude for complex tasks
- Requires significant hardware
- Slower inference (especially on CPU)
- May struggle with specialized biomedical knowledge
-
-## AWS Bedrock
-
-### Overview
-
-AWS-managed LLM service offering multiple model providers.
-
-### Setup
-
-1. **AWS Prerequisites:**
-   - AWS account with Bedrock access
-   - Model access enabled in Bedrock console
-   - AWS credentials configured
-
-2. **Configure AWS Credentials:**
-   ```bash
-   # Option 1: AWS CLI
-   aws configure
-
-   # Option 2: Environment variables
-   export AWS_ACCESS_KEY_ID="your-key"
-   export AWS_SECRET_ACCESS_KEY="your-secret"
-   export AWS_REGION="us-east-1"
-   ```
-
-3. **Enable Model Access:**
-   - Navigate to AWS Bedrock console
-   - Request access to desired models
-   - Wait for approval (may take hours/days)
-
-4. **Configure Biomni:**
-   ```python
-   from biomni.config import default_config
-
-   default_config.llm = "bedrock/anthropic.claude-3-sonnet"
-   # Or: default_config.llm = "bedrock/anthropic.claude-v2"
-   ```
-
-### Available Models
-
-Bedrock provides access to:
- Anthropic Claude models
- Amazon Titan models
- AI21 Jurassic models
- Cohere Command models
- Meta Llama models
-
-### IAM Permissions
-
-Required IAM policy:
-```json
-{
-  "Version": "2012-10-17",
-  "Statement": [
-    {
-      "Effect": "Allow",
-      "Action": [
-        "bedrock:InvokeModel",
-        "bedrock:InvokeModelWithResponseStream"
-      ],
-      "Resource": "arn:aws:bedrock:*::foundation-model/*"
-    }
-  ]
-}
-```
-
-### Example Configuration
-
-```python
-from biomni.config import default_config
-import boto3
-
-# Verify AWS credentials
-session = boto3.Session()
-credentials = session.get_credentials()
-print(f"AWS Access Key: {credentials.access_key[:8]}...")
-
-# Configure Biomni
-default_config.llm = "bedrock/anthropic.claude-3-sonnet"
-default_config.timeout_seconds = 1800
-```
-
-## Biomni-R0 (Local Specialized Model)
-
-### Overview
-
-Biomni-R0 is a 32B parameter reasoning model specifically trained for biological problem-solving. Provides the highest quality for complex biomedical reasoning but requires local deployment.
-
-### Setup
-
-1. **Hardware Requirements:**
-   - GPU with 48GB+ VRAM (e.g., A100, H100)
-   - Or multi-GPU setup (2x 24GB)
-   - 100GB+ storage for model weights
-
-2. **Install Dependencies:**
-   ```bash
-   pip install "sglang[all]"
-   pip install flashinfer  # Optional but recommended
-   ```
-
-3. **Deploy Model:**
-   ```bash
-   python -m sglang.launch_server \
-       --model-path snap-stanford/biomni-r0 \
-       --host 0.0.0.0 \
-       --port 30000 \
-       --trust-remote-code \
-       --mem-fraction-static 0.8
-   ```
-
-   For multi-GPU:
-   ```bash
-   python -m sglang.launch_server \
-       --model-path snap-stanford/biomni-r0 \
-       --host 0.0.0.0 \
-       --port 30000 \
-       --trust-remote-code \
-       --tp 2  # Tensor parallelism across 2 GPUs
-   ```
-
-4. **Configure Biomni:**
-   ```python
-   from biomni.config import default_config
-
-   default_config.llm = "openai/biomni-r0"
-   default_config.api_base = "http://localhost:30000/v1"
-   default_config.timeout_seconds = 2400  # Longer for complex reasoning
-   ```
-
-### When to Use Biomni-R0
-
-Biomni-R0 excels at:
- Multi-step biological reasoning
- Complex experimental design
- Hypothesis generation and evaluation
- Literature-informed analysis
- Tasks requiring deep biological knowledge
-
-```python
-# For complex biological reasoning tasks
-default_config.llm = "openai/biomni-r0"
-
-agent.go("""
-Design a comprehensive CRISPR screening experiment to identify synthetic
-lethal interactions with TP53 mutations in cancer cells, including:
-1. Rationale and hypothesis
-2. Guide RNA library design strategy
-3. Experimental controls
-4. Statistical analysis plan
-5. Expected outcomes and validation approach
-""")
-```
-
-### Performance Comparison
-
-| Model | Speed | Biological Reasoning | Code Quality | Cost |
-|-------|-------|---------------------|--------------|------|
-| GPT-4 | Fast | Good | Excellent | Medium |
-| Claude Sonnet 4 | Fast | Excellent | Excellent | Medium |
-| Biomni-R0 | Moderate | Outstanding | Good | Free (local) |
-
-## Multi-Provider Strategy
-
-### Intelligent Model Selection
-
-Use different models for different task types:
-
-```python
-from biomni.agent import A1
-from biomni.config import default_config
-
-# Strategy 1: Task-based selection
-def get_agent_for_task(task_complexity):
-    if task_complexity == "simple":
-        default_config.llm = "gpt-3.5-turbo"
-        default_config.timeout_seconds = 300
-    elif task_complexity == "medium":
-        default_config.llm = "claude-sonnet-4-20250514"
-        default_config.timeout_seconds = 1200
-    else:  # complex
-        default_config.llm = "openai/biomni-r0"
-        default_config.timeout_seconds = 2400
-
-    return A1(path='./data')
-
-# Strategy 2: Fallback on failure
-def execute_with_fallback(task):
-    models = [
-        "claude-sonnet-4-20250514",
-        "gpt-4o",
-        "claude-opus-4-20250514"
-    ]
-
-    for model in models:
-        try:
-            default_config.llm = model
-            agent = A1(path='./data')
-            agent.go(task)
-            return
-        except Exception as e:
-            print(f"Failed with {model}: {e}, trying next...")
-
-    raise Exception("All models failed")
-```
-
-### Cost Optimization Strategy
-
-```python
-# Phase 1: Rapid prototyping with cheap models
-default_config.llm = "gpt-3.5-turbo"
-agent.go("Quick exploratory analysis of dataset structure")
-
-# Phase 2: Detailed analysis with high-quality models
-default_config.llm = "claude-sonnet-4-20250514"
-agent.go("Comprehensive differential expression analysis with pathway enrichment")
-
-# Phase 3: Complex reasoning with specialized models
-default_config.llm = "openai/biomni-r0"
-agent.go("Generate biological hypotheses based on multi-omics integration")
-```
-
-## Troubleshooting
-
-### Common Issues
-
-**Issue: "API key not found"**
- Verify environment variable is set: `echo $ANTHROPIC_API_KEY`
- Check `.env` file exists and is in correct location
- Try setting key programmatically: `os.environ['ANTHROPIC_API_KEY'] = 'key'`
-
-**Issue: "Rate limit exceeded"**
- Implement exponential backoff and retry
- Upgrade API tier if available
- Switch to alternative provider temporarily
-
-**Issue: "Model not found"**
- Verify model identifier is correct
- Check API key has access to requested model
- For Azure: ensure deployment exists with exact name
-
-**Issue: "Timeout errors"**
- Increase `default_config.timeout_seconds`
- Break complex tasks into smaller steps
- Consider using faster model for initial phases
-
-**Issue: "Connection refused (Ollama/Biomni-R0)"**
- Verify local server is running
- Check port is not blocked by firewall
- Confirm `api_base` URL is correct
-
-### Testing Configuration
-
-```python
-from biomni.utils import list_available_models, validate_environment
-
-# Check environment setup
-status = validate_environment()
-print("Environment Status:", status)
-
-# List available models based on configured keys
-models = list_available_models()
-print("Available Models:", models)
-
-# Test specific model
-try:
-    from biomni.agent import A1
-    agent = A1(path='./data', llm='claude-sonnet-4-20250514')
-    agent.go("Print 'Configuration successful!'")
-except Exception as e:
-    print(f"Configuration test failed: {e}")
-```
-
-## Best Practices Summary
-
-1. **For most users:** Start with Claude Sonnet 4 or GPT-4o
-2. **For cost sensitivity:** Use GPT-3.5-turbo for exploration, Claude Sonnet 4 for production
-3. **For privacy/offline:** Deploy Ollama locally
-4. **For complex reasoning:** Use Biomni-R0 if hardware available
-5. **For enterprise:** Consider Azure OpenAI or AWS Bedrock
-6. **For speed:** Use Groq for rapid iteration
-
-7. **Always:**
-   - Set appropriate timeouts
-   - Implement error handling and retries
-   - Log model and configuration for reproducibility
-   - Test configuration before production use
--- a/scientific-packages/biomni/references/task_examples.md
+++ b/scientific-packages/biomni/references/task_examples.md
--- a/scientific-packages/biomni/scripts/generate_report.py
+++ b/scientific-packages/biomni/scripts/generate_report.py
@@ -1,381 +0,0 @@
-#!/usr/bin/env python3
-"""
-Enhanced PDF Report Generation for Biomni
-
-This script provides advanced PDF report generation with custom formatting,
-styling, and metadata for Biomni analysis results.
-"""
-
-import argparse
-import sys
-from pathlib import Path
-from datetime import datetime
-from typing import Optional, Dict, Any
-
-
-def generate_markdown_report(
-    title: str,
-    sections: list,
-    metadata: Optional[Dict[str, Any]] = None,
-    output_path: str = "report.md"
-) -> str:
-    """
-    Generate a formatted markdown report.
-
-    Args:
-        title: Report title
-        sections: List of dicts with 'heading' and 'content' keys
-        metadata: Optional metadata dict (author, date, etc.)
-        output_path: Path to save markdown file
-
-    Returns:
-        Path to generated markdown file
-    """
-    md_content = []
-
-    # Title
-    md_content.append(f"# {title}\n")
-
-    # Metadata
-    if metadata:
-        md_content.append("---\n")
-        for key, value in metadata.items():
-            md_content.append(f"**{key}:** {value}  \n")
-        md_content.append("---\n\n")
-
-    # Sections
-    for section in sections:
-        heading = section.get('heading', 'Section')
-        content = section.get('content', '')
-        level = section.get('level', 2)  # Default to h2
-
-        md_content.append(f"{'#' * level} {heading}\n\n")
-        md_content.append(f"{content}\n\n")
-
-    # Write to file
-    output = Path(output_path)
-    output.write_text('\n'.join(md_content))
-
-    return str(output)
-
-
-def convert_to_pdf_weasyprint(
-    markdown_path: str,
-    output_path: str,
-    css_style: Optional[str] = None
-) -> bool:
-    """
-    Convert markdown to PDF using WeasyPrint.
-
-    Args:
-        markdown_path: Path to markdown file
-        output_path: Path for output PDF
-        css_style: Optional CSS stylesheet path
-
-    Returns:
-        True if successful, False otherwise
-    """
-    try:
-        import markdown
-        from weasyprint import HTML, CSS
-
-        # Read markdown
-        with open(markdown_path, 'r') as f:
-            md_content = f.read()
-
-        # Convert to HTML
-        html_content = markdown.markdown(
-            md_content,
-            extensions=['tables', 'fenced_code', 'codehilite']
-        )
-
-        # Wrap in HTML template
-        html_template = f"""
-        <!DOCTYPE html>
-        <html>
-        <head>
-            <meta charset="utf-8">
-            <title>Biomni Report</title>
-            <style>
-                body {{
-                    font-family: 'Helvetica', 'Arial', sans-serif;
-                    line-height: 1.6;
-                    color: #333;
-                    max-width: 800px;
-                    margin: 40px auto;
-                    padding: 20px;
-                }}
-                h1 {{
-                    color: #2c3e50;
-                    border-bottom: 3px solid #3498db;
-                    padding-bottom: 10px;
-                }}
-                h2 {{
-                    color: #34495e;
-                    margin-top: 30px;
-                    border-bottom: 1px solid #bdc3c7;
-                    padding-bottom: 5px;
-                }}
-                h3 {{
-                    color: #7f8c8d;
-                }}
-                code {{
-                    background-color: #f4f4f4;
-                    padding: 2px 6px;
-                    border-radius: 3px;
-                    font-family: 'Courier New', monospace;
-                }}
-                pre {{
-                    background-color: #f4f4f4;
-                    padding: 15px;
-                    border-radius: 5px;
-                    overflow-x: auto;
-                }}
-                table {{
-                    border-collapse: collapse;
-                    width: 100%;
-                    margin: 20px 0;
-                }}
-                th, td {{
-                    border: 1px solid #ddd;
-                    padding: 12px;
-                    text-align: left;
-                }}
-                th {{
-                    background-color: #3498db;
-                    color: white;
-                }}
-                tr:nth-child(even) {{
-                    background-color: #f9f9f9;
-                }}
-                .metadata {{
-                    background-color: #ecf0f1;
-                    padding: 15px;
-                    border-radius: 5px;
-                    margin: 20px 0;
-                }}
-            </style>
-        </head>
-        <body>
-            {html_content}
-        </body>
-        </html>
-        """
-
-        # Generate PDF
-        pdf = HTML(string=html_template)
-
-        # Add custom CSS if provided
-        stylesheets = []
-        if css_style and Path(css_style).exists():
-            stylesheets.append(CSS(filename=css_style))
-
-        pdf.write_pdf(output_path, stylesheets=stylesheets)
-
-        return True
-
-    except ImportError:
-        print("Error: WeasyPrint not installed. Install with: pip install weasyprint")
-        return False
-    except Exception as e:
-        print(f"Error generating PDF: {e}")
-        return False
-
-
-def convert_to_pdf_pandoc(markdown_path: str, output_path: str) -> bool:
-    """
-    Convert markdown to PDF using Pandoc.
-
-    Args:
-        markdown_path: Path to markdown file
-        output_path: Path for output PDF
-
-    Returns:
-        True if successful, False otherwise
-    """
-    try:
-        import subprocess
-
-        # Check if pandoc is installed
-        result = subprocess.run(
-            ['pandoc', '--version'],
-            capture_output=True,
-            text=True
-        )
-
-        if result.returncode != 0:
-            print("Error: Pandoc not installed")
-            return False
-
-        # Convert with pandoc
-        result = subprocess.run(
-            [
-                'pandoc',
-                markdown_path,
-                '-o', output_path,
-                '--pdf-engine=pdflatex',
-                '-V', 'geometry:margin=1in',
-                '--toc'
-            ],
-            capture_output=True,
-            text=True
-        )
-
-        if result.returncode != 0:
-            print(f"Pandoc error: {result.stderr}")
-            return False
-
-        return True
-
-    except FileNotFoundError:
-        print("Error: Pandoc not found. Install from https://pandoc.org/")
-        return False
-    except Exception as e:
-        print(f"Error: {e}")
-        return False
-
-
-def create_biomni_report(
-    conversation_history: list,
-    output_path: str = "biomni_report.pdf",
-    method: str = "weasyprint"
-) -> bool:
-    """
-    Create a formatted PDF report from Biomni conversation history.
-
-    Args:
-        conversation_history: List of conversation turns
-        output_path: Output PDF path
-        method: Conversion method ('weasyprint' or 'pandoc')
-
-    Returns:
-        True if successful
-    """
-    # Prepare report sections
-    metadata = {
-        'Date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
-        'Tool': 'Biomni AI Agent',
-        'Report Type': 'Analysis Summary'
-    }
-
-    sections = []
-
-    # Executive Summary
-    sections.append({
-        'heading': 'Executive Summary',
-        'level': 2,
-        'content': 'This report contains the complete analysis workflow executed by the Biomni biomedical AI agent.'
-    })
-
-    # Conversation history
-    for i, turn in enumerate(conversation_history, 1):
-        sections.append({
-            'heading': f'Task {i}: {turn.get("task", "Analysis")}',
-            'level': 2,
-            'content': f'**Input:**\n```\n{turn.get("input", "")}\n```\n\n**Output:**\n{turn.get("output", "")}'
-        })
-
-    # Generate markdown
-    md_path = output_path.replace('.pdf', '.md')
-    generate_markdown_report(
-        title="Biomni Analysis Report",
-        sections=sections,
-        metadata=metadata,
-        output_path=md_path
-    )
-
-    # Convert to PDF
-    if method == 'weasyprint':
-        success = convert_to_pdf_weasyprint(md_path, output_path)
-    elif method == 'pandoc':
-        success = convert_to_pdf_pandoc(md_path, output_path)
-    else:
-        print(f"Unknown method: {method}")
-        return False
-
-    if success:
-        print(f"✓ Report generated: {output_path}")
-        print(f"  Markdown: {md_path}")
-    else:
-        print("✗ Failed to generate PDF")
-        print(f"  Markdown available: {md_path}")
-
-    return success
-
-
-def main():
-    """CLI for report generation."""
-    parser = argparse.ArgumentParser(
-        description='Generate formatted PDF reports for Biomni analyses'
-    )
-
-    parser.add_argument(
-        'input',
-        type=str,
-        help='Input markdown file or conversation history'
-    )
-
-    parser.add_argument(
-        '-o', '--output',
-        type=str,
-        default='biomni_report.pdf',
-        help='Output PDF path (default: biomni_report.pdf)'
-    )
-
-    parser.add_argument(
-        '-m', '--method',
-        type=str,
-        choices=['weasyprint', 'pandoc'],
-        default='weasyprint',
-        help='Conversion method (default: weasyprint)'
-    )
-
-    parser.add_argument(
-        '--css',
-        type=str,
-        help='Custom CSS stylesheet path'
-    )
-
-    args = parser.parse_args()
-
-    # Check if input is markdown or conversation history
-    input_path = Path(args.input)
-
-    if not input_path.exists():
-        print(f"Error: Input file not found: {args.input}")
-        return 1
-
-    # If input is markdown, convert directly
-    if input_path.suffix == '.md':
-        if args.method == 'weasyprint':
-            success = convert_to_pdf_weasyprint(
-                str(input_path),
-                args.output,
-                args.css
-            )
-        else:
-            success = convert_to_pdf_pandoc(str(input_path), args.output)
-
-        return 0 if success else 1
-
-    # Otherwise, assume it's conversation history (JSON)
-    try:
-        import json
-        with open(input_path) as f:
-            history = json.load(f)
-
-        success = create_biomni_report(
-            history,
-            args.output,
-            args.method
-        )
-
-        return 0 if success else 1
-
-    except json.JSONDecodeError:
-        print("Error: Input file is not valid JSON or markdown")
-        return 1
-
-
-if __name__ == "__main__":
-    sys.exit(main())
--- a/scientific-packages/biomni/scripts/setup_environment.py
+++ b/scientific-packages/biomni/scripts/setup_environment.py
@@ -1,230 +0,0 @@
-#!/usr/bin/env python3
-"""
-Biomni Environment Setup and Validation Script
-
-This script helps users set up and validate their Biomni environment,
-including checking dependencies, API keys, and data availability.
-"""
-
-import os
-import sys
-import subprocess
-from pathlib import Path
-from typing import Dict, List, Tuple
-
-
-def check_python_version() -> Tuple[bool, str]:
-    """Check if Python version is compatible."""
-    version = sys.version_info
-    if version.major == 3 and version.minor >= 8:
-        return True, f"Python {version.major}.{version.minor}.{version.micro} ✓"
-    else:
-        return False, f"Python {version.major}.{version.minor} - requires Python 3.8+"
-
-
-def check_conda_env() -> Tuple[bool, str]:
-    """Check if running in biomni conda environment."""
-    conda_env = os.environ.get('CONDA_DEFAULT_ENV', None)
-    if conda_env == 'biomni_e1':
-        return True, f"Conda environment: {conda_env} ✓"
-    else:
-        return False, f"Not in biomni_e1 environment (current: {conda_env})"
-
-
-def check_package_installed(package: str) -> bool:
-    """Check if a Python package is installed."""
-    try:
-        __import__(package)
-        return True
-    except ImportError:
-        return False
-
-
-def check_dependencies() -> Tuple[bool, List[str]]:
-    """Check for required and optional dependencies."""
-    required = ['biomni']
-    optional = ['weasyprint', 'markdown2pdf']
-
-    missing_required = [pkg for pkg in required if not check_package_installed(pkg)]
-    missing_optional = [pkg for pkg in optional if not check_package_installed(pkg)]
-
-    messages = []
-    success = len(missing_required) == 0
-
-    if missing_required:
-        messages.append(f"Missing required packages: {', '.join(missing_required)}")
-        messages.append("Install with: pip install biomni --upgrade")
-    else:
-        messages.append("Required packages: ✓")
-
-    if missing_optional:
-        messages.append(f"Missing optional packages: {', '.join(missing_optional)}")
-        messages.append("For PDF reports, install: pip install weasyprint")
-
-    return success, messages
-
-
-def check_api_keys() -> Tuple[bool, Dict[str, bool]]:
-    """Check which API keys are configured."""
-    api_keys = {
-        'ANTHROPIC_API_KEY': os.environ.get('ANTHROPIC_API_KEY'),
-        'OPENAI_API_KEY': os.environ.get('OPENAI_API_KEY'),
-        'GEMINI_API_KEY': os.environ.get('GEMINI_API_KEY'),
-        'GROQ_API_KEY': os.environ.get('GROQ_API_KEY'),
-    }
-
-    configured = {key: bool(value) for key, value in api_keys.items()}
-    has_any = any(configured.values())
-
-    return has_any, configured
-
-
-def check_data_directory(data_path: str = './data') -> Tuple[bool, str]:
-    """Check if Biomni data directory exists and has content."""
-    path = Path(data_path)
-
-    if not path.exists():
-        return False, f"Data directory not found at {data_path}"
-
-    # Check if directory has files (data has been downloaded)
-    files = list(path.glob('*'))
-    if len(files) == 0:
-        return False, f"Data directory exists but is empty. Run agent once to download."
-
-    # Rough size check (should be ~11GB)
-    total_size = sum(f.stat().st_size for f in path.rglob('*') if f.is_file())
-    size_gb = total_size / (1024**3)
-
-    if size_gb < 1:
-        return False, f"Data directory exists but seems incomplete ({size_gb:.1f} GB)"
-
-    return True, f"Data directory: {data_path} ({size_gb:.1f} GB) ✓"
-
-
-def check_disk_space(required_gb: float = 20) -> Tuple[bool, str]:
-    """Check if sufficient disk space is available."""
-    try:
-        import shutil
-        stat = shutil.disk_usage('.')
-        free_gb = stat.free / (1024**3)
-
-        if free_gb >= required_gb:
-            return True, f"Disk space: {free_gb:.1f} GB available ✓"
-        else:
-            return False, f"Low disk space: {free_gb:.1f} GB (need {required_gb} GB)"
-    except Exception as e:
-        return False, f"Could not check disk space: {e}"
-
-
-def test_biomni_import() -> Tuple[bool, str]:
-    """Test if Biomni can be imported and initialized."""
-    try:
-        from biomni.agent import A1
-        from biomni.config import default_config
-        return True, "Biomni import successful ✓"
-    except ImportError as e:
-        return False, f"Cannot import Biomni: {e}"
-    except Exception as e:
-        return False, f"Biomni import error: {e}"
-
-
-def suggest_fixes(results: Dict[str, Tuple[bool, any]]) -> List[str]:
-    """Generate suggestions for fixing issues."""
-    suggestions = []
-
-    if not results['python'][0]:
-        suggestions.append("➜ Upgrade Python to 3.8 or higher")
-
-    if not results['conda'][0]:
-        suggestions.append("➜ Activate biomni environment: conda activate biomni_e1")
-
-    if not results['dependencies'][0]:
-        suggestions.append("➜ Install Biomni: pip install biomni --upgrade")
-
-    if not results['api_keys'][0]:
-        suggestions.append("➜ Set API key: export ANTHROPIC_API_KEY='your-key'")
-        suggestions.append("   Or create .env file with API keys")
-
-    if not results['data'][0]:
-        suggestions.append("➜ Data will auto-download on first agent.go() call")
-
-    if not results['disk_space'][0]:
-        suggestions.append("➜ Free up disk space (need ~20GB total)")
-
-    return suggestions
-
-
-def main():
-    """Run all environment checks and display results."""
-    print("=" * 60)
-    print("Biomni Environment Validation")
-    print("=" * 60)
-    print()
-
-    # Run all checks
-    results = {}
-
-    print("Checking Python version...")
-    results['python'] = check_python_version()
-    print(f"  {results['python'][1]}")
-    print()
-
-    print("Checking conda environment...")
-    results['conda'] = check_conda_env()
-    print(f"  {results['conda'][1]}")
-    print()
-
-    print("Checking dependencies...")
-    results['dependencies'] = check_dependencies()
-    for msg in results['dependencies'][1]:
-        print(f"  {msg}")
-    print()
-
-    print("Checking API keys...")
-    results['api_keys'] = check_api_keys()
-    has_keys, key_status = results['api_keys']
-    for key, configured in key_status.items():
-        status = "✓" if configured else "✗"
-        print(f"  {key}: {status}")
-    print()
-
-    print("Checking Biomni data directory...")
-    results['data'] = check_data_directory()
-    print(f"  {results['data'][1]}")
-    print()
-
-    print("Checking disk space...")
-    results['disk_space'] = check_disk_space()
-    print(f"  {results['disk_space'][1]}")
-    print()
-
-    print("Testing Biomni import...")
-    results['biomni_import'] = test_biomni_import()
-    print(f"  {results['biomni_import'][1]}")
-    print()
-
-    # Summary
-    print("=" * 60)
-    all_passed = all(result[0] for result in results.values())
-
-    if all_passed:
-        print("✓ All checks passed! Environment is ready.")
-        print()
-        print("Quick start:")
-        print("  from biomni.agent import A1")
-        print("  agent = A1(path='./data', llm='claude-sonnet-4-20250514')")
-        print("  agent.go('Your biomedical task')")
-    else:
-        print("⚠ Some checks failed. See suggestions below:")
-        print()
-        suggestions = suggest_fixes(results)
-        for suggestion in suggestions:
-            print(suggestion)
-
-    print("=" * 60)
-
-    return 0 if all_passed else 1
-
-
-if __name__ == "__main__":
-    sys.exit(main())
--- a/scientific-packages/pyopenms/SKILL.md
+++ b/scientific-packages/pyopenms/SKILL.md
@@ -1,530 +0,0 @@
---
-name: pyopenms
-description: "Mass spectrometry toolkit (OpenMS Python). Process mzML/mzXML, peak picking, feature detection, peptide ID, proteomics/metabolomics workflows, for LC-MS/MS analysis."
---
-
-# pyOpenMS
-
-## Overview
-
-pyOpenMS is an open-source Python library for mass spectrometry data analysis in proteomics and metabolomics. Process LC-MS/MS data, perform peptide identification, detect and quantify features, and integrate with common proteomics tools (Comet, Mascot, MSGF+, Percolator, MSstats) using Python bindings to the OpenMS C++ library.
-
-## When to Use This Skill
-
-This skill should be used when:
- Processing mass spectrometry data (mzML, mzXML files)
- Performing peak picking and feature detection in LC-MS data
- Conducting peptide and protein identification workflows
- Quantifying metabolites or proteins
- Integrating proteomics or metabolomics tools into Python pipelines
- Working with OpenMS tools and file formats
-
-## Core Capabilities
-
-### 1. File I/O and Data Import/Export
-
-Handle diverse mass spectrometry file formats efficiently:
-
-**Supported Formats:**
- **mzML/mzXML**: Primary raw MS data formats (profile or centroid)
- **FASTA**: Protein/peptide sequence databases
- **mzTab**: Standardized reporting format for identification and quantification
- **mzIdentML**: Peptide and protein identification data
- **TraML**: Transition lists for targeted experiments
- **pepXML/protXML**: Search engine results
-
-**Reading mzML Files:**
-```python
-import pyopenms as oms
-
-# Load MS data
-exp = oms.MSExperiment()
-oms.MzMLFile().load("input_data.mzML", exp)
-
-# Access basic information
-print(f"Number of spectra: {exp.getNrSpectra()}")
-print(f"Number of chromatograms: {exp.getNrChromatograms()}")
-```
-
-**Writing mzML Files:**
-```python
-# Save processed data
-oms.MzMLFile().store("output_data.mzML", exp)
-```
-
-**File Encoding:** pyOpenMS automatically handles Base64 encoding, zlib compression, and Numpress compression internally.
-
-### 2. MS Data Structures and Manipulation
-
-Work with core mass spectrometry data structures. See `references/data_structures.md` for comprehensive details.
-
-**MSSpectrum** - Individual mass spectrum:
-```python
-# Create spectrum with metadata
-spectrum = oms.MSSpectrum()
-spectrum.setRT(205.2)  # Retention time in seconds
-spectrum.setMSLevel(2)  # MS2 spectrum
-
-# Set peak data (m/z, intensity arrays)
-mz_array = [100.5, 200.3, 300.7, 400.2]
-intensity_array = [1000, 5000, 3000, 2000]
-spectrum.set_peaks((mz_array, intensity_array))
-
-# Add precursor information for MS2
-precursor = oms.Precursor()
-precursor.setMZ(450.5)
-precursor.setCharge(2)
-spectrum.setPrecursors([precursor])
-```
-
-**MSExperiment** - Complete LC-MS/MS run:
-```python
-# Create experiment and add spectra
-exp = oms.MSExperiment()
-exp.addSpectrum(spectrum)
-
-# Access spectra
-first_spectrum = exp.getSpectrum(0)
-for spec in exp:
-    print(f"RT: {spec.getRT()}, MS Level: {spec.getMSLevel()}")
-```
-
-**MSChromatogram** - Extracted ion chromatogram:
-```python
-# Create chromatogram
-chrom = oms.MSChromatogram()
-chrom.set_peaks(([10.5, 11.2, 11.8], [1000, 5000, 3000]))  # RT, intensity
-exp.addChromatogram(chrom)
-```
-
-**Efficient Peak Access:**
-```python
-# Get peaks as numpy arrays for fast processing
-mz_array, intensity_array = spectrum.get_peaks()
-
-# Modify and set back
-intensity_array *= 2  # Double all intensities
-spectrum.set_peaks((mz_array, intensity_array))
-```
-
-### 3. Chemistry and Peptide Handling
-
-Perform chemical calculations for proteomics and metabolomics. See `references/chemistry.md` for detailed examples.
-
-**Molecular Formulas and Mass Calculations:**
-```python
-# Create empirical formula
-formula = oms.EmpiricalFormula("C6H12O6")  # Glucose
-print(f"Monoisotopic mass: {formula.getMonoWeight()}")
-print(f"Average mass: {formula.getAverageWeight()}")
-
-# Formula arithmetic
-water = oms.EmpiricalFormula("H2O")
-dehydrated = formula - water
-
-# Isotope-specific formulas
-heavy_carbon = oms.EmpiricalFormula("(13)C6H12O6")
-```
-
-**Isotopic Distributions:**
-```python
-# Generate coarse isotope pattern (unit mass resolution)
-coarse_gen = oms.CoarseIsotopePatternGenerator()
-pattern = coarse_gen.run(formula)
-
-# Generate fine structure (high resolution)
-fine_gen = oms.FineIsotopePatternGenerator(0.01)  # 0.01 Da resolution
-fine_pattern = fine_gen.run(formula)
-```
-
-**Amino Acids and Residues:**
-```python
-# Access residue information
-res_db = oms.ResidueDB()
-leucine = res_db.getResidue("Leucine")
-print(f"L monoisotopic mass: {leucine.getMonoWeight()}")
-print(f"L formula: {leucine.getFormula()}")
-print(f"L pKa: {leucine.getPka()}")
-```
-
-**Peptide Sequences:**
-```python
-# Create peptide sequence
-peptide = oms.AASequence.fromString("PEPTIDE")
-print(f"Peptide mass: {peptide.getMonoWeight()}")
-print(f"Formula: {peptide.getFormula()}")
-
-# Add modifications
-modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)")
-print(f"Modified mass: {modified.getMonoWeight()}")
-
-# Theoretical fragmentation
-ions = []
-for i in range(1, peptide.size()):
-    b_ion = peptide.getPrefix(i)
-    y_ion = peptide.getSuffix(i)
-    ions.append(('b', i, b_ion.getMonoWeight()))
-    ions.append(('y', i, y_ion.getMonoWeight()))
-```
-
-**Protein Digestion:**
-```python
-# Enzymatic digestion
-dig = oms.ProteaseDigestion()
-dig.setEnzyme("Trypsin")
-dig.setMissedCleavages(2)
-
-protein_seq = oms.AASequence.fromString("MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK")
-peptides = []
-dig.digest(protein_seq, peptides)
-
-for pep in peptides:
-    print(f"{pep.toString()}: {pep.getMonoWeight():.2f} Da")
-```
-
-**Modifications:**
-```python
-# Access modification database
-mod_db = oms.ModificationsDB()
-oxidation = mod_db.getModification("Oxidation")
-print(f"Oxidation mass diff: {oxidation.getDiffMonoMass()}")
-print(f"Residues: {oxidation.getResidues()}")
-```
-
-### 4. Signal Processing and Filtering
-
-Apply algorithms to process and filter MS data. See `references/algorithms.md` for comprehensive coverage.
-
-**Spectral Smoothing:**
-```python
-# Gaussian smoothing
-gauss_filter = oms.GaussFilter()
-params = gauss_filter.getParameters()
-params.setValue("gaussian_width", 0.2)
-gauss_filter.setParameters(params)
-gauss_filter.filterExperiment(exp)
-
-# Savitzky-Golay filter
-sg_filter = oms.SavitzkyGolayFilter()
-sg_filter.filterExperiment(exp)
-```
-
-**Peak Filtering:**
-```python
-# Keep only N largest peaks per spectrum
-n_largest = oms.NLargest()
-params = n_largest.getParameters()
-params.setValue("n", 100)  # Keep top 100 peaks
-n_largest.setParameters(params)
-n_largest.filterExperiment(exp)
-
-# Threshold filtering
-threshold_filter = oms.ThresholdMower()
-params = threshold_filter.getParameters()
-params.setValue("threshold", 1000.0)  # Remove peaks below 1000 intensity
-threshold_filter.setParameters(params)
-threshold_filter.filterExperiment(exp)
-
-# Window-based filtering
-window_filter = oms.WindowMower()
-params = window_filter.getParameters()
-params.setValue("windowsize", 50.0)  # 50 m/z windows
-params.setValue("peakcount", 10)     # Keep 10 highest per window
-window_filter.setParameters(params)
-window_filter.filterExperiment(exp)
-```
-
-**Spectrum Normalization:**
-```python
-normalizer = oms.Normalizer()
-normalizer.filterExperiment(exp)
-```
-
-**MS Level Filtering:**
-```python
-# Keep only MS2 spectra
-exp.filterMSLevel(2)
-
-# Filter by retention time range
-exp.filterRT(100.0, 500.0)  # Keep RT between 100-500 seconds
-
-# Filter by m/z range
-exp.filterMZ(400.0, 1500.0)  # Keep m/z between 400-1500
-```
-
-### 5. Feature Detection and Quantification
-
-Detect and quantify features in LC-MS data:
-
-**Peak Picking (Centroiding):**
-```python
-# Convert profile data to centroid
-picker = oms.PeakPickerHiRes()
-params = picker.getParameters()
-params.setValue("signal_to_noise", 1.0)
-picker.setParameters(params)
-
-exp_centroided = oms.MSExperiment()
-picker.pickExperiment(exp, exp_centroided)
-```
-
-**Feature Detection:**
-```python
-# Detect features across LC-MS runs
-feature_finder = oms.FeatureFinderMultiplex()
-
-features = oms.FeatureMap()
-feature_finder.run(exp, features, params)
-
-print(f"Found {features.size()} features")
-for feature in features:
-    print(f"m/z: {feature.getMZ():.4f}, RT: {feature.getRT():.2f}, "
-          f"Intensity: {feature.getIntensity():.0f}")
-```
-
-**Feature Linking (Map Alignment):**
-```python
-# Link features across multiple samples
-feature_grouper = oms.FeatureGroupingAlgorithmQT()
-consensus_map = oms.ConsensusMap()
-
-# Provide multiple feature maps from different samples
-feature_maps = [features1, features2, features3]
-feature_grouper.group(feature_maps, consensus_map)
-```
-
-### 6. Peptide Identification Workflows
-
-Integrate with search engines and process identification results:
-
-**Database Searching:**
-```python
-# Prepare parameters for search engine
-params = oms.Param()
-params.setValue("database", "uniprot_human.fasta")
-params.setValue("precursor_mass_tolerance", 10.0)  # ppm
-params.setValue("fragment_mass_tolerance", 0.5)     # Da
-params.setValue("enzyme", "Trypsin")
-params.setValue("missed_cleavages", 2)
-
-# Variable modifications
-params.setValue("variable_modifications", ["Oxidation (M)", "Phospho (STY)"])
-
-# Fixed modifications
-params.setValue("fixed_modifications", ["Carbamidomethyl (C)"])
-```
-
-**FDR Control:**
-```python
-# False discovery rate estimation
-fdr = oms.FalseDiscoveryRate()
-fdr_threshold = 0.01  # 1% FDR
-
-# Apply to peptide identifications
-protein_ids = []
-peptide_ids = []
-oms.IdXMLFile().load("search_results.idXML", protein_ids, peptide_ids)
-
-fdr.apply(protein_ids, peptide_ids)
-```
-
-### 7. Metabolomics Workflows
-
-Analyze small molecule data:
-
-**Adduct Detection:**
-```python
-# Common metabolite adducts
-adducts = ["[M+H]+", "[M+Na]+", "[M+K]+", "[M-H]-", "[M+Cl]-"]
-
-# Feature annotation with adducts
-for feature in features:
-    mz = feature.getMZ()
-    # Calculate neutral mass for each adduct hypothesis
-    for adduct in adducts:
-        # Annotation logic
-        pass
-```
-
-**Isotope Pattern Matching:**
-```python
-# Compare experimental to theoretical isotope patterns
-experimental_pattern = []  # Extract from feature
-theoretical = coarse_gen.run(formula)
-
-# Calculate similarity score
-similarity = compare_isotope_patterns(experimental_pattern, theoretical)
-```
-
-### 8. Quality Control and Visualization
-
-Monitor data quality and visualize results:
-
-**Basic Statistics:**
-```python
-# Calculate TIC (Total Ion Current)
-tic_values = []
-rt_values = []
-for spectrum in exp:
-    if spectrum.getMSLevel() == 1:
-        tic = sum(spectrum.get_peaks()[1])  # Sum intensities
-        tic_values.append(tic)
-        rt_values.append(spectrum.getRT())
-
-# Base peak chromatogram
-bpc_values = []
-for spectrum in exp:
-    if spectrum.getMSLevel() == 1:
-        max_intensity = max(spectrum.get_peaks()[1]) if spectrum.size() > 0 else 0
-        bpc_values.append(max_intensity)
-```
-
-**Plotting (with pyopenms.plotting or matplotlib):**
-```python
-import matplotlib.pyplot as plt
-
-# Plot TIC
-plt.figure(figsize=(10, 4))
-plt.plot(rt_values, tic_values)
-plt.xlabel('Retention Time (s)')
-plt.ylabel('Total Ion Current')
-plt.title('TIC')
-plt.show()
-
-# Plot single spectrum
-spectrum = exp.getSpectrum(0)
-mz, intensity = spectrum.get_peaks()
-plt.stem(mz, intensity, basefmt=' ')
-plt.xlabel('m/z')
-plt.ylabel('Intensity')
-plt.title(f'Spectrum at RT {spectrum.getRT():.2f}s')
-plt.show()
-```
-
-## Common Workflows
-
-### Complete LC-MS/MS Processing Pipeline
-
-```python
-import pyopenms as oms
-
-# 1. Load data
-exp = oms.MSExperiment()
-oms.MzMLFile().load("raw_data.mzML", exp)
-
-# 2. Filter and smooth
-exp.filterMSLevel(1)  # Keep only MS1 for feature detection
-gauss = oms.GaussFilter()
-gauss.filterExperiment(exp)
-
-# 3. Peak picking
-picker = oms.PeakPickerHiRes()
-exp_centroid = oms.MSExperiment()
-picker.pickExperiment(exp, exp_centroid)
-
-# 4. Feature detection
-ff = oms.FeatureFinderMultiplex()
-features = oms.FeatureMap()
-ff.run(exp_centroid, features, oms.Param())
-
-# 5. Export results
-oms.FeatureXMLFile().store("features.featureXML", features)
-print(f"Detected {features.size()} features")
-```
-
-### Theoretical Peptide Mass Calculation
-
-```python
-# Calculate masses for peptide with modifications
-peptide = oms.AASequence.fromString("PEPTIDEK")
-print(f"Unmodified [M+H]+: {peptide.getMonoWeight() + 1.007276:.4f}")
-
-# With modification
-modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)K")
-print(f"Oxidized [M+H]+: {modified.getMonoWeight() + 1.007276:.4f}")
-
-# Calculate for different charge states
-for z in [1, 2, 3]:
-    mz = (peptide.getMonoWeight() + z * 1.007276) / z
-    print(f"[M+{z}H]^{z}+: {mz:.4f}")
-```
-
-## Installation
-
-Ensure pyOpenMS is installed before using this skill:
-
-```bash
-# Via conda (recommended)
-conda install -c bioconda pyopenms
-
-# Via pip
-pip install pyopenms
-```
-
-## Integration with Other Tools
-
-pyOpenMS integrates seamlessly with:
-
- **Search Engines**: Comet, Mascot, MSGF+, MSFragger, Sage, SpectraST
- **Post-processing**: Percolator, MSstats, Epiphany
- **Metabolomics**: SIRIUS, CSI:FingerID
- **Data Analysis**: Pandas, NumPy, SciPy for downstream analysis
- **Visualization**: Matplotlib, Seaborn for plotting
-
-## Resources
-
-### references/
-
-Detailed documentation on core concepts:
-
- **data_structures.md** - Comprehensive guide to MSExperiment, MSSpectrum, MSChromatogram, and peak data handling
- **algorithms.md** - Complete reference for signal processing, filtering, feature detection, and quantification algorithms
- **chemistry.md** - In-depth coverage of chemistry calculations, peptide handling, modifications, and isotope distributions
-
-Load these references when needing detailed information about specific pyOpenMS capabilities.
-
-## Best Practices
-
-1. **File Format**: Always use mzML for raw MS data (standardized, well-supported)
-2. **Peak Access**: Use `get_peaks()` and `set_peaks()` with numpy arrays for efficient processing
-3. **Parameters**: Always check and configure algorithm parameters via `getParameters()` and `setParameters()`
-4. **Memory**: For large datasets, process spectra iteratively rather than loading entire experiments
-5. **Validation**: Check data integrity (MS levels, RT ordering, precursor information) after loading
-6. **Modifications**: Use standard modification names from UniMod database
-7. **Units**: RT in seconds, m/z in Thomson (Da/charge), intensity in arbitrary units
-
-## Common Patterns
-
-**Algorithm Application Pattern:**
-```python
-# 1. Instantiate algorithm
-algorithm = oms.SomeAlgorithm()
-
-# 2. Get and configure parameters
-params = algorithm.getParameters()
-params.setValue("parameter_name", value)
-algorithm.setParameters(params)
-
-# 3. Apply to data
-algorithm.filterExperiment(exp)  # or .process(), .run(), depending on algorithm
-```
-
-**File I/O Pattern:**
-```python
-# Read
-data_container = oms.DataContainer()  # MSExperiment, FeatureMap, etc.
-oms.FileHandler().load("input.format", data_container)
-
-# Process
-# ... manipulate data_container ...
-
-# Write
-oms.FileHandler().store("output.format", data_container)
-```
-
-## Getting Help
-
- **Documentation**: https://pyopenms.readthedocs.io/
- **API Reference**: Browse class documentation for detailed method signatures
- **OpenMS Website**: https://www.openms.org/
- **GitHub Issues**: https://github.com/OpenMS/OpenMS/issues
--- a/scientific-packages/pyopenms/references/algorithms.md
+++ b/scientific-packages/pyopenms/references/algorithms.md
@@ -1,643 +0,0 @@
-# pyOpenMS Algorithms Reference
-
-This document provides comprehensive coverage of algorithms available in pyOpenMS for signal processing, feature detection, and quantification.
-
-## Algorithm Usage Pattern
-
-Most pyOpenMS algorithms follow a consistent pattern:
-
-```python
-import pyopenms as oms
-
-# 1. Instantiate algorithm
-algorithm = oms.AlgorithmName()
-
-# 2. Get parameters
-params = algorithm.getParameters()
-
-# 3. Modify parameters
-params.setValue("parameter_name", value)
-
-# 4. Set parameters back
-algorithm.setParameters(params)
-
-# 5. Apply to data
-algorithm.filterExperiment(exp)  # or .process(), .run(), etc.
-```
-
-## Signal Processing Algorithms
-
-### Smoothing Filters
-
-#### GaussFilter - Gaussian Smoothing
-
-Applies Gaussian smoothing to reduce noise.
-
-```python
-gauss = oms.GaussFilter()
-
-# Configure parameters
-params = gauss.getParameters()
-params.setValue("gaussian_width", 0.2)  # Gaussian width (larger = more smoothing)
-params.setValue("ppm_tolerance", 10.0)  # PPM tolerance for spacing
-params.setValue("use_ppm_tolerance", "true")
-gauss.setParameters(params)
-
-# Apply to experiment
-gauss.filterExperiment(exp)
-
-# Or apply to single spectrum
-spectrum_smoothed = oms.MSSpectrum()
-gauss.filter(spectrum, spectrum_smoothed)
-```
-
-**Key Parameters:**
- `gaussian_width`: Width of Gaussian kernel (default: 0.2 Da)
- `ppm_tolerance`: Tolerance in ppm for spacing
- `use_ppm_tolerance`: Whether to use ppm instead of absolute spacing
-
-#### SavitzkyGolayFilter
-
-Applies Savitzky-Golay smoothing (polynomial fitting).
-
-```python
-sg_filter = oms.SavitzkyGolayFilter()
-
-params = sg_filter.getParameters()
-params.setValue("frame_length", 11)  # Window size (must be odd)
-params.setValue("polynomial_order", 3)  # Polynomial degree
-sg_filter.setParameters(params)
-
-sg_filter.filterExperiment(exp)
-```
-
-**Key Parameters:**
- `frame_length`: Size of smoothing window (must be odd)
- `polynomial_order`: Degree of polynomial (typically 2-4)
-
-### Peak Filtering
-
-#### NLargest - Keep Top N Peaks
-
-Retains only the N most intense peaks per spectrum.
-
-```python
-n_largest = oms.NLargest()
-
-params = n_largest.getParameters()
-params.setValue("n", 100)  # Keep top 100 peaks
-params.setValue("threshold", 0.0)  # Optional minimum intensity
-n_largest.setParameters(params)
-
-n_largest.filterExperiment(exp)
-```
-
-**Key Parameters:**
- `n`: Number of peaks to keep per spectrum
- `threshold`: Minimum absolute intensity threshold
-
-#### ThresholdMower - Intensity Threshold Filtering
-
-Removes peaks below a specified intensity threshold.
-
-```python
-threshold_filter = oms.ThresholdMower()
-
-params = threshold_filter.getParameters()
-params.setValue("threshold", 1000.0)  # Absolute intensity threshold
-threshold_filter.setParameters(params)
-
-threshold_filter.filterExperiment(exp)
-```
-
-**Key Parameters:**
- `threshold`: Absolute intensity cutoff
-
-#### WindowMower - Window-Based Peak Selection
-
-Divides m/z range into windows and keeps top N peaks per window.
-
-```python
-window_mower = oms.WindowMower()
-
-params = window_mower.getParameters()
-params.setValue("windowsize", 50.0)  # Window size in Da (or Thomson)
-params.setValue("peakcount", 10)     # Peaks to keep per window
-params.setValue("movetype", "jump")  # "jump" or "slide"
-window_mower.setParameters(params)
-
-window_mower.filterExperiment(exp)
-```
-
-**Key Parameters:**
- `windowsize`: Size of m/z window (Da)
- `peakcount`: Number of peaks to retain per window
- `movetype`: "jump" (non-overlapping) or "slide" (overlapping windows)
-
-#### BernNorm - Bernoulli Normalization
-
-Statistical normalization based on Bernoulli distribution.
-
-```python
-bern_norm = oms.BernNorm()
-
-params = bern_norm.getParameters()
-params.setValue("threshold", 0.7)  # Threshold for normalization
-bern_norm.setParameters(params)
-
-bern_norm.filterExperiment(exp)
-```
-
-### Spectrum Normalization
-
-#### Normalizer
-
-Normalizes spectrum intensities to unit total intensity or maximum intensity.
-
-```python
-normalizer = oms.Normalizer()
-
-params = normalizer.getParameters()
-params.setValue("method", "to_one")  # "to_one" or "to_TIC"
-normalizer.setParameters(params)
-
-normalizer.filterExperiment(exp)
-```
-
-**Methods:**
- `to_one`: Normalize max peak to 1.0
- `to_TIC`: Normalize to total ion current = 1.0
-
-#### Scaler
-
-Scales intensities by a constant factor.
-
-```python
-scaler = oms.Scaler()
-
-params = scaler.getParameters()
-params.setValue("scaling", 1000.0)  # Scaling factor
-scaler.setParameters(params)
-
-scaler.filterExperiment(exp)
-```
-
-## Centroiding and Peak Picking
-
-### PeakPickerHiRes - High-Resolution Peak Picking
-
-Converts profile spectra to centroid mode for high-resolution data.
-
-```python
-picker = oms.PeakPickerHiRes()
-
-params = picker.getParameters()
-params.setValue("signal_to_noise", 1.0)      # S/N threshold
-params.setValue("spacing_difference", 1.5)    # Peak spacing factor
-params.setValue("sn_win_len", 20.0)          # S/N window length
-params.setValue("sn_bin_count", 30)          # Bins for S/N estimation
-params.setValue("ms1_only", "false")         # Process only MS1
-params.setValue("ms_levels", [1, 2])         # MS levels to process
-picker.setParameters(params)
-
-# Pick peaks
-exp_centroided = oms.MSExperiment()
-picker.pickExperiment(exp, exp_centroided)
-```
-
-**Key Parameters:**
- `signal_to_noise`: Minimum signal-to-noise ratio
- `spacing_difference`: Minimum spacing between peaks
- `ms_levels`: List of MS levels to process
-
-### PeakPickerWavelet - Wavelet-Based Peak Picking
-
-Uses continuous wavelet transform for peak detection.
-
-```python
-wavelet_picker = oms.PeakPickerWavelet()
-
-params = wavelet_picker.getParameters()
-params.setValue("signal_to_noise", 1.0)
-params.setValue("peak_width", 0.15)  # Expected peak width (Da)
-wavelet_picker.setParameters(params)
-
-wavelet_picker.pickExperiment(exp, exp_centroided)
-```
-
-## Feature Detection
-
-### FeatureFinder Algorithms
-
-Feature finders detect 2D features (m/z and RT) in LC-MS data.
-
-#### FeatureFinderMultiplex
-
-For multiplex labeling experiments (SILAC, dimethyl labeling).
-
-```python
-ff = oms.FeatureFinderMultiplex()
-
-params = ff.getParameters()
-params.setValue("algorithm:labels", "[]")  # Empty for label-free
-params.setValue("algorithm:charge", "2:4")  # Charge range
-params.setValue("algorithm:rt_typical", 40.0)  # Expected feature RT width
-params.setValue("algorithm:rt_min", 2.0)  # Minimum RT width
-params.setValue("algorithm:mz_tolerance", 10.0)  # m/z tolerance (ppm)
-params.setValue("algorithm:intensity_cutoff", 1000.0)  # Minimum intensity
-ff.setParameters(params)
-
-# Run feature detection
-features = oms.FeatureMap()
-ff.run(exp, features, oms.Param())
-
-print(f"Found {features.size()} features")
-```
-
-**Key Parameters:**
- `algorithm:charge`: Charge state range to consider
- `algorithm:rt_typical`: Expected peak width in RT dimension
- `algorithm:mz_tolerance`: Mass tolerance in ppm
- `algorithm:intensity_cutoff`: Minimum intensity threshold
-
-#### FeatureFinderCentroided
-
-For centroided data, identifies isotope patterns and traces over RT.
-
-```python
-ff_centroided = oms.FeatureFinderCentroided()
-
-params = ff_centroided.getParameters()
-params.setValue("mass_trace:mz_tolerance", 10.0)  # ppm
-params.setValue("mass_trace:min_spectra", 5)  # Min consecutive spectra
-params.setValue("isotopic_pattern:charge_low", 1)
-params.setValue("isotopic_pattern:charge_high", 4)
-params.setValue("seed:min_score", 0.5)
-ff_centroided.setParameters(params)
-
-features = oms.FeatureMap()
-seeds = oms.FeatureMap()  # Optional seed features
-ff_centroided.run(exp, features, params, seeds)
-```
-
-#### FeatureFinderIdentification
-
-Uses peptide identifications to guide feature detection.
-
-```python
-ff_id = oms.FeatureFinderIdentification()
-
-params = ff_id.getParameters()
-params.setValue("extract:mz_window", 10.0)  # ppm
-params.setValue("extract:rt_window", 60.0)  # seconds
-params.setValue("detect:peak_width", 30.0)  # Expected peak width
-ff_id.setParameters(params)
-
-# Requires peptide identifications
-protein_ids = []
-peptide_ids = []
-features = oms.FeatureMap()
-
-ff_id.run(exp, protein_ids, peptide_ids, features)
-```
-
-## Charge and Isotope Deconvolution
-
-### Decharging and Charge State Deconvolution
-
-#### FeatureDeconvolution
-
-Resolves charge states and combines features.
-
-```python
-deconv = oms.FeatureDeconvolution()
-
-params = deconv.getParameters()
-params.setValue("charge_min", 1)
-params.setValue("charge_max", 4)
-params.setValue("q_value", 0.01)  # FDR threshold
-deconv.setParameters(params)
-
-features_deconv = oms.FeatureMap()
-consensus_map = oms.ConsensusMap()
-deconv.compute(features, features_deconv, consensus_map)
-```
-
-## Map Alignment
-
-### MapAlignmentAlgorithm
-
-Aligns retention times across multiple LC-MS runs.
-
-#### MapAlignmentAlgorithmPoseClustering
-
-Pose clustering-based RT alignment.
-
-```python
-aligner = oms.MapAlignmentAlgorithmPoseClustering()
-
-params = aligner.getParameters()
-params.setValue("max_num_peaks_considered", 1000)
-params.setValue("pairfinder:distance_MZ:max_difference", 0.3)  # Da
-params.setValue("pairfinder:distance_RT:max_difference", 60.0)  # seconds
-aligner.setParameters(params)
-
-# Align multiple feature maps
-feature_maps = [features1, features2, features3]
-transformations = []
-
-# Create reference (e.g., use first map)
-reference = oms.FeatureMap(feature_maps[0])
-
-# Align others to reference
-for fm in feature_maps[1:]:
-    transformation = oms.TransformationDescription()
-    aligner.align(fm, reference, transformation)
-    transformations.append(transformation)
-
-    # Apply transformation
-    transformer = oms.MapAlignmentTransformer()
-    transformer.transformRetentionTimes(fm, transformation)
-```
-
-## Feature Linking
-
-### FeatureGroupingAlgorithm
-
-Links features across samples to create consensus features.
-
-#### FeatureGroupingAlgorithmQT
-
-Quality threshold-based feature linking.
-
-```python
-grouper = oms.FeatureGroupingAlgorithmQT()
-
-params = grouper.getParameters()
-params.setValue("distance_RT:max_difference", 60.0)  # seconds
-params.setValue("distance_MZ:max_difference", 10.0)  # ppm
-params.setValue("distance_MZ:unit", "ppm")
-grouper.setParameters(params)
-
-# Create consensus map
-consensus_map = oms.ConsensusMap()
-
-# Group features from multiple samples
-feature_maps = [features1, features2, features3]
-grouper.group(feature_maps, consensus_map)
-
-print(f"Created {consensus_map.size()} consensus features")
-```
-
-#### FeatureGroupingAlgorithmKD
-
-KD-tree based linking (faster for large datasets).
-
-```python
-grouper_kd = oms.FeatureGroupingAlgorithmKD()
-
-params = grouper_kd.getParameters()
-params.setValue("mz_unit", "ppm")
-params.setValue("mz_tolerance", 10.0)
-params.setValue("rt_tolerance", 30.0)
-grouper_kd.setParameters(params)
-
-consensus_map = oms.ConsensusMap()
-grouper_kd.group(feature_maps, consensus_map)
-```
-
-## Chromatographic Analysis
-
-### ElutionPeakDetection
-
-Detects elution peaks in chromatograms.
-
-```python
-epd = oms.ElutionPeakDetection()
-
-params = epd.getParameters()
-params.setValue("chrom_peak_snr", 3.0)  # Signal-to-noise threshold
-params.setValue("chrom_fwhm", 5.0)  # Expected FWHM (seconds)
-epd.setParameters(params)
-
-# Apply to chromatograms
-for chrom in exp.getChromatograms():
-    peaks = epd.detectPeaks(chrom)
-```
-
-### MRMFeatureFinderScoring
-
-Scoring and peak picking for targeted (MRM/SRM) experiments.
-
-```python
-mrm_finder = oms.MRMFeatureFinderScoring()
-
-params = mrm_finder.getParameters()
-params.setValue("TransitionGroupPicker:min_peak_width", 2.0)
-params.setValue("TransitionGroupPicker:recalculate_peaks", "true")
-params.setValue("TransitionGroupPicker:PeakPickerMRM:signal_to_noise", 1.0)
-mrm_finder.setParameters(params)
-
-# Requires chromatograms
-features = oms.FeatureMap()
-mrm_finder.pickExperiment(chrom_exp, features, targets, transformation, swath_maps)
-```
-
-## Quantification
-
-### ProteinInference
-
-Infers proteins from peptide identifications.
-
-```python
-protein_inference = oms.BasicProteinInferenceAlgorithm()
-
-# Apply to identification results
-protein_inference.run(peptide_ids, protein_ids)
-```
-
-### IsobaricQuantification
-
-Quantification for isobaric labeling (TMT, iTRAQ).
-
-```python
-# For TMT/iTRAQ quantification
-iso_quant = oms.IsobaricQuantification()
-
-params = iso_quant.getParameters()
-params.setValue("channel_116_description", "Sample1")
-params.setValue("channel_117_description", "Sample2")
-# ... configure all channels
-iso_quant.setParameters(params)
-
-# Run quantification
-quant_method = oms.IsobaricQuantitationMethod.TMT_10PLEX
-quant_info = oms.IsobaricQuantifierStatistics()
-iso_quant.quantify(exp, quant_info)
-```
-
-## Data Processing
-
-### BaselineFiltering
-
-Removes baseline from spectra.
-
-```python
-baseline = oms.TopHatFilter()
-
-params = baseline.getParameters()
-params.setValue("struc_elem_length", 3.0)  # Structuring element size
-params.setValue("struc_elem_unit", "Thomson")
-baseline.setParameters(params)
-
-baseline.filterExperiment(exp)
-```
-
-### SpectraMerger
-
-Merges consecutive similar spectra.
-
-```python
-merger = oms.SpectraMerger()
-
-params = merger.getParameters()
-params.setValue("mz_binning_width", 0.05)  # Binning width (Da)
-params.setValue("sort_blocks", "RT_ascending")
-merger.setParameters(params)
-
-merger.mergeSpectra(exp)
-```
-
-## Quality Control
-
-### MzMLFileQuality
-
-Analyzes mzML file quality.
-
-```python
-# Calculate basic QC metrics
-def calculate_qc_metrics(exp):
-    metrics = {
-        'n_spectra': exp.getNrSpectra(),
-        'n_ms1': sum(1 for s in exp if s.getMSLevel() == 1),
-        'n_ms2': sum(1 for s in exp if s.getMSLevel() == 2),
-        'rt_range': (exp.getMinRT(), exp.getMaxRT()),
-        'mz_range': (exp.getMinMZ(), exp.getMaxMZ()),
-    }
-
-    # Calculate TIC
-    tics = []
-    for spectrum in exp:
-        if spectrum.getMSLevel() == 1:
-            mz, intensity = spectrum.get_peaks()
-            tics.append(sum(intensity))
-
-    metrics['median_tic'] = np.median(tics)
-    metrics['mean_tic'] = np.mean(tics)
-
-    return metrics
-```
-
-## FDR Control
-
-### FalseDiscoveryRate
-
-Estimates and controls false discovery rate.
-
-```python
-fdr = oms.FalseDiscoveryRate()
-
-params = fdr.getParameters()
-params.setValue("add_decoy_peptides", "false")
-params.setValue("add_decoy_proteins", "false")
-fdr.setParameters(params)
-
-# Apply to identifications
-fdr.apply(protein_ids, peptide_ids)
-
-# Filter by FDR threshold
-fdr_threshold = 0.01
-filtered_peptides = [p for p in peptide_ids if p.getMetaValue("q-value") <= fdr_threshold]
-```
-
-## Algorithm Selection Guide
-
-### When to Use Which Algorithm
-
-**For Smoothing:**
- Use `GaussFilter` for general-purpose smoothing
- Use `SavitzkyGolayFilter` for preserving peak shapes
-
-**For Peak Picking:**
- Use `PeakPickerHiRes` for high-resolution Orbitrap/FT-ICR data
- Use `PeakPickerWavelet` for lower-resolution TOF data
-
-**For Feature Detection:**
- Use `FeatureFinderCentroided` for label-free proteomics (DDA)
- Use `FeatureFinderMultiplex` for SILAC/dimethyl labeling
- Use `FeatureFinderIdentification` when you have ID information
- Use `MRMFeatureFinderScoring` for targeted (MRM/SRM) experiments
-
-**For Feature Linking:**
- Use `FeatureGroupingAlgorithmQT` for small-medium datasets (<10 samples)
- Use `FeatureGroupingAlgorithmKD` for large datasets (>10 samples)
-
-## Parameter Tuning Tips
-
-1. **S/N Thresholds**: Start with 1-3 for clean data, increase for noisy data
-2. **m/z Tolerance**: Use 5-10 ppm for high-resolution instruments, 0.5-1 Da for low-res
-3. **RT Tolerance**: Typically 30-60 seconds depending on chromatographic stability
-4. **Peak Width**: Measure from real data - varies by instrument and gradient length
-5. **Charge States**: Set based on expected analytes (1-2 for metabolites, 2-4 for peptides)
-
-## Common Algorithm Workflows
-
-### Complete Proteomics Workflow
-
-```python
-# 1. Load data
-exp = oms.MSExperiment()
-oms.MzMLFile().load("raw.mzML", exp)
-
-# 2. Smooth
-gauss = oms.GaussFilter()
-gauss.filterExperiment(exp)
-
-# 3. Peak picking
-picker = oms.PeakPickerHiRes()
-exp_centroid = oms.MSExperiment()
-picker.pickExperiment(exp, exp_centroid)
-
-# 4. Feature detection
-ff = oms.FeatureFinderCentroided()
-features = oms.FeatureMap()
-ff.run(exp_centroid, features, oms.Param(), oms.FeatureMap())
-
-# 5. Save results
-oms.FeatureXMLFile().store("features.featureXML", features)
-```
-
-### Multi-Sample Quantification
-
-```python
-# Load multiple samples
-feature_maps = []
-for filename in ["sample1.mzML", "sample2.mzML", "sample3.mzML"]:
-    exp = oms.MSExperiment()
-    oms.MzMLFile().load(filename, exp)
-
-    # Process and detect features
-    features = detect_features(exp)  # Your processing function
-    feature_maps.append(features)
-
-# Align retention times
-align_feature_maps(feature_maps)  # Implement alignment
-
-# Link features
-grouper = oms.FeatureGroupingAlgorithmQT()
-consensus_map = oms.ConsensusMap()
-grouper.group(feature_maps, consensus_map)
-
-# Export quantification matrix
-export_quant_matrix(consensus_map)
-```
--- a/scientific-packages/pyopenms/references/chemistry.md
+++ b/scientific-packages/pyopenms/references/chemistry.md
@@ -1,715 +0,0 @@
-# pyOpenMS Chemistry Reference
-
-This document provides comprehensive coverage of chemistry-related functionality in pyOpenMS, including elements, isotopes, molecular formulas, amino acids, peptides, proteins, and modifications.
-
-## Elements and Isotopes
-
-### ElementDB - Element Database
-
-Access atomic and isotopic data for all elements.
-
-```python
-import pyopenms as oms
-
-# Get element database instance
-element_db = oms.ElementDB()
-
-# Get element by symbol
-carbon = element_db.getElement("C")
-nitrogen = element_db.getElement("N")
-oxygen = element_db.getElement("O")
-
-# Element properties
-print(f"Carbon monoisotopic weight: {carbon.getMonoWeight()}")
-print(f"Carbon average weight: {carbon.getAverageWeight()}")
-print(f"Atomic number: {carbon.getAtomicNumber()}")
-print(f"Symbol: {carbon.getSymbol()}")
-print(f"Name: {carbon.getName()}")
-```
-
-### Isotope Information
-
-```python
-# Get isotope distribution for an element
-isotopes = carbon.getIsotopeDistribution()
-
-# Access specific isotope
-c12 = element_db.getElement("C", 12)  # Carbon-12
-c13 = element_db.getElement("C", 13)  # Carbon-13
-
-print(f"C-12 abundance: {isotopes.getContainer()[0].getIntensity()}")
-print(f"C-13 abundance: {isotopes.getContainer()[1].getIntensity()}")
-
-# Isotope mass
-print(f"C-12 mass: {c12.getMonoWeight()}")
-print(f"C-13 mass: {c13.getMonoWeight()}")
-```
-
-### Constants
-
-```python
-# Physical constants
-avogadro = oms.Constants.AVOGADRO
-electron_mass = oms.Constants.ELECTRON_MASS_U
-proton_mass = oms.Constants.PROTON_MASS_U
-
-print(f"Avogadro's number: {avogadro}")
-print(f"Electron mass: {electron_mass} u")
-print(f"Proton mass: {proton_mass} u")
-```
-
-## Empirical Formulas
-
-### EmpiricalFormula - Molecular Formulas
-
-Represent and manipulate molecular formulas.
-
-#### Creating Formulas
-
-```python
-# From string
-glucose = oms.EmpiricalFormula("C6H12O6")
-water = oms.EmpiricalFormula("H2O")
-ammonia = oms.EmpiricalFormula("NH3")
-
-# From element composition
-formula = oms.EmpiricalFormula()
-formula.setCharge(1)  # Set charge state
-```
-
-#### Formula Arithmetic
-
-```python
-# Addition
-sucrose = oms.EmpiricalFormula("C12H22O11")
-hydrolyzed = sucrose + water  # Hydrolysis adds water
-
-# Subtraction
-dehydrated = glucose - water  # Dehydration removes water
-
-# Multiplication
-three_waters = water * 3  # 3 H2O = H6O3
-
-# Division
-formula_half = sucrose / 2  # Half the formula
-```
-
-#### Mass Calculations
-
-```python
-# Monoisotopic mass
-mono_mass = glucose.getMonoWeight()
-print(f"Glucose monoisotopic mass: {mono_mass:.6f} Da")
-
-# Average mass
-avg_mass = glucose.getAverageWeight()
-print(f"Glucose average mass: {avg_mass:.6f} Da")
-
-# Mass difference
-mass_diff = (glucose - water).getMonoWeight()
-```
-
-#### Elemental Composition
-
-```python
-# Get element counts
-formula = oms.EmpiricalFormula("C6H12O6")
-
-# Access individual elements
-n_carbon = formula.getNumberOf(element_db.getElement("C"))
-n_hydrogen = formula.getNumberOf(element_db.getElement("H"))
-n_oxygen = formula.getNumberOf(element_db.getElement("O"))
-
-print(f"C: {n_carbon}, H: {n_hydrogen}, O: {n_oxygen}")
-
-# String representation
-print(f"Formula: {formula.toString()}")
-```
-
-#### Isotope-Specific Formulas
-
-```python
-# Specify specific isotopes using parentheses
-labeled_glucose = oms.EmpiricalFormula("(13)C6H12O6")  # All carbons are C-13
-partially_labeled = oms.EmpiricalFormula("C5(13)CH12O6")  # One C-13
-
-# Deuterium labeling
-deuterated = oms.EmpiricalFormula("C6D12O6")  # D2O instead of H2O
-```
-
-#### Charge States
-
-```python
-# Set charge
-formula = oms.EmpiricalFormula("C6H12O6")
-formula.setCharge(1)  # Positive charge
-
-# Get charge
-charge = formula.getCharge()
-
-# Calculate m/z for charged molecule
-mz = formula.getMonoWeight() / abs(charge) if charge != 0 else formula.getMonoWeight()
-```
-
-### Isotope Pattern Generation
-
-Generate theoretical isotope patterns for formulas.
-
-#### CoarseIsotopePatternGenerator
-
-For unit mass resolution (low-resolution instruments).
-
-```python
-# Create generator
-coarse_gen = oms.CoarseIsotopePatternGenerator()
-
-# Generate pattern
-formula = oms.EmpiricalFormula("C6H12O6")
-pattern = coarse_gen.run(formula)
-
-# Access isotope peaks
-iso_dist = pattern.getContainer()
-for peak in iso_dist:
-    mass = peak.getMZ()
-    abundance = peak.getIntensity()
-    print(f"m/z: {mass:.4f}, Abundance: {abundance:.4f}")
-```
-
-#### FineIsotopePatternGenerator
-
-For high-resolution instruments (hyperfine structure).
-
-```python
-# Create generator with resolution
-fine_gen = oms.FineIsotopePatternGenerator(0.01)  # 0.01 Da resolution
-
-# Generate fine pattern
-fine_pattern = fine_gen.run(formula)
-
-# Access fine isotope structure
-for peak in fine_pattern.getContainer():
-    print(f"m/z: {peak.getMZ():.6f}, Abundance: {peak.getIntensity():.6f}")
-```
-
-#### Isotope Pattern Matching
-
-```python
-# Compare experimental to theoretical
-def compare_isotope_patterns(experimental_mz, experimental_int, formula):
-    # Generate theoretical
-    coarse_gen = oms.CoarseIsotopePatternGenerator()
-    theoretical = coarse_gen.run(formula)
-
-    # Extract theoretical peaks
-    theo_peaks = theoretical.getContainer()
-    theo_mz = [p.getMZ() for p in theo_peaks]
-    theo_int = [p.getIntensity() for p in theo_peaks]
-
-    # Normalize both patterns
-    exp_int_norm = [i / max(experimental_int) for i in experimental_int]
-    theo_int_norm = [i / max(theo_int) for i in theo_int]
-
-    # Calculate similarity (e.g., cosine similarity)
-    # ... implement similarity calculation
-    return similarity_score
-```
-
-## Amino Acids and Residues
-
-### Residue - Amino Acid Representation
-
-Access properties of amino acids.
-
-```python
-# Get residue database
-res_db = oms.ResidueDB()
-
-# Get specific residue
-leucine = res_db.getResidue("Leucine")
-# Or by one-letter code
-leu = res_db.getResidue("L")
-
-# Residue properties
-print(f"Name: {leucine.getName()}")
-print(f"Three-letter code: {leucine.getThreeLetterCode()}")
-print(f"One-letter code: {leucine.getOneLetterCode()}")
-print(f"Monoisotopic mass: {leucine.getMonoWeight():.6f}")
-print(f"Average mass: {leucine.getAverageWeight():.6f}")
-
-# Chemical formula
-formula = leucine.getFormula()
-print(f"Formula: {formula.toString()}")
-
-# pKa values
-print(f"pKa (N-term): {leucine.getPka()}")
-print(f"pKa (C-term): {leucine.getPkb()}")
-print(f"pKa (side chain): {leucine.getPkc()}")
-
-# Side chain basicity/acidity
-print(f"Basicity: {leucine.getBasicity()}")
-print(f"Hydrophobicity: {leucine.getHydrophobicity()}")
-```
-
-### All Standard Amino Acids
-
-```python
-# Iterate over all residues
-for residue_name in ["Alanine", "Cysteine", "Aspartic acid", "Glutamic acid",
-                     "Phenylalanine", "Glycine", "Histidine", "Isoleucine",
-                     "Lysine", "Leucine", "Methionine", "Asparagine",
-                     "Proline", "Glutamine", "Arginine", "Serine",
-                     "Threonine", "Valine", "Tryptophan", "Tyrosine"]:
-    res = res_db.getResidue(residue_name)
-    print(f"{res.getOneLetterCode()}: {res.getMonoWeight():.4f} Da")
-```
-
-### Internal Residues vs. Termini
-
-```python
-# Get internal residue mass (no terminal groups)
-internal_mass = leucine.getInternalToFull()
-
-# Get residue with N-terminal modification
-n_terminal = res_db.getResidue("L[1]")  # With NH2
-
-# Get residue with C-terminal modification
-c_terminal = res_db.getResidue("L[2]")  # With COOH
-```
-
-## Peptide Sequences
-
-### AASequence - Amino Acid Sequences
-
-Represent and manipulate peptide sequences.
-
-#### Creating Sequences
-
-```python
-# From string
-peptide = oms.AASequence.fromString("PEPTIDE")
-longer = oms.AASequence.fromString("MKTAYIAKQRQISFVK")
-
-# Empty sequence
-empty_seq = oms.AASequence()
-```
-
-#### Sequence Properties
-
-```python
-peptide = oms.AASequence.fromString("PEPTIDE")
-
-# Length
-length = peptide.size()
-print(f"Length: {length} residues")
-
-# Mass
-mono_mass = peptide.getMonoWeight()
-avg_mass = peptide.getAverageWeight()
-print(f"Monoisotopic mass: {mono_mass:.6f} Da")
-print(f"Average mass: {avg_mass:.6f} Da")
-
-# Formula
-formula = peptide.getFormula()
-print(f"Formula: {formula.toString()}")
-
-# String representation
-seq_str = peptide.toString()
-print(f"Sequence: {seq_str}")
-```
-
-#### Accessing Individual Residues
-
-```python
-peptide = oms.AASequence.fromString("PEPTIDE")
-
-# Access by index
-first_aa = peptide[0]  # Returns Residue object
-print(f"First amino acid: {first_aa.getOneLetterCode()}")
-
-# Iterate
-for i in range(peptide.size()):
-    residue = peptide[i]
-    print(f"Position {i}: {residue.getOneLetterCode()}")
-```
-
-#### Modifications
-
-Add post-translational modifications (PTMs) to sequences.
-
-```python
-# Modifications in sequence string
-# Format: AA(ModificationName)
-oxidized_met = oms.AASequence.fromString("PEPTIDEM(Oxidation)")
-phospho = oms.AASequence.fromString("PEPTIDES(Phospho)T(Phospho)")
-
-# Multiple modifications
-multi_mod = oms.AASequence.fromString("M(Oxidation)PEPTIDEK(Acetyl)")
-
-# N-terminal modifications
-n_term_acetyl = oms.AASequence.fromString("(Acetyl)PEPTIDE")
-
-# C-terminal modifications
-c_term_amide = oms.AASequence.fromString("PEPTIDE(Amidated)")
-
-# Check mass change
-unmodified = oms.AASequence.fromString("PEPTIDE")
-modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)")
-mass_diff = modified.getMonoWeight() - unmodified.getMonoWeight()
-print(f"Mass shift from oxidation: {mass_diff:.6f} Da")
-```
-
-#### Sequence Manipulation
-
-```python
-# Prefix (N-terminal fragment)
-prefix = peptide.getPrefix(3)  # First 3 residues
-print(f"Prefix: {prefix.toString()}")
-
-# Suffix (C-terminal fragment)
-suffix = peptide.getSuffix(3)  # Last 3 residues
-print(f"Suffix: {suffix.toString()}")
-
-# Subsequence
-subseq = peptide.getSubsequence(2, 4)  # Residues 2-4
-print(f"Subsequence: {subseq.toString()}")
-```
-
-#### Theoretical Fragmentation
-
-Generate theoretical fragment ions for MS/MS.
-
-```python
-peptide = oms.AASequence.fromString("PEPTIDE")
-
-# b-ions (N-terminal fragments)
-b_ions = []
-for i in range(1, peptide.size()):
-    b_fragment = peptide.getPrefix(i)
-    b_mass = b_fragment.getMonoWeight()
-    b_ions.append(('b', i, b_mass))
-    print(f"b{i}: {b_mass:.4f}")
-
-# y-ions (C-terminal fragments)
-y_ions = []
-for i in range(1, peptide.size()):
-    y_fragment = peptide.getSuffix(i)
-    y_mass = y_fragment.getMonoWeight()
-    y_ions.append(('y', i, y_mass))
-    print(f"y{i}: {y_mass:.4f}")
-
-# a-ions (b - CO)
-a_ions = []
-CO_mass = 27.994915  # CO loss
-for ion_type, position, mass in b_ions:
-    a_mass = mass - CO_mass
-    a_ions.append(('a', position, a_mass))
-
-# c-ions (b + NH3)
-NH3_mass = 17.026549  # NH3 gain
-c_ions = []
-for ion_type, position, mass in b_ions:
-    c_mass = mass + NH3_mass
-    c_ions.append(('c', position, c_mass))
-
-# z-ions (y - NH3)
-z_ions = []
-for ion_type, position, mass in y_ions:
-    z_mass = mass - NH3_mass
-    z_ions.append(('z', position, z_mass))
-```
-
-#### Calculate m/z for Charge States
-
-```python
-peptide = oms.AASequence.fromString("PEPTIDE")
-proton_mass = 1.007276
-
-# [M+H]+
-mz_1 = peptide.getMonoWeight() + proton_mass
-print(f"[M+H]+: {mz_1:.4f}")
-
-# [M+2H]2+
-mz_2 = (peptide.getMonoWeight() + 2 * proton_mass) / 2
-print(f"[M+2H]2+: {mz_2:.4f}")
-
-# [M+3H]3+
-mz_3 = (peptide.getMonoWeight() + 3 * proton_mass) / 3
-print(f"[M+3H]3+: {mz_3:.4f}")
-
-# General formula for any charge
-def calculate_mz(sequence, charge):
-    proton_mass = 1.007276
-    return (sequence.getMonoWeight() + charge * proton_mass) / charge
-
-for z in range(1, 5):
-    print(f"[M+{z}H]{z}+: {calculate_mz(peptide, z):.4f}")
-```
-
-## Protein Digestion
-
-### ProteaseDigestion - Enzymatic Cleavage
-
-Simulate enzymatic protein digestion.
-
-#### Basic Digestion
-
-```python
-# Create digestion object
-dig = oms.ProteaseDigestion()
-
-# Set enzyme
-dig.setEnzyme("Trypsin")  # Cleaves after K, R
-
-# Other common enzymes:
-# - "Trypsin" (K, R)
-# - "Lys-C" (K)
-# - "Arg-C" (R)
-# - "Asp-N" (D)
-# - "Glu-C" (E, D)
-# - "Chymotrypsin" (F, Y, W, L)
-
-# Set missed cleavages
-dig.setMissedCleavages(0)  # No missed cleavages
-dig.setMissedCleavages(2)  # Allow up to 2 missed cleavages
-
-# Perform digestion
-protein = oms.AASequence.fromString("MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK")
-peptides = []
-dig.digest(protein, peptides)
-
-# Print results
-for pep in peptides:
-    print(f"{pep.toString()}: {pep.getMonoWeight():.2f} Da")
-```
-
-#### Advanced Digestion Options
-
-```python
-# Get enzyme specificity
-specificity = dig.getSpecificity()
-# oms.EnzymaticDigestion.SPEC_FULL (both termini)
-# oms.EnzymaticDigestion.SPEC_SEMI (one terminus)
-# oms.EnzymaticDigestion.SPEC_NONE (no specificity)
-
-# Set specificity for semi-tryptic search
-dig.setSpecificity(oms.EnzymaticDigestion.SPEC_SEMI)
-
-# Get cleavage sites
-cleavage_residues = dig.getEnzyme().getCutAfterResidues()
-restriction_residues = dig.getEnzyme().getRestriction()
-```
-
-#### Filter Peptides by Properties
-
-```python
-# Filter by mass range
-min_mass = 600.0
-max_mass = 4000.0
-filtered = [p for p in peptides if min_mass <= p.getMonoWeight() <= max_mass]
-
-# Filter by length
-min_length = 6
-max_length = 30
-length_filtered = [p for p in peptides if min_length <= p.size() <= max_length]
-
-# Combine filters
-valid_peptides = [p for p in peptides
-                  if min_mass <= p.getMonoWeight() <= max_mass
-                  and min_length <= p.size() <= max_length]
-```
-
-## Modifications
-
-### ModificationsDB - Modification Database
-
-Access and apply post-translational modifications.
-
-#### Accessing Modifications
-
-```python
-# Get modifications database
-mod_db = oms.ModificationsDB()
-
-# Get specific modification
-oxidation = mod_db.getModification("Oxidation")
-phospho = mod_db.getModification("Phospho")
-acetyl = mod_db.getModification("Acetyl")
-
-# Modification properties
-print(f"Name: {oxidation.getFullName()}")
-print(f"Mass difference: {oxidation.getDiffMonoMass():.6f} Da")
-print(f"Formula: {oxidation.getDiffFormula().toString()}")
-
-# Affected residues
-print(f"Residues: {oxidation.getResidues()}")  # e.g., ['M']
-
-# Specificity (N-term, C-term, anywhere)
-print(f"Term specificity: {oxidation.getTermSpecificity()}")
-```
-
-#### Common Modifications
-
-```python
-# Oxidation (M)
-oxidation = mod_db.getModification("Oxidation")
-print(f"Oxidation: +{oxidation.getDiffMonoMass():.4f} Da")
-
-# Phosphorylation (S, T, Y)
-phospho = mod_db.getModification("Phospho")
-print(f"Phospho: +{phospho.getDiffMonoMass():.4f} Da")
-
-# Carbamidomethylation (C) - common alkylation
-carbamido = mod_db.getModification("Carbamidomethyl")
-print(f"Carbamidomethyl: +{carbamido.getDiffMonoMass():.4f} Da")
-
-# Acetylation (K, N-term)
-acetyl = mod_db.getModification("Acetyl")
-print(f"Acetyl: +{acetyl.getDiffMonoMass():.4f} Da")
-
-# Deamidation (N, Q)
-deamid = mod_db.getModification("Deamidated")
-print(f"Deamidation: +{deamid.getDiffMonoMass():.4f} Da")
-```
-
-#### Searching Modifications
-
-```python
-# Search modifications by mass
-mass_tolerance = 0.01  # Da
-target_mass = 15.9949  # Oxidation
-
-# Get all modifications
-all_mods = []
-mod_db.getAllSearchModifications(all_mods)
-
-# Find matching modifications
-matching = []
-for mod_name in all_mods:
-    mod = mod_db.getModification(mod_name)
-    if abs(mod.getDiffMonoMass() - target_mass) < mass_tolerance:
-        matching.append(mod)
-        print(f"Match: {mod.getFullName()} ({mod.getDiffMonoMass():.4f} Da)")
-```
-
-#### Variable vs. Fixed Modifications
-
-```python
-# In search engines, specify:
-# Fixed modifications: applied to all occurrences
-fixed_mods = ["Carbamidomethyl (C)"]
-
-# Variable modifications: optionally present
-variable_mods = ["Oxidation (M)", "Phospho (S)", "Phospho (T)", "Phospho (Y)"]
-```
-
-## Ribonucleotides (RNA)
-
-### Ribonucleotide - RNA Building Blocks
-
-```python
-# Get ribonucleotide database
-ribo_db = oms.RibonucleotideDB()
-
-# Get specific ribonucleotide
-adenine = ribo_db.getRibonucleotide("A")
-uracil = ribo_db.getRibonucleotide("U")
-guanine = ribo_db.getRibonucleotide("G")
-cytosine = ribo_db.getRibonucleotide("C")
-
-# Properties
-print(f"Adenine mono mass: {adenine.getMonoWeight()}")
-print(f"Formula: {adenine.getFormula().toString()}")
-
-# Modified ribonucleotides
-modified_ribo = ribo_db.getRibonucleotide("m6A")  # N6-methyladenosine
-```
-
-## Practical Examples
-
-### Calculate Peptide Mass with Modifications
-
-```python
-def calculate_peptide_mz(sequence_str, charge):
-    """Calculate m/z for a peptide sequence string with modifications."""
-    peptide = oms.AASequence.fromString(sequence_str)
-    proton_mass = 1.007276
-    mz = (peptide.getMonoWeight() + charge * proton_mass) / charge
-    return mz
-
-# Examples
-print(calculate_peptide_mz("PEPTIDE", 2))  # Unmodified [M+2H]2+
-print(calculate_peptide_mz("PEPTIDEM(Oxidation)", 2))  # With oxidation
-print(calculate_peptide_mz("(Acetyl)PEPTIDEK(Acetyl)", 2))  # Acetylated
-```
-
-### Generate Complete Fragment Ion Series
-
-```python
-def generate_fragment_ions(sequence_str, charge_states=[1, 2]):
-    """Generate comprehensive fragment ion list."""
-    peptide = oms.AASequence.fromString(sequence_str)
-    proton_mass = 1.007276
-    fragments = []
-
-    for i in range(1, peptide.size()):
-        # b and y ions
-        b_frag = peptide.getPrefix(i)
-        y_frag = peptide.getSuffix(i)
-
-        for z in charge_states:
-            b_mz = (b_frag.getMonoWeight() + z * proton_mass) / z
-            y_mz = (y_frag.getMonoWeight() + z * proton_mass) / z
-
-            fragments.append({
-                'type': 'b',
-                'position': i,
-                'charge': z,
-                'mz': b_mz
-            })
-            fragments.append({
-                'type': 'y',
-                'position': i,
-                'charge': z,
-                'mz': y_mz
-            })
-
-    return fragments
-
-# Usage
-ions = generate_fragment_ions("PEPTIDE", charge_states=[1, 2])
-for ion in ions:
-    print(f"{ion['type']}{ion['position']}^{ion['charge']}+: {ion['mz']:.4f}")
-```
-
-### Digest Protein and Calculate Peptide Masses
-
-```python
-def digest_and_calculate(protein_seq_str, enzyme="Trypsin", missed_cleavages=2,
-                         min_mass=600, max_mass=4000):
-    """Digest protein and return valid peptides with masses."""
-    dig = oms.ProteaseDigestion()
-    dig.setEnzyme(enzyme)
-    dig.setMissedCleavages(missed_cleavages)
-
-    protein = oms.AASequence.fromString(protein_seq_str)
-    peptides = []
-    dig.digest(protein, peptides)
-
-    results = []
-    for pep in peptides:
-        mass = pep.getMonoWeight()
-        if min_mass <= mass <= max_mass:
-            results.append({
-                'sequence': pep.toString(),
-                'mass': mass,
-                'length': pep.size()
-            })
-
-    return results
-
-# Usage
-protein = "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK"
-peptides = digest_and_calculate(protein)
-for pep in peptides:
-    print(f"{pep['sequence']}: {pep['mass']:.2f} Da ({pep['length']} aa)")
-```
--- a/scientific-packages/pyopenms/references/data_structures.md
+++ b/scientific-packages/pyopenms/references/data_structures.md
@@ -1,560 +0,0 @@
-# pyOpenMS Data Structures Reference
-
-This document provides comprehensive coverage of core data structures in pyOpenMS for representing mass spectrometry data.
-
-## Core Hierarchy
-
-```
-MSExperiment                    # Top-level: Complete LC-MS/MS run
-├── MSSpectrum[]               # Collection of mass spectra
-│   ├── Peak1D[]              # Individual m/z, intensity pairs
-│   └── SpectrumSettings      # Metadata (RT, MS level, precursor)
-└── MSChromatogram[]          # Collection of chromatograms
-    ├── ChromatogramPeak[]    # RT, intensity pairs
-    └── ChromatogramSettings  # Metadata
-```
-
-## MSSpectrum
-
-Represents a single mass spectrum (1-dimensional peak data).
-
-### Creation and Basic Properties
-
-```python
-import pyopenms as oms
-
-# Create empty spectrum
-spectrum = oms.MSSpectrum()
-
-# Set metadata
-spectrum.setRT(123.45)           # Retention time in seconds
-spectrum.setMSLevel(1)           # MS level (1 for MS1, 2 for MS2, etc.)
-spectrum.setNativeID("scan=1234") # Native ID from file
-
-# Additional metadata
-spectrum.setDriftTime(15.2)      # Ion mobility drift time
-spectrum.setName("MyScan")       # Optional name
-```
-
-### Peak Data Management
-
-**Setting Peaks (Method 1 - Lists):**
-```python
-mz_values = [100.5, 200.3, 300.7, 400.2, 500.1]
-intensity_values = [1000, 5000, 3000, 2000, 1500]
-
-spectrum.set_peaks((mz_values, intensity_values))
-```
-
-**Setting Peaks (Method 2 - NumPy arrays):**
-```python
-import numpy as np
-
-mz_array = np.array([100.5, 200.3, 300.7, 400.2, 500.1])
-intensity_array = np.array([1000, 5000, 3000, 2000, 1500])
-
-spectrum.set_peaks((mz_array, intensity_array))
-```
-
-**Retrieving Peaks:**
-```python
-# Get as numpy arrays (efficient)
-mz_array, intensity_array = spectrum.get_peaks()
-
-# Check number of peaks
-n_peaks = spectrum.size()
-
-# Get individual peak (slower)
-for i in range(spectrum.size()):
-    peak = spectrum[i]
-    mz = peak.getMZ()
-    intensity = peak.getIntensity()
-```
-
-### Precursor Information (for MS2/MSn spectra)
-
-```python
-# Create precursor
-precursor = oms.Precursor()
-precursor.setMZ(456.789)           # Precursor m/z
-precursor.setCharge(2)             # Precursor charge
-precursor.setIntensity(50000)      # Precursor intensity
-precursor.setIsolationWindowLowerOffset(1.5)  # Lower isolation window
-precursor.setIsolationWindowUpperOffset(1.5)  # Upper isolation window
-
-# Set activation method
-activation = oms.Activation()
-activation.setActivationEnergy(35.0)  # Collision energy
-activation.setMethod(oms.Activation.ActivationMethod.CID)
-precursor.setActivation(activation)
-
-# Assign to spectrum
-spectrum.setPrecursors([precursor])
-
-# Retrieve precursor information
-precursors = spectrum.getPrecursors()
-if len(precursors) > 0:
-    prec = precursors[0]
-    print(f"Precursor m/z: {prec.getMZ()}")
-    print(f"Precursor charge: {prec.getCharge()}")
-```
-
-### Spectrum Metadata Access
-
-```python
-# Check if spectrum is sorted by m/z
-is_sorted = spectrum.isSorted()
-
-# Sort spectrum by m/z
-spectrum.sortByPosition()
-
-# Sort by intensity
-spectrum.sortByIntensity()
-
-# Clear all peaks
-spectrum.clear(False)  # False = keep metadata, True = clear everything
-
-# Get retention time
-rt = spectrum.getRT()
-
-# Get MS level
-ms_level = spectrum.getMSLevel()
-```
-
-### Spectrum Types and Modes
-
-```python
-# Set spectrum type
-spectrum.setType(oms.SpectrumSettings.SpectrumType.CENTROID)  # or PROFILE
-
-# Get spectrum type
-spec_type = spectrum.getType()
-if spec_type == oms.SpectrumSettings.SpectrumType.CENTROID:
-    print("Centroid spectrum")
-elif spec_type == oms.SpectrumSettings.SpectrumType.PROFILE:
-    print("Profile spectrum")
-```
-
-### Data Processing Annotations
-
-```python
-# Add processing information
-processing = oms.DataProcessing()
-processing.setMetaValue("smoothing", "gaussian")
-spectrum.setDataProcessing([processing])
-```
-
-## MSExperiment
-
-Represents a complete LC-MS/MS experiment containing multiple spectra and chromatograms.
-
-### Creation and Population
-
-```python
-# Create empty experiment
-exp = oms.MSExperiment()
-
-# Add spectra
-spectrum1 = oms.MSSpectrum()
-spectrum1.setRT(100.0)
-spectrum1.set_peaks(([100, 200], [1000, 2000]))
-
-spectrum2 = oms.MSSpectrum()
-spectrum2.setRT(200.0)
-spectrum2.set_peaks(([100, 200], [1500, 2500]))
-
-exp.addSpectrum(spectrum1)
-exp.addSpectrum(spectrum2)
-
-# Add chromatograms
-chrom = oms.MSChromatogram()
-chrom.set_peaks(([10.5, 11.0, 11.5], [1000, 5000, 3000]))
-exp.addChromatogram(chrom)
-```
-
-### Accessing Spectra and Chromatograms
-
-```python
-# Get number of spectra and chromatograms
-n_spectra = exp.getNrSpectra()
-n_chroms = exp.getNrChromatograms()
-
-# Access by index
-first_spectrum = exp.getSpectrum(0)
-last_spectrum = exp.getSpectrum(exp.getNrSpectra() - 1)
-
-# Iterate over all spectra
-for spectrum in exp:
-    rt = spectrum.getRT()
-    ms_level = spectrum.getMSLevel()
-    n_peaks = spectrum.size()
-    print(f"RT: {rt:.2f}s, MS{ms_level}, Peaks: {n_peaks}")
-
-# Get all spectra as list
-spectra = exp.getSpectra()
-
-# Access chromatograms
-chrom = exp.getChromatogram(0)
-```
-
-### Filtering Operations
-
-```python
-# Filter by MS level
-exp.filterMSLevel(1)  # Keep only MS1 spectra
-exp.filterMSLevel(2)  # Keep only MS2 spectra
-
-# Filter by retention time range
-exp.filterRT(100.0, 500.0)  # Keep RT between 100-500 seconds
-
-# Filter by m/z range (all spectra)
-exp.filterMZ(300.0, 1500.0)  # Keep m/z between 300-1500
-
-# Filter by scan number
-exp.filterScanNumber(100, 200)  # Keep scans 100-200
-```
-
-### Metadata and Properties
-
-```python
-# Set experiment metadata
-exp.setMetaValue("operator", "John Doe")
-exp.setMetaValue("instrument", "Q Exactive HF")
-
-# Get metadata
-operator = exp.getMetaValue("operator")
-
-# Get RT range
-rt_range = exp.getMinRT(), exp.getMaxRT()
-
-# Get m/z range
-mz_range = exp.getMinMZ(), exp.getMaxMZ()
-
-# Clear all data
-exp.clear(False)  # False = keep metadata
-```
-
-### Sorting and Organization
-
-```python
-# Sort spectra by retention time
-exp.sortSpectra()
-
-# Update ranges (call after modifications)
-exp.updateRanges()
-
-# Check if experiment is empty
-is_empty = exp.empty()
-
-# Reset (clear everything)
-exp.reset()
-```
-
-## MSChromatogram
-
-Represents an extracted or reconstructed chromatogram (retention time vs. intensity).
-
-### Creation and Basic Usage
-
-```python
-# Create chromatogram
-chrom = oms.MSChromatogram()
-
-# Set peaks (RT, intensity pairs)
-rt_values = [10.0, 10.5, 11.0, 11.5, 12.0]
-intensity_values = [1000, 5000, 8000, 6000, 2000]
-chrom.set_peaks((rt_values, intensity_values))
-
-# Get peaks
-rt_array, int_array = chrom.get_peaks()
-
-# Get size
-n_points = chrom.size()
-```
-
-### Chromatogram Types
-
-```python
-# Set chromatogram type
-chrom.setChromatogramType(oms.ChromatogramSettings.ChromatogramType.SELECTED_ION_CURRENT_CHROMATOGRAM)
-
-# Other types:
-# - TOTAL_ION_CURRENT_CHROMATOGRAM
-# - BASEPEAK_CHROMATOGRAM
-# - SELECTED_ION_CURRENT_CHROMATOGRAM
-# - SELECTED_REACTION_MONITORING_CHROMATOGRAM
-```
-
-### Metadata
-
-```python
-# Set native ID
-chrom.setNativeID("TIC")
-
-# Set name
-chrom.setName("Total Ion Current")
-
-# Access
-native_id = chrom.getNativeID()
-name = chrom.getName()
-```
-
-### Precursor and Product Information (for SRM/MRM)
-
-```python
-# For targeted experiments
-precursor = oms.Precursor()
-precursor.setMZ(456.7)
-chrom.setPrecursor(precursor)
-
-product = oms.Product()
-product.setMZ(789.4)
-chrom.setProduct(product)
-```
-
-## Peak1D and ChromatogramPeak
-
-Individual peak data points.
-
-### Peak1D (for mass spectra)
-
-```python
-# Create individual peak
-peak = oms.Peak1D()
-peak.setMZ(456.789)
-peak.setIntensity(10000)
-
-# Access
-mz = peak.getMZ()
-intensity = peak.getIntensity()
-
-# Set position and intensity
-peak.setPosition([456.789])
-peak.setIntensity(10000)
-```
-
-### ChromatogramPeak (for chromatograms)
-
-```python
-# Create chromatogram peak
-chrom_peak = oms.ChromatogramPeak()
-chrom_peak.setRT(125.5)
-chrom_peak.setIntensity(5000)
-
-# Access
-rt = chrom_peak.getRT()
-intensity = chrom_peak.getIntensity()
-```
-
-## FeatureMap and Feature
-
-For quantification results.
-
-### Feature
-
-Represents a detected LC-MS feature (peptide or metabolite signal).
-
-```python
-# Create feature
-feature = oms.Feature()
-
-# Set properties
-feature.setMZ(456.789)
-feature.setRT(123.45)
-feature.setIntensity(1000000)
-feature.setCharge(2)
-feature.setWidth(15.0)  # RT width in seconds
-
-# Set quality score
-feature.setOverallQuality(0.95)
-
-# Access
-mz = feature.getMZ()
-rt = feature.getRT()
-intensity = feature.getIntensity()
-charge = feature.getCharge()
-```
-
-### FeatureMap
-
-Collection of features.
-
-```python
-# Create feature map
-feature_map = oms.FeatureMap()
-
-# Add features
-feature1 = oms.Feature()
-feature1.setMZ(456.789)
-feature1.setRT(123.45)
-feature1.setIntensity(1000000)
-
-feature_map.push_back(feature1)
-
-# Get size
-n_features = feature_map.size()
-
-# Iterate
-for feature in feature_map:
-    print(f"m/z: {feature.getMZ():.4f}, RT: {feature.getRT():.2f}")
-
-# Access by index
-first_feature = feature_map[0]
-
-# Clear
-feature_map.clear()
-```
-
-## PeptideIdentification and ProteinIdentification
-
-For identification results.
-
-### PeptideIdentification
-
-```python
-# Create peptide identification
-pep_id = oms.PeptideIdentification()
-pep_id.setRT(123.45)
-pep_id.setMZ(456.789)
-
-# Create peptide hit
-hit = oms.PeptideHit()
-hit.setSequence(oms.AASequence.fromString("PEPTIDE"))
-hit.setCharge(2)
-hit.setScore(25.5)
-hit.setRank(1)
-
-# Add to identification
-pep_id.setHits([hit])
-pep_id.setHigherScoreBetter(True)
-pep_id.setScoreType("XCorr")
-
-# Access
-hits = pep_id.getHits()
-for hit in hits:
-    seq = hit.getSequence().toString()
-    score = hit.getScore()
-    print(f"Sequence: {seq}, Score: {score}")
-```
-
-### ProteinIdentification
-
-```python
-# Create protein identification
-prot_id = oms.ProteinIdentification()
-
-# Create protein hit
-prot_hit = oms.ProteinHit()
-prot_hit.setAccession("P12345")
-prot_hit.setSequence("MKTAYIAKQRQISFVK...")
-prot_hit.setScore(100.5)
-
-# Add to identification
-prot_id.setHits([prot_hit])
-prot_id.setScoreType("Mascot Score")
-prot_id.setHigherScoreBetter(True)
-
-# Search parameters
-search_params = oms.ProteinIdentification.SearchParameters()
-search_params.db = "uniprot_human.fasta"
-search_params.enzyme = "Trypsin"
-prot_id.setSearchParameters(search_params)
-```
-
-## ConsensusMap and ConsensusFeature
-
-For linking features across multiple samples.
-
-### ConsensusFeature
-
-```python
-# Create consensus feature
-cons_feature = oms.ConsensusFeature()
-cons_feature.setMZ(456.789)
-cons_feature.setRT(123.45)
-cons_feature.setIntensity(5000000)  # Combined intensity
-
-# Access linked features
-for handle in cons_feature.getFeatureList():
-    map_index = handle.getMapIndex()
-    feature_index = handle.getIndex()
-    intensity = handle.getIntensity()
-```
-
-### ConsensusMap
-
-```python
-# Create consensus map
-consensus_map = oms.ConsensusMap()
-
-# Add consensus features
-consensus_map.push_back(cons_feature)
-
-# Iterate
-for cons_feat in consensus_map:
-    mz = cons_feat.getMZ()
-    rt = cons_feat.getRT()
-    n_features = cons_feat.size()  # Number of linked features
-```
-
-## Best Practices
-
-1. **Use numpy arrays** for peak data when possible - much faster than individual peak access
-2. **Sort spectra** by position (m/z) before searching or filtering
-3. **Update ranges** after modifying MSExperiment: `exp.updateRanges()`
-4. **Check MS level** before processing - different algorithms for MS1 vs MS2
-5. **Validate precursor info** for MS2 spectra - ensure charge and m/z are set
-6. **Use appropriate containers** - MSExperiment for raw data, FeatureMap for quantification
-7. **Clear metadata carefully** - use `clear(False)` to preserve metadata when clearing peaks
-
-## Common Patterns
-
-### Create MS2 Spectrum with Precursor
-
-```python
-spectrum = oms.MSSpectrum()
-spectrum.setRT(205.2)
-spectrum.setMSLevel(2)
-spectrum.set_peaks(([100, 200, 300], [1000, 5000, 3000]))
-
-precursor = oms.Precursor()
-precursor.setMZ(450.5)
-precursor.setCharge(2)
-spectrum.setPrecursors([precursor])
-```
-
-### Extract MS1 Spectra from Experiment
-
-```python
-ms1_exp = oms.MSExperiment()
-for spectrum in exp:
-    if spectrum.getMSLevel() == 1:
-        ms1_exp.addSpectrum(spectrum)
-```
-
-### Calculate Total Ion Current (TIC)
-
-```python
-tic_values = []
-rt_values = []
-for spectrum in exp:
-    if spectrum.getMSLevel() == 1:
-        mz, intensity = spectrum.get_peaks()
-        tic = np.sum(intensity)
-        tic_values.append(tic)
-        rt_values.append(spectrum.getRT())
-```
-
-### Find Spectrum Closest to RT
-
-```python
-target_rt = 125.5
-closest_spectrum = None
-min_diff = float('inf')
-
-for spectrum in exp:
-    diff = abs(spectrum.getRT() - target_rt)
-    if diff < min_diff:
-        min_diff = diff
-        closest_spectrum = spectrum
-```
--- a/scientific-packages/scikit-learn/SKILL.md
+++ b/scientific-packages/scikit-learn/SKILL.md
@@ -1,780 +0,0 @@
---
-name: scikit-learn
-description: "ML toolkit. Classification, regression, clustering, PCA, preprocessing, pipelines, GridSearch, cross-validation, RandomForest, SVM, for general machine learning workflows."
---
-
-# Scikit-learn: Machine Learning in Python
-
-## Overview
-
-Scikit-learn is Python's premier machine learning library, offering simple and efficient tools for predictive data analysis. Apply this skill for classification, regression, clustering, dimensionality reduction, model selection, preprocessing, and hyperparameter optimization.
-
-## When to Use This Skill
-
-This skill should be used when:
- Building classification models (spam detection, image recognition, medical diagnosis)
- Creating regression models (price prediction, forecasting, trend analysis)
- Performing clustering analysis (customer segmentation, pattern discovery)
- Reducing dimensionality (PCA, t-SNE for visualization)
- Preprocessing data (scaling, encoding, imputation)
- Evaluating model performance (cross-validation, metrics)
- Tuning hyperparameters (grid search, random search)
- Creating machine learning pipelines
- Detecting anomalies or outliers
- Implementing ensemble methods
-
-## Core Machine Learning Workflow
-
-### Standard ML Pipeline
-
-Follow this general workflow for supervised learning tasks:
-
-1. **Data Preparation**
-   - Load and explore data
-   - Split into train/test sets
-   - Handle missing values
-   - Encode categorical features
-   - Scale/normalize features
-
-2. **Model Selection**
-   - Start with baseline model
-   - Try more complex models
-   - Use domain knowledge to guide selection
-
-3. **Model Training**
-   - Fit model on training data
-   - Use pipelines to prevent data leakage
-   - Apply cross-validation
-
-4. **Model Evaluation**
-   - Evaluate on test set
-   - Use appropriate metrics
-   - Analyze errors
-
-5. **Model Optimization**
-   - Tune hyperparameters
-   - Feature engineering
-   - Ensemble methods
-
-6. **Deployment**
-   - Save model using joblib
-   - Create prediction pipeline
-   - Monitor performance
-
-### Classification Quick Start
-
-```python
-from sklearn.model_selection import train_test_split
-from sklearn.preprocessing import StandardScaler
-from sklearn.ensemble import RandomForestClassifier
-from sklearn.metrics import classification_report
-from sklearn.pipeline import Pipeline
-
-# Create pipeline (prevents data leakage)
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
-])
-
-# Split data (use stratify for imbalanced classes)
-X_train, X_test, y_train, y_test = train_test_split(
-    X, y, test_size=0.2, random_state=42, stratify=y
-)
-
-# Train
-pipeline.fit(X_train, y_train)
-
-# Evaluate
-y_pred = pipeline.predict(X_test)
-print(classification_report(y_test, y_pred))
-
-# Cross-validation for robust evaluation
-from sklearn.model_selection import cross_val_score
-scores = cross_val_score(pipeline, X_train, y_train, cv=5)
-print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
-```
-
-### Regression Quick Start
-
-```python
-from sklearn.model_selection import train_test_split
-from sklearn.preprocessing import StandardScaler
-from sklearn.ensemble import RandomForestRegressor
-from sklearn.metrics import mean_squared_error, r2_score
-from sklearn.pipeline import Pipeline
-
-# Create pipeline
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
-])
-
-# Split data
-X_train, X_test, y_train, y_test = train_test_split(
-    X, y, test_size=0.2, random_state=42
-)
-
-# Train
-pipeline.fit(X_train, y_train)
-
-# Evaluate
-y_pred = pipeline.predict(X_test)
-rmse = mean_squared_error(y_test, y_pred, squared=False)
-r2 = r2_score(y_test, y_pred)
-print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")
-```
-
-## Algorithm Selection Guide
-
-### Classification Algorithms
-
-**Start with baseline**: LogisticRegression
- Fast, interpretable, works well for linearly separable data
- Good for high-dimensional data (text classification)
-
-**General-purpose**: RandomForestClassifier
- Handles non-linear relationships
- Robust to outliers
- Provides feature importance
- Good default choice
-
-**Best performance**: HistGradientBoostingClassifier
- State-of-the-art for tabular data
- Fast on large datasets (>10K samples)
- Often wins Kaggle competitions
-
-**Special cases**:
- **Small datasets (<1K)**: SVC with RBF kernel
- **Very large datasets (>100K)**: SGDClassifier or LinearSVC
- **Interpretability critical**: LogisticRegression or DecisionTreeClassifier
- **Probabilistic predictions**: GaussianNB or calibrated models
- **Text classification**: LogisticRegression with TfidfVectorizer
-
-### Regression Algorithms
-
-**Start with baseline**: LinearRegression or Ridge
- Fast, interpretable
- Works well when relationships are linear
-
-**General-purpose**: RandomForestRegressor
- Handles non-linear relationships
- Robust to outliers
- Good default choice
-
-**Best performance**: HistGradientBoostingRegressor
- State-of-the-art for tabular data
- Fast on large datasets
-
-**Special cases**:
- **Regularization needed**: Ridge (L2) or Lasso (L1 + feature selection)
- **Very large datasets**: SGDRegressor
- **Outliers present**: HuberRegressor or RANSAC
-
-### Clustering Algorithms
-
-**Known number of clusters**: KMeans
- Fast and scalable
- Assumes spherical clusters
-
-**Unknown number of clusters**: DBSCAN or HDBSCAN
- Handles arbitrary shapes
- Automatic outlier detection
-
-**Hierarchical relationships**: AgglomerativeClustering
- Creates hierarchy of clusters
- Good for visualization (dendrograms)
-
-**Soft clustering (probabilities)**: GaussianMixture
- Provides cluster probabilities
- Handles elliptical clusters
-
-### Dimensionality Reduction
-
-**Preprocessing/feature extraction**: PCA
- Fast and efficient
- Linear transformation
- ALWAYS standardize first
-
-**Visualization only**: t-SNE or UMAP
- Preserves local structure
- Non-linear
- DO NOT use for preprocessing
-
-**Sparse data (text)**: TruncatedSVD
- Works with sparse matrices
- Latent Semantic Analysis
-
-**Non-negative data**: NMF
- Interpretable components
- Topic modeling
-
-## Working with Different Data Types
-
-### Numeric Features
-
-**Continuous features**:
-1. Check distribution
-2. Handle outliers (remove, clip, or use RobustScaler)
-3. Scale using StandardScaler (most algorithms) or MinMaxScaler (neural networks)
-
-**Count data**:
-1. Consider log transformation or sqrt
-2. Scale after transformation
-
-**Skewed data**:
-1. Use PowerTransformer (Yeo-Johnson or Box-Cox)
-2. Or QuantileTransformer for stronger normalization
-
-### Categorical Features
-
-**Low cardinality (<10 categories)**:
-```python
-from sklearn.preprocessing import OneHotEncoder
-encoder = OneHotEncoder(drop='first', sparse_output=True)
-```
-
-**High cardinality (>10 categories)**:
-```python
-from sklearn.preprocessing import TargetEncoder
-encoder = TargetEncoder()
-# Uses target statistics, prevents leakage with cross-fitting
-```
-
-**Ordinal relationships**:
-```python
-from sklearn.preprocessing import OrdinalEncoder
-encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
-```
-
-### Text Data
-
-```python
-from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.naive_bayes import MultinomialNB
-from sklearn.pipeline import Pipeline
-
-text_pipeline = Pipeline([
-    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
-    ('classifier', MultinomialNB())
-])
-
-text_pipeline.fit(X_train_text, y_train)
-```
-
-### Mixed Data Types
-
-```python
-from sklearn.compose import ColumnTransformer
-from sklearn.preprocessing import StandardScaler, OneHotEncoder
-from sklearn.impute import SimpleImputer
-from sklearn.pipeline import Pipeline
-
-# Define feature types
-numeric_features = ['age', 'income', 'credit_score']
-categorical_features = ['country', 'occupation']
-
-# Separate preprocessing pipelines
-numeric_transformer = Pipeline([
-    ('imputer', SimpleImputer(strategy='median')),
-    ('scaler', StandardScaler())
-])
-
-categorical_transformer = Pipeline([
-    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
-    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True))
-])
-
-# Combine with ColumnTransformer
-preprocessor = ColumnTransformer(
-    transformers=[
-        ('num', numeric_transformer, numeric_features),
-        ('cat', categorical_transformer, categorical_features)
-    ])
-
-# Complete pipeline
-from sklearn.ensemble import RandomForestClassifier
-pipeline = Pipeline([
-    ('preprocessor', preprocessor),
-    ('classifier', RandomForestClassifier())
-])
-
-pipeline.fit(X_train, y_train)
-```
-
-## Model Evaluation
-
-### Classification Metrics
-
-**Balanced datasets**: Use accuracy or F1-score
-
-**Imbalanced datasets**: Use balanced_accuracy, F1-weighted, or ROC-AUC
-```python
-from sklearn.metrics import balanced_accuracy_score, f1_score, roc_auc_score
-
-balanced_acc = balanced_accuracy_score(y_true, y_pred)
-f1 = f1_score(y_true, y_pred, average='weighted')
-
-# ROC-AUC requires probabilities
-y_proba = model.predict_proba(X_test)
-auc = roc_auc_score(y_true, y_proba, multi_class='ovr')
-```
-
-**Cost-sensitive**: Define custom scorer or adjust decision threshold
-
-**Comprehensive report**:
-```python
-from sklearn.metrics import classification_report, confusion_matrix
-
-print(classification_report(y_true, y_pred))
-print(confusion_matrix(y_true, y_pred))
-```
-
-### Regression Metrics
-
-**Standard use**: RMSE and R²
-```python
-from sklearn.metrics import mean_squared_error, r2_score
-
-rmse = mean_squared_error(y_true, y_pred, squared=False)
-r2 = r2_score(y_true, y_pred)
-```
-
-**Outliers present**: Use MAE (robust to outliers)
-```python
-from sklearn.metrics import mean_absolute_error
-mae = mean_absolute_error(y_true, y_pred)
-```
-
-**Percentage errors matter**: Use MAPE
-```python
-from sklearn.metrics import mean_absolute_percentage_error
-mape = mean_absolute_percentage_error(y_true, y_pred)
-```
-
-### Cross-Validation
-
-**Standard approach** (5-10 folds):
-```python
-from sklearn.model_selection import cross_val_score
-
-scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
-print(f"CV Score: {scores.mean():.3f} (+/- {scores.std():.3f})")
-```
-
-**Imbalanced classes** (use stratification):
-```python
-from sklearn.model_selection import StratifiedKFold
-
-cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
-scores = cross_val_score(model, X, y, cv=cv)
-```
-
-**Time series** (respect temporal order):
-```python
-from sklearn.model_selection import TimeSeriesSplit
-
-cv = TimeSeriesSplit(n_splits=5)
-scores = cross_val_score(model, X, y, cv=cv)
-```
-
-**Multiple metrics**:
-```python
-from sklearn.model_selection import cross_validate
-
-scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
-results = cross_validate(model, X, y, cv=5, scoring=scoring)
-
-for metric in scoring:
-    scores = results[f'test_{metric}']
-    print(f"{metric}: {scores.mean():.3f}")
-```
-
-## Hyperparameter Tuning
-
-### Grid Search (Exhaustive)
-
-```python
-from sklearn.model_selection import GridSearchCV
-
-param_grid = {
-    'n_estimators': [100, 200, 500],
-    'max_depth': [10, 20, 30, None],
-    'min_samples_split': [2, 5, 10]
-}
-
-grid_search = GridSearchCV(
-    RandomForestClassifier(random_state=42),
-    param_grid,
-    cv=5,
-    scoring='f1_weighted',
-    n_jobs=-1,  # Use all CPU cores
-    verbose=1
-)
-
-grid_search.fit(X_train, y_train)
-
-print(f"Best parameters: {grid_search.best_params_}")
-print(f"Best CV score: {grid_search.best_score_:.3f}")
-
-# Use best model
-best_model = grid_search.best_estimator_
-test_score = best_model.score(X_test, y_test)
-```
-
-### Random Search (Faster)
-
-```python
-from sklearn.model_selection import RandomizedSearchCV
-from scipy.stats import randint, uniform
-
-param_distributions = {
-    'n_estimators': randint(100, 1000),
-    'max_depth': randint(5, 50),
-    'min_samples_split': randint(2, 20),
-    'max_features': uniform(0.1, 0.9)
-}
-
-random_search = RandomizedSearchCV(
-    RandomForestClassifier(random_state=42),
-    param_distributions,
-    n_iter=100,  # Number of combinations to try
-    cv=5,
-    scoring='f1_weighted',
-    n_jobs=-1,
-    random_state=42
-)
-
-random_search.fit(X_train, y_train)
-```
-
-### Pipeline Hyperparameter Tuning
-
-```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.svm import SVC
-
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('svm', SVC())
-])
-
-# Use double underscore for nested parameters
-param_grid = {
-    'svm__C': [0.1, 1, 10, 100],
-    'svm__kernel': ['rbf', 'linear'],
-    'svm__gamma': ['scale', 'auto', 0.001, 0.01]
-}
-
-grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
-grid_search.fit(X_train, y_train)
-```
-
-## Feature Engineering and Selection
-
-### Feature Importance
-
-```python
-# Tree-based models have built-in feature importance
-from sklearn.ensemble import RandomForestClassifier
-
-model = RandomForestClassifier(n_estimators=100)
-model.fit(X_train, y_train)
-
-importances = model.feature_importances_
-feature_importance_df = pd.DataFrame({
-    'feature': feature_names,
-    'importance': importances
-}).sort_values('importance', ascending=False)
-
-# Permutation importance (works for any model)
-from sklearn.inspection import permutation_importance
-
-result = permutation_importance(
-    model, X_test, y_test,
-    n_repeats=10,
-    random_state=42,
-    n_jobs=-1
-)
-
-importance_df = pd.DataFrame({
-    'feature': feature_names,
-    'importance': result.importances_mean,
-    'std': result.importances_std
-}).sort_values('importance', ascending=False)
-```
-
-### Feature Selection Methods
-
-**Univariate selection**:
-```python
-from sklearn.feature_selection import SelectKBest, f_classif
-
-selector = SelectKBest(f_classif, k=10)
-X_selected = selector.fit_transform(X, y)
-selected_features = selector.get_support(indices=True)
-```
-
-**Recursive Feature Elimination**:
-```python
-from sklearn.feature_selection import RFECV
-from sklearn.ensemble import RandomForestClassifier
-
-selector = RFECV(
-    RandomForestClassifier(n_estimators=100),
-    step=1,
-    cv=5,
-    n_jobs=-1
-)
-X_selected = selector.fit_transform(X, y)
-print(f"Optimal features: {selector.n_features_}")
-```
-
-**Model-based selection**:
-```python
-from sklearn.feature_selection import SelectFromModel
-
-selector = SelectFromModel(
-    RandomForestClassifier(n_estimators=100),
-    threshold='median'  # or '0.5*mean', or specific value
-)
-X_selected = selector.fit_transform(X, y)
-```
-
-### Polynomial Features
-
-```python
-from sklearn.preprocessing import PolynomialFeatures
-from sklearn.linear_model import Ridge
-from sklearn.pipeline import Pipeline
-
-pipeline = Pipeline([
-    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
-    ('scaler', StandardScaler()),
-    ('ridge', Ridge())
-])
-
-pipeline.fit(X_train, y_train)
-```
-
-## Common Patterns and Best Practices
-
-### Always Use Pipelines
-
-Pipelines prevent data leakage and ensure proper workflow:
-
-✅ **Correct**:
-```python
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('model', LogisticRegression())
-])
-pipeline.fit(X_train, y_train)
-y_pred = pipeline.predict(X_test)
-```
-
-❌ **Wrong** (data leakage):
-```python
-scaler = StandardScaler().fit(X)  # Fit on all data!
-X_train, X_test = train_test_split(scaler.transform(X))
-```
-
-### Stratify for Imbalanced Classes
-
-```python
-# Always use stratify for classification with imbalanced classes
-X_train, X_test, y_train, y_test = train_test_split(
-    X, y, test_size=0.2, stratify=y, random_state=42
-)
-```
-
-### Scale When Necessary
-
-**Scale for**: SVM, Neural Networks, KNN, Linear Models with regularization, PCA, Gradient Descent
-
-**Don't scale for**: Tree-based models (Random Forest, Gradient Boosting), Naive Bayes
-
-### Handle Missing Values
-
-```python
-from sklearn.impute import SimpleImputer
-
-# Numeric: use median (robust to outliers)
-imputer = SimpleImputer(strategy='median')
-
-# Categorical: use constant value or most_frequent
-imputer = SimpleImputer(strategy='constant', fill_value='missing')
-```
-
-### Use Appropriate Metrics
-
- **Balanced classification**: accuracy, F1
- **Imbalanced classification**: balanced_accuracy, F1-weighted, ROC-AUC
- **Regression with outliers**: MAE instead of RMSE
- **Cost-sensitive**: custom scorer
-
-### Set Random States
-
-```python
-# For reproducibility
-model = RandomForestClassifier(random_state=42)
-X_train, X_test, y_train, y_test = train_test_split(
-    X, y, random_state=42
-)
-```
-
-### Use Parallel Processing
-
-```python
-# Use all CPU cores
-model = RandomForestClassifier(n_jobs=-1)
-grid_search = GridSearchCV(model, param_grid, n_jobs=-1)
-```
-
-## Unsupervised Learning
-
-### Clustering Workflow
-
-```python
-from sklearn.preprocessing import StandardScaler
-from sklearn.cluster import KMeans
-from sklearn.metrics import silhouette_score
-
-# Always scale for clustering
-scaler = StandardScaler()
-X_scaled = scaler.fit_transform(X)
-
-# Elbow method to find optimal k
-inertias = []
-silhouette_scores = []
-K_range = range(2, 11)
-
-for k in K_range:
-    kmeans = KMeans(n_clusters=k, random_state=42)
-    labels = kmeans.fit_predict(X_scaled)
-    inertias.append(kmeans.inertia_)
-    silhouette_scores.append(silhouette_score(X_scaled, labels))
-
-# Plot and choose k
-import matplotlib.pyplot as plt
-fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
-ax1.plot(K_range, inertias, 'bo-')
-ax1.set_xlabel('k')
-ax1.set_ylabel('Inertia')
-ax2.plot(K_range, silhouette_scores, 'ro-')
-ax2.set_xlabel('k')
-ax2.set_ylabel('Silhouette Score')
-plt.show()
-
-# Fit final model
-optimal_k = 5  # Based on elbow/silhouette
-kmeans = KMeans(n_clusters=optimal_k, random_state=42)
-labels = kmeans.fit_predict(X_scaled)
-```
-
-### Dimensionality Reduction
-
-```python
-from sklearn.decomposition import PCA
-from sklearn.preprocessing import StandardScaler
-
-# ALWAYS scale before PCA
-scaler = StandardScaler()
-X_scaled = scaler.fit_transform(X)
-
-# Specify variance to retain
-pca = PCA(n_components=0.95)  # Keep 95% of variance
-X_pca = pca.fit_transform(X_scaled)
-
-print(f"Original features: {X.shape[1]}")
-print(f"Reduced features: {pca.n_components_}")
-print(f"Variance explained: {pca.explained_variance_ratio_.sum():.3f}")
-
-# Visualize explained variance
-import matplotlib.pyplot as plt
-plt.plot(np.cumsum(pca.explained_variance_ratio_))
-plt.xlabel('Number of components')
-plt.ylabel('Cumulative explained variance')
-plt.show()
-```
-
-### Visualization with t-SNE
-
-```python
-from sklearn.manifold import TSNE
-from sklearn.decomposition import PCA
-
-# Reduce to 50 dimensions with PCA first (faster)
-pca = PCA(n_components=min(50, X.shape[1]))
-X_pca = pca.fit_transform(X_scaled)
-
-# Apply t-SNE (only for visualization!)
-tsne = TSNE(n_components=2, random_state=42, perplexity=30)
-X_tsne = tsne.fit_transform(X_pca)
-
-# Visualize
-plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.6)
-plt.colorbar()
-plt.title('t-SNE Visualization')
-plt.show()
-```
-
-## Saving and Loading Models
-
-```python
-import joblib
-
-# Save model or pipeline
-joblib.dump(model, 'model.pkl')
-joblib.dump(pipeline, 'pipeline.pkl')
-
-# Load
-loaded_model = joblib.load('model.pkl')
-loaded_pipeline = joblib.load('pipeline.pkl')
-
-# Use loaded model
-predictions = loaded_model.predict(X_new)
-```
-
-## Reference Documentation
-
-This skill includes comprehensive reference files:
-
- **`references/supervised_learning.md`**: Detailed coverage of all classification and regression algorithms, parameters, use cases, and selection guidelines
- **`references/preprocessing.md`**: Complete guide to data preprocessing including scaling, encoding, imputation, transformations, and best practices
- **`references/model_evaluation.md`**: In-depth coverage of cross-validation strategies, metrics, hyperparameter tuning, and validation techniques
- **`references/unsupervised_learning.md`**: Comprehensive guide to clustering, dimensionality reduction, anomaly detection, and evaluation methods
- **`references/pipelines_and_composition.md`**: Complete guide to Pipeline, ColumnTransformer, FeatureUnion, custom transformers, and composition patterns
- **`references/quick_reference.md`**: Quick lookup guide with code snippets, common patterns, and decision trees for algorithm selection
-
-Read these files when:
- Need detailed parameter explanations for specific algorithms
- Comparing multiple algorithms for a task
- Understanding evaluation metrics in depth
- Building complex preprocessing workflows
- Troubleshooting common issues
-
-Example search patterns:
-```python
-# To find information about specific algorithms
-grep -r "GradientBoosting" references/
-
-# To find preprocessing techniques
-grep -r "OneHotEncoder" references/preprocessing.md
-
-# To find evaluation metrics
-grep -r "f1_score" references/model_evaluation.md
-```
-
-## Common Pitfalls to Avoid
-
-1. **Data leakage**: Always use pipelines, fit only on training data
-2. **Not scaling**: Scale for distance-based algorithms (SVM, KNN, Neural Networks)
-3. **Wrong metrics**: Use appropriate metrics for imbalanced data
-4. **Not using cross-validation**: Single train-test split can be misleading
-5. **Forgetting stratification**: Stratify for imbalanced classification
-6. **Using t-SNE for preprocessing**: t-SNE is for visualization only!
-7. **Not setting random_state**: Results won't be reproducible
-8. **Ignoring class imbalance**: Use stratification, appropriate metrics, or resampling
-9. **PCA without scaling**: Components will be dominated by high-variance features
-10. **Testing on training data**: Always evaluate on held-out test set
--- a/scientific-packages/scikit-learn/references/model_evaluation.md
+++ b/scientific-packages/scikit-learn/references/model_evaluation.md
@@ -1,601 +0,0 @@
-# Model Evaluation and Selection in scikit-learn
-
-## Overview
-Model evaluation assesses how well models generalize to unseen data. Scikit-learn provides three main APIs for evaluation:
-1. **Estimator score methods**: Built-in evaluation (accuracy for classifiers, R² for regressors)
-2. **Scoring parameter**: Used in cross-validation and hyperparameter tuning
-3. **Metric functions**: Specialized evaluation in `sklearn.metrics`
-
-## Cross-Validation
-
-Cross-validation evaluates model performance by splitting data into multiple train/test sets. This addresses overfitting: "a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data."
-
-### Basic Cross-Validation
-
-```python
-from sklearn.model_selection import cross_val_score
-from sklearn.linear_model import LogisticRegression
-
-model = LogisticRegression()
-scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
-print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
-```
-
-### Cross-Validation Strategies
-
-#### For i.i.d. Data
-
-**KFold**: Standard k-fold cross-validation
- Splits data into k equal folds
- Each fold used once as test set
- `n_splits`: Number of folds (typically 5 or 10)
-
-```python
-from sklearn.model_selection import KFold
-cv = KFold(n_splits=5, shuffle=True, random_state=42)
-```
-
-**RepeatedKFold**: Repeats KFold with different randomization
- More robust estimation
- Computationally expensive
-
-**LeaveOneOut (LOO)**: Each sample is a test set
- Maximum training data usage
- Very computationally expensive
- High variance in estimates
- Use only for small datasets (<1000 samples)
-
-**ShuffleSplit**: Random train/test splits
- Flexible train/test sizes
- Can control number of iterations
- Good for quick evaluation
-
-```python
-from sklearn.model_selection import ShuffleSplit
-cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
-```
-
-#### For Imbalanced Classes
-
-**StratifiedKFold**: Preserves class proportions in each fold
- Essential for imbalanced datasets
- Default for classification in cross_val_score()
-
-```python
-from sklearn.model_selection import StratifiedKFold
-cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
-```
-
-**StratifiedShuffleSplit**: Stratified random splits
-
-#### For Grouped Data
-
-Use when samples are not independent (e.g., multiple measurements from same subject).
-
-**GroupKFold**: Groups don't appear in both train and test
-```python
-from sklearn.model_selection import GroupKFold
-cv = GroupKFold(n_splits=5)
-scores = cross_val_score(model, X, y, groups=groups, cv=cv)
-```
-
-**StratifiedGroupKFold**: Combines stratification with group separation
-
-**LeaveOneGroupOut**: Each group becomes a test set
-
-#### For Time Series
-
-**TimeSeriesSplit**: Expanding window approach
- Successive training sets are supersets
- Respects temporal ordering
- No data leakage from future to past
-
-```python
-from sklearn.model_selection import TimeSeriesSplit
-cv = TimeSeriesSplit(n_splits=5)
-for train_idx, test_idx in cv.split(X):
-    # Train on indices 0 to t, test on t+1 to t+k
-    pass
-```
-
-### Cross-Validation Functions
-
-**cross_val_score**: Returns array of scores
-```python
-scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
-```
-
-**cross_validate**: Returns multiple metrics and timing
-```python
-results = cross_validate(
-    model, X, y, cv=5,
-    scoring=['accuracy', 'f1_weighted', 'roc_auc'],
-    return_train_score=True,
-    return_estimator=True  # Returns fitted estimators
-)
-print(results['test_accuracy'])
-print(results['fit_time'])
-```
-
-**cross_val_predict**: Returns predictions for model blending/visualization
-```python
-from sklearn.model_selection import cross_val_predict
-y_pred = cross_val_predict(model, X, y, cv=5)
-# Use for confusion matrix, error analysis, etc.
-```
-
-## Hyperparameter Tuning
-
-### GridSearchCV
-Exhaustively searches all parameter combinations.
-
-```python
-from sklearn.model_selection import GridSearchCV
-from sklearn.ensemble import RandomForestClassifier
-
-param_grid = {
-    'n_estimators': [100, 200, 500],
-    'max_depth': [10, 20, 30, None],
-    'min_samples_split': [2, 5, 10],
-    'min_samples_leaf': [1, 2, 4]
-}
-
-grid_search = GridSearchCV(
-    RandomForestClassifier(random_state=42),
-    param_grid,
-    cv=5,
-    scoring='f1_weighted',
-    n_jobs=-1,  # Use all CPU cores
-    verbose=2
-)
-
-grid_search.fit(X_train, y_train)
-print("Best parameters:", grid_search.best_params_)
-print("Best score:", grid_search.best_score_)
-
-# Use best model
-best_model = grid_search.best_estimator_
-```
-
-**When to use**:
- Small parameter spaces
- When computational resources allow
- When exhaustive search is desired
-
-### RandomizedSearchCV
-Samples parameter combinations from distributions.
-
-```python
-from sklearn.model_selection import RandomizedSearchCV
-from scipy.stats import randint, uniform
-
-param_distributions = {
-    'n_estimators': randint(100, 1000),
-    'max_depth': randint(5, 50),
-    'min_samples_split': randint(2, 20),
-    'min_samples_leaf': randint(1, 10),
-    'max_features': uniform(0.1, 0.9)
-}
-
-random_search = RandomizedSearchCV(
-    RandomForestClassifier(random_state=42),
-    param_distributions,
-    n_iter=100,  # Number of parameter settings sampled
-    cv=5,
-    scoring='f1_weighted',
-    n_jobs=-1,
-    random_state=42
-)
-
-random_search.fit(X_train, y_train)
-```
-
-**When to use**:
- Large parameter spaces
- When budget is limited
- Often finds good parameters faster than GridSearchCV
-
-**Advantage**: "Budget can be chosen independent of the number of parameters and possible values"
-
-### Successive Halving
-
-**HalvingGridSearchCV** and **HalvingRandomSearchCV**: Tournament-style selection
-
-**How it works**:
-1. Start with many candidates, minimal resources
-2. Eliminate poor performers
-3. Increase resources for remaining candidates
-4. Repeat until best candidates found
-
-**When to use**:
- Large parameter spaces
- Expensive model training
- When many parameter combinations are clearly inferior
-
-```python
-from sklearn.experimental import enable_halving_search_cv
-from sklearn.model_selection import HalvingGridSearchCV
-
-halving_search = HalvingGridSearchCV(
-    estimator,
-    param_grid,
-    factor=3,  # Proportion of candidates eliminated each round
-    cv=5
-)
-```
-
-## Classification Metrics
-
-### Accuracy-Based Metrics
-
-**Accuracy**: Proportion of correct predictions
-```python
-from sklearn.metrics import accuracy_score
-accuracy = accuracy_score(y_true, y_pred)
-```
-
-**When to use**: Balanced datasets only
-**When NOT to use**: Imbalanced datasets (misleading)
-
-**Balanced Accuracy**: Average recall per class
-```python
-from sklearn.metrics import balanced_accuracy_score
-bal_acc = balanced_accuracy_score(y_true, y_pred)
-```
-
-**When to use**: Imbalanced datasets, ensures all classes matter equally
-
-### Precision, Recall, F-Score
-
-**Precision**: Of predicted positives, how many are actually positive
- Formula: TP / (TP + FP)
- Answers: "How reliable are positive predictions?"
-
-**Recall** (Sensitivity): Of actual positives, how many are predicted positive
- Formula: TP / (TP + FN)
- Answers: "How complete is positive detection?"
-
-**F1-Score**: Harmonic mean of precision and recall
- Formula: 2 * (precision * recall) / (precision + recall)
- Balanced measure when both precision and recall are important
-
-```python
-from sklearn.metrics import precision_recall_fscore_support, f1_score
-
-precision, recall, f1, support = precision_recall_fscore_support(
-    y_true, y_pred, average='weighted'
-)
-
-# Or individually
-f1 = f1_score(y_true, y_pred, average='weighted')
-```
-
-**Averaging strategies for multiclass**:
- `binary`: Binary classification only
- `micro`: Calculate globally (total TP, FP, FN)
- `macro`: Calculate per class, unweighted mean (all classes equal)
- `weighted`: Calculate per class, weighted by support (class frequency)
- `samples`: For multilabel classification
-
-**When to use**:
- `macro`: When all classes equally important (even rare ones)
- `weighted`: When class frequency matters
- `micro`: When overall performance across all samples matters
-
-### Confusion Matrix
-
-Shows true positives, false positives, true negatives, false negatives.
-
-```python
-from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
-import matplotlib.pyplot as plt
-
-cm = confusion_matrix(y_true, y_pred)
-disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
-disp.plot()
-plt.show()
-```
-
-### ROC Curve and AUC
-
-**ROC (Receiver Operating Characteristic)**: Plot of true positive rate vs false positive rate at different thresholds
-
-**AUC (Area Under Curve)**: Measures overall ability to discriminate between classes
- 1.0 = perfect classifier
- 0.5 = random classifier
- <0.5 = worse than random
-
-```python
-from sklearn.metrics import roc_auc_score, roc_curve
-import matplotlib.pyplot as plt
-
-# Requires probability predictions
-y_proba = model.predict_proba(X_test)[:, 1]  # Probabilities for positive class
-
-auc = roc_auc_score(y_true, y_proba)
-fpr, tpr, thresholds = roc_curve(y_true, y_proba)
-
-plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
-plt.xlabel('False Positive Rate')
-plt.ylabel('True Positive Rate')
-plt.legend()
-plt.show()
-```
-
-**Multiclass ROC**: Use `multi_class='ovr'` (one-vs-rest) or `'ovo'` (one-vs-one)
-
-```python
-auc = roc_auc_score(y_true, y_proba, multi_class='ovr')
-```
-
-### Log Loss
-
-Measures probability calibration quality.
-
-```python
-from sklearn.metrics import log_loss
-loss = log_loss(y_true, y_proba)
-```
-
-**When to use**: When probability quality matters, not just class predictions
-**Lower is better**: Perfect predictions have log loss of 0
-
-### Classification Report
-
-Comprehensive summary of precision, recall, f1-score per class.
-
-```python
-from sklearn.metrics import classification_report
-
-print(classification_report(y_true, y_pred, target_names=['Class 0', 'Class 1']))
-```
-
-## Regression Metrics
-
-### Mean Squared Error (MSE)
-Average squared difference between predictions and true values.
-
-```python
-from sklearn.metrics import mean_squared_error
-mse = mean_squared_error(y_true, y_pred)
-rmse = mean_squared_error(y_true, y_pred, squared=False)  # Root MSE
-```
-
-**Characteristics**:
- Penalizes large errors heavily (squared term)
- Same units as target² (use RMSE for same units as target)
- Lower is better
-
-### Mean Absolute Error (MAE)
-Average absolute difference between predictions and true values.
-
-```python
-from sklearn.metrics import mean_absolute_error
-mae = mean_absolute_error(y_true, y_pred)
-```
-
-**Characteristics**:
- More robust to outliers than MSE
- Same units as target
- More interpretable
- Lower is better
-
-**MSE vs MAE**: Use MAE when outliers shouldn't dominate the metric
-
-### R² Score (Coefficient of Determination)
-Proportion of variance explained by the model.
-
-```python
-from sklearn.metrics import r2_score
-r2 = r2_score(y_true, y_pred)
-```
-
-**Interpretation**:
- 1.0 = perfect predictions
- 0.0 = model as good as mean
- <0.0 = model worse than mean (possible!)
- Higher is better
-
-**Note**: Can be negative for models that perform worse than predicting the mean.
-
-### Mean Absolute Percentage Error (MAPE)
-Percentage-based error metric.
-
-```python
-from sklearn.metrics import mean_absolute_percentage_error
-mape = mean_absolute_percentage_error(y_true, y_pred)
-```
-
-**When to use**: When relative errors matter more than absolute errors
-**Warning**: Undefined when true values are zero
-
-### Median Absolute Error
-Median of absolute errors (robust to outliers).
-
-```python
-from sklearn.metrics import median_absolute_error
-med_ae = median_absolute_error(y_true, y_pred)
-```
-
-### Max Error
-Maximum residual error.
-
-```python
-from sklearn.metrics import max_error
-max_err = max_error(y_true, y_pred)
-```
-
-**When to use**: When worst-case performance matters
-
-## Custom Scoring Functions
-
-Create custom scorers for GridSearchCV and cross_val_score:
-
-```python
-from sklearn.metrics import make_scorer, fbeta_score
-
-# F2 score (weights recall higher than precision)
-f2_scorer = make_scorer(fbeta_score, beta=2)
-
-# Custom function
-def custom_metric(y_true, y_pred):
-    # Your custom logic
-    return score
-
-custom_scorer = make_scorer(custom_metric, greater_is_better=True)
-
-# Use in cross-validation or grid search
-scores = cross_val_score(model, X, y, cv=5, scoring=custom_scorer)
-```
-
-## Scoring Parameter Options
-
-Common scoring strings for `scoring` parameter:
-
-**Classification**:
- `'accuracy'`, `'balanced_accuracy'`
- `'precision'`, `'recall'`, `'f1'` (add `_macro`, `_micro`, `_weighted` for multiclass)
- `'roc_auc'`, `'roc_auc_ovr'`, `'roc_auc_ovo'`
- `'log_loss'` (lower is better, negate for maximization)
- `'jaccard'` (Jaccard similarity)
-
-**Regression**:
- `'r2'`
- `'neg_mean_squared_error'`, `'neg_root_mean_squared_error'`
- `'neg_mean_absolute_error'`
- `'neg_mean_absolute_percentage_error'`
- `'neg_median_absolute_error'`
-
-**Note**: Many metrics are negated (neg_*) so GridSearchCV can maximize them.
-
-## Validation Strategies
-
-### Train-Test Split
-Simple single split.
-
-```python
-from sklearn.model_selection import train_test_split
-
-X_train, X_test, y_train, y_test = train_test_split(
-    X, y,
-    test_size=0.2,
-    random_state=42,
-    stratify=y  # For classification with imbalanced classes
-)
-```
-
-**When to use**: Large datasets, quick evaluation
-**Parameters**:
- `test_size`: Proportion for test (typically 0.2-0.3)
- `stratify`: Preserves class proportions
- `random_state`: Reproducibility
-
-### Train-Validation-Test Split
-Three-way split for hyperparameter tuning.
-
-```python
-# First split: train+val and test
-X_trainval, X_test, y_trainval, y_test = train_test_split(
-    X, y, test_size=0.2, random_state=42
-)
-
-# Second split: train and validation
-X_train, X_val, y_train, y_val = train_test_split(
-    X_trainval, y_trainval, test_size=0.2, random_state=42
-)
-
-# Or use GridSearchCV with train+val, then evaluate on test
-```
-
-**When to use**: Model selection and final evaluation
-**Strategy**:
-1. Train: Model training
-2. Validation: Hyperparameter tuning
-3. Test: Final, unbiased evaluation (touch only once!)
-
-### Learning Curves
-
-Diagnose bias vs variance issues.
-
-```python
-from sklearn.model_selection import learning_curve
-import matplotlib.pyplot as plt
-
-train_sizes, train_scores, val_scores = learning_curve(
-    model, X, y,
-    cv=5,
-    train_sizes=np.linspace(0.1, 1.0, 10),
-    scoring='accuracy',
-    n_jobs=-1
-)
-
-plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
-plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
-plt.xlabel('Training set size')
-plt.ylabel('Score')
-plt.legend()
-plt.show()
-```
-
-**Interpretation**:
- Large gap between train and validation: **Overfitting** (high variance)
- Both scores low: **Underfitting** (high bias)
- Scores converging but low: Need better features or more complex model
- Validation score still improving: More data would help
-
-## Best Practices
-
-### Metric Selection Guidelines
-
-**Classification - Balanced classes**:
- Accuracy or F1-score
-
-**Classification - Imbalanced classes**:
- Balanced accuracy
- F1-score (weighted or macro)
- ROC-AUC
- Precision-Recall curve
-
-**Classification - Cost-sensitive**:
- Custom scorer with cost matrix
- Adjust threshold on probabilities
-
-**Regression - Typical use**:
- RMSE (sensitive to outliers)
- R² (proportion of variance explained)
-
-**Regression - Outliers present**:
- MAE (robust to outliers)
- Median absolute error
-
-**Regression - Percentage errors matter**:
- MAPE
-
-### Cross-Validation Guidelines
-
-**Number of folds**:
- 5-10 folds typical
- More folds = more computation, less variance in estimate
- LeaveOneOut only for small datasets
-
-**Stratification**:
- Always use for classification with imbalanced classes
- Use StratifiedKFold by default for classification
-
-**Grouping**:
- Always use when samples are not independent
- Time series: Always use TimeSeriesSplit
-
-**Nested cross-validation**:
- For unbiased performance estimate when doing hyperparameter tuning
- Outer loop: Performance estimation
- Inner loop: Hyperparameter selection
-
-### Avoiding Common Pitfalls
-
-1. **Data leakage**: Fit preprocessors only on training data within each CV fold (use Pipeline!)
-2. **Test set leakage**: Never use test set for model selection
-3. **Improper metric**: Use metrics appropriate for problem (balanced_accuracy for imbalanced data)
-4. **Multiple testing**: More models evaluated = higher chance of random good results
-5. **Temporal leakage**: For time series, use TimeSeriesSplit, not random splits
-6. **Target leakage**: Features shouldn't contain information not available at prediction time
--- a/scientific-packages/scikit-learn/references/pipelines_and_composition.md
+++ b/scientific-packages/scikit-learn/references/pipelines_and_composition.md
@@ -1,679 +0,0 @@
-# Pipelines and Composite Estimators in scikit-learn
-
-## Overview
-Pipelines chain multiple estimators into a single unit, ensuring proper workflow sequencing and preventing data leakage. As the documentation states: "Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification."
-
-## Pipeline Basics
-
-### Creating Pipelines
-
-```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.decomposition import PCA
-from sklearn.linear_model import LogisticRegression
-
-# Method 1: List of (name, estimator) tuples
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('pca', PCA(n_components=10)),
-    ('classifier', LogisticRegression())
-])
-
-# Method 2: Using make_pipeline (auto-generates names)
-from sklearn.pipeline import make_pipeline
-pipeline = make_pipeline(
-    StandardScaler(),
-    PCA(n_components=10),
-    LogisticRegression()
-)
-```
-
-### Using Pipelines
-
-```python
-# Fit and predict like any estimator
-pipeline.fit(X_train, y_train)
-y_pred = pipeline.predict(X_test)
-score = pipeline.score(X_test, y_test)
-
-# Access steps
-pipeline.named_steps['scaler']
-pipeline.steps[0]  # Returns ('scaler', StandardScaler(...))
-pipeline[0]        # Returns StandardScaler(...) object
-pipeline['scaler'] # Returns StandardScaler(...) object
-
-# Get final estimator
-pipeline[-1]  # Returns LogisticRegression(...) object
-```
-
-### Pipeline Rules
-
-**All steps except the last must be transformers** (have `fit()` and `transform()` methods).
-
-**The final step** can be:
- Predictor (classifier/regressor) with `fit()` and `predict()`
- Transformer with `fit()` and `transform()`
- Any estimator with at least `fit()`
-
-### Pipeline Benefits
-
-1. **Convenience**: Single `fit()` and `predict()` call
-2. **Prevents data leakage**: Ensures proper fit/transform on train/test
-3. **Joint parameter selection**: Tune all steps together with GridSearchCV
-4. **Reproducibility**: Encapsulates entire workflow
-
-## Accessing and Setting Parameters
-
-### Nested Parameters
-
-Access step parameters using `stepname__parameter` syntax:
-
-```python
-from sklearn.model_selection import GridSearchCV
-
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('clf', LogisticRegression())
-])
-
-# Grid search over pipeline parameters
-param_grid = {
-    'scaler__with_mean': [True, False],
-    'clf__C': [0.1, 1.0, 10.0],
-    'clf__penalty': ['l1', 'l2']
-}
-
-grid_search = GridSearchCV(pipeline, param_grid, cv=5)
-grid_search.fit(X_train, y_train)
-```
-
-### Setting Parameters
-
-```python
-# Set parameters
-pipeline.set_params(clf__C=10.0, scaler__with_std=False)
-
-# Get parameters
-params = pipeline.get_params()
-```
-
-## Caching Intermediate Results
-
-Cache fitted transformers to avoid recomputation:
-
-```python
-from tempfile import mkdtemp
-from shutil import rmtree
-
-# Create cache directory
-cachedir = mkdtemp()
-
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('pca', PCA(n_components=10)),
-    ('clf', LogisticRegression())
-], memory=cachedir)
-
-# When doing grid search, scaler and PCA only fit once per fold
-grid_search = GridSearchCV(pipeline, param_grid, cv=5)
-grid_search.fit(X_train, y_train)
-
-# Clean up cache
-rmtree(cachedir)
-
-# Or use joblib for persistent caching
-from joblib import Memory
-memory = Memory(location='./cache', verbose=0)
-pipeline = Pipeline([...], memory=memory)
-```
-
-**When to use caching**:
- Expensive transformations (PCA, feature selection)
- Grid search over final estimator parameters only
- Multiple experiments with same preprocessing
-
-## ColumnTransformer
-
-Apply different transformations to different columns (essential for heterogeneous data).
-
-### Basic Usage
-
-```python
-from sklearn.compose import ColumnTransformer
-from sklearn.preprocessing import StandardScaler, OneHotEncoder
-
-# Define which transformations for which columns
-preprocessor = ColumnTransformer(
-    transformers=[
-        ('num', StandardScaler(), ['age', 'income', 'credit_score']),
-        ('cat', OneHotEncoder(), ['country', 'occupation'])
-    ],
-    remainder='drop'  # What to do with remaining columns
-)
-
-X_transformed = preprocessor.fit_transform(X)
-```
-
-### Column Selection Methods
-
-```python
-# Method 1: Column names (list of strings)
-('num', StandardScaler(), ['age', 'income'])
-
-# Method 2: Column indices (list of integers)
-('num', StandardScaler(), [0, 1, 2])
-
-# Method 3: Boolean mask
-('num', StandardScaler(), [True, True, False, True, False])
-
-# Method 4: Slice
-('num', StandardScaler(), slice(0, 3))
-
-# Method 5: make_column_selector (by dtype or pattern)
-from sklearn.compose import make_column_selector as selector
-
-preprocessor = ColumnTransformer([
-    ('num', StandardScaler(), selector(dtype_include='number')),
-    ('cat', OneHotEncoder(), selector(dtype_include='object'))
-])
-
-# Select by pattern
-selector(pattern='.*_score$')  # All columns ending with '_score'
-```
-
-### Remainder Parameter
-
-Controls what happens to columns not specified:
-
-```python
-# Drop remaining columns (default)
-remainder='drop'
-
-# Pass through remaining columns unchanged
-remainder='passthrough'
-
-# Apply transformer to remaining columns
-remainder=StandardScaler()
-```
-
-### Full Pipeline with ColumnTransformer
-
-```python
-from sklearn.compose import ColumnTransformer
-from sklearn.pipeline import Pipeline
-from sklearn.impute import SimpleImputer
-from sklearn.preprocessing import StandardScaler, OneHotEncoder
-from sklearn.ensemble import RandomForestClassifier
-
-# Separate preprocessing for numeric and categorical
-numeric_features = ['age', 'income', 'credit_score']
-categorical_features = ['country', 'occupation', 'education']
-
-numeric_transformer = Pipeline(steps=[
-    ('imputer', SimpleImputer(strategy='median')),
-    ('scaler', StandardScaler())
-])
-
-categorical_transformer = Pipeline(steps=[
-    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
-    ('onehot', OneHotEncoder(handle_unknown='ignore'))
-])
-
-preprocessor = ColumnTransformer(
-    transformers=[
-        ('num', numeric_transformer, numeric_features),
-        ('cat', categorical_transformer, categorical_features)
-    ])
-
-# Complete pipeline
-clf = Pipeline(steps=[
-    ('preprocessor', preprocessor),
-    ('classifier', RandomForestClassifier())
-])
-
-clf.fit(X_train, y_train)
-y_pred = clf.predict(X_test)
-
-# Grid search over preprocessing and model parameters
-param_grid = {
-    'preprocessor__num__imputer__strategy': ['mean', 'median'],
-    'preprocessor__cat__onehot__max_categories': [10, 20, None],
-    'classifier__n_estimators': [100, 200],
-    'classifier__max_depth': [10, 20, None]
-}
-
-grid_search = GridSearchCV(clf, param_grid, cv=5)
-grid_search.fit(X_train, y_train)
-```
-
-## FeatureUnion
-
-Combine multiple transformer outputs by concatenating features side-by-side.
-
-```python
-from sklearn.pipeline import FeatureUnion
-from sklearn.decomposition import PCA
-from sklearn.feature_selection import SelectKBest
-
-# Combine PCA and feature selection
-combined_features = FeatureUnion([
-    ('pca', PCA(n_components=10)),
-    ('univ_select', SelectKBest(k=5))
-])
-
-X_features = combined_features.fit_transform(X, y)
-# Result: 15 features (10 from PCA + 5 from SelectKBest)
-
-# In a pipeline
-pipeline = Pipeline([
-    ('features', combined_features),
-    ('classifier', LogisticRegression())
-])
-```
-
-### FeatureUnion with Transformers on Different Data
-
-```python
-from sklearn.pipeline import FeatureUnion
-from sklearn.preprocessing import FunctionTransformer
-import numpy as np
-
-def get_numeric_data(X):
-    return X[:, :3]  # First 3 columns
-
-def get_text_data(X):
-    return X[:, 3]   # 4th column (text)
-
-from sklearn.feature_extraction.text import TfidfVectorizer
-
-combined = FeatureUnion([
-    ('numeric_features', Pipeline([
-        ('selector', FunctionTransformer(get_numeric_data)),
-        ('scaler', StandardScaler())
-    ])),
-    ('text_features', Pipeline([
-        ('selector', FunctionTransformer(get_text_data)),
-        ('tfidf', TfidfVectorizer())
-    ]))
-])
-```
-
-**Note**: ColumnTransformer is usually more convenient than FeatureUnion for heterogeneous data.
-
-## Common Pipeline Patterns
-
-### Classification Pipeline
-
-```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.feature_selection import SelectKBest, f_classif
-from sklearn.svm import SVC
-
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('feature_selection', SelectKBest(f_classif, k=10)),
-    ('classifier', SVC(kernel='rbf'))
-])
-```
-
-### Regression Pipeline
-
-```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler, PolynomialFeatures
-from sklearn.linear_model import Ridge
-
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('poly', PolynomialFeatures(degree=2)),
-    ('ridge', Ridge(alpha=1.0))
-])
-```
-
-### Text Classification Pipeline
-
-```python
-from sklearn.pipeline import Pipeline
-from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.naive_bayes import MultinomialNB
-
-pipeline = Pipeline([
-    ('tfidf', TfidfVectorizer(max_features=1000)),
-    ('classifier', MultinomialNB())
-])
-
-# Works directly with text
-pipeline.fit(X_train_text, y_train)
-y_pred = pipeline.predict(X_test_text)
-```
-
-### Image Processing Pipeline
-
-```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.decomposition import PCA
-from sklearn.neural_network import MLPClassifier
-
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('pca', PCA(n_components=100)),
-    ('mlp', MLPClassifier(hidden_layer_sizes=(100, 50)))
-])
-```
-
-### Dimensionality Reduction + Clustering
-
-```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.decomposition import PCA
-from sklearn.cluster import KMeans
-
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('pca', PCA(n_components=10)),
-    ('kmeans', KMeans(n_clusters=5))
-])
-
-labels = pipeline.fit_predict(X)
-```
-
-## Custom Transformers
-
-### Using FunctionTransformer
-
-```python
-from sklearn.preprocessing import FunctionTransformer
-import numpy as np
-
-# Log transformation
-log_transformer = FunctionTransformer(np.log1p)
-
-# Custom function
-def custom_transform(X):
-    # Your transformation logic
-    return X_transformed
-
-custom_transformer = FunctionTransformer(custom_transform)
-
-# In pipeline
-pipeline = Pipeline([
-    ('log', log_transformer),
-    ('scaler', StandardScaler()),
-    ('model', LinearRegression())
-])
-```
-
-### Creating Custom Transformer Class
-
-```python
-from sklearn.base import BaseEstimator, TransformerMixin
-
-class CustomTransformer(BaseEstimator, TransformerMixin):
-    def __init__(self, parameter=1.0):
-        self.parameter = parameter
-
-    def fit(self, X, y=None):
-        # Learn parameters from X
-        self.learned_param_ = X.mean()  # Example
-        return self
-
-    def transform(self, X):
-        # Transform X using learned parameters
-        return X * self.parameter - self.learned_param_
-
-    # Optional: for pipelines that need inverse transform
-    def inverse_transform(self, X):
-        return (X + self.learned_param_) / self.parameter
-
-# Use in pipeline
-pipeline = Pipeline([
-    ('custom', CustomTransformer(parameter=2.0)),
-    ('model', LinearRegression())
-])
-```
-
-**Key requirements**:
- Inherit from `BaseEstimator` and `TransformerMixin`
- Implement `fit()` and `transform()` methods
- `fit()` must return `self`
- Use trailing underscore for learned attributes (`learned_param_`)
- Constructor parameters should be stored as attributes
-
-### Transformer for Pandas DataFrames
-
-```python
-from sklearn.base import BaseEstimator, TransformerMixin
-import pandas as pd
-
-class DataFrameTransformer(BaseEstimator, TransformerMixin):
-    def __init__(self, columns=None):
-        self.columns = columns
-
-    def fit(self, X, y=None):
-        return self
-
-    def transform(self, X):
-        if isinstance(X, pd.DataFrame):
-            if self.columns:
-                return X[self.columns].values
-            return X.values
-        return X
-```
-
-## Visualization
-
-### Display Pipeline in Jupyter
-
-```python
-from sklearn import set_config
-
-# Enable HTML display
-set_config(display='diagram')
-
-# Now displaying the pipeline shows interactive diagram
-pipeline
-```
-
-### Print Pipeline Structure
-
-```python
-from sklearn.utils import estimator_html_repr
-
-# Get HTML representation
-html = estimator_html_repr(pipeline)
-
-# Or just print
-print(pipeline)
-```
-
-## Advanced Patterns
-
-### Conditional Transformations
-
-```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler, FunctionTransformer
-
-def conditional_scale(X, scale=True):
-    if scale:
-        return StandardScaler().fit_transform(X)
-    return X
-
-pipeline = Pipeline([
-    ('conditional_scaler', FunctionTransformer(
-        conditional_scale,
-        kw_args={'scale': True}
-    )),
-    ('model', LogisticRegression())
-])
-```
-
-### Multiple Preprocessing Paths
-
-```python
-from sklearn.compose import ColumnTransformer
-from sklearn.pipeline import Pipeline
-
-# Different preprocessing for different feature types
-preprocessor = ColumnTransformer([
-    # Numeric: impute + scale
-    ('num_standard', Pipeline([
-        ('imputer', SimpleImputer(strategy='mean')),
-        ('scaler', StandardScaler())
-    ]), ['age', 'income']),
-
-    # Numeric: impute + log + scale
-    ('num_skewed', Pipeline([
-        ('imputer', SimpleImputer(strategy='median')),
-        ('log', FunctionTransformer(np.log1p)),
-        ('scaler', StandardScaler())
-    ]), ['price', 'revenue']),
-
-    # Categorical: impute + one-hot
-    ('cat', Pipeline([
-        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
-        ('onehot', OneHotEncoder(handle_unknown='ignore'))
-    ]), ['category', 'region']),
-
-    # Text: TF-IDF
-    ('text', TfidfVectorizer(), 'description')
-])
-```
-
-### Feature Engineering Pipeline
-
-```python
-from sklearn.base import BaseEstimator, TransformerMixin
-
-class FeatureEngineer(BaseEstimator, TransformerMixin):
-    def fit(self, X, y=None):
-        return self
-
-    def transform(self, X):
-        X = X.copy()
-        # Add engineered features
-        X['age_income_ratio'] = X['age'] / (X['income'] + 1)
-        X['total_score'] = X['score1'] + X['score2'] + X['score3']
-        return X
-
-pipeline = Pipeline([
-    ('engineer', FeatureEngineer()),
-    ('preprocessor', preprocessor),
-    ('model', RandomForestClassifier())
-])
-```
-
-## Best Practices
-
-### Always Use Pipelines When
-
-1. **Preprocessing is needed**: Scaling, encoding, imputation
-2. **Cross-validation**: Ensures proper fit/transform split
-3. **Hyperparameter tuning**: Joint optimization of preprocessing and model
-4. **Production deployment**: Single object to serialize
-5. **Multiple steps**: Any workflow with >1 step
-
-### Pipeline Do's
-
- ✅ Fit pipeline only on training data
- ✅ Use ColumnTransformer for heterogeneous data
- ✅ Cache expensive transformations during grid search
- ✅ Use make_pipeline for simple cases
- ✅ Set verbose=True to debug issues
- ✅ Use remainder='passthrough' when appropriate
-
-### Pipeline Don'ts
-
- ❌ Fit preprocessing on full dataset before split (data leakage!)
- ❌ Manually transform test data (use pipeline.predict())
- ❌ Forget to handle missing values before scaling
- ❌ Mix pandas DataFrames and arrays inconsistently
- ❌ Skip using pipelines for "just one preprocessing step"
-
-### Data Leakage Prevention
-
-```python
-# ❌ WRONG - Data leakage
-scaler = StandardScaler().fit(X)  # Fit on all data
-X_train, X_test, y_train, y_test = train_test_split(X, y)
-X_train_scaled = scaler.transform(X_train)
-X_test_scaled = scaler.transform(X_test)
-
-# ✅ CORRECT - No leakage with pipeline
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('model', LogisticRegression())
-])
-
-X_train, X_test, y_train, y_test = train_test_split(X, y)
-pipeline.fit(X_train, y_train)  # Scaler fits only on train
-y_pred = pipeline.predict(X_test)  # Scaler transforms only on test
-
-# ✅ CORRECT - No leakage in cross-validation
-scores = cross_val_score(pipeline, X, y, cv=5)
-# Each fold: scaler fits on train folds, transforms on test fold
-```
-
-### Debugging Pipelines
-
-```python
-# Examine intermediate outputs
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('pca', PCA(n_components=10)),
-    ('model', LogisticRegression())
-])
-
-# Fit pipeline
-pipeline.fit(X_train, y_train)
-
-# Get output after scaling
-X_scaled = pipeline.named_steps['scaler'].transform(X_train)
-
-# Get output after PCA
-X_pca = pipeline[:-1].transform(X_train)  # All steps except last
-
-# Or build partial pipeline
-partial_pipeline = Pipeline(pipeline.steps[:-1])
-X_transformed = partial_pipeline.transform(X_train)
-```
-
-### Saving and Loading Pipelines
-
-```python
-import joblib
-
-# Save pipeline
-joblib.dump(pipeline, 'model_pipeline.pkl')
-
-# Load pipeline
-pipeline = joblib.load('model_pipeline.pkl')
-
-# Use loaded pipeline
-y_pred = pipeline.predict(X_new)
-```
-
-## Common Errors and Solutions
-
-**Error**: `ValueError: could not convert string to float`
- **Cause**: Categorical features not encoded
- **Solution**: Add OneHotEncoder or OrdinalEncoder to pipeline
-
-**Error**: `All intermediate steps should be transformers`
- **Cause**: Non-transformer in non-final position
- **Solution**: Ensure only last step is predictor
-
-**Error**: `X has different number of features than during fitting`
- **Cause**: Different columns in train and test
- **Solution**: Ensure consistent column handling, use `handle_unknown='ignore'` in OneHotEncoder
-
-**Error**: Different results in cross-validation vs train-test split
- **Cause**: Data leakage (fitting preprocessing on all data)
- **Solution**: Always use Pipeline for preprocessing
-
-**Error**: Pipeline too slow during grid search
- **Solution**: Use caching with `memory` parameter
--- a/scientific-packages/scikit-learn/references/preprocessing.md
+++ b/scientific-packages/scikit-learn/references/preprocessing.md
@@ -1,413 +0,0 @@
-# Data Preprocessing in scikit-learn
-
-## Overview
-Preprocessing transforms raw data into a format suitable for machine learning algorithms. Many algorithms require standardized or normalized data to perform well.
-
-## Standardization and Scaling
-
-### StandardScaler
-Removes mean and scales to unit variance (z-score normalization).
-
-**Formula**: `z = (x - μ) / σ`
-
-**Use cases**:
- Most ML algorithms (especially SVM, neural networks, PCA)
- When features have different units or scales
- When assuming Gaussian-like distribution
-
-**Important**: Fit only on training data, then transform both train and test sets.
-
-```python
-from sklearn.preprocessing import StandardScaler
-scaler = StandardScaler()
-X_train_scaled = scaler.fit_transform(X_train)
-X_test_scaled = scaler.transform(X_test)  # Use same parameters
-```
-
-### MinMaxScaler
-Scales features to a specified range, typically [0, 1].
-
-**Formula**: `X_scaled = (X - X_min) / (X_max - X_min)`
-
-**Use cases**:
- When bounded range is needed
- Neural networks (often prefer [0, 1] range)
- When distribution is not Gaussian
- Image pixel values
-
-**Parameters**:
- `feature_range`: Tuple (min, max), default (0, 1)
-
-**Warning**: Sensitive to outliers since it uses min/max.
-
-### MaxAbsScaler
-Scales to [-1, 1] by dividing by maximum absolute value.
-
-**Use cases**:
- Sparse data (preserves sparsity)
- Data already centered at zero
- When sign of values is meaningful
-
-**Advantage**: Doesn't shift/center the data, preserves zero entries.
-
-### RobustScaler
-Uses median and interquartile range (IQR) instead of mean and standard deviation.
-
-**Formula**: `X_scaled = (X - median) / IQR`
-
-**Use cases**:
- When outliers are present
- When StandardScaler produces skewed results
- Robust statistics preferred
-
-**Parameters**:
- `quantile_range`: Tuple (q_min, q_max), default (25.0, 75.0)
-
-## Normalization
-
-### normalize() function and Normalizer
-Scales individual samples (rows) to unit norm, not features (columns).
-
-**Use cases**:
- Text classification (TF-IDF vectors)
- When similarity metrics (dot product, cosine) are used
- When each sample should have equal weight
-
-**Norms**:
- `l1`: Manhattan norm (sum of absolutes = 1)
- `l2`: Euclidean norm (sum of squares = 1) - **most common**
- `max`: Maximum absolute value = 1
-
-**Key difference from scalers**: Operates on rows (samples), not columns (features).
-
-```python
-from sklearn.preprocessing import Normalizer
-normalizer = Normalizer(norm='l2')
-X_normalized = normalizer.transform(X)
-```
-
-## Encoding Categorical Features
-
-### OrdinalEncoder
-Converts categories to integers (0 to n_categories - 1).
-
-**Use cases**:
- Ordinal relationships exist (small < medium < large)
- Preprocessing before other transformations
- Tree-based algorithms (which can handle integers)
-
-**Parameters**:
- `handle_unknown`: 'error' or 'use_encoded_value'
- `unknown_value`: Value for unknown categories
- `encoded_missing_value`: Value for missing data
-
-```python
-from sklearn.preprocessing import OrdinalEncoder
-encoder = OrdinalEncoder()
-X_encoded = encoder.fit_transform(X_categorical)
-```
-
-### OneHotEncoder
-Creates binary columns for each category.
-
-**Use cases**:
- Nominal categories (no order)
- Linear models, neural networks
- When category relationships shouldn't be assumed
-
-**Parameters**:
- `drop`: 'first', 'if_binary', array-like (prevents multicollinearity)
- `sparse_output`: True (default, memory efficient) or False
- `handle_unknown`: 'error', 'ignore', 'infrequent_if_exist'
- `min_frequency`: Group infrequent categories
- `max_categories`: Limit number of categories
-
-**High cardinality handling**:
-```python
-encoder = OneHotEncoder(min_frequency=100, handle_unknown='infrequent_if_exist')
-# Groups categories appearing < 100 times into 'infrequent' category
-```
-
-**Memory tip**: Use `sparse_output=True` (default) for high-cardinality features.
-
-### TargetEncoder
-Uses target statistics to encode categories.
-
-**Use cases**:
- High-cardinality categorical features (zip codes, user IDs)
- When linear relationships with target are expected
- Often improves performance over one-hot encoding
-
-**How it works**:
- Replaces category with mean of target for that category
- Uses cross-fitting during fit_transform() to prevent target leakage
- Applies smoothing to handle rare categories
-
-**Parameters**:
- `smooth`: Smoothing parameter for rare categories
- `cv`: Cross-validation strategy
-
-**Warning**: Only for supervised learning. Requires target variable.
-
-```python
-from sklearn.preprocessing import TargetEncoder
-encoder = TargetEncoder()
-X_encoded = encoder.fit_transform(X_categorical, y)
-```
-
-### LabelEncoder
-Encodes target labels into integers 0 to n_classes - 1.
-
-**Use cases**: Encoding target variable for classification (not features!)
-
-**Important**: Use `LabelEncoder` for targets, not features. For features, use OrdinalEncoder or OneHotEncoder.
-
-### Binarizer
-Converts numeric values to binary (0 or 1) based on threshold.
-
-**Use cases**: Creating binary features from continuous values
-
-## Non-linear Transformations
-
-### QuantileTransformer
-Maps features to uniform or normal distribution using rank transformation.
-
-**Use cases**:
- Unusual distributions (bimodal, heavy tails)
- Reducing outlier impact
- When normal distribution is desired
-
-**Parameters**:
- `output_distribution`: 'uniform' (default) or 'normal'
- `n_quantiles`: Number of quantiles (default: min(1000, n_samples))
-
-**Effect**: Strong transformation that reduces outlier influence and makes data more Gaussian-like.
-
-### PowerTransformer
-Applies parametric monotonic transformation to make data more Gaussian.
-
-**Methods**:
- `yeo-johnson`: Works with positive and negative values (default)
- `box-cox`: Only positive values
-
-**Use cases**:
- Skewed distributions
- When Gaussian assumption is important
- Variance stabilization
-
-**Advantage**: Less radical than QuantileTransformer, preserves more of original relationships.
-
-## Discretization
-
-### KBinsDiscretizer
-Bins continuous features into discrete intervals.
-
-**Strategies**:
- `uniform`: Equal-width bins
- `quantile`: Equal-frequency bins
- `kmeans`: K-means clustering to determine bins
-
-**Encoding**:
- `ordinal`: Integer encoding (0 to n_bins - 1)
- `onehot`: One-hot encoding
- `onehot-dense`: Dense one-hot encoding
-
-**Use cases**:
- Making linear models handle non-linear relationships
- Reducing noise in features
- Making features more interpretable
-
-```python
-from sklearn.preprocessing import KBinsDiscretizer
-disc = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
-X_binned = disc.fit_transform(X)
-```
-
-## Feature Generation
-
-### PolynomialFeatures
-Generates polynomial and interaction features.
-
-**Parameters**:
- `degree`: Polynomial degree
- `interaction_only`: Only multiplicative interactions (no x²)
- `include_bias`: Include constant feature
-
-**Use cases**:
- Adding non-linearity to linear models
- Feature engineering
- Polynomial regression
-
-**Warning**: Number of features grows rapidly: (n+d)!/d!n! for degree d.
-
-```python
-from sklearn.preprocessing import PolynomialFeatures
-poly = PolynomialFeatures(degree=2, include_bias=False)
-X_poly = poly.fit_transform(X)
-# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
-```
-
-### SplineTransformer
-Generates B-spline basis functions.
-
-**Use cases**:
- Smooth non-linear transformations
- Alternative to PolynomialFeatures (less oscillation at boundaries)
- Generalized additive models (GAMs)
-
-**Parameters**:
- `n_knots`: Number of knots
- `degree`: Spline degree
- `knots`: Knot positions ('uniform', 'quantile', or array)
-
-## Missing Value Handling
-
-### SimpleImputer
-Imputes missing values with various strategies.
-
-**Strategies**:
- `mean`: Mean of column (numeric only)
- `median`: Median of column (numeric only)
- `most_frequent`: Mode (numeric or categorical)
- `constant`: Fill with constant value
-
-**Parameters**:
- `strategy`: Imputation strategy
- `fill_value`: Value when strategy='constant'
- `missing_values`: What represents missing (np.nan, None, specific value)
-
-```python
-from sklearn.impute import SimpleImputer
-imputer = SimpleImputer(strategy='median')
-X_imputed = imputer.fit_transform(X)
-```
-
-### KNNImputer
-Imputes using k-nearest neighbors.
-
-**Use cases**: When relationships between features should inform imputation
-
-**Parameters**:
- `n_neighbors`: Number of neighbors
- `weights`: 'uniform' or 'distance'
-
-### IterativeImputer
-Models each feature with missing values as function of other features.
-
-**Use cases**:
- Complex relationships between features
- When multiple features have missing values
- Higher quality imputation (but slower)
-
-**Parameters**:
- `estimator`: Estimator for regression (default: BayesianRidge)
- `max_iter`: Maximum iterations
-
-## Function Transformers
-
-### FunctionTransformer
-Applies custom function to data.
-
-**Use cases**:
- Custom transformations in pipelines
- Log transformation, square root, etc.
- Domain-specific preprocessing
-
-```python
-from sklearn.preprocessing import FunctionTransformer
-import numpy as np
-
-log_transformer = FunctionTransformer(np.log1p, validate=True)
-X_log = log_transformer.transform(X)
-```
-
-## Best Practices
-
-### Feature Scaling Guidelines
-
-**Always scale**:
- SVM, neural networks
- K-nearest neighbors
- Linear/Logistic regression with regularization
- PCA, LDA
- Gradient descent-based algorithms
-
-**Don't need to scale**:
- Tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting)
- Naive Bayes
-
-### Pipeline Integration
-
-Always use preprocessing within pipelines to prevent data leakage:
-
-```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.linear_model import LogisticRegression
-
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('classifier', LogisticRegression())
-])
-
-pipeline.fit(X_train, y_train)  # Scaler fit only on train data
-y_pred = pipeline.predict(X_test)  # Scaler transform only on test data
-```
-
-### Common Transformations by Data Type
-
-**Numeric - Continuous**:
- StandardScaler (most common)
- MinMaxScaler (neural networks)
- RobustScaler (outliers present)
- PowerTransformer (skewed data)
-
-**Numeric - Count Data**:
- sqrt or log transformation
- QuantileTransformer
- StandardScaler after transformation
-
-**Categorical - Low Cardinality (<10 categories)**:
- OneHotEncoder
-
-**Categorical - High Cardinality (>10 categories)**:
- TargetEncoder (supervised)
- Frequency encoding
- OneHotEncoder with min_frequency parameter
-
-**Categorical - Ordinal**:
- OrdinalEncoder
-
-**Text**:
- CountVectorizer or TfidfVectorizer
- Normalizer after vectorization
-
-### Data Leakage Prevention
-
-1. **Fit only on training data**: Never include test data when fitting preprocessors
-2. **Use pipelines**: Ensures proper fit/transform separation
-3. **Cross-validation**: Use Pipeline with cross_val_score() for proper evaluation
-4. **Target encoding**: Use cv parameter in TargetEncoder for cross-fitting
-
-```python
-# WRONG - data leakage
-scaler = StandardScaler().fit(X_full)
-X_train_scaled = scaler.transform(X_train)
-X_test_scaled = scaler.transform(X_test)
-
-# CORRECT - no leakage
-scaler = StandardScaler().fit(X_train)
-X_train_scaled = scaler.transform(X_train)
-X_test_scaled = scaler.transform(X_test)
-```
-
-## Preprocessing Checklist
-
-Before modeling:
-1. Handle missing values (imputation or removal)
-2. Encode categorical variables appropriately
-3. Scale/normalize numeric features (if needed for algorithm)
-4. Handle outliers (RobustScaler, clipping, removal)
-5. Create additional features if beneficial (PolynomialFeatures, domain knowledge)
-6. Check for data leakage in preprocessing steps
-7. Wrap everything in a Pipeline
--- a/scientific-packages/scikit-learn/references/quick_reference.md
+++ b/scientific-packages/scikit-learn/references/quick_reference.md
@@ -1,625 +0,0 @@
-# Scikit-learn Quick Reference
-
-## Essential Imports
-
-```python
-# Core
-import numpy as np
-import pandas as pd
-from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
-from sklearn.pipeline import Pipeline, make_pipeline
-from sklearn.compose import ColumnTransformer
-
-# Preprocessing
-from sklearn.preprocessing import (
-    StandardScaler, MinMaxScaler, RobustScaler,
-    OneHotEncoder, OrdinalEncoder, LabelEncoder,
-    PolynomialFeatures
-)
-from sklearn.impute import SimpleImputer
-
-# Models - Classification
-from sklearn.linear_model import LogisticRegression
-from sklearn.tree import DecisionTreeClassifier
-from sklearn.ensemble import (
-    RandomForestClassifier,
-    GradientBoostingClassifier,
-    HistGradientBoostingClassifier
-)
-from sklearn.svm import SVC
-from sklearn.neighbors import KNeighborsClassifier
-
-# Models - Regression
-from sklearn.linear_model import LinearRegression, Ridge, Lasso
-from sklearn.ensemble import (
-    RandomForestRegressor,
-    GradientBoostingRegressor,
-    HistGradientBoostingRegressor
-)
-
-# Clustering
-from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
-from sklearn.mixture import GaussianMixture
-
-# Dimensionality Reduction
-from sklearn.decomposition import PCA, NMF, TruncatedSVD
-from sklearn.manifold import TSNE
-
-# Metrics
-from sklearn.metrics import (
-    accuracy_score, precision_score, recall_score, f1_score,
-    confusion_matrix, classification_report,
-    mean_squared_error, r2_score, mean_absolute_error
-)
-```
-
-## Basic Workflow Template
-
-### Classification
-
-```python
-from sklearn.model_selection import train_test_split
-from sklearn.preprocessing import StandardScaler
-from sklearn.ensemble import RandomForestClassifier
-from sklearn.metrics import classification_report
-
-# Split data
-X_train, X_test, y_train, y_test = train_test_split(
-    X, y, test_size=0.2, random_state=42, stratify=y
-)
-
-# Scale features
-scaler = StandardScaler()
-X_train_scaled = scaler.fit_transform(X_train)
-X_test_scaled = scaler.transform(X_test)
-
-# Train model
-model = RandomForestClassifier(n_estimators=100, random_state=42)
-model.fit(X_train_scaled, y_train)
-
-# Predict and evaluate
-y_pred = model.predict(X_test_scaled)
-print(classification_report(y_test, y_pred))
-```
-
-### Regression
-
-```python
-from sklearn.model_selection import train_test_split
-from sklearn.preprocessing import StandardScaler
-from sklearn.ensemble import RandomForestRegressor
-from sklearn.metrics import mean_squared_error, r2_score
-
-# Split data
-X_train, X_test, y_train, y_test = train_test_split(
-    X, y, test_size=0.2, random_state=42
-)
-
-# Scale features
-scaler = StandardScaler()
-X_train_scaled = scaler.fit_transform(X_train)
-X_test_scaled = scaler.transform(X_test)
-
-# Train model
-model = RandomForestRegressor(n_estimators=100, random_state=42)
-model.fit(X_train_scaled, y_train)
-
-# Predict and evaluate
-y_pred = model.predict(X_test_scaled)
-print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.3f}")
-print(f"R²: {r2_score(y_test, y_pred):.3f}")
-```
-
-### With Pipeline (Recommended)
-
-```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.ensemble import RandomForestClassifier
-from sklearn.model_selection import train_test_split, cross_val_score
-
-# Create pipeline
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
-])
-
-# Split and train
-X_train, X_test, y_train, y_test = train_test_split(
-    X, y, test_size=0.2, random_state=42
-)
-pipeline.fit(X_train, y_train)
-
-# Evaluate
-score = pipeline.score(X_test, y_test)
-cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
-print(f"Test accuracy: {score:.3f}")
-print(f"CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
-```
-
-## Common Preprocessing Patterns
-
-### Numeric Data
-
-```python
-from sklearn.preprocessing import StandardScaler
-from sklearn.impute import SimpleImputer
-from sklearn.pipeline import Pipeline
-
-numeric_transformer = Pipeline([
-    ('imputer', SimpleImputer(strategy='median')),
-    ('scaler', StandardScaler())
-])
-```
-
-### Categorical Data
-
-```python
-from sklearn.preprocessing import OneHotEncoder
-from sklearn.impute import SimpleImputer
-from sklearn.pipeline import Pipeline
-
-categorical_transformer = Pipeline([
-    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
-    ('onehot', OneHotEncoder(handle_unknown='ignore'))
-])
-```
-
-### Mixed Data with ColumnTransformer
-
-```python
-from sklearn.compose import ColumnTransformer
-
-numeric_features = ['age', 'income', 'credit_score']
-categorical_features = ['country', 'occupation']
-
-preprocessor = ColumnTransformer(
-    transformers=[
-        ('num', numeric_transformer, numeric_features),
-        ('cat', categorical_transformer, categorical_features)
-    ])
-
-# Complete pipeline
-from sklearn.ensemble import RandomForestClassifier
-pipeline = Pipeline([
-    ('preprocessor', preprocessor),
-    ('classifier', RandomForestClassifier())
-])
-```
-
-## Model Selection Cheat Sheet
-
-### Quick Decision Tree
-
-```
-Is it supervised?
-├─ Yes
-│  ├─ Predicting categories? → Classification
-│  │  ├─ Start with: LogisticRegression (baseline)
-│  │  ├─ Then try: RandomForestClassifier
-│  │  └─ Best performance: HistGradientBoostingClassifier
-│  └─ Predicting numbers? → Regression
-│     ├─ Start with: LinearRegression/Ridge (baseline)
-│     ├─ Then try: RandomForestRegressor
-│     └─ Best performance: HistGradientBoostingRegressor
-└─ No
-   ├─ Grouping similar items? → Clustering
-   │  ├─ Know # clusters: KMeans
-   │  └─ Unknown # clusters: DBSCAN or HDBSCAN
-   ├─ Reducing dimensions?
-   │  ├─ For preprocessing: PCA
-   │  └─ For visualization: t-SNE or UMAP
-   └─ Finding outliers? → IsolationForest or LocalOutlierFactor
-```
-
-### Algorithm Selection by Data Size
-
- **Small (<1K samples)**: Any algorithm
- **Medium (1K-100K)**: Random Forests, Gradient Boosting, Neural Networks
- **Large (>100K)**: SGDClassifier/Regressor, HistGradientBoosting, LinearSVC
-
-### When to Scale Features
-
-**Always scale**:
- SVM, Neural Networks
- K-Nearest Neighbors
- Linear/Logistic Regression (with regularization)
- PCA, LDA
- Any gradient descent algorithm
-
-**Don't need to scale**:
- Tree-based (Decision Trees, Random Forests, Gradient Boosting)
- Naive Bayes
-
-## Hyperparameter Tuning
-
-### GridSearchCV
-
-```python
-from sklearn.model_selection import GridSearchCV
-
-param_grid = {
-    'n_estimators': [100, 200, 500],
-    'max_depth': [10, 20, None],
-    'min_samples_split': [2, 5, 10]
-}
-
-grid_search = GridSearchCV(
-    RandomForestClassifier(random_state=42),
-    param_grid,
-    cv=5,
-    scoring='f1_weighted',
-    n_jobs=-1
-)
-
-grid_search.fit(X_train, y_train)
-best_model = grid_search.best_estimator_
-print(f"Best params: {grid_search.best_params_}")
-```
-
-### RandomizedSearchCV (Faster)
-
-```python
-from sklearn.model_selection import RandomizedSearchCV
-from scipy.stats import randint, uniform
-
-param_distributions = {
-    'n_estimators': randint(100, 1000),
-    'max_depth': randint(5, 50),
-    'min_samples_split': randint(2, 20)
-}
-
-random_search = RandomizedSearchCV(
-    RandomForestClassifier(random_state=42),
-    param_distributions,
-    n_iter=50,  # Number of combinations to try
-    cv=5,
-    n_jobs=-1,
-    random_state=42
-)
-
-random_search.fit(X_train, y_train)
-```
-
-### Pipeline with GridSearchCV
-
-```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.svm import SVC
-from sklearn.model_selection import GridSearchCV
-
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('svm', SVC())
-])
-
-param_grid = {
-    'svm__C': [0.1, 1, 10],
-    'svm__kernel': ['rbf', 'linear'],
-    'svm__gamma': ['scale', 'auto']
-}
-
-grid = GridSearchCV(pipeline, param_grid, cv=5)
-grid.fit(X_train, y_train)
-```
-
-## Cross-Validation
-
-### Basic Cross-Validation
-
-```python
-from sklearn.model_selection import cross_val_score
-
-scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
-print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
-```
-
-### Multiple Metrics
-
-```python
-from sklearn.model_selection import cross_validate
-
-scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
-results = cross_validate(model, X, y, cv=5, scoring=scoring)
-
-for metric in scoring:
-    scores = results[f'test_{metric}']
-    print(f"{metric}: {scores.mean():.3f} (+/- {scores.std():.3f})")
-```
-
-### Custom CV Strategies
-
-```python
-from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
-
-# For imbalanced classification
-cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
-
-# For time series
-cv = TimeSeriesSplit(n_splits=5)
-
-scores = cross_val_score(model, X, y, cv=cv)
-```
-
-## Common Metrics
-
-### Classification
-
-```python
-from sklearn.metrics import (
-    accuracy_score, balanced_accuracy_score,
-    precision_score, recall_score, f1_score,
-    confusion_matrix, classification_report,
-    roc_auc_score
-)
-
-# Basic metrics
-accuracy = accuracy_score(y_true, y_pred)
-f1 = f1_score(y_true, y_pred, average='weighted')
-
-# Comprehensive report
-print(classification_report(y_true, y_pred))
-
-# ROC AUC (requires probabilities)
-y_proba = model.predict_proba(X_test)[:, 1]
-auc = roc_auc_score(y_true, y_proba)
-```
-
-### Regression
-
-```python
-from sklearn.metrics import (
-    mean_squared_error,
-    mean_absolute_error,
-    r2_score
-)
-
-mse = mean_squared_error(y_true, y_pred)
-rmse = mean_squared_error(y_true, y_pred, squared=False)
-mae = mean_absolute_error(y_true, y_pred)
-r2 = r2_score(y_true, y_pred)
-
-print(f"RMSE: {rmse:.3f}")
-print(f"MAE: {mae:.3f}")
-print(f"R²: {r2:.3f}")
-```
-
-## Feature Engineering
-
-### Polynomial Features
-
-```python
-from sklearn.preprocessing import PolynomialFeatures
-
-poly = PolynomialFeatures(degree=2, include_bias=False)
-X_poly = poly.fit_transform(X)
-# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
-```
-
-### Feature Selection
-
-```python
-from sklearn.feature_selection import (
-    SelectKBest, f_classif,
-    RFE,
-    SelectFromModel
-)
-
-# Univariate selection
-selector = SelectKBest(f_classif, k=10)
-X_selected = selector.fit_transform(X, y)
-
-# Recursive feature elimination
-from sklearn.ensemble import RandomForestClassifier
-rfe = RFE(RandomForestClassifier(), n_features_to_select=10)
-X_selected = rfe.fit_transform(X, y)
-
-# Model-based selection
-selector = SelectFromModel(
-    RandomForestClassifier(n_estimators=100),
-    threshold='median'
-)
-X_selected = selector.fit_transform(X, y)
-```
-
-### Feature Importance
-
-```python
-# Tree-based models
-model = RandomForestClassifier()
-model.fit(X_train, y_train)
-importances = model.feature_importances_
-
-# Visualize
-import matplotlib.pyplot as plt
-indices = np.argsort(importances)[::-1]
-plt.bar(range(X.shape[1]), importances[indices])
-plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
-plt.show()
-
-# Permutation importance (works for any model)
-from sklearn.inspection import permutation_importance
-result = permutation_importance(model, X_test, y_test, n_repeats=10)
-importances = result.importances_mean
-```
-
-## Clustering
-
-### K-Means
-
-```python
-from sklearn.cluster import KMeans
-from sklearn.preprocessing import StandardScaler
-
-# Always scale for k-means
-scaler = StandardScaler()
-X_scaled = scaler.fit_transform(X)
-
-# Fit k-means
-kmeans = KMeans(n_clusters=3, random_state=42)
-labels = kmeans.fit_predict(X_scaled)
-
-# Evaluate
-from sklearn.metrics import silhouette_score
-score = silhouette_score(X_scaled, labels)
-print(f"Silhouette score: {score:.3f}")
-```
-
-### Elbow Method
-
-```python
-inertias = []
-K_range = range(2, 11)
-
-for k in K_range:
-    kmeans = KMeans(n_clusters=k, random_state=42)
-    kmeans.fit(X_scaled)
-    inertias.append(kmeans.inertia_)
-
-plt.plot(K_range, inertias, 'bo-')
-plt.xlabel('k')
-plt.ylabel('Inertia')
-plt.show()
-```
-
-### DBSCAN
-
-```python
-from sklearn.cluster import DBSCAN
-
-dbscan = DBSCAN(eps=0.5, min_samples=5)
-labels = dbscan.fit_predict(X_scaled)
-
-# -1 indicates noise/outliers
-n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
-n_noise = list(labels).count(-1)
-print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
-```
-
-## Dimensionality Reduction
-
-### PCA
-
-```python
-from sklearn.decomposition import PCA
-from sklearn.preprocessing import StandardScaler
-
-# Always scale before PCA
-scaler = StandardScaler()
-X_scaled = scaler.fit_transform(X)
-
-# Specify n_components
-pca = PCA(n_components=2)
-X_pca = pca.fit_transform(X_scaled)
-
-# Or specify variance to retain
-pca = PCA(n_components=0.95)  # Keep 95% variance
-X_pca = pca.fit_transform(X_scaled)
-
-print(f"Explained variance: {pca.explained_variance_ratio_}")
-print(f"Components needed: {pca.n_components_}")
-```
-
-### t-SNE (Visualization Only)
-
-```python
-from sklearn.manifold import TSNE
-
-# Reduce to 50 dimensions with PCA first (recommended)
-pca = PCA(n_components=50)
-X_pca = pca.fit_transform(X_scaled)
-
-# Apply t-SNE
-tsne = TSNE(n_components=2, random_state=42, perplexity=30)
-X_tsne = tsne.fit_transform(X_pca)
-
-# Visualize
-plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
-plt.colorbar()
-plt.show()
-```
-
-## Saving and Loading Models
-
-```python
-import joblib
-
-# Save model
-joblib.dump(model, 'model.pkl')
-
-# Save pipeline
-joblib.dump(pipeline, 'pipeline.pkl')
-
-# Load
-model = joblib.load('model.pkl')
-pipeline = joblib.load('pipeline.pkl')
-
-# Use loaded model
-y_pred = model.predict(X_new)
-```
-
-## Common Pitfalls and Solutions
-
-### Data Leakage
-❌ **Wrong**: Fit on all data before split
-```python
-scaler = StandardScaler().fit(X)
-X_train, X_test = train_test_split(scaler.transform(X))
-```
-
-✅ **Correct**: Use pipeline or fit only on train
-```python
-X_train, X_test = train_test_split(X)
-pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
-pipeline.fit(X_train, y_train)
-```
-
-### Not Scaling
-❌ **Wrong**: Using SVM without scaling
-```python
-svm = SVC()
-svm.fit(X_train, y_train)
-```
-
-✅ **Correct**: Scale for SVM
-```python
-pipeline = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
-pipeline.fit(X_train, y_train)
-```
-
-### Wrong Metric for Imbalanced Data
-❌ **Wrong**: Using accuracy for 99:1 imbalance
-```python
-accuracy = accuracy_score(y_true, y_pred)  # Can be misleading
-```
-
-✅ **Correct**: Use appropriate metrics
-```python
-f1 = f1_score(y_true, y_pred, average='weighted')
-balanced_acc = balanced_accuracy_score(y_true, y_pred)
-```
-
-### Not Using Stratification
-❌ **Wrong**: Random split for imbalanced data
-```python
-X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
-```
-
-✅ **Correct**: Stratify for imbalanced classes
-```python
-X_train, X_test, y_train, y_test = train_test_split(
-    X, y, test_size=0.2, stratify=y
-)
-```
-
-## Performance Tips
-
-1. **Use n_jobs=-1** for parallel processing (RandomForest, GridSearchCV)
-2. **Use HistGradientBoosting** for large datasets (>10K samples)
-3. **Use MiniBatchKMeans** for large clustering tasks
-4. **Use IncrementalPCA** for data that doesn't fit in memory
-5. **Use sparse matrices** for high-dimensional sparse data (text)
-6. **Cache transformers** in pipelines during grid search
-7. **Use RandomizedSearchCV** instead of GridSearchCV for large parameter spaces
-8. **Reduce dimensionality** with PCA before applying expensive algorithms
--- a/scientific-packages/scikit-learn/references/supervised_learning.md
+++ b/scientific-packages/scikit-learn/references/supervised_learning.md
@@ -1,261 +0,0 @@
-# Supervised Learning in scikit-learn
-
-## Overview
-Supervised learning algorithms learn patterns from labeled training data to make predictions on new data. Scikit-learn organizes supervised learning into 17 major categories.
-
-## Linear Models
-
-### Regression
- **LinearRegression**: Ordinary least squares regression
- **Ridge**: L2-regularized regression, good for multicollinearity
- **Lasso**: L1-regularized regression, performs feature selection
- **ElasticNet**: Combined L1/L2 regularization
- **LassoLars**: Lasso using Least Angle Regression algorithm
- **BayesianRidge**: Bayesian approach with automatic relevance determination
-
-### Classification
- **LogisticRegression**: Binary and multiclass classification
- **RidgeClassifier**: Ridge regression for classification
- **SGDClassifier**: Linear classifiers with SGD training
-
-**Use cases**: Baseline models, interpretable predictions, high-dimensional data, when linear relationships are expected
-
-**Key parameters**:
- `alpha`: Regularization strength (higher = more regularization)
- `fit_intercept`: Whether to calculate intercept
- `solver`: Optimization algorithm ('lbfgs', 'saga', 'liblinear')
-
-## Support Vector Machines (SVM)
-
- **SVC**: Support Vector Classification
- **SVR**: Support Vector Regression
- **LinearSVC**: Linear SVM using liblinear (faster for large datasets)
- **OneClassSVM**: Unsupervised outlier detection
-
-**Use cases**: Complex non-linear decision boundaries, high-dimensional spaces, when clear margin of separation exists
-
-**Key parameters**:
- `kernel`: 'linear', 'poly', 'rbf', 'sigmoid'
- `C`: Regularization parameter (lower = more regularization)
- `gamma`: Kernel coefficient ('scale', 'auto', or float)
- `degree`: Polynomial degree (for poly kernel)
-
-**Performance tip**: SVMs don't scale well beyond tens of thousands of samples. Use LinearSVC for large datasets with linear kernel.
-
-## Decision Trees
-
- **DecisionTreeClassifier**: Classification tree
- **DecisionTreeRegressor**: Regression tree
- **ExtraTreeClassifier/Regressor**: Extremely randomized tree
-
-**Use cases**: Non-linear relationships, feature importance analysis, interpretable rules, handling mixed data types
-
-**Key parameters**:
- `max_depth`: Maximum tree depth (controls overfitting)
- `min_samples_split`: Minimum samples to split a node
- `min_samples_leaf`: Minimum samples in leaf node
- `max_features`: Number of features to consider for splits
- `criterion`: 'gini', 'entropy' (classification); 'squared_error', 'absolute_error' (regression)
-
-**Overfitting prevention**: Limit `max_depth`, increase `min_samples_split/leaf`, use pruning with `ccp_alpha`
-
-## Ensemble Methods
-
-### Random Forests
- **RandomForestClassifier**: Ensemble of decision trees
- **RandomForestRegressor**: Regression variant
-
-**Use cases**: Robust general-purpose algorithm, reduces overfitting vs single trees, handles non-linear relationships
-
-**Key parameters**:
- `n_estimators`: Number of trees (higher = better but slower)
- `max_depth`: Maximum tree depth
- `max_features`: Features per split ('sqrt', 'log2', int, float)
- `bootstrap`: Whether to use bootstrap samples
- `n_jobs`: Parallel processing (-1 uses all cores)
-
-### Gradient Boosting
- **HistGradientBoostingClassifier/Regressor**: Histogram-based, fast for large datasets (>10k samples)
- **GradientBoostingClassifier/Regressor**: Traditional implementation, better for small datasets
-
-**Use cases**: High-performance predictions, winning Kaggle competitions, structured/tabular data
-
-**Key parameters**:
- `n_estimators`: Number of boosting stages
- `learning_rate`: Shrinks contribution of each tree
- `max_depth`: Maximum tree depth (typically 3-8)
- `subsample`: Fraction of samples per tree (enables stochastic gradient boosting)
- `early_stopping`: Stop when validation score stops improving
-
-**Performance tip**: HistGradientBoosting is orders of magnitude faster for large datasets
-
-### AdaBoost
- **AdaBoostClassifier/Regressor**: Adaptive boosting
-
-**Use cases**: Boosting weak learners, less prone to overfitting than other methods
-
-**Key parameters**:
- `estimator`: Base estimator (default: DecisionTreeClassifier with max_depth=1)
- `n_estimators`: Number of boosting iterations
- `learning_rate`: Weight applied to each classifier
-
-### Bagging
- **BaggingClassifier/Regressor**: Bootstrap aggregating with any base estimator
-
-**Use cases**: Reducing variance of unstable models, parallel ensemble creation
-
-**Key parameters**:
- `estimator`: Base estimator to fit
- `n_estimators`: Number of estimators
- `max_samples`: Samples to draw per estimator
- `bootstrap`: Whether to use replacement
-
-### Voting & Stacking
- **VotingClassifier/Regressor**: Combines different model types
- **StackingClassifier/Regressor**: Meta-learner trained on base predictions
-
-**Use cases**: Combining diverse models, leveraging different model strengths
-
-## Neural Networks
-
- **MLPClassifier**: Multi-layer perceptron classifier
- **MLPRegressor**: Multi-layer perceptron regressor
-
-**Use cases**: Complex non-linear patterns, when gradient boosting is too slow, deep feature learning
-
-**Key parameters**:
- `hidden_layer_sizes`: Tuple of hidden layer sizes (e.g., (100, 50))
- `activation`: 'relu', 'tanh', 'logistic'
- `solver`: 'adam', 'lbfgs', 'sgd'
- `alpha`: L2 regularization term
- `learning_rate`: Learning rate schedule
- `early_stopping`: Stop when validation score stops improving
-
-**Important**: Feature scaling is critical for neural networks. Always use StandardScaler or similar.
-
-## Nearest Neighbors
-
- **KNeighborsClassifier/Regressor**: K-nearest neighbors
- **RadiusNeighborsClassifier/Regressor**: Radius-based neighbors
- **NearestCentroid**: Classification using class centroids
-
-**Use cases**: Simple baseline, irregular decision boundaries, when interpretability isn't critical
-
-**Key parameters**:
- `n_neighbors`: Number of neighbors (typically 3-11)
- `weights`: 'uniform' or 'distance' (distance-weighted voting)
- `metric`: Distance metric ('euclidean', 'manhattan', 'minkowski')
- `algorithm`: 'auto', 'ball_tree', 'kd_tree', 'brute'
-
-## Naive Bayes
-
- **GaussianNB**: Assumes Gaussian distribution of features
- **MultinomialNB**: For discrete counts (text classification)
- **BernoulliNB**: For binary/boolean features
- **CategoricalNB**: For categorical features
- **ComplementNB**: Adapted for imbalanced datasets
-
-**Use cases**: Text classification, fast baseline, when features are independent, small training sets
-
-**Key parameters**:
- `alpha`: Smoothing parameter (Laplace/Lidstone smoothing)
- `fit_prior`: Whether to learn class prior probabilities
-
-## Linear/Quadratic Discriminant Analysis
-
- **LinearDiscriminantAnalysis**: Linear decision boundary with dimensionality reduction
- **QuadraticDiscriminantAnalysis**: Quadratic decision boundary
-
-**Use cases**: When classes have Gaussian distributions, dimensionality reduction, when covariance assumptions hold
-
-## Gaussian Processes
-
- **GaussianProcessClassifier**: Probabilistic classification
- **GaussianProcessRegressor**: Probabilistic regression with uncertainty estimates
-
-**Use cases**: When uncertainty quantification is important, small datasets, smooth function approximation
-
-**Key parameters**:
- `kernel`: Covariance function (RBF, Matern, RationalQuadratic, etc.)
- `alpha`: Noise level
-
-**Limitation**: Doesn't scale well to large datasets (O(n³) complexity)
-
-## Stochastic Gradient Descent
-
- **SGDClassifier**: Linear classifiers with SGD
- **SGDRegressor**: Linear regressors with SGD
-
-**Use cases**: Very large datasets (>100k samples), online learning, when data doesn't fit in memory
-
-**Key parameters**:
- `loss`: Loss function ('hinge', 'log_loss', 'squared_error', etc.)
- `penalty`: Regularization ('l2', 'l1', 'elasticnet')
- `alpha`: Regularization strength
- `learning_rate`: Learning rate schedule
-
-## Semi-Supervised Learning
-
- **SelfTrainingClassifier**: Self-training with any base classifier
- **LabelPropagation**: Label propagation through graph
- **LabelSpreading**: Label spreading (modified label propagation)
-
-**Use cases**: When labeled data is scarce but unlabeled data is abundant
-
-## Feature Selection
-
- **VarianceThreshold**: Remove low-variance features
- **SelectKBest**: Select K highest scoring features
- **SelectPercentile**: Select top percentile of features
- **RFE**: Recursive feature elimination
- **RFECV**: RFE with cross-validation
- **SelectFromModel**: Select features based on importance
- **SequentialFeatureSelector**: Forward/backward feature selection
-
-**Use cases**: Reducing dimensionality, removing irrelevant features, improving interpretability, reducing overfitting
-
-## Probability Calibration
-
- **CalibratedClassifierCV**: Calibrate classifier probabilities
-
-**Use cases**: When probability estimates are important (not just class predictions), especially with SVM and Naive Bayes
-
-**Methods**:
- `sigmoid`: Platt scaling
- `isotonic`: Isotonic regression (more flexible, needs more data)
-
-## Multi-Output Methods
-
- **MultiOutputClassifier**: Fit one classifier per target
- **MultiOutputRegressor**: Fit one regressor per target
- **ClassifierChain**: Models dependencies between targets
- **RegressorChain**: Regression variant
-
-**Use cases**: Predicting multiple related targets simultaneously
-
-## Specialized Regression
-
- **IsotonicRegression**: Monotonic regression
- **QuantileRegressor**: Quantile regression for prediction intervals
-
-## Algorithm Selection Guidelines
-
-**Start with**:
-1. **Logistic Regression** (classification) or **LinearRegression/Ridge** (regression) as baseline
-2. **RandomForestClassifier/Regressor** for general non-linear problems
-3. **HistGradientBoostingClassifier/Regressor** when best performance is needed
-
-**Consider dataset size**:
- Small (<1k samples): SVM, Gaussian Processes, any algorithm
- Medium (1k-100k): Random Forests, Gradient Boosting, Neural Networks
- Large (>100k): SGD, HistGradientBoosting, LinearSVC
-
-**Consider interpretability needs**:
- High interpretability: Linear models, Decision Trees, Naive Bayes
- Medium: Random Forests (feature importance), Rule extraction
- Low (black box acceptable): Gradient Boosting, Neural Networks, SVM with RBF kernel
-
-**Consider training time**:
- Fast: Linear models, Naive Bayes, Decision Trees
- Medium: Random Forests (parallelizable), SVM (small data)
- Slow: Gradient Boosting, Neural Networks, SVM (large data), Gaussian Processes
--- a/scientific-packages/scikit-learn/references/unsupervised_learning.md
+++ b/scientific-packages/scikit-learn/references/unsupervised_learning.md
@@ -1,728 +0,0 @@
-# Unsupervised Learning in scikit-learn
-
-## Overview
-Unsupervised learning discovers patterns in data without labeled targets. Main tasks include clustering (grouping similar samples), dimensionality reduction (reducing feature count), and anomaly detection (finding outliers).
-
-## Clustering Algorithms
-
-### K-Means
-
-Groups data into k clusters by minimizing within-cluster variance.
-
-**Algorithm**:
-1. Initialize k centroids (k-means++ initialization recommended)
-2. Assign each point to nearest centroid
-3. Update centroids to mean of assigned points
-4. Repeat until convergence
-
-```python
-from sklearn.cluster import KMeans
-
-kmeans = KMeans(
-    n_clusters=3,
-    init='k-means++',  # Smart initialization
-    n_init=10,         # Number of times to run with different seeds
-    max_iter=300,
-    random_state=42
-)
-labels = kmeans.fit_predict(X)
-centroids = kmeans.cluster_centers_
-```
-
-**Use cases**:
- Customer segmentation
- Image compression
- Data preprocessing (clustering as features)
-
-**Strengths**:
- Fast and scalable
- Simple to understand
- Works well with spherical clusters
-
-**Limitations**:
- Assumes spherical clusters of similar size
- Sensitive to initialization (mitigated by k-means++)
- Must specify k beforehand
- Sensitive to outliers
-
-**Choosing k**: Use elbow method, silhouette score, or domain knowledge
-
-**Variants**:
- **MiniBatchKMeans**: Faster for large datasets, uses mini-batches
- **KMeans with n_init='auto'**: Adaptive number of initializations
-
-### DBSCAN
-
-Density-Based Spatial Clustering of Applications with Noise. Identifies clusters as dense regions separated by sparse areas.
-
-```python
-from sklearn.cluster import DBSCAN
-
-dbscan = DBSCAN(
-    eps=0.5,           # Maximum distance between neighbors
-    min_samples=5,     # Minimum points to form dense region
-    metric='euclidean'
-)
-labels = dbscan.fit_predict(X)
-# -1 indicates noise/outliers
-```
-
-**Use cases**:
- Arbitrary cluster shapes
- Outlier detection
- When cluster count is unknown
- Geographic/spatial data
-
-**Strengths**:
- Discovers arbitrary-shaped clusters
- Automatically detects outliers
- Doesn't require specifying number of clusters
- Robust to outliers
-
-**Limitations**:
- Struggles with varying densities
- Sensitive to eps and min_samples parameters
- Not deterministic (border points may vary)
-
-**Parameter tuning**:
- `eps`: Plot k-distance graph, look for elbow
- `min_samples`: Rule of thumb: 2 * dimensions
-
-### HDBSCAN
-
-Hierarchical DBSCAN that handles variable cluster densities.
-
-```python
-from sklearn.cluster import HDBSCAN
-
-hdbscan = HDBSCAN(
-    min_cluster_size=5,
-    min_samples=None,  # Defaults to min_cluster_size
-    metric='euclidean'
-)
-labels = hdbscan.fit_predict(X)
-```
-
-**Advantages over DBSCAN**:
- Handles variable density clusters
- More robust parameter selection
- Provides cluster membership probabilities
- Hierarchical structure
-
-**Use cases**: When DBSCAN struggles with varying densities
-
-### Hierarchical Clustering
-
-Builds nested cluster hierarchies using agglomerative (bottom-up) approach.
-
-```python
-from sklearn.cluster import AgglomerativeClustering
-
-agg_clust = AgglomerativeClustering(
-    n_clusters=3,
-    linkage='ward',  # 'ward', 'complete', 'average', 'single'
-    metric='euclidean'
-)
-labels = agg_clust.fit_predict(X)
-
-# Visualize with dendrogram
-from scipy.cluster.hierarchy import dendrogram, linkage as scipy_linkage
-import matplotlib.pyplot as plt
-
-linkage_matrix = scipy_linkage(X, method='ward')
-dendrogram(linkage_matrix)
-plt.show()
-```
-
-**Linkage methods**:
- `ward`: Minimizes variance (only with Euclidean) - **most common**
- `complete`: Maximum distance between clusters
- `average`: Average distance between clusters
- `single`: Minimum distance between clusters
-
-**Use cases**:
- When hierarchical structure is meaningful
- Taxonomy/phylogenetic trees
- When visualization is important (dendrograms)
-
-**Strengths**:
- No need to specify k initially (cut dendrogram at desired level)
- Produces hierarchy of clusters
- Deterministic
-
-**Limitations**:
- Computationally expensive (O(n²) to O(n³))
- Not suitable for large datasets
- Cannot undo previous merges
-
-### Spectral Clustering
-
-Performs dimensionality reduction using affinity matrix before clustering.
-
-```python
-from sklearn.cluster import SpectralClustering
-
-spectral = SpectralClustering(
-    n_clusters=3,
-    affinity='rbf',  # 'rbf', 'nearest_neighbors', 'precomputed'
-    gamma=1.0,
-    n_neighbors=10,
-    random_state=42
-)
-labels = spectral.fit_predict(X)
-```
-
-**Use cases**:
- Non-convex clusters
- Image segmentation
- Graph clustering
- When similarity matrix is available
-
-**Strengths**:
- Handles non-convex clusters
- Works with similarity matrices
- Often better than k-means for complex shapes
-
-**Limitations**:
- Computationally expensive
- Requires specifying number of clusters
- Memory intensive
-
-### Mean Shift
-
-Discovers clusters through iterative centroid updates based on density.
-
-```python
-from sklearn.cluster import MeanShift, estimate_bandwidth
-
-# Estimate bandwidth
-bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)
-
-mean_shift = MeanShift(bandwidth=bandwidth)
-labels = mean_shift.fit_predict(X)
-cluster_centers = mean_shift.cluster_centers_
-```
-
-**Use cases**:
- When cluster count is unknown
- Computer vision applications
- Object tracking
-
-**Strengths**:
- Automatically determines number of clusters
- Handles arbitrary shapes
- No assumptions about cluster shape
-
-**Limitations**:
- Computationally expensive
- Very sensitive to bandwidth parameter
- Doesn't scale well
-
-### Affinity Propagation
-
-Uses message-passing between samples to identify exemplars.
-
-```python
-from sklearn.cluster import AffinityPropagation
-
-affinity_prop = AffinityPropagation(
-    damping=0.5,       # Damping factor (0.5-1.0)
-    preference=None,   # Self-preference (controls number of clusters)
-    random_state=42
-)
-labels = affinity_prop.fit_predict(X)
-exemplars = affinity_prop.cluster_centers_indices_
-```
-
-**Use cases**:
- When number of clusters is unknown
- When exemplars (representative samples) are needed
-
-**Strengths**:
- Automatically determines number of clusters
- Identifies exemplar samples
- No initialization required
-
-**Limitations**:
- Very slow: O(n²t) where t is iterations
- Not suitable for large datasets
- Memory intensive
-
-### Gaussian Mixture Models (GMM)
-
-Probabilistic model assuming data comes from mixture of Gaussian distributions.
-
-```python
-from sklearn.mixture import GaussianMixture
-
-gmm = GaussianMixture(
-    n_components=3,
-    covariance_type='full',  # 'full', 'tied', 'diag', 'spherical'
-    random_state=42
-)
-labels = gmm.fit_predict(X)
-probabilities = gmm.predict_proba(X)  # Soft clustering
-```
-
-**Covariance types**:
- `full`: Each component has its own covariance matrix
- `tied`: All components share same covariance
- `diag`: Diagonal covariance (independent features)
- `spherical`: Spherical covariance (isotropic)
-
-**Use cases**:
- When soft clustering is needed (probabilities)
- When clusters have different shapes/sizes
- Generative modeling
- Density estimation
-
-**Strengths**:
- Provides probabilities (soft clustering)
- Can handle elliptical clusters
- Generative model (can sample new data)
- Model selection with BIC/AIC
-
-**Limitations**:
- Assumes Gaussian distributions
- Sensitive to initialization
- Can converge to local optima
-
-**Model selection**:
-```python
-from sklearn.mixture import GaussianMixture
-import numpy as np
-
-n_components_range = range(2, 10)
-bic_scores = []
-
-for n in n_components_range:
-    gmm = GaussianMixture(n_components=n, random_state=42)
-    gmm.fit(X)
-    bic_scores.append(gmm.bic(X))
-
-optimal_n = n_components_range[np.argmin(bic_scores)]
-```
-
-### BIRCH
-
-Builds Clustering Feature Tree for memory-efficient processing of large datasets.
-
-```python
-from sklearn.cluster import Birch
-
-birch = Birch(
-    n_clusters=3,
-    threshold=0.5,
-    branching_factor=50
-)
-labels = birch.fit_predict(X)
-```
-
-**Use cases**:
- Very large datasets
- Streaming data
- Memory constraints
-
-**Strengths**:
- Memory efficient
- Single pass over data
- Incremental learning
-
-## Dimensionality Reduction
-
-### Principal Component Analysis (PCA)
-
-Finds orthogonal components that explain maximum variance.
-
-```python
-from sklearn.decomposition import PCA
-import matplotlib.pyplot as plt
-
-# Specify number of components
-pca = PCA(n_components=2, random_state=42)
-X_transformed = pca.fit_transform(X)
-
-print("Explained variance ratio:", pca.explained_variance_ratio_)
-print("Total variance explained:", pca.explained_variance_ratio_.sum())
-
-# Or specify variance to retain
-pca = PCA(n_components=0.95)  # Keep 95% of variance
-X_transformed = pca.fit_transform(X)
-print(f"Components needed: {pca.n_components_}")
-
-# Visualize explained variance
-plt.plot(np.cumsum(pca.explained_variance_ratio_))
-plt.xlabel('Number of components')
-plt.ylabel('Cumulative explained variance')
-plt.show()
-```
-
-**Use cases**:
- Visualization (reduce to 2-3 dimensions)
- Remove multicollinearity
- Noise reduction
- Speed up training
- Feature extraction
-
-**Strengths**:
- Fast and efficient
- Reduces multicollinearity
- Works well for linear relationships
- Interpretable components
-
-**Limitations**:
- Only linear transformations
- Sensitive to scaling (always standardize first!)
- Components may be hard to interpret
-
-**Variants**:
- **IncrementalPCA**: For datasets that don't fit in memory
- **KernelPCA**: Non-linear dimensionality reduction
- **SparsePCA**: Sparse loadings for interpretability
-
-### t-SNE
-
-t-Distributed Stochastic Neighbor Embedding for visualization.
-
-```python
-from sklearn.manifold import TSNE
-
-tsne = TSNE(
-    n_components=2,
-    perplexity=30,      # Balance local vs global structure (5-50)
-    learning_rate='auto',
-    n_iter=1000,
-    random_state=42
-)
-X_embedded = tsne.fit_transform(X)
-
-# Visualize
-plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y)
-plt.show()
-```
-
-**Use cases**:
- Visualization only (do not use for preprocessing!)
- Exploring high-dimensional data
- Finding clusters visually
-
-**Important notes**:
- **Only for visualization**, not for preprocessing
- Each run produces different results (use random_state for reproducibility)
- Slow for large datasets
- Cannot transform new data (no transform() method)
-
-**Parameter tuning**:
- `perplexity`: 5-50, larger for larger datasets
- Lower perplexity = focus on local structure
- Higher perplexity = focus on global structure
-
-### UMAP
-
-Uniform Manifold Approximation and Projection (requires umap-learn package).
-
-**Advantages over t-SNE**:
- Preserves global structure better
- Faster
- Can transform new data
- Can be used for preprocessing (not just visualization)
-
-### Truncated SVD (LSA)
-
-Similar to PCA but works with sparse matrices (e.g., TF-IDF).
-
-```python
-from sklearn.decomposition import TruncatedSVD
-
-svd = TruncatedSVD(n_components=100, random_state=42)
-X_reduced = svd.fit_transform(X_sparse)
-```
-
-**Use cases**:
- Text data (after TF-IDF)
- Sparse matrices
- Latent Semantic Analysis (LSA)
-
-### Non-negative Matrix Factorization (NMF)
-
-Factorizes data into non-negative components.
-
-```python
-from sklearn.decomposition import NMF
-
-nmf = NMF(n_components=10, init='nndsvd', random_state=42)
-W = nmf.fit_transform(X)  # Document-topic matrix
-H = nmf.components_        # Topic-word matrix
-```
-
-**Use cases**:
- Topic modeling
- Audio source separation
- Image processing
- When non-negativity is important (e.g., counts)
-
-**Strengths**:
- Interpretable components (additive, non-negative)
- Sparse representations
-
-### Independent Component Analysis (ICA)
-
-Separates multivariate signal into independent components.
-
-```python
-from sklearn.decomposition import FastICA
-
-ica = FastICA(n_components=10, random_state=42)
-X_independent = ica.fit_transform(X)
-```
-
-**Use cases**:
- Blind source separation
- Signal processing
- Feature extraction when independence is expected
-
-### Factor Analysis
-
-Models observed variables as linear combinations of latent factors plus noise.
-
-```python
-from sklearn.decomposition import FactorAnalysis
-
-fa = FactorAnalysis(n_components=5, random_state=42)
-X_factors = fa.fit_transform(X)
-```
-
-**Use cases**:
- When noise is heteroscedastic
- Latent variable modeling
- Psychology/social science research
-
-**Difference from PCA**: Models noise explicitly, assumes features have independent noise
-
-## Anomaly Detection
-
-### One-Class SVM
-
-Learns boundary around normal data.
-
-```python
-from sklearn.svm import OneClassSVM
-
-oc_svm = OneClassSVM(
-    nu=0.1,           # Proportion of outliers expected
-    kernel='rbf',
-    gamma='auto'
-)
-oc_svm.fit(X_train)
-predictions = oc_svm.predict(X_test)  # 1 for inliers, -1 for outliers
-```
-
-**Use cases**:
- Novelty detection
- When only normal data is available for training
-
-### Isolation Forest
-
-Isolates outliers using random forests.
-
-```python
-from sklearn.ensemble import IsolationForest
-
-iso_forest = IsolationForest(
-    contamination=0.1,  # Expected proportion of outliers
-    random_state=42
-)
-predictions = iso_forest.fit_predict(X)  # 1 for inliers, -1 for outliers
-scores = iso_forest.score_samples(X)     # Anomaly scores
-```
-
-**Use cases**:
- General anomaly detection
- Works well with high-dimensional data
- Fast and scalable
-
-**Strengths**:
- Fast
- Effective in high dimensions
- Low memory requirements
-
-### Local Outlier Factor (LOF)
-
-Detects outliers based on local density deviation.
-
-```python
-from sklearn.neighbors import LocalOutlierFactor
-
-lof = LocalOutlierFactor(
-    n_neighbors=20,
-    contamination=0.1
-)
-predictions = lof.fit_predict(X)  # 1 for inliers, -1 for outliers
-scores = lof.negative_outlier_factor_  # Anomaly scores (negative)
-```
-
-**Use cases**:
- Finding local outliers
- When global methods fail
-
-## Clustering Evaluation
-
-### With Ground Truth Labels
-
-When true labels are available (for validation):
-
-**Adjusted Rand Index (ARI)**:
-```python
-from sklearn.metrics import adjusted_rand_score
-ari = adjusted_rand_score(y_true, y_pred)
-# Range: [-1, 1], 1 = perfect, 0 = random
-```
-
-**Normalized Mutual Information (NMI)**:
-```python
-from sklearn.metrics import normalized_mutual_info_score
-nmi = normalized_mutual_info_score(y_true, y_pred)
-# Range: [0, 1], 1 = perfect
-```
-
-**V-Measure**:
-```python
-from sklearn.metrics import v_measure_score
-v = v_measure_score(y_true, y_pred)
-# Range: [0, 1], harmonic mean of homogeneity and completeness
-```
-
-### Without Ground Truth Labels
-
-When true labels are unavailable (unsupervised evaluation):
-
-**Silhouette Score**:
-Measures how similar objects are to their own cluster vs other clusters.
-
-```python
-from sklearn.metrics import silhouette_score, silhouette_samples
-import matplotlib.pyplot as plt
-
-score = silhouette_score(X, labels)
-# Range: [-1, 1], higher is better
-# >0.7: Strong structure
-# 0.5-0.7: Reasonable structure
-# 0.25-0.5: Weak structure
-# <0.25: No substantial structure
-
-# Per-sample scores for detailed analysis
-sample_scores = silhouette_samples(X, labels)
-
-# Visualize silhouette plot
-for i in range(n_clusters):
-    cluster_scores = sample_scores[labels == i]
-    cluster_scores.sort()
-    plt.barh(range(len(cluster_scores)), cluster_scores)
-plt.axvline(x=score, color='red', linestyle='--')
-plt.show()
-```
-
-**Davies-Bouldin Index**:
-```python
-from sklearn.metrics import davies_bouldin_score
-db = davies_bouldin_score(X, labels)
-# Lower is better, 0 = perfect
-```
-
-**Calinski-Harabasz Index** (Variance Ratio Criterion):
-```python
-from sklearn.metrics import calinski_harabasz_score
-ch = calinski_harabasz_score(X, labels)
-# Higher is better
-```
-
-**Inertia** (K-Means specific):
-```python
-inertia = kmeans.inertia_
-# Sum of squared distances to nearest cluster center
-# Use for elbow method
-```
-
-### Elbow Method (K-Means)
-
-```python
-from sklearn.cluster import KMeans
-import matplotlib.pyplot as plt
-
-inertias = []
-K_range = range(2, 11)
-
-for k in K_range:
-    kmeans = KMeans(n_clusters=k, random_state=42)
-    kmeans.fit(X)
-    inertias.append(kmeans.inertia_)
-
-plt.plot(K_range, inertias, 'bo-')
-plt.xlabel('Number of clusters (k)')
-plt.ylabel('Inertia')
-plt.title('Elbow Method')
-plt.show()
-# Look for "elbow" where inertia starts decreasing more slowly
-```
-
-## Best Practices
-
-### Clustering Algorithm Selection
-
-**Use K-Means when**:
- Clusters are spherical and similar size
- Speed is important
- Data is not too high-dimensional
-
-**Use DBSCAN when**:
- Arbitrary cluster shapes
- Number of clusters unknown
- Outlier detection needed
-
-**Use Hierarchical when**:
- Hierarchy is meaningful
- Small to medium datasets
- Visualization is important
-
-**Use GMM when**:
- Soft clustering needed
- Clusters have different shapes/sizes
- Probabilistic interpretation needed
-
-**Use Spectral Clustering when**:
- Non-convex clusters
- Have similarity matrix
- Moderate dataset size
-
-### Preprocessing for Clustering
-
-1. **Always scale features**: Use StandardScaler or MinMaxScaler
-2. **Handle outliers**: Remove or use robust algorithms (DBSCAN, HDBSCAN)
-3. **Reduce dimensionality if needed**: PCA for speed, careful with interpretation
-4. **Check for categorical variables**: Encode appropriately or use specialized algorithms
-
-### Dimensionality Reduction Guidelines
-
-**For preprocessing/feature extraction**:
- PCA (linear relationships)
- TruncatedSVD (sparse data)
- NMF (non-negative data)
-
-**For visualization only**:
- t-SNE (preserves local structure)
- UMAP (preserves both local and global structure)
-
-**Always**:
- Standardize features before PCA
- Use appropriate n_components (elbow plot, explained variance)
- Don't use t-SNE for anything except visualization
-
-### Common Pitfalls
-
-1. **Not scaling data**: Most algorithms sensitive to scale
-2. **Using t-SNE for preprocessing**: Only for visualization!
-3. **Overfitting cluster count**: Too many clusters = overfitting noise
-4. **Ignoring outliers**: Can severely affect centroid-based methods
-5. **Wrong metric**: Euclidean assumes all features equally important
-6. **Not validating results**: Always check with multiple metrics and domain knowledge
-7. **PCA without standardization**: Components dominated by high-variance features
--- a/scientific-packages/scikit-learn/scripts/classification_pipeline.py
+++ b/scientific-packages/scikit-learn/scripts/classification_pipeline.py
@@ -1,219 +0,0 @@
-#!/usr/bin/env python3
-"""
-Complete classification pipeline with preprocessing, training, evaluation, and hyperparameter tuning.
-Demonstrates best practices for scikit-learn workflows.
-"""
-
-import numpy as np
-import pandas as pd
-from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
-from sklearn.preprocessing import StandardScaler, OneHotEncoder
-from sklearn.impute import SimpleImputer
-from sklearn.compose import ColumnTransformer
-from sklearn.pipeline import Pipeline
-from sklearn.ensemble import RandomForestClassifier
-from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
-import joblib
-
-
-def create_preprocessing_pipeline(numeric_features, categorical_features):
-    """
-    Create preprocessing pipeline for mixed data types.
-
-    Args:
-        numeric_features: List of numeric column names
-        categorical_features: List of categorical column names
-
-    Returns:
-        ColumnTransformer with appropriate preprocessing for each data type
-    """
-    numeric_transformer = Pipeline(steps=[
-        ('imputer', SimpleImputer(strategy='median')),
-        ('scaler', StandardScaler())
-    ])
-
-    categorical_transformer = Pipeline(steps=[
-        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
-        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True))
-    ])
-
-    preprocessor = ColumnTransformer(
-        transformers=[
-            ('num', numeric_transformer, numeric_features),
-            ('cat', categorical_transformer, categorical_features)
-        ])
-
-    return preprocessor
-
-
-def create_full_pipeline(preprocessor, classifier=None):
-    """
-    Create complete ML pipeline with preprocessing and classification.
-
-    Args:
-        preprocessor: Preprocessing ColumnTransformer
-        classifier: Classifier instance (default: RandomForestClassifier)
-
-    Returns:
-        Complete Pipeline
-    """
-    if classifier is None:
-        classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
-
-    pipeline = Pipeline(steps=[
-        ('preprocessor', preprocessor),
-        ('classifier', classifier)
-    ])
-
-    return pipeline
-
-
-def evaluate_model(pipeline, X_train, y_train, X_test, y_test, cv=5):
-    """
-    Evaluate model using cross-validation and test set.
-
-    Args:
-        pipeline: Trained pipeline
-        X_train, y_train: Training data
-        X_test, y_test: Test data
-        cv: Number of cross-validation folds
-
-    Returns:
-        Dictionary with evaluation results
-    """
-    # Cross-validation on training set
-    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='accuracy')
-
-    # Test set evaluation
-    y_pred = pipeline.predict(X_test)
-    test_score = pipeline.score(X_test, y_test)
-
-    # Get probabilities if available
-    try:
-        y_proba = pipeline.predict_proba(X_test)
-        if len(np.unique(y_test)) == 2:
-            # Binary classification
-            auc = roc_auc_score(y_test, y_proba[:, 1])
-        else:
-            # Multiclass
-            auc = roc_auc_score(y_test, y_proba, multi_class='ovr')
-    except:
-        auc = None
-
-    results = {
-        'cv_mean': cv_scores.mean(),
-        'cv_std': cv_scores.std(),
-        'test_score': test_score,
-        'auc': auc,
-        'classification_report': classification_report(y_test, y_pred),
-        'confusion_matrix': confusion_matrix(y_test, y_pred)
-    }
-
-    return results
-
-
-def tune_hyperparameters(pipeline, X_train, y_train, param_grid, cv=5):
-    """
-    Perform hyperparameter tuning using GridSearchCV.
-
-    Args:
-        pipeline: Pipeline to tune
-        X_train, y_train: Training data
-        param_grid: Dictionary of parameters to search
-        cv: Number of cross-validation folds
-
-    Returns:
-        GridSearchCV object with best model
-    """
-    grid_search = GridSearchCV(
-        pipeline,
-        param_grid,
-        cv=cv,
-        scoring='f1_weighted',
-        n_jobs=-1,
-        verbose=1
-    )
-
-    grid_search.fit(X_train, y_train)
-
-    print(f"Best parameters: {grid_search.best_params_}")
-    print(f"Best CV score: {grid_search.best_score_:.3f}")
-
-    return grid_search
-
-
-def main():
-    """
-    Example usage of the classification pipeline.
-    """
-    # Load your data here
-    # X, y = load_data()
-
-    # Example with synthetic data
-    from sklearn.datasets import make_classification
-    X, y = make_classification(
-        n_samples=1000,
-        n_features=20,
-        n_informative=15,
-        n_redundant=5,
-        random_state=42
-    )
-
-    # Convert to DataFrame for demonstration
-    feature_names = [f'feature_{i}' for i in range(X.shape[1])]
-    X = pd.DataFrame(X, columns=feature_names)
-
-    # Split features into numeric and categorical (all numeric in this example)
-    numeric_features = feature_names
-    categorical_features = []
-
-    # Split data (use stratify for imbalanced classes)
-    X_train, X_test, y_train, y_test = train_test_split(
-        X, y, test_size=0.2, random_state=42, stratify=y
-    )
-
-    # Create preprocessing pipeline
-    preprocessor = create_preprocessing_pipeline(numeric_features, categorical_features)
-
-    # Create full pipeline
-    pipeline = create_full_pipeline(preprocessor)
-
-    # Train model
-    print("Training model...")
-    pipeline.fit(X_train, y_train)
-
-    # Evaluate model
-    print("\nEvaluating model...")
-    results = evaluate_model(pipeline, X_train, y_train, X_test, y_test)
-
-    print(f"CV Accuracy: {results['cv_mean']:.3f} (+/- {results['cv_std']:.3f})")
-    print(f"Test Accuracy: {results['test_score']:.3f}")
-    if results['auc']:
-        print(f"ROC-AUC: {results['auc']:.3f}")
-    print("\nClassification Report:")
-    print(results['classification_report'])
-
-    # Hyperparameter tuning (optional)
-    print("\nTuning hyperparameters...")
-    param_grid = {
-        'classifier__n_estimators': [100, 200],
-        'classifier__max_depth': [10, 20, None],
-        'classifier__min_samples_split': [2, 5]
-    }
-
-    grid_search = tune_hyperparameters(pipeline, X_train, y_train, param_grid)
-
-    # Evaluate best model
-    print("\nEvaluating tuned model...")
-    best_pipeline = grid_search.best_estimator_
-    y_pred = best_pipeline.predict(X_test)
-    print(classification_report(y_test, y_pred))
-
-    # Save model
-    print("\nSaving model...")
-    joblib.dump(best_pipeline, 'best_model.pkl')
-    print("Model saved as 'best_model.pkl'")
-
-
-if __name__ == "__main__":
-    main()
--- a/scientific-packages/scikit-learn/scripts/clustering_analysis.py
+++ b/scientific-packages/scikit-learn/scripts/clustering_analysis.py
@@ -1,291 +0,0 @@
-#!/usr/bin/env python3
-"""
-Clustering analysis script with multiple algorithms and evaluation.
-Demonstrates k-means, DBSCAN, and hierarchical clustering with visualization.
-"""
-
-import numpy as np
-import pandas as pd
-from sklearn.preprocessing import StandardScaler
-from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
-from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
-from sklearn.decomposition import PCA
-import matplotlib.pyplot as plt
-import seaborn as sns
-
-
-def scale_data(X):
-    """
-    Scale features using StandardScaler.
-    ALWAYS scale data before clustering!
-
-    Args:
-        X: Feature matrix
-
-    Returns:
-        Scaled feature matrix and fitted scaler
-    """
-    scaler = StandardScaler()
-    X_scaled = scaler.fit_transform(X)
-    return X_scaled, scaler
-
-
-def find_optimal_k(X_scaled, k_range=range(2, 11)):
-    """
-    Find optimal number of clusters using elbow method and silhouette score.
-
-    Args:
-        X_scaled: Scaled feature matrix
-        k_range: Range of k values to try
-
-    Returns:
-        Dictionary with inertias and silhouette scores
-    """
-    inertias = []
-    silhouette_scores = []
-
-    for k in k_range:
-        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
-        labels = kmeans.fit_predict(X_scaled)
-        inertias.append(kmeans.inertia_)
-        silhouette_scores.append(silhouette_score(X_scaled, labels))
-
-    return {
-        'k_values': list(k_range),
-        'inertias': inertias,
-        'silhouette_scores': silhouette_scores
-    }
-
-
-def plot_elbow_silhouette(results):
-    """
-    Plot elbow method and silhouette scores.
-
-    Args:
-        results: Dictionary from find_optimal_k
-    """
-    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
-
-    # Elbow plot
-    ax1.plot(results['k_values'], results['inertias'], 'bo-')
-    ax1.set_xlabel('Number of clusters (k)')
-    ax1.set_ylabel('Inertia')
-    ax1.set_title('Elbow Method')
-    ax1.grid(True, alpha=0.3)
-
-    # Silhouette plot
-    ax2.plot(results['k_values'], results['silhouette_scores'], 'ro-')
-    ax2.set_xlabel('Number of clusters (k)')
-    ax2.set_ylabel('Silhouette Score')
-    ax2.set_title('Silhouette Score vs k')
-    ax2.grid(True, alpha=0.3)
-
-    plt.tight_layout()
-    plt.savefig('elbow_silhouette.png', dpi=300, bbox_inches='tight')
-    print("Saved elbow and silhouette plots to 'elbow_silhouette.png'")
-    plt.close()
-
-
-def evaluate_clustering(X_scaled, labels, algorithm_name):
-    """
-    Evaluate clustering using multiple metrics.
-
-    Args:
-        X_scaled: Scaled feature matrix
-        labels: Cluster labels
-        algorithm_name: Name of clustering algorithm
-
-    Returns:
-        Dictionary with evaluation metrics
-    """
-    # Filter out noise points for DBSCAN (-1 labels)
-    mask = labels != -1
-    X_filtered = X_scaled[mask]
-    labels_filtered = labels[mask]
-
-    n_clusters = len(set(labels_filtered))
-    n_noise = list(labels).count(-1)
-
-    results = {
-        'algorithm': algorithm_name,
-        'n_clusters': n_clusters,
-        'n_noise': n_noise
-    }
-
-    # Calculate metrics if we have valid clusters
-    if n_clusters > 1:
-        results['silhouette'] = silhouette_score(X_filtered, labels_filtered)
-        results['davies_bouldin'] = davies_bouldin_score(X_filtered, labels_filtered)
-        results['calinski_harabasz'] = calinski_harabasz_score(X_filtered, labels_filtered)
-    else:
-        results['silhouette'] = None
-        results['davies_bouldin'] = None
-        results['calinski_harabasz'] = None
-
-    return results
-
-
-def perform_kmeans(X_scaled, n_clusters=3):
-    """
-    Perform k-means clustering.
-
-    Args:
-        X_scaled: Scaled feature matrix
-        n_clusters: Number of clusters
-
-    Returns:
-        Fitted KMeans model and labels
-    """
-    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
-    labels = kmeans.fit_predict(X_scaled)
-    return kmeans, labels
-
-
-def perform_dbscan(X_scaled, eps=0.5, min_samples=5):
-    """
-    Perform DBSCAN clustering.
-
-    Args:
-        X_scaled: Scaled feature matrix
-        eps: Maximum distance between neighbors
-        min_samples: Minimum points to form dense region
-
-    Returns:
-        Fitted DBSCAN model and labels
-    """
-    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
-    labels = dbscan.fit_predict(X_scaled)
-    return dbscan, labels
-
-
-def perform_hierarchical(X_scaled, n_clusters=3, linkage='ward'):
-    """
-    Perform hierarchical clustering.
-
-    Args:
-        X_scaled: Scaled feature matrix
-        n_clusters: Number of clusters
-        linkage: Linkage criterion ('ward', 'complete', 'average', 'single')
-
-    Returns:
-        Fitted AgglomerativeClustering model and labels
-    """
-    hierarchical = AgglomerativeClustering(n_clusters=n_clusters, linkage=linkage)
-    labels = hierarchical.fit_predict(X_scaled)
-    return hierarchical, labels
-
-
-def visualize_clusters_2d(X_scaled, labels, algorithm_name, method='pca'):
-    """
-    Visualize clusters in 2D using PCA or t-SNE.
-
-    Args:
-        X_scaled: Scaled feature matrix
-        labels: Cluster labels
-        algorithm_name: Name of algorithm for title
-        method: 'pca' or 'tsne'
-    """
-    # Reduce to 2D
-    if method == 'pca':
-        pca = PCA(n_components=2, random_state=42)
-        X_2d = pca.fit_transform(X_scaled)
-        variance = pca.explained_variance_ratio_
-        xlabel = f'PC1 ({variance[0]:.1%} variance)'
-        ylabel = f'PC2 ({variance[1]:.1%} variance)'
-    else:
-        from sklearn.manifold import TSNE
-        # Use PCA first to speed up t-SNE
-        pca = PCA(n_components=min(50, X_scaled.shape[1]), random_state=42)
-        X_pca = pca.fit_transform(X_scaled)
-        tsne = TSNE(n_components=2, random_state=42, perplexity=30)
-        X_2d = tsne.fit_transform(X_pca)
-        xlabel = 't-SNE 1'
-        ylabel = 't-SNE 2'
-
-    # Plot
-    plt.figure(figsize=(10, 8))
-    scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis', alpha=0.6, s=50)
-    plt.colorbar(scatter, label='Cluster')
-    plt.xlabel(xlabel)
-    plt.ylabel(ylabel)
-    plt.title(f'{algorithm_name} Clustering ({method.upper()})')
-    plt.grid(True, alpha=0.3)
-
-    filename = f'{algorithm_name.lower().replace(" ", "_")}_{method}.png'
-    plt.savefig(filename, dpi=300, bbox_inches='tight')
-    print(f"Saved visualization to '{filename}'")
-    plt.close()
-
-
-def main():
-    """
-    Example clustering analysis workflow.
-    """
-    # Load your data here
-    # X = load_data()
-
-    # Example with synthetic data
-    from sklearn.datasets import make_blobs
-    X, y_true = make_blobs(
-        n_samples=500,
-        n_features=10,
-        centers=4,
-        cluster_std=1.0,
-        random_state=42
-    )
-
-    print(f"Dataset shape: {X.shape}")
-
-    # Scale data (ALWAYS scale for clustering!)
-    print("\nScaling data...")
-    X_scaled, scaler = scale_data(X)
-
-    # Find optimal k
-    print("\nFinding optimal number of clusters...")
-    results = find_optimal_k(X_scaled)
-    plot_elbow_silhouette(results)
-
-    # Based on elbow/silhouette, choose optimal k
-    optimal_k = 4  # Adjust based on plots
-
-    # Perform k-means
-    print(f"\nPerforming k-means with k={optimal_k}...")
-    kmeans, kmeans_labels = perform_kmeans(X_scaled, n_clusters=optimal_k)
-    kmeans_results = evaluate_clustering(X_scaled, kmeans_labels, 'K-Means')
-
-    # Perform DBSCAN
-    print("\nPerforming DBSCAN...")
-    dbscan, dbscan_labels = perform_dbscan(X_scaled, eps=0.5, min_samples=5)
-    dbscan_results = evaluate_clustering(X_scaled, dbscan_labels, 'DBSCAN')
-
-    # Perform hierarchical clustering
-    print("\nPerforming hierarchical clustering...")
-    hierarchical, hier_labels = perform_hierarchical(X_scaled, n_clusters=optimal_k)
-    hier_results = evaluate_clustering(X_scaled, hier_labels, 'Hierarchical')
-
-    # Print results
-    print("\n" + "="*60)
-    print("CLUSTERING RESULTS")
-    print("="*60)
-
-    for results in [kmeans_results, dbscan_results, hier_results]:
-        print(f"\n{results['algorithm']}:")
-        print(f"  Clusters: {results['n_clusters']}")
-        if results['n_noise'] > 0:
-            print(f"  Noise points: {results['n_noise']}")
-        if results['silhouette']:
-            print(f"  Silhouette Score: {results['silhouette']:.3f}")
-            print(f"  Davies-Bouldin Index: {results['davies_bouldin']:.3f} (lower is better)")
-            print(f"  Calinski-Harabasz Index: {results['calinski_harabasz']:.1f} (higher is better)")
-
-    # Visualize clusters
-    print("\nCreating visualizations...")
-    visualize_clusters_2d(X_scaled, kmeans_labels, 'K-Means', method='pca')
-    visualize_clusters_2d(X_scaled, dbscan_labels, 'DBSCAN', method='pca')
-    visualize_clusters_2d(X_scaled, hier_labels, 'Hierarchical', method='pca')
-
-    print("\nClustering analysis complete!")
-
-
-if __name__ == "__main__":
-    main()
--- a/scientific-packages/transformers/SKILL.md
+++ b/scientific-packages/transformers/SKILL.md
@@ -1,349 +0,0 @@
---
-name: transformers
-description: Work with state-of-the-art machine learning models for NLP, computer vision, audio, and multimodal tasks using HuggingFace Transformers. This skill should be used when fine-tuning pre-trained models, performing inference with pipelines, generating text, training sequence models, or working with BERT, GPT, T5, ViT, and other transformer architectures. Covers model loading, tokenization, training with Trainer API, text generation strategies, and task-specific patterns for classification, NER, QA, summarization, translation, and image tasks. (plugin:scientific-packages@claude-scientific-skills)
---
-
-# Transformers
-
-## Overview
-
-The Transformers library provides state-of-the-art machine learning models for NLP, computer vision, audio, and multimodal tasks. Apply this skill for quick inference through pipelines, comprehensive training via the Trainer API, and flexible text generation with various decoding strategies.
-
-## Core Capabilities
-
-### 1. Quick Inference with Pipelines
-
-For rapid inference without complex setup, use the `pipeline()` API. Pipelines abstract away tokenization, model invocation, and post-processing.
-
-```python
-from transformers import pipeline
-
-# Text classification
-classifier = pipeline("text-classification")
-result = classifier("This product is amazing!")
-
-# Named entity recognition
-ner = pipeline("token-classification")
-entities = ner("Sarah works at Microsoft in Seattle")
-
-# Question answering
-qa = pipeline("question-answering")
-answer = qa(question="What is the capital?", context="Paris is the capital of France.")
-
-# Text generation
-generator = pipeline("text-generation", model="gpt2")
-text = generator("Once upon a time", max_length=50)
-
-# Image classification
-image_classifier = pipeline("image-classification")
-predictions = image_classifier("image.jpg")
-```
-
-**When to use pipelines:**
- Quick prototyping and testing
- Simple inference tasks without custom logic
- Demonstrations and examples
- Production inference for standard tasks
-
-**Available pipeline tasks:**
- **NLP**: text-classification, token-classification, question-answering, summarization, translation, text-generation, fill-mask, zero-shot-classification
- **Vision**: image-classification, object-detection, image-segmentation, depth-estimation, zero-shot-image-classification
- **Audio**: automatic-speech-recognition, audio-classification, text-to-audio
- **Multimodal**: image-to-text, visual-question-answering, image-text-to-text
-
-For comprehensive pipeline documentation, see `references/pipelines.md`.
-
-### 2. Model Training and Fine-Tuning
-
-Use the Trainer API for comprehensive model training with support for distributed training, mixed precision, and advanced optimization.
-
-**Basic training workflow:**
-
-```python
-from transformers import (
-    AutoTokenizer,
-    AutoModelForSequenceClassification,
-    TrainingArguments,
-    Trainer
-)
-from datasets import load_dataset
-
-# 1. Load and tokenize data
-dataset = load_dataset("imdb")
-tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
-
-def tokenize_function(examples):
-    return tokenizer(examples["text"], padding="max_length", truncation=True)
-
-tokenized_datasets = dataset.map(tokenize_function, batched=True)
-
-# 2. Load model
-model = AutoModelForSequenceClassification.from_pretrained(
-    "bert-base-uncased",
-    num_labels=2
-)
-
-# 3. Configure training
-training_args = TrainingArguments(
-    output_dir="./results",
-    num_train_epochs=3,
-    per_device_train_batch_size=16,
-    eval_strategy="epoch",
-    save_strategy="epoch",
-    load_best_model_at_end=True,
-)
-
-# 4. Create trainer and train
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=tokenized_datasets["train"],
-    eval_dataset=tokenized_datasets["test"],
-)
-
-trainer.train()
-```
-
-**Key training features:**
- Mixed precision training (fp16/bf16)
- Distributed training (multi-GPU, multi-node)
- Gradient accumulation
- Learning rate scheduling with warmup
- Checkpoint management
- Hyperparameter search
- Push to Hugging Face Hub
-
-For detailed training documentation, see `references/training.md`.
-
-### 3. Text Generation
-
-Generate text using various decoding strategies including greedy decoding, beam search, sampling, and more.
-
-**Generation strategies:**
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-model = AutoModelForCausalLM.from_pretrained("gpt2")
-tokenizer = AutoTokenizer.from_pretrained("gpt2")
-inputs = tokenizer("Once upon a time", return_tensors="pt")
-
-# Greedy decoding (deterministic)
-outputs = model.generate(**inputs, max_new_tokens=50)
-
-# Beam search (explores multiple hypotheses)
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=50,
-    num_beams=5,
-    early_stopping=True
-)
-
-# Sampling (creative, diverse)
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=50,
-    do_sample=True,
-    temperature=0.7,
-    top_p=0.9,
-    top_k=50
-)
-```
-
-**Generation parameters:**
- `temperature`: Controls randomness (0.1-2.0)
- `top_k`: Sample from top-k tokens
- `top_p`: Nucleus sampling threshold
- `num_beams`: Number of beams for beam search
- `repetition_penalty`: Discourage repetition
- `no_repeat_ngram_size`: Prevent repeating n-grams
-
-For comprehensive generation documentation, see `references/generation_strategies.md`.
-
-### 4. Task-Specific Patterns
-
-Common task patterns with appropriate model classes:
-
-**Text Classification:**
-```python
-from transformers import AutoModelForSequenceClassification
-
-model = AutoModelForSequenceClassification.from_pretrained(
-    "bert-base-uncased",
-    num_labels=3,
-    id2label={0: "negative", 1: "neutral", 2: "positive"}
-)
-```
-
-**Named Entity Recognition (Token Classification):**
-```python
-from transformers import AutoModelForTokenClassification
-
-model = AutoModelForTokenClassification.from_pretrained(
-    "bert-base-uncased",
-    num_labels=9  # Number of entity types
-)
-```
-
-**Question Answering:**
-```python
-from transformers import AutoModelForQuestionAnswering
-
-model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")
-```
-
-**Summarization and Translation (Seq2Seq):**
-```python
-from transformers import AutoModelForSeq2SeqLM
-
-model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
-```
-
-**Image Classification:**
-```python
-from transformers import AutoModelForImageClassification
-
-model = AutoModelForImageClassification.from_pretrained(
-    "google/vit-base-patch16-224",
-    num_labels=num_classes
-)
-```
-
-For detailed task-specific workflows including data preprocessing, training, and evaluation, see `references/task_patterns.md`.
-
-## Auto Classes
-
-Use Auto classes for automatic architecture selection based on model checkpoints:
-
-```python
-from transformers import (
-    AutoTokenizer,           # Tokenization
-    AutoModel,               # Base model (hidden states)
-    AutoModelForSequenceClassification,
-    AutoModelForTokenClassification,
-    AutoModelForQuestionAnswering,
-    AutoModelForCausalLM,    # GPT-style
-    AutoModelForMaskedLM,    # BERT-style
-    AutoModelForSeq2SeqLM,   # T5, BART
-    AutoProcessor,           # For multimodal models
-    AutoImageProcessor,      # For vision models
-)
-
-# Load any model by name
-tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
-model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
-```
-
-For comprehensive API documentation, see `references/api_reference.md`.
-
-## Model Loading and Optimization
-
-**Device placement:**
-```python
-model = AutoModel.from_pretrained("bert-base-uncased", device_map="auto")
-```
-
-**Mixed precision:**
-```python
-model = AutoModel.from_pretrained(
-    "model-name",
-    torch_dtype=torch.float16  # or torch.bfloat16
-)
-```
-
-**Quantization:**
-```python
-from transformers import BitsAndBytesConfig
-
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.float16
-)
-
-model = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/Llama-2-7b-hf",
-    quantization_config=quantization_config,
-    device_map="auto"
-)
-```
-
-## Common Workflows
-
-### Quick Inference Workflow
-1. Choose appropriate pipeline for task
-2. Load pipeline with optional model specification
-3. Pass inputs and get results
-4. For batch processing, pass list of inputs
-
-**See:** `scripts/quick_inference.py` for comprehensive pipeline examples
-
-### Training Workflow
-1. Load and preprocess dataset using 🤗 Datasets
-2. Tokenize data with appropriate tokenizer
-3. Load pre-trained model for specific task
-4. Configure TrainingArguments
-5. Create Trainer with model, data, and compute_metrics
-6. Train with `trainer.train()`
-7. Evaluate with `trainer.evaluate()`
-8. Save model and optionally push to Hub
-
-**See:** `scripts/fine_tune_classifier.py` for complete training example
-
-### Text Generation Workflow
-1. Load causal or seq2seq language model
-2. Load tokenizer and tokenize prompt
-3. Choose generation strategy (greedy, beam search, sampling)
-4. Configure generation parameters
-5. Generate with `model.generate()`
-6. Decode output tokens to text
-
-**See:** `scripts/generate_text.py` for generation strategy examples
-
-## Best Practices
-
-1. **Use Auto classes** for flexibility across different model architectures
-2. **Batch processing** for efficiency - process multiple inputs at once
-3. **Device management** - use `device_map="auto"` for automatic placement
-4. **Memory optimization** - enable fp16/bf16 or quantization for large models
-5. **Checkpoint management** - save checkpoints regularly and load best model
-6. **Pipeline for quick tasks** - use pipelines for standard inference tasks
-7. **Custom metrics** - define compute_metrics for task-specific evaluation
-8. **Gradient accumulation** - use for large effective batch sizes on limited memory
-9. **Learning rate warmup** - typically 5-10% of total training steps
-10. **Hub integration** - push trained models to Hub for sharing and versioning
-
-## Resources
-
-### scripts/
-Executable Python scripts demonstrating common Transformers workflows:
-
- `quick_inference.py` - Pipeline examples for NLP, vision, audio, and multimodal tasks
- `fine_tune_classifier.py` - Complete fine-tuning workflow with Trainer API
- `generate_text.py` - Text generation with various decoding strategies
-
-Run scripts directly to see examples in action:
-```bash
-python scripts/quick_inference.py
-python scripts/fine_tune_classifier.py
-python scripts/generate_text.py
-```
-
-### references/
-Comprehensive reference documentation loaded into context as needed:
-
- `api_reference.md` - Core classes and APIs (Auto classes, Trainer, GenerationConfig, etc.)
- `pipelines.md` - All available pipelines organized by modality with examples
- `training.md` - Training patterns, TrainingArguments, distributed training, callbacks
- `generation_strategies.md` - Text generation methods, decoding strategies, parameters
- `task_patterns.md` - Complete workflows for common tasks (classification, NER, QA, summarization, etc.)
-
-When working on specific tasks or features, load the relevant reference file for detailed guidance.
-
-## Additional Information
-
- **Official Documentation**: https://huggingface.co/docs/transformers/index
- **Model Hub**: https://huggingface.co/models (1M+ pre-trained models)
- **Datasets Hub**: https://huggingface.co/datasets
- **Installation**: `pip install transformers datasets evaluate accelerate`
- **GPU Support**: Requires PyTorch or TensorFlow with CUDA
- **Framework Support**: PyTorch (primary), TensorFlow, JAX/Flax
--- a/scientific-packages/transformers/references/api_reference.md
+++ b/scientific-packages/transformers/references/api_reference.md
@@ -1,485 +0,0 @@
-# Transformers API Reference
-
-This reference covers the core classes and APIs in the Transformers library.
-
-## Core Auto Classes
-
-Auto classes provide a convenient way to automatically select the appropriate architecture based on model name or checkpoint.
-
-### AutoTokenizer
-
-```python
-from transformers import AutoTokenizer
-
-# Load tokenizer
-tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
-
-# Tokenize single text
-encoded = tokenizer("Hello, how are you?")
-# Returns: {'input_ids': [...], 'attention_mask': [...]}
-
-# Tokenize with options
-encoded = tokenizer(
-    "Hello, how are you?",
-    padding="max_length",
-    truncation=True,
-    max_length=512,
-    return_tensors="pt"  # "pt" for PyTorch, "tf" for TensorFlow
-)
-
-# Tokenize pairs (for classification, QA, etc.)
-encoded = tokenizer(
-    "Question or sentence A",
-    "Context or sentence B",
-    padding=True,
-    truncation=True
-)
-
-# Batch tokenization
-texts = ["Text 1", "Text 2", "Text 3"]
-encoded = tokenizer(texts, padding=True, truncation=True)
-
-# Decode tokens back to text
-text = tokenizer.decode(token_ids, skip_special_tokens=True)
-
-# Batch decode
-texts = tokenizer.batch_decode(batch_token_ids, skip_special_tokens=True)
-```
-
-**Key Parameters:**
- `padding`: "max_length", "longest", or True (pad to max in batch)
- `truncation`: True or strategy ("longest_first", "only_first", "only_second")
- `max_length`: Maximum sequence length
- `return_tensors`: "pt" (PyTorch), "tf" (TensorFlow), "np" (NumPy)
- `return_attention_mask`: Return attention masks (default True)
- `return_token_type_ids`: Return token type IDs for pairs (default True)
- `add_special_tokens`: Add special tokens like [CLS], [SEP] (default True)
-
-**Special Properties:**
- `tokenizer.vocab_size`: Size of vocabulary
- `tokenizer.pad_token_id`: ID of padding token
- `tokenizer.eos_token_id`: ID of end-of-sequence token
- `tokenizer.bos_token_id`: ID of beginning-of-sequence token
- `tokenizer.unk_token_id`: ID of unknown token
-
-### AutoModel
-
-Base model class that outputs hidden states.
-
-```python
-from transformers import AutoModel
-
-model = AutoModel.from_pretrained("bert-base-uncased")
-
-# Forward pass
-outputs = model(**inputs)
-
-# Access hidden states
-last_hidden_state = outputs.last_hidden_state  # [batch_size, seq_length, hidden_size]
-pooler_output = outputs.pooler_output  # [batch_size, hidden_size]
-
-# Get all hidden states
-model = AutoModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
-outputs = model(**inputs)
-all_hidden_states = outputs.hidden_states  # Tuple of tensors
-```
-
-### Task-Specific Auto Classes
-
-```python
-from transformers import (
-    AutoModelForSequenceClassification,
-    AutoModelForTokenClassification,
-    AutoModelForQuestionAnswering,
-    AutoModelForCausalLM,
-    AutoModelForMaskedLM,
-    AutoModelForSeq2SeqLM,
-    AutoModelForImageClassification,
-    AutoModelForObjectDetection,
-    AutoModelForVision2Seq,
-)
-
-# Sequence classification (sentiment, topic, etc.)
-model = AutoModelForSequenceClassification.from_pretrained(
-    "bert-base-uncased",
-    num_labels=3,
-    id2label={0: "negative", 1: "neutral", 2: "positive"},
-    label2id={"negative": 0, "neutral": 1, "positive": 2}
-)
-
-# Token classification (NER, POS tagging)
-model = AutoModelForTokenClassification.from_pretrained(
-    "bert-base-uncased",
-    num_labels=9  # Number of entity types
-)
-
-# Question answering
-model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")
-
-# Causal language modeling (GPT-style)
-model = AutoModelForCausalLM.from_pretrained("gpt2")
-
-# Masked language modeling (BERT-style)
-model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
-
-# Sequence-to-sequence (T5, BART)
-model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
-
-# Image classification
-model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
-
-# Object detection
-model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50")
-
-# Vision-to-text (image captioning, VQA)
-model = AutoModelForVision2Seq.from_pretrained("microsoft/git-base")
-```
-
-### AutoProcessor
-
-For multimodal models that need both text and image processing.
-
-```python
-from transformers import AutoProcessor
-
-# For vision-language models
-processor = AutoProcessor.from_pretrained("microsoft/git-base")
-
-# Process image and text
-from PIL import Image
-image = Image.open("image.jpg")
-inputs = processor(images=image, text="caption", return_tensors="pt")
-
-# For audio models
-processor = AutoProcessor.from_pretrained("openai/whisper-base")
-inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
-```
-
-### AutoImageProcessor
-
-For vision-only models.
-
-```python
-from transformers import AutoImageProcessor
-
-processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
-
-# Process single image
-from PIL import Image
-image = Image.open("image.jpg")
-inputs = processor(image, return_tensors="pt")
-
-# Batch processing
-images = [Image.open(f"image{i}.jpg") for i in range(10)]
-inputs = processor(images, return_tensors="pt")
-```
-
-## Model Loading Options
-
-### from_pretrained Parameters
-
-```python
-model = AutoModel.from_pretrained(
-    "model-name",
-    # Device and precision
-    device_map="auto",  # Automatic device placement
-    torch_dtype=torch.float16,  # Use fp16
-    low_cpu_mem_usage=True,  # Reduce CPU memory during loading
-
-    # Quantization
-    load_in_8bit=True,  # 8-bit quantization
-    load_in_4bit=True,  # 4-bit quantization
-
-    # Model configuration
-    num_labels=3,  # For classification
-    id2label={...},  # Label mapping
-    label2id={...},
-
-    # Outputs
-    output_hidden_states=True,
-    output_attentions=True,
-
-    # Trust remote code
-    trust_remote_code=True,  # For custom models
-
-    # Caching
-    cache_dir="./cache",
-    force_download=False,
-    resume_download=True,
-)
-```
-
-### Quantization with BitsAndBytes
-
-```python
-from transformers import BitsAndBytesConfig, AutoModelForCausalLM
-
-# 4-bit quantization
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.float16,
-    bnb_4bit_use_double_quant=True,
-    bnb_4bit_quant_type="nf4"
-)
-
-model = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/Llama-2-7b-hf",
-    quantization_config=quantization_config,
-    device_map="auto"
-)
-```
-
-## Training Components
-
-### TrainingArguments
-
-See `training.md` for comprehensive coverage. Key parameters:
-
-```python
-from transformers import TrainingArguments
-
-args = TrainingArguments(
-    output_dir="./results",
-    num_train_epochs=3,
-    per_device_train_batch_size=16,
-    per_device_eval_batch_size=64,
-    learning_rate=2e-5,
-    weight_decay=0.01,
-    eval_strategy="epoch",
-    save_strategy="epoch",
-    load_best_model_at_end=True,
-    metric_for_best_model="accuracy",
-    fp16=True,
-    logging_steps=100,
-    save_total_limit=2,
-)
-```
-
-### Trainer
-
-```python
-from transformers import Trainer
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=train_dataset,
-    eval_dataset=eval_dataset,
-    compute_metrics=compute_metrics,
-    data_collator=data_collator,
-    callbacks=[callback1, callback2],
-)
-
-# Train
-trainer.train()
-
-# Resume from checkpoint
-trainer.train(resume_from_checkpoint=True)
-
-# Evaluate
-metrics = trainer.evaluate()
-
-# Predict
-predictions = trainer.predict(test_dataset)
-
-# Hyperparameter search
-best_trial = trainer.hyperparameter_search(
-    direction="maximize",
-    backend="optuna",
-    n_trials=10,
-)
-
-# Save model
-trainer.save_model("./final_model")
-
-# Push to Hub
-trainer.push_to_hub(commit_message="Training complete")
-```
-
-### Data Collators
-
-```python
-from transformers import (
-    DataCollatorWithPadding,
-    DataCollatorForTokenClassification,
-    DataCollatorForSeq2Seq,
-    DataCollatorForLanguageModeling,
-    DefaultDataCollator,
-)
-
-# For classification/regression with dynamic padding
-data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
-
-# For token classification (NER)
-data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
-
-# For seq2seq tasks
-data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
-
-# For language modeling
-data_collator = DataCollatorForLanguageModeling(
-    tokenizer=tokenizer,
-    mlm=True,  # True for masked LM, False for causal LM
-    mlm_probability=0.15
-)
-
-# Default (no special handling)
-data_collator = DefaultDataCollator()
-```
-
-## Generation Components
-
-### GenerationConfig
-
-See `generation_strategies.md` for comprehensive coverage.
-
-```python
-from transformers import GenerationConfig
-
-config = GenerationConfig(
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.7,
-    top_p=0.9,
-    top_k=50,
-    num_beams=5,
-    repetition_penalty=1.2,
-    no_repeat_ngram_size=3,
-)
-
-# Use with model
-outputs = model.generate(**inputs, generation_config=config)
-```
-
-### generate() Method
-
-```python
-outputs = model.generate(
-    input_ids=inputs.input_ids,
-    attention_mask=inputs.attention_mask,
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.7,
-    top_p=0.9,
-    num_return_sequences=3,
-    eos_token_id=tokenizer.eos_token_id,
-    pad_token_id=tokenizer.pad_token_id,
-)
-```
-
-## Pipeline API
-
-See `pipelines.md` for comprehensive coverage.
-
-```python
-from transformers import pipeline
-
-# Basic usage
-pipe = pipeline("task-name", model="model-name", device=0)
-results = pipe(inputs)
-
-# With custom model
-from transformers import AutoModelForSequenceClassification, AutoTokenizer
-
-model = AutoModelForSequenceClassification.from_pretrained("model-name")
-tokenizer = AutoTokenizer.from_pretrained("model-name")
-pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
-```
-
-## Configuration Classes
-
-### Model Configuration
-
-```python
-from transformers import AutoConfig
-
-# Load configuration
-config = AutoConfig.from_pretrained("bert-base-uncased")
-
-# Access configuration
-print(config.hidden_size)
-print(config.num_attention_heads)
-print(config.num_hidden_layers)
-
-# Modify configuration
-config.num_labels = 5
-config.output_hidden_states = True
-
-# Create model with config
-model = AutoModel.from_config(config)
-
-# Save configuration
-config.save_pretrained("./config")
-```
-
-## Utilities
-
-### Hub Utilities
-
-```python
-from huggingface_hub import login, snapshot_download
-
-# Login
-login(token="hf_...")
-
-# Download model
-snapshot_download(repo_id="model-name", cache_dir="./cache")
-
-# Push to Hub
-model.push_to_hub("username/model-name", commit_message="Initial commit")
-tokenizer.push_to_hub("username/model-name")
-```
-
-### Evaluation Metrics
-
-```python
-import evaluate
-
-# Load metric
-metric = evaluate.load("accuracy")
-
-# Compute metric
-results = metric.compute(predictions=predictions, references=labels)
-
-# Common metrics
-accuracy = evaluate.load("accuracy")
-precision = evaluate.load("precision")
-recall = evaluate.load("recall")
-f1 = evaluate.load("f1")
-bleu = evaluate.load("bleu")
-rouge = evaluate.load("rouge")
-```
-
-## Model Outputs
-
-All models return dataclass objects with named attributes:
-
-```python
-# Sequence classification output
-outputs = model(**inputs)
-logits = outputs.logits  # [batch_size, num_labels]
-loss = outputs.loss  # If labels provided
-
-# Causal LM output
-outputs = model(**inputs)
-logits = outputs.logits  # [batch_size, seq_length, vocab_size]
-past_key_values = outputs.past_key_values  # KV cache
-
-# Seq2Seq output
-outputs = model(**inputs, labels=labels)
-loss = outputs.loss
-logits = outputs.logits
-encoder_last_hidden_state = outputs.encoder_last_hidden_state
-
-# Access as dict
-outputs_dict = outputs.to_tuple()  # or dict(outputs)
-```
-
-## Best Practices
-
-1. **Use Auto classes**: AutoModel, AutoTokenizer for flexibility
-2. **Device management**: Use `device_map="auto"` for multi-GPU
-3. **Memory optimization**: Use `torch_dtype=torch.float16` and quantization
-4. **Caching**: Set `cache_dir` to avoid re-downloading
-5. **Batch processing**: Process multiple inputs at once for efficiency
-6. **Trust remote code**: Only set `trust_remote_code=True` for trusted sources
--- a/scientific-packages/transformers/references/generation_strategies.md
+++ b/scientific-packages/transformers/references/generation_strategies.md
@@ -1,373 +0,0 @@
-# Text Generation Strategies
-
-Transformers provides flexible text generation capabilities through the `generate()` method, supporting multiple decoding strategies and configuration options.
-
-## Basic Generation
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-model = AutoModelForCausalLM.from_pretrained("gpt2")
-tokenizer = AutoTokenizer.from_pretrained("gpt2")
-
-inputs = tokenizer("Once upon a time", return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=50)
-generated_text = tokenizer.decode(outputs[0])
-```
-
-## Decoding Strategies
-
-### 1. Greedy Decoding
-
-Selects the token with highest probability at each step. Deterministic but can be repetitive.
-
-```python
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=50,
-    do_sample=False,
-    num_beams=1  # Greedy is default when num_beams=1 and do_sample=False
-)
-```
-
-### 2. Beam Search
-
-Explores multiple hypotheses simultaneously, keeping top-k candidates at each step.
-
-```python
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=50,
-    num_beams=5,  # Number of beams
-    early_stopping=True,  # Stop when all beams reach EOS
-    no_repeat_ngram_size=2,  # Prevent repeating n-grams
-)
-```
-
-**Key parameters:**
- `num_beams`: Number of beams (higher = more thorough but slower)
- `early_stopping`: Stop when all beams finish (True/False)
- `length_penalty`: Exponential penalty for length (>1.0 favors longer sequences)
- `no_repeat_ngram_size`: Prevent repeating n-grams
-
-### 3. Sampling (Multinomial)
-
-Samples from probability distribution, introducing randomness and diversity.
-
-```python
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=50,
-    do_sample=True,
-    temperature=0.7,  # Controls randomness (lower = more focused)
-    top_k=50,  # Consider only top-k tokens
-    top_p=0.9,  # Nucleus sampling (cumulative probability threshold)
-)
-```
-
-**Key parameters:**
- `temperature`: Scales logits before softmax (0.1-2.0 typical range)
-  - Lower (0.1-0.7): More focused, deterministic
-  - Higher (0.8-1.5): More creative, random
- `top_k`: Sample from top-k tokens only
- `top_p`: Nucleus sampling - sample from smallest set with cumulative probability > p
-
-### 4. Beam Search with Sampling
-
-Combines beam search with sampling for diverse but coherent outputs.
-
-```python
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=50,
-    num_beams=5,
-    do_sample=True,
-    temperature=0.8,
-    top_k=50,
-)
-```
-
-### 5. Contrastive Search
-
-Balances coherence and diversity using contrastive objective.
-
-```python
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=50,
-    penalty_alpha=0.6,  # Contrastive penalty
-    top_k=4,  # Consider top-k candidates
-)
-```
-
-### 6. Assisted Decoding
-
-Uses a smaller "assistant" model to speed up generation of larger model.
-
-```python
-from transformers import AutoModelForCausalLM
-
-model = AutoModelForCausalLM.from_pretrained("gpt2-large")
-assistant_model = AutoModelForCausalLM.from_pretrained("gpt2")
-
-outputs = model.generate(
-    **inputs,
-    assistant_model=assistant_model,
-    max_new_tokens=50,
-)
-```
-
-## GenerationConfig
-
-Configure generation parameters with `GenerationConfig` for reusability.
-
-```python
-from transformers import GenerationConfig
-
-generation_config = GenerationConfig(
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.7,
-    top_p=0.9,
-    top_k=50,
-    repetition_penalty=1.2,
-    no_repeat_ngram_size=3,
-)
-
-# Use with model
-outputs = model.generate(**inputs, generation_config=generation_config)
-
-# Save and load
-generation_config.save_pretrained("./config")
-loaded_config = GenerationConfig.from_pretrained("./config")
-```
-
-## Key Parameters Reference
-
-### Output Length Control
-
- `max_length`: Maximum total tokens (input + output)
- `max_new_tokens`: Maximum new tokens to generate (recommended over max_length)
- `min_length`: Minimum total tokens
- `min_new_tokens`: Minimum new tokens to generate
-
-### Sampling Parameters
-
- `temperature`: Sampling temperature (0.1-2.0, default 1.0)
- `top_k`: Top-k sampling (1-100, typically 50)
- `top_p`: Nucleus sampling (0.0-1.0, typically 0.9)
- `do_sample`: Enable sampling (True/False)
-
-### Beam Search Parameters
-
- `num_beams`: Number of beams (1-20, typically 5)
- `early_stopping`: Stop when beams finish (True/False)
- `length_penalty`: Length penalty (>1.0 favors longer, <1.0 favors shorter)
- `num_beam_groups`: Diverse beam search groups
- `diversity_penalty`: Penalty for similar beams
-
-### Repetition Control
-
- `repetition_penalty`: Penalty for repeating tokens (1.0-2.0, default 1.0)
- `no_repeat_ngram_size`: Prevent repeating n-grams (2-5 typical)
- `encoder_repetition_penalty`: Penalty for repeating encoder tokens
-
-### Special Tokens
-
- `bos_token_id`: Beginning of sequence token
- `eos_token_id`: End of sequence token (or list of tokens)
- `pad_token_id`: Padding token
- `forced_bos_token_id`: Force specific token at beginning
- `forced_eos_token_id`: Force specific token at end
-
-### Multiple Sequences
-
- `num_return_sequences`: Number of sequences to return
- `num_beam_groups`: Number of diverse beam groups
-
-## Advanced Generation Techniques
-
-### Constrained Generation
-
-Force generation to include specific tokens or follow patterns.
-
-```python
-from transformers import PhrasalConstraint
-
-constraints = [
-    PhrasalConstraint(tokenizer("New York", add_special_tokens=False).input_ids)
-]
-
-outputs = model.generate(
-    **inputs,
-    constraints=constraints,
-    num_beams=5,
-)
-```
-
-### Streaming Generation
-
-Generate tokens one at a time for real-time display.
-
-```python
-from transformers import TextIteratorStreamer
-from threading import Thread
-
-streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
-
-generation_kwargs = dict(
-    **inputs,
-    max_new_tokens=100,
-    streamer=streamer,
-)
-
-thread = Thread(target=model.generate, kwargs=generation_kwargs)
-thread.start()
-
-for new_text in streamer:
-    print(new_text, end="", flush=True)
-
-thread.join()
-```
-
-### Logit Processors
-
-Customize token selection with custom logit processors.
-
-```python
-from transformers import LogitsProcessor, LogitsProcessorList
-
-class CustomLogitsProcessor(LogitsProcessor):
-    def __call__(self, input_ids, scores):
-        # Modify scores here
-        return scores
-
-logits_processor = LogitsProcessorList([CustomLogitsProcessor()])
-
-outputs = model.generate(
-    **inputs,
-    logits_processor=logits_processor,
-)
-```
-
-### Stopping Criteria
-
-Define custom stopping conditions.
-
-```python
-from transformers import StoppingCriteria, StoppingCriteriaList
-
-class CustomStoppingCriteria(StoppingCriteria):
-    def __call__(self, input_ids, scores, **kwargs):
-        # Return True to stop generation
-        return False
-
-stopping_criteria = StoppingCriteriaList([CustomStoppingCriteria()])
-
-outputs = model.generate(
-    **inputs,
-    stopping_criteria=stopping_criteria,
-)
-```
-
-## Best Practices
-
-### For Creative Tasks (Stories, Dialogue)
-```python
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=200,
-    do_sample=True,
-    temperature=0.8,
-    top_p=0.95,
-    repetition_penalty=1.2,
-    no_repeat_ngram_size=3,
-)
-```
-
-### For Factual Tasks (Summaries, QA)
-```python
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=100,
-    num_beams=4,
-    early_stopping=True,
-    no_repeat_ngram_size=2,
-    length_penalty=1.0,
-)
-```
-
-### For Chat/Instruction Following
-```python
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=512,
-    do_sample=True,
-    temperature=0.7,
-    top_p=0.9,
-    repetition_penalty=1.1,
-)
-```
-
-## Vision-Language Model Generation
-
-For models like LLaVA, BLIP-2, etc.:
-
-```python
-from transformers import AutoProcessor, AutoModelForVision2Seq
-from PIL import Image
-
-model = AutoModelForVision2Seq.from_pretrained("llava-hf/llava-1.5-7b-hf")
-processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
-
-image = Image.open("image.jpg")
-inputs = processor(text="Describe this image", images=image, return_tensors="pt")
-
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.7,
-)
-
-generated_text = processor.decode(outputs[0], skip_special_tokens=True)
-```
-
-## Performance Optimization
-
-### Use KV Cache
-```python
-# KV cache is enabled by default
-outputs = model.generate(**inputs, use_cache=True)
-```
-
-### Mixed Precision
-```python
-import torch
-
-with torch.cuda.amp.autocast():
-    outputs = model.generate(**inputs, max_new_tokens=100)
-```
-
-### Batch Generation
-```python
-texts = ["Prompt 1", "Prompt 2", "Prompt 3"]
-inputs = tokenizer(texts, return_tensors="pt", padding=True)
-outputs = model.generate(**inputs, max_new_tokens=50)
-```
-
-### Quantization
-```python
-from transformers import BitsAndBytesConfig
-
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.float16
-)
-
-model = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/Llama-2-7b-hf",
-    quantization_config=quantization_config,
-    device_map="auto"
-)
-```
--- a/scientific-packages/transformers/references/pipelines.md
+++ b/scientific-packages/transformers/references/pipelines.md
@@ -1,234 +0,0 @@
-# Transformers Pipelines
-
-Pipelines provide a simple and optimized interface for inference across many machine learning tasks. They abstract away the complexity of tokenization, model invocation, and post-processing.
-
-## Usage Pattern
-
-```python
-from transformers import pipeline
-
-# Basic usage
-classifier = pipeline("text-classification")
-result = classifier("This movie was amazing!")
-
-# With specific model
-classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
-result = classifier("This movie was amazing!")
-```
-
-## Natural Language Processing Pipelines
-
-### Text Classification
-```python
-classifier = pipeline("text-classification")
-classifier("I love this product!")
-# [{'label': 'POSITIVE', 'score': 0.9998}]
-```
-
-### Zero-Shot Classification
-```python
-classifier = pipeline("zero-shot-classification")
-classifier("This is about climate change", candidate_labels=["politics", "science", "sports"])
-```
-
-### Token Classification (NER)
-```python
-ner = pipeline("token-classification")
-ner("My name is Sarah and I work at Microsoft in Seattle")
-```
-
-### Question Answering
-```python
-qa = pipeline("question-answering")
-qa(question="What is the capital?", context="The capital of France is Paris.")
-```
-
-### Text Generation
-```python
-generator = pipeline("text-generation")
-generator("Once upon a time", max_length=50)
-```
-
-### Text2Text Generation
-```python
-generator = pipeline("text2text-generation", model="t5-base")
-generator("translate English to French: Hello")
-```
-
-### Summarization
-```python
-summarizer = pipeline("summarization")
-summarizer("Long article text here...", max_length=130, min_length=30)
-```
-
-### Translation
-```python
-translator = pipeline("translation_en_to_fr")
-translator("Hello, how are you?")
-```
-
-### Fill Mask
-```python
-unmasker = pipeline("fill-mask")
-unmasker("Paris is the [MASK] of France.")
-```
-
-### Feature Extraction
-```python
-extractor = pipeline("feature-extraction")
-embeddings = extractor("This is a sentence")
-```
-
-### Document Question Answering
-```python
-doc_qa = pipeline("document-question-answering")
-doc_qa(image="document.png", question="What is the invoice number?")
-```
-
-### Table Question Answering
-```python
-table_qa = pipeline("table-question-answering")
-table_qa(table=data, query="How many employees?")
-```
-
-## Computer Vision Pipelines
-
-### Image Classification
-```python
-classifier = pipeline("image-classification")
-classifier("cat.jpg")
-```
-
-### Zero-Shot Image Classification
-```python
-classifier = pipeline("zero-shot-image-classification")
-classifier("cat.jpg", candidate_labels=["cat", "dog", "bird"])
-```
-
-### Object Detection
-```python
-detector = pipeline("object-detection")
-detector("street.jpg")
-```
-
-### Image Segmentation
-```python
-segmenter = pipeline("image-segmentation")
-segmenter("image.jpg")
-```
-
-### Image-to-Image
-```python
-img2img = pipeline("image-to-image", model="lllyasviel/sd-controlnet-canny")
-img2img("input.jpg")
-```
-
-### Depth Estimation
-```python
-depth = pipeline("depth-estimation")
-depth("image.jpg")
-```
-
-### Video Classification
-```python
-classifier = pipeline("video-classification")
-classifier("video.mp4")
-```
-
-### Keypoint Matching
-```python
-matcher = pipeline("keypoint-matching")
-matcher(image1="img1.jpg", image2="img2.jpg")
-```
-
-## Audio Pipelines
-
-### Automatic Speech Recognition
-```python
-asr = pipeline("automatic-speech-recognition")
-asr("audio.wav")
-```
-
-### Audio Classification
-```python
-classifier = pipeline("audio-classification")
-classifier("audio.wav")
-```
-
-### Zero-Shot Audio Classification
-```python
-classifier = pipeline("zero-shot-audio-classification")
-classifier("audio.wav", candidate_labels=["speech", "music", "noise"])
-```
-
-### Text-to-Audio/Text-to-Speech
-```python
-synthesizer = pipeline("text-to-audio")
-audio = synthesizer("Hello, how are you today?")
-```
-
-## Multimodal Pipelines
-
-### Image-to-Text (Image Captioning)
-```python
-captioner = pipeline("image-to-text")
-captioner("image.jpg")
-```
-
-### Visual Question Answering
-```python
-vqa = pipeline("visual-question-answering")
-vqa(image="image.jpg", question="What color is the car?")
-```
-
-### Image-Text-to-Text (VLMs)
-```python
-vlm = pipeline("image-text-to-text")
-vlm(images="image.jpg", text="Describe this image in detail")
-```
-
-### Zero-Shot Object Detection
-```python
-detector = pipeline("zero-shot-object-detection")
-detector("image.jpg", candidate_labels=["car", "person", "tree"])
-```
-
-## Pipeline Configuration
-
-### Common Parameters
-
- `model`: Specify model identifier or path
- `device`: Set device (0 for GPU, -1 for CPU, or "cuda:0")
- `batch_size`: Process multiple inputs at once
- `torch_dtype`: Set precision (torch.float16, torch.bfloat16)
-
-```python
-# GPU with half precision
-pipe = pipeline("text-generation", model="gpt2", device=0, torch_dtype=torch.float16)
-
-# Batch processing
-pipe(["text 1", "text 2", "text 3"], batch_size=8)
-```
-
-### Task-Specific Parameters
-
-Each pipeline accepts task-specific parameters in the call:
-
-```python
-# Text generation
-generator("prompt", max_length=100, temperature=0.7, top_p=0.9, num_return_sequences=3)
-
-# Summarization
-summarizer("text", max_length=130, min_length=30, do_sample=False)
-
-# Translation
-translator("text", max_length=512, num_beams=4)
-```
-
-## Best Practices
-
-1. **Reuse pipelines**: Create once, use multiple times for efficiency
-2. **Batch processing**: Use batches for multiple inputs to maximize throughput
-3. **GPU acceleration**: Set `device=0` for GPU when available
-4. **Model selection**: Choose task-specific models for best results
-5. **Memory management**: Use `torch_dtype=torch.float16` for large models
--- a/scientific-packages/transformers/references/task_patterns.md
+++ b/scientific-packages/transformers/references/task_patterns.md
@@ -1,599 +0,0 @@
-# Common Task Patterns
-
-This document provides common patterns and workflows for typical tasks using Transformers.
-
-## Text Classification
-
-### Binary or Multi-class Classification
-
-```python
-from transformers import (
-    AutoTokenizer,
-    AutoModelForSequenceClassification,
-    TrainingArguments,
-    Trainer
-)
-from datasets import load_dataset
-import evaluate
-import numpy as np
-
-# Load dataset
-dataset = load_dataset("imdb")
-
-# Tokenize
-tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
-
-def tokenize_function(examples):
-    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
-
-tokenized_datasets = dataset.map(tokenize_function, batched=True)
-
-# Load model
-id2label = {0: "negative", 1: "positive"}
-label2id = {"negative": 0, "positive": 1}
-
-model = AutoModelForSequenceClassification.from_pretrained(
-    "bert-base-uncased",
-    num_labels=2,
-    id2label=id2label,
-    label2id=label2id
-)
-
-# Metrics
-metric = evaluate.load("accuracy")
-
-def compute_metrics(eval_pred):
-    logits, labels = eval_pred
-    predictions = np.argmax(logits, axis=-1)
-    return metric.compute(predictions=predictions, references=labels)
-
-# Train
-training_args = TrainingArguments(
-    output_dir="./results",
-    eval_strategy="epoch",
-    learning_rate=2e-5,
-    per_device_train_batch_size=16,
-    per_device_eval_batch_size=64,
-    num_train_epochs=3,
-    weight_decay=0.01,
-    load_best_model_at_end=True,
-)
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=tokenized_datasets["train"],
-    eval_dataset=tokenized_datasets["test"],
-    compute_metrics=compute_metrics,
-)
-
-trainer.train()
-
-# Inference
-text = "This movie was fantastic!"
-inputs = tokenizer(text, return_tensors="pt")
-outputs = model(**inputs)
-predictions = outputs.logits.argmax(-1)
-print(id2label[predictions.item()])
-```
-
-## Named Entity Recognition (Token Classification)
-
-```python
-from transformers import (
-    AutoTokenizer,
-    AutoModelForTokenClassification,
-    TrainingArguments,
-    Trainer,
-    DataCollatorForTokenClassification
-)
-from datasets import load_dataset
-import evaluate
-import numpy as np
-
-# Load dataset
-dataset = load_dataset("conll2003")
-
-# Tokenize (align labels with tokenized words)
-tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
-
-def tokenize_and_align_labels(examples):
-    tokenized_inputs = tokenizer(
-        examples["tokens"],
-        truncation=True,
-        is_split_into_words=True
-    )
-
-    labels = []
-    for i, label in enumerate(examples["ner_tags"]):
-        word_ids = tokenized_inputs.word_ids(batch_index=i)
-        label_ids = []
-        previous_word_idx = None
-        for word_idx in word_ids:
-            if word_idx is None:
-                label_ids.append(-100)
-            elif word_idx != previous_word_idx:
-                label_ids.append(label[word_idx])
-            else:
-                label_ids.append(-100)
-            previous_word_idx = word_idx
-        labels.append(label_ids)
-
-    tokenized_inputs["labels"] = labels
-    return tokenized_inputs
-
-tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)
-
-# Model
-label_list = dataset["train"].features["ner_tags"].feature.names
-model = AutoModelForTokenClassification.from_pretrained(
-    "bert-base-uncased",
-    num_labels=len(label_list)
-)
-
-# Data collator
-data_collator = DataCollatorForTokenClassification(tokenizer)
-
-# Metrics
-metric = evaluate.load("seqeval")
-
-def compute_metrics(eval_pred):
-    predictions, labels = eval_pred
-    predictions = np.argmax(predictions, axis=2)
-
-    true_labels = [[label_list[l] for l in label if l != -100] for label in labels]
-    true_predictions = [
-        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
-        for prediction, label in zip(predictions, labels)
-    ]
-
-    return metric.compute(predictions=true_predictions, references=true_labels)
-
-# Train
-training_args = TrainingArguments(
-    output_dir="./results",
-    eval_strategy="epoch",
-    learning_rate=2e-5,
-    per_device_train_batch_size=16,
-    num_train_epochs=3,
-    weight_decay=0.01,
-)
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=tokenized_datasets["train"],
-    eval_dataset=tokenized_datasets["validation"],
-    data_collator=data_collator,
-    compute_metrics=compute_metrics,
-)
-
-trainer.train()
-```
-
-## Question Answering
-
-```python
-from transformers import (
-    AutoTokenizer,
-    AutoModelForQuestionAnswering,
-    TrainingArguments,
-    Trainer,
-    DefaultDataCollator
-)
-from datasets import load_dataset
-
-# Load dataset
-dataset = load_dataset("squad")
-
-# Tokenize
-tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
-
-def preprocess_function(examples):
-    questions = [q.strip() for q in examples["question"]]
-    inputs = tokenizer(
-        questions,
-        examples["context"],
-        max_length=384,
-        truncation="only_second",
-        return_offsets_mapping=True,
-        padding="max_length",
-    )
-
-    offset_mapping = inputs.pop("offset_mapping")
-    answers = examples["answers"]
-    start_positions = []
-    end_positions = []
-
-    for i, offset in enumerate(offset_mapping):
-        answer = answers[i]
-        start_char = answer["answer_start"][0]
-        end_char = start_char + len(answer["text"][0])
-
-        # Find start and end token positions
-        sequence_ids = inputs.sequence_ids(i)
-        context_start = sequence_ids.index(1)
-        context_end = len(sequence_ids) - 1 - sequence_ids[::-1].index(1)
-
-        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
-            start_positions.append(0)
-            end_positions.append(0)
-        else:
-            idx = context_start
-            while idx <= context_end and offset[idx][0] <= start_char:
-                idx += 1
-            start_positions.append(idx - 1)
-
-            idx = context_end
-            while idx >= context_start and offset[idx][1] >= end_char:
-                idx -= 1
-            end_positions.append(idx + 1)
-
-    inputs["start_positions"] = start_positions
-    inputs["end_positions"] = end_positions
-    return inputs
-
-tokenized_datasets = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)
-
-# Model
-model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")
-
-# Train
-training_args = TrainingArguments(
-    output_dir="./results",
-    eval_strategy="epoch",
-    learning_rate=2e-5,
-    per_device_train_batch_size=16,
-    num_train_epochs=3,
-)
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=tokenized_datasets["train"],
-    eval_dataset=tokenized_datasets["validation"],
-    data_collator=DefaultDataCollator(),
-)
-
-trainer.train()
-
-# Inference
-question = "What is the capital of France?"
-context = "Paris is the capital and most populous city of France."
-inputs = tokenizer(question, context, return_tensors="pt")
-outputs = model(**inputs)
-
-start_pos = outputs.start_logits.argmax()
-end_pos = outputs.end_logits.argmax()
-answer_tokens = inputs.input_ids[0][start_pos:end_pos+1]
-answer = tokenizer.decode(answer_tokens)
-```
-
-## Text Summarization
-
-```python
-from transformers import (
-    AutoTokenizer,
-    AutoModelForSeq2SeqLM,
-    TrainingArguments,
-    Trainer,
-    DataCollatorForSeq2Seq
-)
-from datasets import load_dataset
-import evaluate
-
-# Load dataset
-dataset = load_dataset("cnn_dailymail", "3.0.0")
-
-# Tokenize
-tokenizer = AutoTokenizer.from_pretrained("t5-small")
-
-def preprocess_function(examples):
-    inputs = ["summarize: " + doc for doc in examples["article"]]
-    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
-
-    labels = tokenizer(
-        text_target=examples["highlights"],
-        max_length=128,
-        truncation=True
-    )
-
-    model_inputs["labels"] = labels["input_ids"]
-    return model_inputs
-
-tokenized_datasets = dataset.map(preprocess_function, batched=True)
-
-# Model
-model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
-
-# Data collator
-data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
-
-# Metrics
-rouge = evaluate.load("rouge")
-
-def compute_metrics(eval_pred):
-    predictions, labels = eval_pred
-    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
-    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
-    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
-
-    result = rouge.compute(
-        predictions=decoded_preds,
-        references=decoded_labels,
-        use_stemmer=True
-    )
-
-    return {k: round(v, 4) for k, v in result.items()}
-
-# Train
-training_args = TrainingArguments(
-    output_dir="./results",
-    eval_strategy="epoch",
-    learning_rate=2e-5,
-    per_device_train_batch_size=8,
-    per_device_eval_batch_size=8,
-    num_train_epochs=3,
-    predict_with_generate=True,
-)
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=tokenized_datasets["train"],
-    eval_dataset=tokenized_datasets["validation"],
-    data_collator=data_collator,
-    compute_metrics=compute_metrics,
-)
-
-trainer.train()
-
-# Inference
-text = "Long article text..."
-inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)
-outputs = model.generate(**inputs, max_length=128, num_beams=4)
-summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
-```
-
-## Translation
-
-```python
-from transformers import (
-    AutoTokenizer,
-    AutoModelForSeq2SeqLM,
-    TrainingArguments,
-    Trainer,
-    DataCollatorForSeq2Seq
-)
-from datasets import load_dataset
-
-# Load dataset
-dataset = load_dataset("wmt16", "de-en")
-
-# Tokenize
-tokenizer = AutoTokenizer.from_pretrained("t5-small")
-
-def preprocess_function(examples):
-    inputs = [f"translate German to English: {de}" for de in examples["de"]]
-    model_inputs = tokenizer(inputs, max_length=128, truncation=True)
-
-    labels = tokenizer(
-        text_target=examples["en"],
-        max_length=128,
-        truncation=True
-    )
-
-    model_inputs["labels"] = labels["input_ids"]
-    return model_inputs
-
-tokenized_datasets = dataset.map(preprocess_function, batched=True)
-
-# Model and training (similar to summarization)
-model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
-
-# Inference
-text = "Guten Tag, wie geht es Ihnen?"
-inputs = tokenizer(f"translate German to English: {text}", return_tensors="pt")
-outputs = model.generate(**inputs, max_length=128)
-translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
-```
-
-## Causal Language Modeling (Training from Scratch or Fine-tuning)
-
-```python
-from transformers import (
-    AutoTokenizer,
-    AutoModelForCausalLM,
-    TrainingArguments,
-    Trainer,
-    DataCollatorForLanguageModeling
-)
-from datasets import load_dataset
-
-# Load dataset
-dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
-
-# Tokenize
-tokenizer = AutoTokenizer.from_pretrained("gpt2")
-tokenizer.pad_token = tokenizer.eos_token
-
-def tokenize_function(examples):
-    return tokenizer(examples["text"], truncation=True, max_length=512)
-
-tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
-
-# Group texts into chunks
-block_size = 128
-
-def group_texts(examples):
-    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
-    total_length = len(concatenated_examples[list(examples.keys())[0]])
-    total_length = (total_length // block_size) * block_size
-    result = {
-        k: [t[i:i + block_size] for i in range(0, total_length, block_size)]
-        for k, t in concatenated_examples.items()
-    }
-    result["labels"] = result["input_ids"].copy()
-    return result
-
-lm_datasets = tokenized_datasets.map(group_texts, batched=True)
-
-# Model
-model = AutoModelForCausalLM.from_pretrained("gpt2")
-
-# Data collator
-data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
-
-# Train
-training_args = TrainingArguments(
-    output_dir="./results",
-    eval_strategy="epoch",
-    learning_rate=2e-5,
-    per_device_train_batch_size=8,
-    num_train_epochs=3,
-)
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=lm_datasets["train"],
-    eval_dataset=lm_datasets["validation"],
-    data_collator=data_collator,
-)
-
-trainer.train()
-```
-
-## Image Classification
-
-```python
-from transformers import (
-    AutoImageProcessor,
-    AutoModelForImageClassification,
-    TrainingArguments,
-    Trainer
-)
-from datasets import load_dataset
-from torchvision.transforms import Compose, Resize, ToTensor, Normalize
-import numpy as np
-import evaluate
-
-# Load dataset
-dataset = load_dataset("food101", split="train[:5000]")
-
-# Prepare image transforms
-image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
-
-normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
-size = image_processor.size["height"]
-
-transforms = Compose([
-    Resize((size, size)),
-    ToTensor(),
-    normalize,
-])
-
-def preprocess_function(examples):
-    examples["pixel_values"] = [transforms(img.convert("RGB")) for img in examples["image"]]
-    return examples
-
-dataset = dataset.with_transform(preprocess_function)
-
-# Model
-model = AutoModelForImageClassification.from_pretrained(
-    "google/vit-base-patch16-224",
-    num_labels=len(dataset["train"].features["label"].names),
-    ignore_mismatched_sizes=True
-)
-
-# Metrics
-metric = evaluate.load("accuracy")
-
-def compute_metrics(eval_pred):
-    predictions = np.argmax(eval_pred.predictions, axis=1)
-    return metric.compute(predictions=predictions, references=eval_pred.label_ids)
-
-# Train
-training_args = TrainingArguments(
-    output_dir="./results",
-    eval_strategy="epoch",
-    learning_rate=5e-5,
-    per_device_train_batch_size=16,
-    num_train_epochs=3,
-)
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=dataset["train"],
-    eval_dataset=dataset["validation"],
-    compute_metrics=compute_metrics,
-)
-
-trainer.train()
-```
-
-## Vision-Language Tasks (Image Captioning)
-
-```python
-from transformers import (
-    AutoProcessor,
-    AutoModelForVision2Seq,
-    TrainingArguments,
-    Trainer
-)
-from datasets import load_dataset
-from PIL import Image
-
-# Load dataset
-dataset = load_dataset("ybelkada/football-dataset")
-
-# Processor
-processor = AutoProcessor.from_pretrained("microsoft/git-base")
-
-def preprocess_function(examples):
-    images = [Image.open(img).convert("RGB") for img in examples["image"]]
-    texts = examples["caption"]
-
-    inputs = processor(images=images, text=texts, padding="max_length", truncation=True)
-    inputs["labels"] = inputs["input_ids"]
-    return inputs
-
-dataset = dataset.map(preprocess_function, batched=True)
-
-# Model
-model = AutoModelForVision2Seq.from_pretrained("microsoft/git-base")
-
-# Train
-training_args = TrainingArguments(
-    output_dir="./results",
-    eval_strategy="epoch",
-    learning_rate=5e-5,
-    per_device_train_batch_size=8,
-    num_train_epochs=3,
-)
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=dataset["train"],
-    eval_dataset=dataset["test"],
-)
-
-trainer.train()
-
-# Inference
-image = Image.open("image.jpg")
-inputs = processor(images=image, return_tensors="pt")
-outputs = model.generate(**inputs)
-caption = processor.decode(outputs[0], skip_special_tokens=True)
-```
-
-## Best Practices Summary
-
-1. **Use appropriate Auto* classes**: AutoTokenizer, AutoModel, etc. for model loading
-2. **Proper preprocessing**: Tokenize, align labels, handle special cases
-3. **Data collators**: Use appropriate collators for dynamic padding
-4. **Metrics**: Load and compute relevant metrics for evaluation
-5. **Training arguments**: Configure properly for task and hardware
-6. **Inference**: Use pipeline() for quick inference, or manual tokenization for custom needs
--- a/scientific-packages/transformers/references/training.md
+++ b/scientific-packages/transformers/references/training.md
@@ -1,328 +0,0 @@
-# Training with Transformers
-
-Transformers provides comprehensive training capabilities through the `Trainer` API, supporting distributed training, mixed precision, and advanced optimization techniques.
-
-## Basic Training Workflow
-
-```python
-from transformers import (
-    AutoModelForSequenceClassification,
-    AutoTokenizer,
-    Trainer,
-    TrainingArguments
-)
-from datasets import load_dataset
-
-# 1. Load and preprocess data
-dataset = load_dataset("imdb")
-tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
-
-def tokenize_function(examples):
-    return tokenizer(examples["text"], padding="max_length", truncation=True)
-
-tokenized_datasets = dataset.map(tokenize_function, batched=True)
-
-# 2. Load model
-model = AutoModelForSequenceClassification.from_pretrained(
-    "bert-base-uncased",
-    num_labels=2
-)
-
-# 3. Define training arguments
-training_args = TrainingArguments(
-    output_dir="./results",
-    num_train_epochs=3,
-    per_device_train_batch_size=16,
-    per_device_eval_batch_size=64,
-    learning_rate=2e-5,
-    eval_strategy="epoch",
-    save_strategy="epoch",
-    load_best_model_at_end=True,
-)
-
-# 4. Create trainer
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=tokenized_datasets["train"],
-    eval_dataset=tokenized_datasets["test"],
-)
-
-# 5. Train
-trainer.train()
-
-# 6. Evaluate
-trainer.evaluate()
-
-# 7. Save model
-trainer.save_model("./final_model")
-```
-
-## TrainingArguments Configuration
-
-### Essential Parameters
-
-**Output and Logging:**
- `output_dir`: Directory for checkpoints and outputs (required)
- `logging_dir`: TensorBoard log directory (default: `{output_dir}/runs`)
- `logging_steps`: Log every N steps (default: 500)
- `logging_strategy`: "steps" or "epoch"
-
-**Training Duration:**
- `num_train_epochs`: Number of epochs (default: 3.0)
- `max_steps`: Max training steps (overrides num_train_epochs if set)
-
-**Batch Size and Gradient Accumulation:**
- `per_device_train_batch_size`: Batch size per device (default: 8)
- `per_device_eval_batch_size`: Eval batch size per device (default: 8)
- `gradient_accumulation_steps`: Accumulate gradients over N steps (default: 1)
- Effective batch size = `per_device_train_batch_size * gradient_accumulation_steps * num_gpus`
-
-**Learning Rate:**
- `learning_rate`: Peak learning rate (default: 5e-5)
- `lr_scheduler_type`: Scheduler type ("linear", "cosine", "constant", etc.)
- `warmup_steps`: Warmup steps (default: 0)
- `warmup_ratio`: Warmup as fraction of total steps
-
-**Evaluation:**
- `eval_strategy`: "no", "steps", or "epoch" (default: "no")
- `eval_steps`: Evaluate every N steps (if eval_strategy="steps")
- `eval_delay`: Delay evaluation until N steps
-
-**Checkpointing:**
- `save_strategy`: "no", "steps", or "epoch" (default: "steps")
- `save_steps`: Save checkpoint every N steps (default: 500)
- `save_total_limit`: Keep only N most recent checkpoints
- `load_best_model_at_end`: Load best checkpoint at end (default: False)
- `metric_for_best_model`: Metric to determine best model
-
-**Optimization:**
- `optim`: Optimizer ("adamw_torch", "adamw_hf", "sgd", etc.)
- `weight_decay`: Weight decay coefficient (default: 0.0)
- `adam_beta1`, `adam_beta2`: Adam optimizer betas
- `adam_epsilon`: Epsilon for Adam (default: 1e-8)
- `max_grad_norm`: Max gradient norm for clipping (default: 1.0)
-
-### Mixed Precision Training
-
-```python
-training_args = TrainingArguments(
-    output_dir="./results",
-    fp16=True,  # Use fp16 on NVIDIA GPUs
-    fp16_opt_level="O1",  # O0, O1, O2, O3 (Apex levels)
-    # or
-    bf16=True,  # Use bf16 on Ampere+ GPUs (better than fp16)
-)
-```
-
-### Distributed Training
-
-**DataParallel (single-node multi-GPU):**
-```python
-# Automatic with multiple GPUs
-training_args = TrainingArguments(
-    output_dir="./results",
-    per_device_train_batch_size=16,  # Per GPU
-)
-# Run: python script.py
-```
-
-**DistributedDataParallel (multi-node or multi-GPU):**
-```bash
-# Single node, multiple GPUs
-python -m torch.distributed.launch --nproc_per_node=4 script.py
-
-# Or use accelerate
-accelerate config
-accelerate launch script.py
-```
-
-**DeepSpeed Integration:**
-```python
-training_args = TrainingArguments(
-    output_dir="./results",
-    deepspeed="ds_config.json",  # DeepSpeed config file
-)
-```
-
-### Advanced Features
-
-**Gradient Checkpointing (reduce memory):**
-```python
-training_args = TrainingArguments(
-    output_dir="./results",
-    gradient_checkpointing=True,
-)
-```
-
-**Compilation with torch.compile:**
-```python
-training_args = TrainingArguments(
-    output_dir="./results",
-    torch_compile=True,
-    torch_compile_backend="inductor",  # or "cudagraphs"
-)
-```
-
-**Push to Hub:**
-```python
-training_args = TrainingArguments(
-    output_dir="./results",
-    push_to_hub=True,
-    hub_model_id="username/model-name",
-    hub_strategy="every_save",  # or "end"
-)
-```
-
-## Custom Training Components
-
-### Custom Metrics
-
-```python
-import evaluate
-import numpy as np
-
-metric = evaluate.load("accuracy")
-
-def compute_metrics(eval_pred):
-    logits, labels = eval_pred
-    predictions = np.argmax(logits, axis=-1)
-    return metric.compute(predictions=predictions, references=labels)
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    compute_metrics=compute_metrics,
-)
-```
-
-### Custom Loss Function
-
-```python
-class CustomTrainer(Trainer):
-    def compute_loss(self, model, inputs, return_outputs=False):
-        labels = inputs.pop("labels")
-        outputs = model(**inputs)
-        logits = outputs.logits
-
-        # Custom loss calculation
-        loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
-        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
-
-        return (loss, outputs) if return_outputs else loss
-```
-
-### Data Collator
-
-```python
-from transformers import DataCollatorWithPadding
-
-data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=train_dataset,
-    data_collator=data_collator,
-)
-```
-
-### Callbacks
-
-```python
-from transformers import TrainerCallback
-
-class CustomCallback(TrainerCallback):
-    def on_epoch_end(self, args, state, control, **kwargs):
-        print(f"Epoch {state.epoch} completed!")
-        return control
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    callbacks=[CustomCallback],
-)
-```
-
-## Hyperparameter Search
-
-```python
-def model_init():
-    return AutoModelForSequenceClassification.from_pretrained(
-        "bert-base-uncased",
-        num_labels=2
-    )
-
-trainer = Trainer(
-    model_init=model_init,
-    args=training_args,
-    train_dataset=train_dataset,
-    eval_dataset=eval_dataset,
-    compute_metrics=compute_metrics,
-)
-
-# Optuna-based search
-best_trial = trainer.hyperparameter_search(
-    direction="maximize",
-    backend="optuna",
-    n_trials=10,
-    hp_space=lambda trial: {
-        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
-        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
-        "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 5),
-    }
-)
-```
-
-## Training Best Practices
-
-1. **Start with small learning rates**: 2e-5 to 5e-5 for fine-tuning
-2. **Use warmup**: 5-10% of total steps for learning rate warmup
-3. **Monitor training**: Use eval_strategy="epoch" or "steps" to track progress
-4. **Save checkpoints**: Set save_strategy and save_total_limit
-5. **Use mixed precision**: Enable fp16 or bf16 for faster training
-6. **Gradient accumulation**: For large effective batch sizes on limited memory
-7. **Load best model**: Set load_best_model_at_end=True to avoid overfitting
-8. **Push to Hub**: Enable push_to_hub for easy model sharing and versioning
-
-## Common Training Patterns
-
-### Classification
-```python
-model = AutoModelForSequenceClassification.from_pretrained(
-    "bert-base-uncased",
-    num_labels=num_classes,
-    id2label=id2label,
-    label2id=label2id
-)
-```
-
-### Question Answering
-```python
-model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")
-```
-
-### Token Classification (NER)
-```python
-model = AutoModelForTokenClassification.from_pretrained(
-    "bert-base-uncased",
-    num_labels=num_tags,
-    id2label=id2label,
-    label2id=label2id
-)
-```
-
-### Sequence-to-Sequence
-```python
-model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
-```
-
-### Causal Language Modeling
-```python
-model = AutoModelForCausalLM.from_pretrained("gpt2")
-```
-
-### Masked Language Modeling
-```python
-model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
-```
--- a/scientific-packages/transformers/scripts/fine_tune_classifier.py
+++ b/scientific-packages/transformers/scripts/fine_tune_classifier.py
@@ -1,241 +0,0 @@
-#!/usr/bin/env python3
-"""
-Fine-tune a transformer model for text classification.
-
-This script demonstrates the complete workflow for fine-tuning a pre-trained
-model on a classification task using the Trainer API.
-"""
-
-import numpy as np
-from datasets import load_dataset
-from transformers import (
-    AutoTokenizer,
-    AutoModelForSequenceClassification,
-    TrainingArguments,
-    Trainer,
-    DataCollatorWithPadding,
-)
-import evaluate
-
-
-def load_and_prepare_data(dataset_name="imdb", model_name="distilbert-base-uncased", max_samples=None):
-    """
-    Load dataset and tokenize.
-
-    Args:
-        dataset_name: Name of the dataset to load
-        model_name: Name of the model/tokenizer to use
-        max_samples: Limit number of samples (for quick testing)
-
-    Returns:
-        tokenized_datasets, tokenizer
-    """
-    print(f"Loading dataset: {dataset_name}")
-    dataset = load_dataset(dataset_name)
-
-    # Optionally limit samples for quick testing
-    if max_samples:
-        dataset["train"] = dataset["train"].select(range(max_samples))
-        dataset["test"] = dataset["test"].select(range(min(max_samples, len(dataset["test"]))))
-
-    print(f"Loading tokenizer: {model_name}")
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-
-    def tokenize_function(examples):
-        return tokenizer(
-            examples["text"],
-            padding="max_length",
-            truncation=True,
-            max_length=512
-        )
-
-    print("Tokenizing dataset...")
-    tokenized_datasets = dataset.map(tokenize_function, batched=True)
-
-    return tokenized_datasets, tokenizer
-
-
-def create_model(model_name, num_labels, id2label, label2id):
-    """
-    Create classification model.
-
-    Args:
-        model_name: Name of the pre-trained model
-        num_labels: Number of classification labels
-        id2label: Dictionary mapping label IDs to names
-        label2id: Dictionary mapping label names to IDs
-
-    Returns:
-        model
-    """
-    print(f"Loading model: {model_name}")
-    model = AutoModelForSequenceClassification.from_pretrained(
-        model_name,
-        num_labels=num_labels,
-        id2label=id2label,
-        label2id=label2id
-    )
-    return model
-
-
-def define_compute_metrics(metric_name="accuracy"):
-    """
-    Define function to compute metrics during evaluation.
-
-    Args:
-        metric_name: Name of the metric to use
-
-    Returns:
-        compute_metrics function
-    """
-    metric = evaluate.load(metric_name)
-
-    def compute_metrics(eval_pred):
-        logits, labels = eval_pred
-        predictions = np.argmax(logits, axis=-1)
-        return metric.compute(predictions=predictions, references=labels)
-
-    return compute_metrics
-
-
-def train_model(model, tokenizer, train_dataset, eval_dataset, output_dir="./results"):
-    """
-    Train the model.
-
-    Args:
-        model: The model to train
-        tokenizer: The tokenizer
-        train_dataset: Training dataset
-        eval_dataset: Evaluation dataset
-        output_dir: Directory for checkpoints and logs
-
-    Returns:
-        trained model, trainer
-    """
-    # Define training arguments
-    training_args = TrainingArguments(
-        output_dir=output_dir,
-        num_train_epochs=3,
-        per_device_train_batch_size=16,
-        per_device_eval_batch_size=64,
-        learning_rate=2e-5,
-        weight_decay=0.01,
-        eval_strategy="epoch",
-        save_strategy="epoch",
-        load_best_model_at_end=True,
-        metric_for_best_model="accuracy",
-        logging_dir=f"{output_dir}/logs",
-        logging_steps=100,
-        save_total_limit=2,
-        fp16=False,  # Set to True if using GPU with fp16 support
-    )
-
-    # Create data collator
-    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
-
-    # Create trainer
-    trainer = Trainer(
-        model=model,
-        args=training_args,
-        train_dataset=train_dataset,
-        eval_dataset=eval_dataset,
-        data_collator=data_collator,
-        compute_metrics=define_compute_metrics("accuracy"),
-    )
-
-    # Train
-    print("\nStarting training...")
-    trainer.train()
-
-    # Evaluate
-    print("\nEvaluating model...")
-    eval_results = trainer.evaluate()
-    print(f"Evaluation results: {eval_results}")
-
-    return model, trainer
-
-
-def test_inference(model, tokenizer, id2label):
-    """
-    Test the trained model with sample texts.
-
-    Args:
-        model: Trained model
-        tokenizer: Tokenizer
-        id2label: Dictionary mapping label IDs to names
-    """
-    print("\n=== Testing Inference ===")
-
-    test_texts = [
-        "This movie was absolutely fantastic! I loved every minute of it.",
-        "Terrible film. Waste of time and money.",
-        "It was okay, nothing special but not bad either."
-    ]
-
-    for text in test_texts:
-        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
-        outputs = model(**inputs)
-        predictions = outputs.logits.argmax(-1)
-        predicted_label = id2label[predictions.item()]
-        confidence = outputs.logits.softmax(-1).max().item()
-
-        print(f"\nText: {text}")
-        print(f"Prediction: {predicted_label} (confidence: {confidence:.3f})")
-
-
-def main():
-    """Main training pipeline."""
-    # Configuration
-    DATASET_NAME = "imdb"
-    MODEL_NAME = "distilbert-base-uncased"
-    OUTPUT_DIR = "./results"
-    MAX_SAMPLES = None  # Set to a small number (e.g., 1000) for quick testing
-
-    # Label mapping
-    id2label = {0: "negative", 1: "positive"}
-    label2id = {"negative": 0, "positive": 1}
-    num_labels = len(id2label)
-
-    print("=" * 60)
-    print("Fine-Tuning Text Classification Model")
-    print("=" * 60)
-
-    # Load and prepare data
-    tokenized_datasets, tokenizer = load_and_prepare_data(
-        dataset_name=DATASET_NAME,
-        model_name=MODEL_NAME,
-        max_samples=MAX_SAMPLES
-    )
-
-    # Create model
-    model = create_model(
-        model_name=MODEL_NAME,
-        num_labels=num_labels,
-        id2label=id2label,
-        label2id=label2id
-    )
-
-    # Train model
-    model, trainer = train_model(
-        model=model,
-        tokenizer=tokenizer,
-        train_dataset=tokenized_datasets["train"],
-        eval_dataset=tokenized_datasets["test"],
-        output_dir=OUTPUT_DIR
-    )
-
-    # Save final model
-    print(f"\nSaving model to {OUTPUT_DIR}/final_model")
-    trainer.save_model(f"{OUTPUT_DIR}/final_model")
-    tokenizer.save_pretrained(f"{OUTPUT_DIR}/final_model")
-
-    # Test inference
-    test_inference(model, tokenizer, id2label)
-
-    print("\n" + "=" * 60)
-    print("Training completed successfully!")
-    print("=" * 60)
-
-
-if __name__ == "__main__":
-    main()
--- a/scientific-packages/transformers/scripts/generate_text.py
+++ b/scientific-packages/transformers/scripts/generate_text.py
@@ -1,189 +0,0 @@
-#!/usr/bin/env python3
-"""
-Text generation with different decoding strategies.
-
-This script demonstrates various text generation approaches using
-different sampling and decoding strategies.
-"""
-
-from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
-
-
-def load_model_and_tokenizer(model_name="gpt2"):
-    """
-    Load model and tokenizer.
-
-    Args:
-        model_name: Name of the model to load
-
-    Returns:
-        model, tokenizer
-    """
-    print(f"Loading model: {model_name}")
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    model = AutoModelForCausalLM.from_pretrained(model_name)
-
-    # Set pad token if not already set
-    if tokenizer.pad_token is None:
-        tokenizer.pad_token = tokenizer.eos_token
-
-    return model, tokenizer
-
-
-def generate_with_greedy(model, tokenizer, prompt, max_new_tokens=50):
-    """Greedy decoding - always picks highest probability token."""
-    print("\n=== Greedy Decoding ===")
-    print(f"Prompt: {prompt}")
-
-    inputs = tokenizer(prompt, return_tensors="pt")
-    outputs = model.generate(
-        **inputs,
-        max_new_tokens=max_new_tokens,
-        do_sample=False,
-        num_beams=1,
-        pad_token_id=tokenizer.pad_token_id
-    )
-
-    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
-    print(f"Generated: {generated_text}\n")
-
-
-def generate_with_beam_search(model, tokenizer, prompt, max_new_tokens=50, num_beams=5):
-    """Beam search - explores multiple hypotheses."""
-    print("\n=== Beam Search ===")
-    print(f"Prompt: {prompt}")
-
-    inputs = tokenizer(prompt, return_tensors="pt")
-    outputs = model.generate(
-        **inputs,
-        max_new_tokens=max_new_tokens,
-        num_beams=num_beams,
-        early_stopping=True,
-        no_repeat_ngram_size=2,
-        pad_token_id=tokenizer.pad_token_id
-    )
-
-    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
-    print(f"Generated: {generated_text}\n")
-
-
-def generate_with_sampling(model, tokenizer, prompt, max_new_tokens=50,
-                           temperature=0.7, top_k=50, top_p=0.9):
-    """Sampling with temperature, top-k, and nucleus (top-p) sampling."""
-    print("\n=== Sampling (Temperature + Top-K + Top-P) ===")
-    print(f"Prompt: {prompt}")
-    print(f"Parameters: temperature={temperature}, top_k={top_k}, top_p={top_p}")
-
-    inputs = tokenizer(prompt, return_tensors="pt")
-    outputs = model.generate(
-        **inputs,
-        max_new_tokens=max_new_tokens,
-        do_sample=True,
-        temperature=temperature,
-        top_k=top_k,
-        top_p=top_p,
-        pad_token_id=tokenizer.pad_token_id
-    )
-
-    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
-    print(f"Generated: {generated_text}\n")
-
-
-def generate_multiple_sequences(model, tokenizer, prompt, max_new_tokens=50,
-                                 num_return_sequences=3):
-    """Generate multiple diverse sequences."""
-    print("\n=== Multiple Sequences (with Sampling) ===")
-    print(f"Prompt: {prompt}")
-    print(f"Generating {num_return_sequences} sequences...")
-
-    inputs = tokenizer(prompt, return_tensors="pt")
-    outputs = model.generate(
-        **inputs,
-        max_new_tokens=max_new_tokens,
-        do_sample=True,
-        temperature=0.8,
-        top_p=0.95,
-        num_return_sequences=num_return_sequences,
-        pad_token_id=tokenizer.pad_token_id
-    )
-
-    for i, output in enumerate(outputs):
-        generated_text = tokenizer.decode(output, skip_special_tokens=True)
-        print(f"\nSequence {i+1}: {generated_text}")
-    print()
-
-
-def generate_with_config(model, tokenizer, prompt):
-    """Use GenerationConfig for reusable configuration."""
-    print("\n=== Using GenerationConfig ===")
-    print(f"Prompt: {prompt}")
-
-    # Create a generation config
-    generation_config = GenerationConfig(
-        max_new_tokens=50,
-        do_sample=True,
-        temperature=0.7,
-        top_p=0.9,
-        top_k=50,
-        repetition_penalty=1.2,
-        no_repeat_ngram_size=3,
-        pad_token_id=tokenizer.pad_token_id
-    )
-
-    inputs = tokenizer(prompt, return_tensors="pt")
-    outputs = model.generate(**inputs, generation_config=generation_config)
-
-    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
-    print(f"Generated: {generated_text}\n")
-
-
-def compare_temperatures(model, tokenizer, prompt, max_new_tokens=50):
-    """Compare different temperature settings."""
-    print("\n=== Temperature Comparison ===")
-    print(f"Prompt: {prompt}\n")
-
-    temperatures = [0.3, 0.7, 1.0, 1.5]
-
-    for temp in temperatures:
-        inputs = tokenizer(prompt, return_tensors="pt")
-        outputs = model.generate(
-            **inputs,
-            max_new_tokens=max_new_tokens,
-            do_sample=True,
-            temperature=temp,
-            top_p=0.9,
-            pad_token_id=tokenizer.pad_token_id
-        )
-
-        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
-        print(f"Temperature {temp}: {generated_text}\n")
-
-
-def main():
-    """Run all generation examples."""
-    print("=" * 70)
-    print("Text Generation Examples")
-    print("=" * 70)
-
-    # Load model and tokenizer
-    model, tokenizer = load_model_and_tokenizer("gpt2")
-
-    # Example prompts
-    story_prompt = "Once upon a time in a distant galaxy"
-    factual_prompt = "The three branches of the US government are"
-
-    # Demonstrate different strategies
-    generate_with_greedy(model, tokenizer, story_prompt)
-    generate_with_beam_search(model, tokenizer, factual_prompt)
-    generate_with_sampling(model, tokenizer, story_prompt)
-    generate_multiple_sequences(model, tokenizer, story_prompt, num_return_sequences=3)
-    generate_with_config(model, tokenizer, story_prompt)
-    compare_temperatures(model, tokenizer, story_prompt)
-
-    print("=" * 70)
-    print("All generation examples completed!")
-    print("=" * 70)
-
-
-if __name__ == "__main__":
-    main()
--- a/scientific-packages/transformers/scripts/quick_inference.py
+++ b/scientific-packages/transformers/scripts/quick_inference.py
@@ -1,133 +0,0 @@
-#!/usr/bin/env python3
-"""
-Quick inference using Transformers pipelines.
-
-This script demonstrates how to quickly use pre-trained models for inference
-across various tasks using the pipeline API.
-"""
-
-from transformers import pipeline
-
-
-def text_classification_example():
-    """Sentiment analysis example."""
-    print("=== Text Classification ===")
-    classifier = pipeline("text-classification")
-    result = classifier("I love using Transformers! It makes NLP so easy.")
-    print(f"Result: {result}\n")
-
-
-def named_entity_recognition_example():
-    """Named Entity Recognition example."""
-    print("=== Named Entity Recognition ===")
-    ner = pipeline("token-classification", aggregation_strategy="simple")
-    text = "My name is Sarah and I work at Microsoft in Seattle"
-    entities = ner(text)
-    for entity in entities:
-        print(f"{entity['word']}: {entity['entity_group']} (score: {entity['score']:.3f})")
-    print()
-
-
-def question_answering_example():
-    """Question Answering example."""
-    print("=== Question Answering ===")
-    qa = pipeline("question-answering")
-    context = "Paris is the capital and most populous city of France. It is located in northern France."
-    question = "What is the capital of France?"
-    answer = qa(question=question, context=context)
-    print(f"Question: {question}")
-    print(f"Answer: {answer['answer']} (score: {answer['score']:.3f})\n")
-
-
-def text_generation_example():
-    """Text generation example."""
-    print("=== Text Generation ===")
-    generator = pipeline("text-generation", model="gpt2")
-    prompt = "Once upon a time in a land far away"
-    generated = generator(prompt, max_length=50, num_return_sequences=1)
-    print(f"Prompt: {prompt}")
-    print(f"Generated: {generated[0]['generated_text']}\n")
-
-
-def summarization_example():
-    """Text summarization example."""
-    print("=== Summarization ===")
-    summarizer = pipeline("summarization")
-    article = """
-    The Transformers library provides thousands of pretrained models to perform tasks
-    on texts such as classification, information extraction, question answering,
-    summarization, translation, text generation, etc in over 100 languages. Its aim
-    is to make cutting-edge NLP easier to use for everyone. The library provides APIs
-    to quickly download and use pretrained models on a given text, fine-tune them on
-    your own datasets then share them with the community on the model hub.
-    """
-    summary = summarizer(article, max_length=50, min_length=25, do_sample=False)
-    print(f"Summary: {summary[0]['summary_text']}\n")
-
-
-def translation_example():
-    """Translation example."""
-    print("=== Translation ===")
-    translator = pipeline("translation_en_to_fr")
-    text = "Hello, how are you today?"
-    translation = translator(text)
-    print(f"English: {text}")
-    print(f"French: {translation[0]['translation_text']}\n")
-
-
-def zero_shot_classification_example():
-    """Zero-shot classification example."""
-    print("=== Zero-Shot Classification ===")
-    classifier = pipeline("zero-shot-classification")
-    text = "This is a breaking news story about a major earthquake."
-    candidate_labels = ["politics", "sports", "science", "breaking news"]
-    result = classifier(text, candidate_labels)
-    print(f"Text: {text}")
-    print("Predictions:")
-    for label, score in zip(result['labels'], result['scores']):
-        print(f"  {label}: {score:.3f}")
-    print()
-
-
-def image_classification_example():
-    """Image classification example (requires PIL)."""
-    print("=== Image Classification ===")
-    try:
-        from PIL import Image
-        import requests
-
-        classifier = pipeline("image-classification")
-        url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-        image = Image.open(requests.get(url, stream=True).raw)
-        predictions = classifier(image)
-        print("Top predictions:")
-        for pred in predictions[:3]:
-            print(f"  {pred['label']}: {pred['score']:.3f}")
-        print()
-    except ImportError:
-        print("PIL not installed. Skipping image classification example.\n")
-
-
-def main():
-    """Run all examples."""
-    print("Transformers Quick Inference Examples")
-    print("=" * 50 + "\n")
-
-    # Text tasks
-    text_classification_example()
-    named_entity_recognition_example()
-    question_answering_example()
-    text_generation_example()
-    summarization_example()
-    translation_example()
-    zero_shot_classification_example()
-
-    # Vision task (optional)
-    image_classification_example()
-
-    print("=" * 50)
-    print("All examples completed!")
-
-
-if __name__ == "__main__":
-    main()
--- a/scientific-skills/adaptyv/SKILL.md
+++ b/scientific-skills/adaptyv/SKILL.md
@@ -0,0 +1,114 @@
+---
+name: adaptyv
+description: Cloud laboratory platform for automated protein testing and validation. Use when designing proteins and needing experimental validation including binding assays, expression testing, thermostability measurements, enzyme activity assays, or protein sequence optimization. Also use for submitting experiments via API, tracking experiment status, downloading results, optimizing protein sequences for better expression using computational tools (NetSolP, SoluProt, SolubleMPNN, ESM), or managing protein design workflows with wet-lab validation.
+---
+
+# Adaptyv
+
+Adaptyv is a cloud laboratory platform that provides automated protein testing and validation services. Submit protein sequences via API or web interface and receive experimental results in approximately 21 days.
+
+## Quick Start
+
+### Authentication Setup
+
+Adaptyv requires API authentication. Set up your credentials:
+
+1. Contact support@adaptyvbio.com to request API access (platform is in alpha/beta)
+2. Receive your API access token
+3. Set environment variable:
+
+```bash
+export ADAPTYV_API_KEY="your_api_key_here"
+```
+
+Or create a `.env` file:
+
+```
+ADAPTYV_API_KEY=your_api_key_here
+```
+
+### Installation
+
+Install the required package using uv:
+
+```bash
+uv pip install requests python-dotenv
+```
+
+### Basic Usage
+
+Submit protein sequences for testing:
+
+```python
+import os
+import requests
+from dotenv import load_dotenv
+
+load_dotenv()
+
+api_key = os.getenv("ADAPTYV_API_KEY")
+base_url = "https://kq5jp7qj7wdqklhsxmovkzn4l40obksv.lambda-url.eu-central-1.on.aws"
+
+headers = {
+    "Authorization": f"Bearer {api_key}",
+    "Content-Type": "application/json"
+}
+
+# Submit experiment
+response = requests.post(
+    f"{base_url}/experiments",
+    headers=headers,
+    json={
+        "sequences": ">protein1\nMKVLWALLGLLGAA...",
+        "experiment_type": "binding",
+        "webhook_url": "https://your-webhook.com/callback"
+    }
+)
+
+experiment_id = response.json()["experiment_id"]
+```
+
+## Available Experiment Types
+
+Adaptyv supports multiple assay types:
+
+- **Binding assays** - Test protein-target interactions using biolayer interferometry
+- **Expression testing** - Measure protein expression levels
+- **Thermostability** - Characterize protein thermal stability
+- **Enzyme activity** - Assess enzymatic function
+
+See `reference/experiments.md` for detailed information on each experiment type and workflows.
+
+## Protein Sequence Optimization
+
+Before submitting sequences, optimize them for better expression and stability:
+
+**Common issues to address:**
+- Unpaired cysteines that create unwanted disulfides
+- Excessive hydrophobic regions causing aggregation
+- Poor solubility predictions
+
+**Recommended tools:**
+- NetSolP / SoluProt - Initial solubility filtering
+- SolubleMPNN - Sequence redesign for improved solubility
+- ESM - Sequence likelihood scoring
+- ipTM - Interface stability assessment
+- pSAE - Hydrophobic exposure quantification
+
+See `reference/protein_optimization.md` for detailed optimization workflows and tool usage.
+
+## API Reference
+
+For complete API documentation including all endpoints, request/response formats, and authentication details, see `reference/api_reference.md`.
+
+## Examples
+
+For concrete code examples covering common use cases (experiment submission, status tracking, result retrieval, batch processing), see `reference/examples.md`.
+
+## Important Notes
+
+- Platform is currently in alpha/beta phase with features subject to change
+- Not all platform features are available via API yet
+- Results typically delivered in ~21 days
+- Contact support@adaptyvbio.com for access requests or questions
+- Suitable for high-throughput AI-driven protein design workflows
--- a/scientific-skills/adaptyv/reference/api_reference.md
+++ b/scientific-skills/adaptyv/reference/api_reference.md
@@ -0,0 +1,308 @@
+# Adaptyv API Reference
+
+## Base URL
+
+```
+https://kq5jp7qj7wdqklhsxmovkzn4l40obksv.lambda-url.eu-central-1.on.aws
+```
+
+## Authentication
+
+All API requests require bearer token authentication in the request header:
+
+```
+Authorization: Bearer YOUR_API_KEY
+```
+
+To obtain API access:
+1. Contact support@adaptyvbio.com
+2. Request API access during alpha/beta period
+3. Receive your personal access token
+
+Store your API key securely:
+- Use environment variables: `ADAPTYV_API_KEY`
+- Never commit API keys to version control
+- Use `.env` files with `.gitignore` for local development
+
+## Endpoints
+
+### Experiments
+
+#### Create Experiment
+
+Submit protein sequences for experimental testing.
+
+**Endpoint:** `POST /experiments`
+
+**Request Body:**
+```json
+{
+  "sequences": ">protein1\nMKVLWALLGLLGAA...\n>protein2\nMATGVLWALLG...",
+  "experiment_type": "binding|expression|thermostability|enzyme_activity",
+  "target_id": "optional_target_identifier",
+  "webhook_url": "https://your-webhook.com/callback",
+  "metadata": {
+    "project": "optional_project_name",
+    "notes": "optional_notes"
+  }
+}
+```
+
+**Sequence Format:**
+- FASTA format with headers
+- Multiple sequences supported
+- Standard amino acid codes
+
+**Response:**
+```json
+{
+  "experiment_id": "exp_abc123xyz",
+  "status": "submitted",
+  "created_at": "2025-11-24T10:00:00Z",
+  "estimated_completion": "2025-12-15T10:00:00Z"
+}
+```
+
+#### Get Experiment Status
+
+Check the current status of an experiment.
+
+**Endpoint:** `GET /experiments/{experiment_id}`
+
+**Response:**
+```json
+{
+  "experiment_id": "exp_abc123xyz",
+  "status": "submitted|processing|completed|failed",
+  "created_at": "2025-11-24T10:00:00Z",
+  "updated_at": "2025-11-25T14:30:00Z",
+  "progress": {
+    "stage": "sequencing|expression|assay|analysis",
+    "percentage": 45
+  }
+}
+```
+
+**Status Values:**
+- `submitted` - Experiment received and queued
+- `processing` - Active testing in progress
+- `completed` - Results available for download
+- `failed` - Experiment encountered an error
+
+#### List Experiments
+
+Retrieve all experiments for your organization.
+
+**Endpoint:** `GET /experiments`
+
+**Query Parameters:**
+- `status` - Filter by status (optional)
+- `limit` - Number of results per page (default: 50)
+- `offset` - Pagination offset (default: 0)
+
+**Response:**
+```json
+{
+  "experiments": [
+    {
+      "experiment_id": "exp_abc123xyz",
+      "status": "completed",
+      "experiment_type": "binding",
+      "created_at": "2025-11-24T10:00:00Z"
+    }
+  ],
+  "total": 150,
+  "limit": 50,
+  "offset": 0
+}
+```
+
+### Results
+
+#### Get Experiment Results
+
+Download results from a completed experiment.
+
+**Endpoint:** `GET /experiments/{experiment_id}/results`
+
+**Response:**
+```json
+{
+  "experiment_id": "exp_abc123xyz",
+  "results": [
+    {
+      "sequence_id": "protein1",
+      "measurements": {
+        "kd": 1.2e-9,
+        "kon": 1.5e5,
+        "koff": 1.8e-4
+      },
+      "quality_metrics": {
+        "confidence": "high",
+        "r_squared": 0.98
+      }
+    }
+  ],
+  "download_urls": {
+    "raw_data": "https://...",
+    "analysis_package": "https://...",
+    "report": "https://..."
+  }
+}
+```
+
+### Targets
+
+#### Search Target Catalog
+
+Search the ACROBiosystems antigen catalog.
+
+**Endpoint:** `GET /targets`
+
+**Query Parameters:**
+- `search` - Search term (protein name, UniProt ID, etc.)
+- `species` - Filter by species
+- `category` - Filter by category
+
+**Response:**
+```json
+{
+  "targets": [
+    {
+      "target_id": "tgt_12345",
+      "name": "Human PD-L1",
+      "species": "Homo sapiens",
+      "uniprot_id": "Q9NZQ7",
+      "availability": "in_stock|custom_order",
+      "price_usd": 450
+    }
+  ]
+}
+```
+
+#### Request Custom Target
+
+Request an antigen not in the standard catalog.
+
+**Endpoint:** `POST /targets/request`
+
+**Request Body:**
+```json
+{
+  "target_name": "Custom target name",
+  "uniprot_id": "optional_uniprot_id",
+  "species": "species_name",
+  "notes": "Additional requirements"
+}
+```
+
+### Organization
+
+#### Get Credits Balance
+
+Check your organization's credit balance and usage.
+
+**Endpoint:** `GET /organization/credits`
+
+**Response:**
+```json
+{
+  "balance": 10000,
+  "currency": "USD",
+  "usage_this_month": 2500,
+  "experiments_remaining": 22
+}
+```
+
+## Webhooks
+
+Configure webhook URLs to receive notifications when experiments complete.
+
+**Webhook Payload:**
+```json
+{
+  "event": "experiment.completed",
+  "experiment_id": "exp_abc123xyz",
+  "status": "completed",
+  "timestamp": "2025-12-15T10:00:00Z",
+  "results_url": "/experiments/exp_abc123xyz/results"
+}
+```
+
+**Webhook Events:**
+- `experiment.submitted` - Experiment received
+- `experiment.started` - Processing began
+- `experiment.completed` - Results available
+- `experiment.failed` - Error occurred
+
+**Security:**
+- Verify webhook signatures (details provided during onboarding)
+- Use HTTPS endpoints only
+- Respond with 200 OK to acknowledge receipt
+
+## Error Handling
+
+**Error Response Format:**
+```json
+{
+  "error": {
+    "code": "invalid_sequence",
+    "message": "Sequence contains invalid amino acid codes",
+    "details": {
+      "sequence_id": "protein1",
+      "position": 45,
+      "character": "X"
+    }
+  }
+}
+```
+
+**Common Error Codes:**
+- `authentication_failed` - Invalid or missing API key
+- `invalid_sequence` - Malformed FASTA or invalid amino acids
+- `insufficient_credits` - Not enough credits for experiment
+- `target_not_found` - Specified target ID doesn't exist
+- `rate_limit_exceeded` - Too many requests
+- `experiment_not_found` - Invalid experiment ID
+- `internal_error` - Server-side error
+
+## Rate Limits
+
+- 100 requests per minute per API key
+- 1000 experiments per day per organization
+- Batch submissions encouraged for large-scale testing
+
+When rate limited, response includes:
+```
+HTTP 429 Too Many Requests
+Retry-After: 60
+```
+
+## Best Practices
+
+1. **Use webhooks** for long-running experiments instead of polling
+2. **Batch sequences** when submitting multiple variants
+3. **Cache results** to avoid redundant API calls
+4. **Implement retry logic** with exponential backoff
+5. **Monitor credits** to avoid experiment failures
+6. **Validate sequences** locally before submission
+7. **Use descriptive metadata** for better experiment tracking
+
+## API Versioning
+
+The API is currently in alpha/beta. Breaking changes may occur but will be:
+- Announced via email to registered users
+- Documented in the changelog
+- Supported with migration guides
+
+Current version is reflected in response headers:
+```
+X-API-Version: alpha-2025-11
+```
+
+## Support
+
+For API issues or questions:
+- Email: support@adaptyvbio.com
+- Documentation updates: https://docs.adaptyvbio.com
+- Report bugs with experiment IDs and request details
--- a/scientific-skills/adaptyv/reference/examples.md
+++ b/scientific-skills/adaptyv/reference/examples.md
@@ -0,0 +1,913 @@
+# Code Examples
+
+## Setup and Authentication
+
+### Basic Setup
+
+```python
+import os
+import requests
+from dotenv import load_dotenv
+
+# Load environment variables
+load_dotenv()
+
+# Configuration
+API_KEY = os.getenv("ADAPTYV_API_KEY")
+BASE_URL = "https://kq5jp7qj7wdqklhsxmovkzn4l40obksv.lambda-url.eu-central-1.on.aws"
+
+# Standard headers
+HEADERS = {
+    "Authorization": f"Bearer {API_KEY}",
+    "Content-Type": "application/json"
+}
+
+def check_api_connection():
+    """Verify API connection and credentials"""
+    try:
+        response = requests.get(f"{BASE_URL}/organization/credits", headers=HEADERS)
+        response.raise_for_status()
+        print("✓ API connection successful")
+        print(f"  Credits remaining: {response.json()['balance']}")
+        return True
+    except requests.exceptions.HTTPError as e:
+        print(f"✗ API authentication failed: {e}")
+        return False
+```
+
+### Environment Setup
+
+Create a `.env` file:
+```bash
+ADAPTYV_API_KEY=your_api_key_here
+```
+
+Install dependencies:
+```bash
+uv pip install requests python-dotenv
+```
+
+## Experiment Submission
+
+### Submit Single Sequence
+
+```python
+def submit_single_experiment(sequence, experiment_type="binding", target_id=None):
+    """
+    Submit a single protein sequence for testing
+
+    Args:
+        sequence: Amino acid sequence string
+        experiment_type: Type of experiment (binding, expression, thermostability, enzyme_activity)
+        target_id: Optional target identifier for binding assays
+
+    Returns:
+        Experiment ID and status
+    """
+
+    # Format as FASTA
+    fasta_content = f">protein_sequence\n{sequence}\n"
+
+    payload = {
+        "sequences": fasta_content,
+        "experiment_type": experiment_type
+    }
+
+    if target_id:
+        payload["target_id"] = target_id
+
+    response = requests.post(
+        f"{BASE_URL}/experiments",
+        headers=HEADERS,
+        json=payload
+    )
+
+    response.raise_for_status()
+    result = response.json()
+
+    print(f"✓ Experiment submitted")
+    print(f"  Experiment ID: {result['experiment_id']}")
+    print(f"  Status: {result['status']}")
+    print(f"  Estimated completion: {result['estimated_completion']}")
+
+    return result
+
+# Example usage
+sequence = "MKVLWAALLGLLGAAAAFPAVTSAVKPYKAAVSAAVSKPYKAAVSAAVSKPYK"
+experiment = submit_single_experiment(sequence, experiment_type="expression")
+```
+
+### Submit Multiple Sequences (Batch)
+
+```python
+def submit_batch_experiment(sequences_dict, experiment_type="binding", metadata=None):
+    """
+    Submit multiple protein sequences in a single batch
+
+    Args:
+        sequences_dict: Dictionary of {name: sequence}
+        experiment_type: Type of experiment
+        metadata: Optional dictionary of additional information
+
+    Returns:
+        Experiment details
+    """
+
+    # Format all sequences as FASTA
+    fasta_content = ""
+    for name, sequence in sequences_dict.items():
+        fasta_content += f">{name}\n{sequence}\n"
+
+    payload = {
+        "sequences": fasta_content,
+        "experiment_type": experiment_type
+    }
+
+    if metadata:
+        payload["metadata"] = metadata
+
+    response = requests.post(
+        f"{BASE_URL}/experiments",
+        headers=HEADERS,
+        json=payload
+    )
+
+    response.raise_for_status()
+    result = response.json()
+
+    print(f"✓ Batch experiment submitted")
+    print(f"  Experiment ID: {result['experiment_id']}")
+    print(f"  Sequences: {len(sequences_dict)}")
+    print(f"  Status: {result['status']}")
+
+    return result
+
+# Example usage
+sequences = {
+    "variant_1": "MKVLWAALLGLLGAAA...",
+    "variant_2": "MKVLSAALLGLLGAAA...",
+    "variant_3": "MKVLAAALLGLLGAAA...",
+    "wildtype": "MKVLWAALLGLLGAAA..."
+}
+
+metadata = {
+    "project": "antibody_optimization",
+    "round": 3,
+    "notes": "Testing solubility-optimized variants"
+}
+
+experiment = submit_batch_experiment(sequences, "expression", metadata)
+```
+
+### Submit with Webhook Notification
+
+```python
+def submit_with_webhook(sequences_dict, experiment_type, webhook_url):
+    """
+    Submit experiment with webhook for completion notification
+
+    Args:
+        sequences_dict: Dictionary of {name: sequence}
+        experiment_type: Type of experiment
+        webhook_url: URL to receive notification when complete
+    """
+
+    fasta_content = ""
+    for name, sequence in sequences_dict.items():
+        fasta_content += f">{name}\n{sequence}\n"
+
+    payload = {
+        "sequences": fasta_content,
+        "experiment_type": experiment_type,
+        "webhook_url": webhook_url
+    }
+
+    response = requests.post(
+        f"{BASE_URL}/experiments",
+        headers=HEADERS,
+        json=payload
+    )
+
+    response.raise_for_status()
+    result = response.json()
+
+    print(f"✓ Experiment submitted with webhook")
+    print(f"  Experiment ID: {result['experiment_id']}")
+    print(f"  Webhook: {webhook_url}")
+
+    return result
+
+# Example
+webhook_url = "https://your-server.com/adaptyv-webhook"
+experiment = submit_with_webhook(sequences, "binding", webhook_url)
+```
+
+## Tracking Experiments
+
+### Check Experiment Status
+
+```python
+def check_experiment_status(experiment_id):
+    """
+    Get current status of an experiment
+
+    Args:
+        experiment_id: Experiment identifier
+
+    Returns:
+        Status information
+    """
+
+    response = requests.get(
+        f"{BASE_URL}/experiments/{experiment_id}",
+        headers=HEADERS
+    )
+
+    response.raise_for_status()
+    status = response.json()
+
+    print(f"Experiment: {experiment_id}")
+    print(f"  Status: {status['status']}")
+    print(f"  Created: {status['created_at']}")
+    print(f"  Updated: {status['updated_at']}")
+
+    if 'progress' in status:
+        print(f"  Progress: {status['progress']['percentage']}%")
+        print(f"  Current stage: {status['progress']['stage']}")
+
+    return status
+
+# Example
+status = check_experiment_status("exp_abc123xyz")
+```
+
+### List All Experiments
+
+```python
+def list_experiments(status_filter=None, limit=50):
+    """
+    List experiments with optional status filtering
+
+    Args:
+        status_filter: Filter by status (submitted, processing, completed, failed)
+        limit: Maximum number of results
+
+    Returns:
+        List of experiments
+    """
+
+    params = {"limit": limit}
+    if status_filter:
+        params["status"] = status_filter
+
+    response = requests.get(
+        f"{BASE_URL}/experiments",
+        headers=HEADERS,
+        params=params
+    )
+
+    response.raise_for_status()
+    result = response.json()
+
+    print(f"Found {result['total']} experiments")
+    for exp in result['experiments']:
+        print(f"  {exp['experiment_id']}: {exp['status']} ({exp['experiment_type']})")
+
+    return result['experiments']
+
+# Example - list all completed experiments
+completed_experiments = list_experiments(status_filter="completed")
+```
+
+### Poll Until Complete
+
+```python
+import time
+
+def wait_for_completion(experiment_id, check_interval=3600):
+    """
+    Poll experiment status until completion
+
+    Args:
+        experiment_id: Experiment identifier
+        check_interval: Seconds between status checks (default: 1 hour)
+
+    Returns:
+        Final status
+    """
+
+    print(f"Monitoring experiment {experiment_id}...")
+
+    while True:
+        status = check_experiment_status(experiment_id)
+
+        if status['status'] == 'completed':
+            print("✓ Experiment completed!")
+            return status
+        elif status['status'] == 'failed':
+            print("✗ Experiment failed")
+            return status
+
+        print(f"  Status: {status['status']} - checking again in {check_interval}s")
+        time.sleep(check_interval)
+
+# Example (not recommended - use webhooks instead!)
+# status = wait_for_completion("exp_abc123xyz", check_interval=3600)
+```
+
+## Retrieving Results
+
+### Download Experiment Results
+
+```python
+import json
+
+def download_results(experiment_id, output_dir="results"):
+    """
+    Download and parse experiment results
+
+    Args:
+        experiment_id: Experiment identifier
+        output_dir: Directory to save results
+
+    Returns:
+        Parsed results data
+    """
+
+    # Get results
+    response = requests.get(
+        f"{BASE_URL}/experiments/{experiment_id}/results",
+        headers=HEADERS
+    )
+
+    response.raise_for_status()
+    results = response.json()
+
+    # Save results JSON
+    os.makedirs(output_dir, exist_ok=True)
+    output_file = f"{output_dir}/{experiment_id}_results.json"
+
+    with open(output_file, 'w') as f:
+        json.dump(results, f, indent=2)
+
+    print(f"✓ Results downloaded: {output_file}")
+    print(f"  Sequences tested: {len(results['results'])}")
+
+    # Download raw data if available
+    if 'download_urls' in results:
+        for data_type, url in results['download_urls'].items():
+            print(f"  {data_type} available at: {url}")
+
+    return results
+
+# Example
+results = download_results("exp_abc123xyz")
+```
+
+### Parse Binding Results
+
+```python
+import pandas as pd
+
+def parse_binding_results(results):
+    """
+    Parse binding assay results into DataFrame
+
+    Args:
+        results: Results dictionary from API
+
+    Returns:
+        pandas DataFrame with organized results
+    """
+
+    data = []
+    for result in results['results']:
+        row = {
+            'sequence_id': result['sequence_id'],
+            'kd': result['measurements']['kd'],
+            'kd_error': result['measurements']['kd_error'],
+            'kon': result['measurements']['kon'],
+            'koff': result['measurements']['koff'],
+            'confidence': result['quality_metrics']['confidence'],
+            'r_squared': result['quality_metrics']['r_squared']
+        }
+        data.append(row)
+
+    df = pd.DataFrame(data)
+
+    # Sort by affinity (lower KD = stronger binding)
+    df = df.sort_values('kd')
+
+    print("Top 5 binders:")
+    print(df.head())
+
+    return df
+
+# Example
+experiment_id = "exp_abc123xyz"
+results = download_results(experiment_id)
+binding_df = parse_binding_results(results)
+
+# Export to CSV
+binding_df.to_csv(f"{experiment_id}_binding_results.csv", index=False)
+```
+
+### Parse Expression Results
+
+```python
+def parse_expression_results(results):
+    """
+    Parse expression testing results into DataFrame
+
+    Args:
+        results: Results dictionary from API
+
+    Returns:
+        pandas DataFrame with organized results
+    """
+
+    data = []
+    for result in results['results']:
+        row = {
+            'sequence_id': result['sequence_id'],
+            'yield_mg_per_l': result['measurements']['total_yield_mg_per_l'],
+            'soluble_fraction': result['measurements']['soluble_fraction_percent'],
+            'purity': result['measurements']['purity_percent'],
+            'percentile': result['ranking']['percentile']
+        }
+        data.append(row)
+
+    df = pd.DataFrame(data)
+
+    # Sort by yield
+    df = df.sort_values('yield_mg_per_l', ascending=False)
+
+    print(f"Mean yield: {df['yield_mg_per_l'].mean():.2f} mg/L")
+    print(f"Top performer: {df.iloc[0]['sequence_id']} ({df.iloc[0]['yield_mg_per_l']:.2f} mg/L)")
+
+    return df
+
+# Example
+results = download_results("exp_expression123")
+expression_df = parse_expression_results(results)
+```
+
+## Target Catalog
+
+### Search for Targets
+
+```python
+def search_targets(query, species=None, category=None):
+    """
+    Search the antigen catalog
+
+    Args:
+        query: Search term (protein name, UniProt ID, etc.)
+        species: Optional species filter
+        category: Optional category filter
+
+    Returns:
+        List of matching targets
+    """
+
+    params = {"search": query}
+    if species:
+        params["species"] = species
+    if category:
+        params["category"] = category
+
+    response = requests.get(
+        f"{BASE_URL}/targets",
+        headers=HEADERS,
+        params=params
+    )
+
+    response.raise_for_status()
+    targets = response.json()['targets']
+
+    print(f"Found {len(targets)} targets matching '{query}':")
+    for target in targets:
+        print(f"  {target['target_id']}: {target['name']}")
+        print(f"    Species: {target['species']}")
+        print(f"    Availability: {target['availability']}")
+        print(f"    Price: ${target['price_usd']}")
+
+    return targets
+
+# Example
+targets = search_targets("PD-L1", species="Homo sapiens")
+```
+
+### Request Custom Target
+
+```python
+def request_custom_target(target_name, uniprot_id=None, species=None, notes=None):
+    """
+    Request a custom antigen not in the standard catalog
+
+    Args:
+        target_name: Name of the target protein
+        uniprot_id: Optional UniProt identifier
+        species: Species name
+        notes: Additional requirements or notes
+
+    Returns:
+        Request confirmation
+    """
+
+    payload = {
+        "target_name": target_name,
+        "species": species
+    }
+
+    if uniprot_id:
+        payload["uniprot_id"] = uniprot_id
+    if notes:
+        payload["notes"] = notes
+
+    response = requests.post(
+        f"{BASE_URL}/targets/request",
+        headers=HEADERS,
+        json=payload
+    )
+
+    response.raise_for_status()
+    result = response.json()
+
+    print(f"✓ Custom target request submitted")
+    print(f"  Request ID: {result['request_id']}")
+    print(f"  Status: {result['status']}")
+
+    return result
+
+# Example
+request = request_custom_target(
+    target_name="Novel receptor XYZ",
+    uniprot_id="P12345",
+    species="Mus musculus",
+    notes="Need high purity for structural studies"
+)
+```
+
+## Complete Workflows
+
+### End-to-End Binding Assay
+
+```python
+def complete_binding_workflow(sequences_dict, target_id, project_name):
+    """
+    Complete workflow: submit sequences, track, and retrieve binding results
+
+    Args:
+        sequences_dict: Dictionary of {name: sequence}
+        target_id: Target identifier from catalog
+        project_name: Project name for metadata
+
+    Returns:
+        DataFrame with binding results
+    """
+
+    print("=== Starting Binding Assay Workflow ===")
+
+    # Step 1: Submit experiment
+    print("\n1. Submitting experiment...")
+    metadata = {
+        "project": project_name,
+        "target": target_id
+    }
+
+    experiment = submit_batch_experiment(
+        sequences_dict,
+        experiment_type="binding",
+        metadata=metadata
+    )
+
+    experiment_id = experiment['experiment_id']
+
+    # Step 2: Save experiment info
+    print("\n2. Saving experiment details...")
+    with open(f"{experiment_id}_info.json", 'w') as f:
+        json.dump(experiment, f, indent=2)
+
+    print(f"✓ Experiment {experiment_id} submitted")
+    print("  Results will be available in ~21 days")
+    print("  Use webhook or poll status for updates")
+
+    # Note: In practice, wait for completion before this step
+    # print("\n3. Waiting for completion...")
+    # status = wait_for_completion(experiment_id)
+
+    # print("\n4. Downloading results...")
+    # results = download_results(experiment_id)
+
+    # print("\n5. Parsing results...")
+    # df = parse_binding_results(results)
+
+    # return df
+
+    return experiment_id
+
+# Example
+antibody_variants = {
+    "variant_1": "EVQLVESGGGLVQPGG...",
+    "variant_2": "EVQLVESGGGLVQPGS...",
+    "variant_3": "EVQLVESGGGLVQPGA...",
+    "wildtype": "EVQLVESGGGLVQPGG..."
+}
+
+experiment_id = complete_binding_workflow(
+    antibody_variants,
+    target_id="tgt_pdl1_human",
+    project_name="antibody_affinity_maturation"
+)
+```
+
+### Optimization + Testing Pipeline
+
+```python
+# Combine computational optimization with experimental testing
+
+def optimization_and_testing_pipeline(initial_sequences, experiment_type="expression"):
+    """
+    Complete pipeline: optimize sequences computationally, then submit for testing
+
+    Args:
+        initial_sequences: Dictionary of {name: sequence}
+        experiment_type: Type of experiment
+
+    Returns:
+        Experiment ID for tracking
+    """
+
+    print("=== Optimization and Testing Pipeline ===")
+
+    # Step 1: Computational optimization
+    print("\n1. Computational optimization...")
+    from protein_optimization import complete_optimization_pipeline
+
+    optimized = complete_optimization_pipeline(initial_sequences)
+
+    print(f"✓ Optimization complete")
+    print(f"  Started with: {len(initial_sequences)} sequences")
+    print(f"  Optimized to: {len(optimized)} sequences")
+
+    # Step 2: Select top candidates
+    print("\n2. Selecting top candidates for testing...")
+    top_candidates = optimized[:50]  # Top 50
+
+    sequences_to_test = {
+        seq_data['name']: seq_data['sequence']
+        for seq_data in top_candidates
+    }
+
+    # Step 3: Submit for experimental validation
+    print("\n3. Submitting to Adaptyv...")
+    metadata = {
+        "optimization_method": "computational_pipeline",
+        "initial_library_size": len(initial_sequences),
+        "computational_scores": [s['combined'] for s in top_candidates]
+    }
+
+    experiment = submit_batch_experiment(
+        sequences_to_test,
+        experiment_type=experiment_type,
+        metadata=metadata
+    )
+
+    print(f"✓ Pipeline complete")
+    print(f"  Experiment ID: {experiment['experiment_id']}")
+
+    return experiment['experiment_id']
+
+# Example
+initial_library = {
+    f"variant_{i}": generate_random_sequence()
+    for i in range(1000)
+}
+
+experiment_id = optimization_and_testing_pipeline(
+    initial_library,
+    experiment_type="expression"
+)
+```
+
+### Batch Result Analysis
+
+```python
+def analyze_multiple_experiments(experiment_ids):
+    """
+    Download and analyze results from multiple experiments
+
+    Args:
+        experiment_ids: List of experiment identifiers
+
+    Returns:
+        Combined DataFrame with all results
+    """
+
+    all_results = []
+
+    for exp_id in experiment_ids:
+        print(f"Processing {exp_id}...")
+
+        # Download results
+        results = download_results(exp_id, output_dir=f"results/{exp_id}")
+
+        # Parse based on experiment type
+        exp_type = results.get('experiment_type', 'unknown')
+
+        if exp_type == 'binding':
+            df = parse_binding_results(results)
+            df['experiment_id'] = exp_id
+            all_results.append(df)
+
+        elif exp_type == 'expression':
+            df = parse_expression_results(results)
+            df['experiment_id'] = exp_id
+            all_results.append(df)
+
+    # Combine all results
+    combined_df = pd.concat(all_results, ignore_index=True)
+
+    print(f"\n✓ Analysis complete")
+    print(f"  Total experiments: {len(experiment_ids)}")
+    print(f"  Total sequences: {len(combined_df)}")
+
+    return combined_df
+
+# Example
+experiment_ids = [
+    "exp_round1_abc",
+    "exp_round2_def",
+    "exp_round3_ghi"
+]
+
+all_data = analyze_multiple_experiments(experiment_ids)
+all_data.to_csv("combined_results.csv", index=False)
+```
+
+## Error Handling
+
+### Robust API Wrapper
+
+```python
+import time
+from requests.exceptions import RequestException, HTTPError
+
+def api_request_with_retry(method, url, max_retries=3, backoff_factor=2, **kwargs):
+    """
+    Make API request with retry logic and error handling
+
+    Args:
+        method: HTTP method (GET, POST, etc.)
+        url: Request URL
+        max_retries: Maximum number of retry attempts
+        backoff_factor: Exponential backoff multiplier
+        **kwargs: Additional arguments for requests
+
+    Returns:
+        Response object
+
+    Raises:
+        RequestException: If all retries fail
+    """
+
+    for attempt in range(max_retries):
+        try:
+            response = requests.request(method, url, **kwargs)
+            response.raise_for_status()
+            return response
+
+        except HTTPError as e:
+            if e.response.status_code == 429:  # Rate limit
+                wait_time = backoff_factor ** attempt
+                print(f"Rate limited. Waiting {wait_time}s...")
+                time.sleep(wait_time)
+                continue
+
+            elif e.response.status_code >= 500:  # Server error
+                if attempt < max_retries - 1:
+                    wait_time = backoff_factor ** attempt
+                    print(f"Server error. Retrying in {wait_time}s...")
+                    time.sleep(wait_time)
+                    continue
+                else:
+                    raise
+
+            else:  # Client error (4xx) - don't retry
+                error_data = e.response.json() if e.response.content else {}
+                print(f"API Error: {error_data.get('error', {}).get('message', str(e))}")
+                raise
+
+        except RequestException as e:
+            if attempt < max_retries - 1:
+                wait_time = backoff_factor ** attempt
+                print(f"Request failed. Retrying in {wait_time}s...")
+                time.sleep(wait_time)
+                continue
+            else:
+                raise
+
+    raise RequestException(f"Failed after {max_retries} attempts")
+
+# Example usage
+response = api_request_with_retry(
+    "POST",
+    f"{BASE_URL}/experiments",
+    headers=HEADERS,
+    json={"sequences": fasta_content, "experiment_type": "binding"}
+)
+```
+
+## Utility Functions
+
+### Validate FASTA Format
+
+```python
+def validate_fasta(fasta_string):
+    """
+    Validate FASTA format and sequences
+
+    Args:
+        fasta_string: FASTA-formatted string
+
+    Returns:
+        Tuple of (is_valid, error_message)
+    """
+
+    lines = fasta_string.strip().split('\n')
+
+    if not lines:
+        return False, "Empty FASTA content"
+
+    if not lines[0].startswith('>'):
+        return False, "FASTA must start with header line (>)"
+
+    valid_amino_acids = set("ACDEFGHIKLMNPQRSTVWY")
+    current_header = None
+
+    for i, line in enumerate(lines):
+        if line.startswith('>'):
+            if not line[1:].strip():
+                return False, f"Line {i+1}: Empty header"
+            current_header = line[1:].strip()
+
+        else:
+            if current_header is None:
+                return False, f"Line {i+1}: Sequence before header"
+
+            sequence = line.strip().upper()
+            invalid = set(sequence) - valid_amino_acids
+
+            if invalid:
+                return False, f"Line {i+1}: Invalid amino acids: {invalid}"
+
+    return True, None
+
+# Example
+fasta = ">protein1\nMKVLWAALLG\n>protein2\nMATGVLWALG"
+is_valid, error = validate_fasta(fasta)
+
+if is_valid:
+    print("✓ FASTA format valid")
+else:
+    print(f"✗ FASTA validation failed: {error}")
+```
+
+### Format Sequences to FASTA
+
+```python
+def sequences_to_fasta(sequences_dict):
+    """
+    Convert dictionary of sequences to FASTA format
+
+    Args:
+        sequences_dict: Dictionary of {name: sequence}
+
+    Returns:
+        FASTA-formatted string
+    """
+
+    fasta_content = ""
+    for name, sequence in sequences_dict.items():
+        # Clean sequence (remove whitespace, ensure uppercase)
+        clean_seq = ''.join(sequence.split()).upper()
+
+        # Validate
+        is_valid, error = validate_fasta(f">{name}\n{clean_seq}")
+        if not is_valid:
+            raise ValueError(f"Invalid sequence '{name}': {error}")
+
+        fasta_content += f">{name}\n{clean_seq}\n"
+
+    return fasta_content
+
+# Example
+sequences = {
+    "var1": "MKVLWAALLG",
+    "var2": "MATGVLWALG"
+}
+
+fasta = sequences_to_fasta(sequences)
+print(fasta)
+```
--- a/scientific-skills/adaptyv/reference/experiments.md
+++ b/scientific-skills/adaptyv/reference/experiments.md
@@ -0,0 +1,360 @@
+# Experiment Types and Workflows
+
+## Overview
+
+Adaptyv provides multiple experimental assay types for comprehensive protein characterization. Each experiment type has specific applications, workflows, and data outputs.
+
+## Binding Assays
+
+### Description
+
+Measure protein-target interactions using biolayer interferometry (BLI), a label-free technique that monitors biomolecular binding in real-time.
+
+### Use Cases
+
+- Antibody-antigen binding characterization
+- Receptor-ligand interaction analysis
+- Protein-protein interaction studies
+- Affinity maturation screening
+- Epitope binning experiments
+
+### Technology: Biolayer Interferometry (BLI)
+
+BLI measures the interference pattern of reflected light from two surfaces:
+- **Reference layer** - Biosensor tip surface
+- **Biological layer** - Accumulated bound molecules
+
+As molecules bind, the optical thickness increases, causing a wavelength shift proportional to binding.
+
+**Advantages:**
+- Label-free detection
+- Real-time kinetics
+- High-throughput compatible
+- Works in crude samples
+- Minimal sample consumption
+
+### Measured Parameters
+
+**Kinetic constants:**
+- **KD** - Equilibrium dissociation constant (binding affinity)
+- **kon** - Association rate constant (binding speed)
+- **koff** - Dissociation rate constant (unbinding speed)
+
+**Typical ranges:**
+- Strong binders: KD < 1 nM
+- Moderate binders: KD = 1-100 nM
+- Weak binders: KD > 100 nM
+
+### Workflow
+
+1. **Sequence submission** - Provide protein sequences in FASTA format
+2. **Expression** - Proteins expressed in appropriate host system
+3. **Purification** - Automated purification protocols
+4. **BLI assay** - Real-time binding measurements against specified targets
+5. **Analysis** - Kinetic curve fitting and quality assessment
+6. **Results delivery** - Binding parameters with confidence metrics
+
+### Sample Requirements
+
+- Protein sequence (standard amino acid codes)
+- Target specification (from catalog or custom request)
+- Buffer conditions (standard or custom)
+- Expected concentration range (optional, improves assay design)
+
+### Results Format
+
+```json
+{
+  "sequence_id": "antibody_variant_1",
+  "target": "Human PD-L1",
+  "measurements": {
+    "kd": 2.5e-9,
+    "kd_error": 0.3e-9,
+    "kon": 1.8e5,
+    "kon_error": 0.2e5,
+    "koff": 4.5e-4,
+    "koff_error": 0.5e-4
+  },
+  "quality_metrics": {
+    "confidence": "high|medium|low",
+    "r_squared": 0.97,
+    "chi_squared": 0.02,
+    "flags": []
+  },
+  "raw_data_url": "https://..."
+}
+```
+
+## Expression Testing
+
+### Description
+
+Quantify protein expression levels in various host systems to assess producibility and optimize sequences for manufacturing.
+
+### Use Cases
+
+- Screening variants for high expression
+- Optimizing codon usage
+- Identifying expression bottlenecks
+- Selecting candidates for scale-up
+- Comparing expression systems
+
+### Host Systems
+
+Available expression platforms:
+- **E. coli** - Rapid, cost-effective, prokaryotic system
+- **Mammalian cells** - Native post-translational modifications
+- **Yeast** - Eukaryotic system with simpler growth requirements
+- **Insect cells** - Alternative eukaryotic platform
+
+### Measured Parameters
+
+- **Total protein yield** (mg/L culture)
+- **Soluble fraction** (percentage)
+- **Purity** (after initial purification)
+- **Expression time course** (optional)
+
+### Workflow
+
+1. **Sequence submission** - Provide protein sequences
+2. **Construct generation** - Cloning into expression vectors
+3. **Expression** - Culture in specified host system
+4. **Quantification** - Protein measurement via multiple methods
+5. **Analysis** - Expression level comparison and ranking
+6. **Results delivery** - Yield data and recommendations
+
+### Results Format
+
+```json
+{
+  "sequence_id": "variant_1",
+  "host_system": "E. coli",
+  "measurements": {
+    "total_yield_mg_per_l": 25.5,
+    "soluble_fraction_percent": 78,
+    "purity_percent": 92
+  },
+  "ranking": {
+    "percentile": 85,
+    "notes": "High expression, good solubility"
+  }
+}
+```
+
+## Thermostability Testing
+
+### Description
+
+Measure protein thermal stability to assess structural integrity, predict shelf-life, and identify stabilizing mutations.
+
+### Use Cases
+
+- Selecting thermally stable variants
+- Formulation development
+- Shelf-life prediction
+- Stability-driven protein engineering
+- Quality control screening
+
+### Measurement Techniques
+
+**Differential Scanning Fluorimetry (DSF):**
+- Monitors protein unfolding via fluorescent dye binding
+- Determines melting temperature (Tm)
+- High-throughput capable
+
+**Circular Dichroism (CD):**
+- Secondary structure analysis
+- Thermal unfolding curves
+- Reversibility assessment
+
+### Measured Parameters
+
+- **Tm** - Melting temperature (midpoint of unfolding)
+- **ΔH** - Enthalpy of unfolding
+- **Aggregation temperature** (Tagg)
+- **Reversibility** - Refolding after heating
+
+### Workflow
+
+1. **Sequence submission** - Provide protein sequences
+2. **Expression and purification** - Standard protocols
+3. **Thermostability assay** - Temperature gradient analysis
+4. **Data analysis** - Curve fitting and parameter extraction
+5. **Results delivery** - Stability metrics with ranking
+
+### Results Format
+
+```json
+{
+  "sequence_id": "variant_1",
+  "measurements": {
+    "tm_celsius": 68.5,
+    "tm_error": 0.5,
+    "tagg_celsius": 72.0,
+    "reversibility_percent": 85
+  },
+  "quality_metrics": {
+    "curve_quality": "excellent",
+    "cooperativity": "two-state"
+  }
+}
+```
+
+## Enzyme Activity Assays
+
+### Description
+
+Measure enzymatic function including substrate turnover, catalytic efficiency, and inhibitor sensitivity.
+
+### Use Cases
+
+- Screening enzyme variants for improved activity
+- Substrate specificity profiling
+- Inhibitor testing
+- pH and temperature optimization
+- Mechanistic studies
+
+### Assay Types
+
+**Continuous assays:**
+- Chromogenic substrates
+- Fluorogenic substrates
+- Real-time monitoring
+
+**Endpoint assays:**
+- HPLC quantification
+- Mass spectrometry
+- Colorimetric detection
+
+### Measured Parameters
+
+**Kinetic parameters:**
+- **kcat** - Turnover number (catalytic rate constant)
+- **KM** - Michaelis constant (substrate affinity)
+- **kcat/KM** - Catalytic efficiency
+- **IC50** - Inhibitor concentration for 50% inhibition
+
+**Activity metrics:**
+- Specific activity (units/mg protein)
+- Relative activity vs. reference
+- Substrate specificity profile
+
+### Workflow
+
+1. **Sequence submission** - Provide enzyme sequences
+2. **Expression and purification** - Optimized for activity retention
+3. **Activity assay** - Substrate turnover measurements
+4. **Kinetic analysis** - Michaelis-Menten fitting
+5. **Results delivery** - Kinetic parameters and rankings
+
+### Results Format
+
+```json
+{
+  "sequence_id": "enzyme_variant_1",
+  "substrate": "substrate_name",
+  "measurements": {
+    "kcat_per_second": 125,
+    "km_micromolar": 45,
+    "kcat_km": 2.8,
+    "specific_activity": 180
+  },
+  "quality_metrics": {
+    "confidence": "high",
+    "r_squared": 0.99
+  },
+  "ranking": {
+    "relative_activity": 1.8,
+    "improvement_vs_wildtype": "80%"
+  }
+}
+```
+
+## Experiment Design Best Practices
+
+### Sequence Submission
+
+1. **Use clear identifiers** - Name sequences descriptively
+2. **Include controls** - Submit wild-type or reference sequences
+3. **Batch similar variants** - Group related sequences in single submission
+4. **Validate sequences** - Check for errors before submission
+
+### Sample Size
+
+- **Pilot studies** - 5-10 sequences to test feasibility
+- **Library screening** - 50-500 sequences for variant exploration
+- **Focused optimization** - 10-50 sequences for fine-tuning
+- **Large-scale campaigns** - 500+ sequences for ML-driven design
+
+### Quality Control
+
+Adaptyv includes automated QC steps:
+- Expression verification before assay
+- Replicate measurements for reliability
+- Positive/negative controls in each batch
+- Statistical validation of results
+
+### Timeline Expectations
+
+**Standard turnaround:** ~21 days from submission to results
+
+**Timeline breakdown:**
+- Construct generation: 3-5 days
+- Expression: 5-7 days
+- Purification: 2-3 days
+- Assay execution: 3-5 days
+- Analysis and QC: 2-3 days
+
+**Factors affecting timeline:**
+- Custom targets (add 1-2 weeks)
+- Novel assay development (add 2-4 weeks)
+- Large batch sizes (may add 1 week)
+
+### Cost Optimization
+
+1. **Batch submissions** - Lower per-sequence cost
+2. **Standard targets** - Catalog antigens are faster/cheaper
+3. **Standard conditions** - Custom buffers add cost
+4. **Computational pre-filtering** - Submit only promising candidates
+
+## Combining Experiment Types
+
+For comprehensive protein characterization, combine multiple assays:
+
+**Therapeutic antibody development:**
+1. Binding assay → Identify high-affinity binders
+2. Expression testing → Select manufacturable candidates
+3. Thermostability → Ensure formulation stability
+
+**Enzyme engineering:**
+1. Activity assay → Screen for improved catalysis
+2. Expression testing → Ensure producibility
+3. Thermostability → Validate industrial robustness
+
+**Sequential vs. Parallel:**
+- **Sequential** - Use results from early assays to filter candidates
+- **Parallel** - Run all assays simultaneously for faster results
+
+## Data Integration
+
+Results integrate with computational workflows:
+
+1. **Download raw data** via API
+2. **Parse results** into standardized format
+3. **Feed into ML models** for next-round design
+4. **Track experiments** with metadata tags
+5. **Visualize trends** across design iterations
+
+## Support and Troubleshooting
+
+**Common issues:**
+- Low expression → Consider sequence optimization (see protein_optimization.md)
+- Poor binding → Verify target specification and expected range
+- Variable results → Check sequence quality and controls
+- Incomplete data → Contact support with experiment ID
+
+**Getting help:**
+- Email: support@adaptyvbio.com
+- Include experiment ID and specific question
+- Provide context (design goals, expected results)
+- Response time: <24 hours for active experiments
--- a/scientific-skills/adaptyv/reference/protein_optimization.md
+++ b/scientific-skills/adaptyv/reference/protein_optimization.md
@@ -0,0 +1,637 @@
+# Protein Sequence Optimization
+
+## Overview
+
+Before submitting protein sequences for experimental testing, use computational tools to optimize sequences for improved expression, solubility, and stability. This pre-screening reduces experimental costs and increases success rates.
+
+## Common Protein Expression Problems
+
+### 1. Unpaired Cysteines
+
+**Problem:**
+- Unpaired cysteines form unwanted disulfide bonds
+- Leads to aggregation and misfolding
+- Reduces expression yield and stability
+
+**Solution:**
+- Remove unpaired cysteines unless functionally necessary
+- Pair cysteines appropriately for structural disulfides
+- Replace with serine or alanine in non-critical positions
+
+**Example:**
+```python
+# Check for cysteine pairs
+from Bio.Seq import Seq
+
+def check_cysteines(sequence):
+    cys_count = sequence.count('C')
+    if cys_count % 2 != 0:
+        print(f"Warning: Odd number of cysteines ({cys_count})")
+    return cys_count
+```
+
+### 2. Excessive Hydrophobicity
+
+**Problem:**
+- Long hydrophobic patches promote aggregation
+- Exposed hydrophobic residues drive protein clumping
+- Poor solubility in aqueous buffers
+
+**Solution:**
+- Maintain balanced hydropathy profiles
+- Use short, flexible linkers between domains
+- Reduce surface-exposed hydrophobic residues
+
+**Metrics:**
+- Kyte-Doolittle hydropathy plots
+- GRAVY score (Grand Average of Hydropathy)
+- pSAE (percent Solvent-Accessible hydrophobic residues)
+
+### 3. Low Solubility
+
+**Problem:**
+- Proteins precipitate during expression or purification
+- Inclusion body formation
+- Difficult downstream processing
+
+**Solution:**
+- Use solubility prediction tools for pre-screening
+- Apply sequence optimization algorithms
+- Add solubilizing tags if needed
+
+## Computational Tools for Optimization
+
+### NetSolP - Initial Solubility Screening
+
+**Purpose:** Fast solubility prediction for filtering sequences.
+
+**Method:** Machine learning model trained on E. coli expression data.
+
+**Usage:**
+```python
+# Install: uv pip install requests
+import requests
+
+def predict_solubility_netsolp(sequence):
+    """Predict protein solubility using NetSolP web service"""
+    url = "https://services.healthtech.dtu.dk/services/NetSolP-1.0/api/predict"
+
+    data = {
+        "sequence": sequence,
+        "format": "fasta"
+    }
+
+    response = requests.post(url, data=data)
+    return response.json()
+
+# Example
+sequence = "MKVLWAALLGLLGAAA..."
+result = predict_solubility_netsolp(sequence)
+print(f"Solubility score: {result['score']}")
+```
+
+**Interpretation:**
+- Score > 0.5: Likely soluble
+- Score < 0.5: Likely insoluble
+- Use for initial filtering before more expensive predictions
+
+**When to use:**
+- First-pass filtering of large libraries
+- Quick validation of designed sequences
+- Prioritizing sequences for experimental testing
+
+### SoluProt - Comprehensive Solubility Prediction
+
+**Purpose:** Advanced solubility prediction with higher accuracy.
+
+**Method:** Deep learning model incorporating sequence and structural features.
+
+**Usage:**
+```python
+# Install: uv pip install soluprot
+from soluprot import predict_solubility
+
+def screen_variants_soluprot(sequences):
+    """Screen multiple sequences for solubility"""
+    results = []
+    for name, seq in sequences.items():
+        score = predict_solubility(seq)
+        results.append({
+            'name': name,
+            'sequence': seq,
+            'solubility_score': score,
+            'predicted_soluble': score > 0.6
+        })
+    return results
+
+# Example
+sequences = {
+    'variant_1': 'MKVLW...',
+    'variant_2': 'MATGV...'
+}
+
+results = screen_variants_soluprot(sequences)
+soluble_variants = [r for r in results if r['predicted_soluble']]
+```
+
+**Interpretation:**
+- Score > 0.6: High solubility confidence
+- Score 0.4-0.6: Uncertain, may need optimization
+- Score < 0.4: Likely problematic
+
+**When to use:**
+- After initial NetSolP filtering
+- When higher prediction accuracy is needed
+- Before committing to expensive synthesis/testing
+
+### SolubleMPNN - Sequence Redesign
+
+**Purpose:** Redesign protein sequences to improve solubility while maintaining function.
+
+**Method:** Graph neural network that suggests mutations to increase solubility.
+
+**Usage:**
+```python
+# Install: uv pip install soluble-mpnn
+from soluble_mpnn import optimize_sequence
+
+def optimize_for_solubility(sequence, structure_pdb=None):
+    """
+    Redesign sequence for improved solubility
+
+    Args:
+        sequence: Original amino acid sequence
+        structure_pdb: Optional PDB file for structure-aware design
+
+    Returns:
+        Optimized sequence variants ranked by predicted solubility
+    """
+
+    variants = optimize_sequence(
+        sequence=sequence,
+        structure=structure_pdb,
+        num_variants=10,
+        temperature=0.1  # Lower = more conservative mutations
+    )
+
+    return variants
+
+# Example
+original_seq = "MKVLWAALLGLLGAAA..."
+optimized_variants = optimize_for_solubility(original_seq)
+
+for i, variant in enumerate(optimized_variants):
+    print(f"Variant {i+1}:")
+    print(f"  Sequence: {variant['sequence']}")
+    print(f"  Solubility score: {variant['solubility_score']}")
+    print(f"  Mutations: {variant['mutations']}")
+```
+
+**Design strategy:**
+- **Conservative** (temperature=0.1): Minimal changes, safer
+- **Moderate** (temperature=0.3): Balance between change and safety
+- **Aggressive** (temperature=0.5): More mutations, higher risk
+
+**When to use:**
+- Primary tool for sequence optimization
+- Default starting point for improving problematic sequences
+- Generating diverse soluble variants
+
+**Best practices:**
+- Generate 10-50 variants per sequence
+- Use structure information when available (improves accuracy)
+- Validate key functional residues are preserved
+- Test multiple temperature settings
+
+### ESM (Evolutionary Scale Modeling) - Sequence Likelihood
+
+**Purpose:** Assess how "natural" a protein sequence appears based on evolutionary patterns.
+
+**Method:** Protein language model trained on millions of natural sequences.
+
+**Usage:**
+```python
+# Install: uv pip install fair-esm
+import torch
+from esm import pretrained
+
+def score_sequence_esm(sequence):
+    """
+    Calculate ESM likelihood score for sequence
+    Higher scores indicate more natural/stable sequences
+    """
+
+    model, alphabet = pretrained.esm2_t33_650M_UR50D()
+    batch_converter = alphabet.get_batch_converter()
+
+    data = [("protein", sequence)]
+    _, _, batch_tokens = batch_converter(data)
+
+    with torch.no_grad():
+        results = model(batch_tokens, repr_layers=[33])
+        token_logprobs = results["logits"].log_softmax(dim=-1)
+
+    # Calculate perplexity as sequence quality metric
+    sequence_score = token_logprobs.mean().item()
+
+    return sequence_score
+
+# Example - Compare variants
+sequences = {
+    'original': 'MKVLW...',
+    'optimized_1': 'MKVLS...',
+    'optimized_2': 'MKVLA...'
+}
+
+for name, seq in sequences.items():
+    score = score_sequence_esm(seq)
+    print(f"{name}: ESM score = {score:.3f}")
+```
+
+**Interpretation:**
+- Higher scores → More "natural" sequence
+- Use to avoid unlikely mutations
+- Balance with functional requirements
+
+**When to use:**
+- Filtering synthetic designs
+- Comparing SolubleMPNN variants
+- Ensuring sequences aren't too artificial
+- Avoiding expression bottlenecks
+
+**Integration with design:**
+```python
+def rank_variants_by_esm(variants):
+    """Rank protein variants by ESM likelihood"""
+    scored = []
+    for v in variants:
+        esm_score = score_sequence_esm(v['sequence'])
+        v['esm_score'] = esm_score
+        scored.append(v)
+
+    # Sort by combined solubility and ESM score
+    scored.sort(
+        key=lambda x: x['solubility_score'] * x['esm_score'],
+        reverse=True
+    )
+
+    return scored
+```
+
+### ipTM - Interface Stability (AlphaFold-Multimer)
+
+**Purpose:** Assess protein-protein interface stability and binding confidence.
+
+**Method:** Interface predicted TM-score from AlphaFold-Multimer predictions.
+
+**Usage:**
+```python
+# Requires AlphaFold-Multimer installation
+# Or use ColabFold for easier access
+
+def predict_interface_stability(protein_a_seq, protein_b_seq):
+    """
+    Predict interface stability using AlphaFold-Multimer
+
+    Returns ipTM score: higher = more stable interface
+    """
+    from colabfold import run_alphafold_multimer
+
+    sequences = {
+        'chainA': protein_a_seq,
+        'chainB': protein_b_seq
+    }
+
+    result = run_alphafold_multimer(sequences)
+
+    return {
+        'ipTM': result['iptm'],
+        'pTM': result['ptm'],
+        'pLDDT': result['plddt']
+    }
+
+# Example for antibody-antigen binding
+antibody_seq = "EVQLVESGGGLVQPGG..."
+antigen_seq = "MKVLWAALLGLLGAAA..."
+
+stability = predict_interface_stability(antibody_seq, antigen_seq)
+print(f"Interface pTM: {stability['ipTM']:.3f}")
+
+# Interpretation
+if stability['ipTM'] > 0.7:
+    print("High confidence interface")
+elif stability['ipTM'] > 0.5:
+    print("Moderate confidence interface")
+else:
+    print("Low confidence interface - may need redesign")
+```
+
+**Interpretation:**
+- ipTM > 0.7: Strong predicted interface
+- ipTM 0.5-0.7: Moderate interface confidence
+- ipTM < 0.5: Weak interface, consider redesign
+
+**When to use:**
+- Antibody-antigen design
+- Protein-protein interaction engineering
+- Validating binding interfaces
+- Comparing interface variants
+
+### pSAE - Solvent-Accessible Hydrophobic Residues
+
+**Purpose:** Quantify exposed hydrophobic residues that promote aggregation.
+
+**Method:** Calculates percentage of solvent-accessible surface area (SASA) occupied by hydrophobic residues.
+
+**Usage:**
+```python
+# Requires structure (PDB file or AlphaFold prediction)
+# Install: uv pip install biopython
+
+from Bio.PDB import PDBParser, DSSP
+import numpy as np
+
+def calculate_psae(pdb_file):
+    """
+    Calculate percent Solvent-Accessible hydrophobic residues (pSAE)
+
+    Lower pSAE = better solubility
+    """
+
+    parser = PDBParser(QUIET=True)
+    structure = parser.get_structure('protein', pdb_file)
+
+    # Run DSSP to get solvent accessibility
+    model = structure[0]
+    dssp = DSSP(model, pdb_file, acc_array='Wilke')
+
+    hydrophobic = ['ALA', 'VAL', 'ILE', 'LEU', 'MET', 'PHE', 'TRP', 'PRO']
+
+    total_sasa = 0
+    hydrophobic_sasa = 0
+
+    for residue in dssp:
+        res_name = residue[1]
+        rel_accessibility = residue[3]
+
+        total_sasa += rel_accessibility
+        if res_name in hydrophobic:
+            hydrophobic_sasa += rel_accessibility
+
+    psae = (hydrophobic_sasa / total_sasa) * 100
+
+    return psae
+
+# Example
+pdb_file = "protein_structure.pdb"
+psae_score = calculate_psae(pdb_file)
+print(f"pSAE: {psae_score:.2f}%")
+
+# Interpretation
+if psae_score < 25:
+    print("Good solubility expected")
+elif psae_score < 35:
+    print("Moderate solubility")
+else:
+    print("High aggregation risk")
+```
+
+**Interpretation:**
+- pSAE < 25%: Low aggregation risk
+- pSAE 25-35%: Moderate risk
+- pSAE > 35%: High aggregation risk
+
+**When to use:**
+- Analyzing designed structures
+- Post-AlphaFold validation
+- Identifying aggregation hotspots
+- Guiding surface mutations
+
+## Recommended Optimization Workflow
+
+### Step 1: Initial Screening (Fast)
+
+```python
+def initial_screening(sequences):
+    """
+    Quick first-pass filtering using NetSolP
+    Filters out obviously problematic sequences
+    """
+    passed = []
+    for name, seq in sequences.items():
+        netsolp_score = predict_solubility_netsolp(seq)
+        if netsolp_score > 0.5:
+            passed.append((name, seq))
+
+    return passed
+```
+
+### Step 2: Detailed Assessment (Moderate)
+
+```python
+def detailed_assessment(filtered_sequences):
+    """
+    More thorough analysis with SoluProt and ESM
+    Ranks sequences by multiple criteria
+    """
+    results = []
+    for name, seq in filtered_sequences:
+        soluprot_score = predict_solubility(seq)
+        esm_score = score_sequence_esm(seq)
+
+        combined_score = soluprot_score * 0.7 + esm_score * 0.3
+
+        results.append({
+            'name': name,
+            'sequence': seq,
+            'soluprot': soluprot_score,
+            'esm': esm_score,
+            'combined': combined_score
+        })
+
+    results.sort(key=lambda x: x['combined'], reverse=True)
+    return results
+```
+
+### Step 3: Sequence Optimization (If needed)
+
+```python
+def optimize_problematic_sequences(sequences_needing_optimization):
+    """
+    Use SolubleMPNN to redesign problematic sequences
+    Returns improved variants
+    """
+    optimized = []
+    for name, seq in sequences_needing_optimization:
+        # Generate multiple variants
+        variants = optimize_sequence(
+            sequence=seq,
+            num_variants=10,
+            temperature=0.2
+        )
+
+        # Score variants with ESM
+        for variant in variants:
+            variant['esm_score'] = score_sequence_esm(variant['sequence'])
+
+        # Keep best variants
+        variants.sort(
+            key=lambda x: x['solubility_score'] * x['esm_score'],
+            reverse=True
+        )
+
+        optimized.extend(variants[:3])  # Top 3 variants per sequence
+
+    return optimized
+```
+
+### Step 4: Structure-Based Validation (For critical sequences)
+
+```python
+def structure_validation(top_candidates):
+    """
+    Predict structures and calculate pSAE for top candidates
+    Final validation before experimental testing
+    """
+    validated = []
+    for candidate in top_candidates:
+        # Predict structure with AlphaFold
+        structure_pdb = predict_structure_alphafold(candidate['sequence'])
+
+        # Calculate pSAE
+        psae = calculate_psae(structure_pdb)
+
+        candidate['psae'] = psae
+        candidate['pass_structure_check'] = psae < 30
+
+        validated.append(candidate)
+
+    return validated
+```
+
+### Complete Workflow Example
+
+```python
+def complete_optimization_pipeline(initial_sequences):
+    """
+    End-to-end optimization pipeline
+
+    Input: Dictionary of {name: sequence}
+    Output: Ranked list of optimized, validated sequences
+    """
+
+    print("Step 1: Initial screening with NetSolP...")
+    filtered = initial_screening(initial_sequences)
+    print(f"  Passed: {len(filtered)}/{len(initial_sequences)}")
+
+    print("Step 2: Detailed assessment with SoluProt and ESM...")
+    assessed = detailed_assessment(filtered)
+
+    # Split into good and needs-optimization
+    good_sequences = [s for s in assessed if s['soluprot'] > 0.6]
+    needs_optimization = [s for s in assessed if s['soluprot'] <= 0.6]
+
+    print(f"  Good sequences: {len(good_sequences)}")
+    print(f"  Need optimization: {len(needs_optimization)}")
+
+    if needs_optimization:
+        print("Step 3: Optimizing problematic sequences with SolubleMPNN...")
+        optimized = optimize_problematic_sequences(needs_optimization)
+        all_sequences = good_sequences + optimized
+    else:
+        all_sequences = good_sequences
+
+    print("Step 4: Structure-based validation for top candidates...")
+    top_20 = all_sequences[:20]
+    final_validated = structure_validation(top_20)
+
+    # Final ranking
+    final_validated.sort(
+        key=lambda x: (
+            x['pass_structure_check'],
+            x['combined'],
+            -x['psae']
+        ),
+        reverse=True
+    )
+
+    return final_validated
+
+# Usage
+initial_library = {
+    'variant_1': 'MKVLWAALLGLLGAAA...',
+    'variant_2': 'MATGVLWAALLGLLGA...',
+    # ... more sequences
+}
+
+optimized_library = complete_optimization_pipeline(initial_library)
+
+# Submit top sequences to Adaptyv
+top_sequences_for_testing = optimized_library[:50]
+```
+
+## Best Practices Summary
+
+1. **Always pre-screen** before experimental testing
+2. **Use NetSolP first** for fast filtering of large libraries
+3. **Apply SolubleMPNN** as default optimization tool
+4. **Validate with ESM** to avoid unnatural sequences
+5. **Calculate pSAE** for structure-based validation
+6. **Test multiple variants** per design to account for prediction uncertainty
+7. **Keep controls** - include wild-type or known-good sequences
+8. **Iterate** - use experimental results to refine predictions
+
+## Integration with Adaptyv
+
+After computational optimization, submit sequences to Adaptyv:
+
+```python
+# After optimization pipeline
+optimized_sequences = complete_optimization_pipeline(initial_library)
+
+# Prepare FASTA format
+fasta_content = ""
+for seq_data in optimized_sequences[:50]:  # Top 50
+    fasta_content += f">{seq_data['name']}\n{seq_data['sequence']}\n"
+
+# Submit to Adaptyv
+import requests
+response = requests.post(
+    "https://kq5jp7qj7wdqklhsxmovkzn4l40obksv.lambda-url.eu-central-1.on.aws/experiments",
+    headers={"Authorization": f"Bearer {api_key}"},
+    json={
+        "sequences": fasta_content,
+        "experiment_type": "expression",
+        "metadata": {
+            "optimization_method": "SolubleMPNN_ESM_pipeline",
+            "computational_scores": [s['combined'] for s in optimized_sequences[:50]]
+        }
+    }
+)
+```
+
+## Troubleshooting
+
+**Issue: All sequences score poorly on solubility predictions**
+- Check if sequences contain unusual amino acids
+- Verify FASTA format is correct
+- Consider if protein family is naturally low-solubility
+- May need experimental validation despite predictions
+
+**Issue: SolubleMPNN changes functionally important residues**
+- Provide structure file to preserve spatial constraints
+- Mask critical residues from mutation
+- Lower temperature parameter for conservative changes
+- Manually revert problematic mutations
+
+**Issue: ESM scores are low after optimization**
+- Optimization may be too aggressive
+- Try lower temperature in SolubleMPNN
+- Balance between solubility and naturalness
+- Consider that some optimization may require non-natural mutations
+
+**Issue: Predictions don't match experimental results**
+- Predictions are probabilistic, not deterministic
+- Host system and conditions affect expression
+- Some proteins may need experimental validation
+- Use predictions as enrichment, not absolute filters
--- a/scientific-skills/aeon/SKILL.md
+++ b/scientific-skills/aeon/SKILL.md
@@ -0,0 +1,368 @@
+---
+name: aeon
+description: This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.
+---
+
+# Aeon Time Series Machine Learning
+
+## Overview
+
+Aeon is a scikit-learn compatible Python toolkit for time series machine learning. It provides state-of-the-art algorithms for classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search.
+
+## When to Use This Skill
+
+Apply this skill when:
+- Classifying or predicting from time series data
+- Detecting anomalies or change points in temporal sequences
+- Clustering similar time series patterns
+- Forecasting future values
+- Finding repeated patterns (motifs) or unusual subsequences (discords)
+- Comparing time series with specialized distance metrics
+- Extracting features from temporal data
+
+## Installation
+
+```bash
+uv pip install aeon
+```
+
+## Core Capabilities
+
+### 1. Time Series Classification
+
+Categorize time series into predefined classes. See `references/classification.md` for complete algorithm catalog.
+
+**Quick Start:**
+```python
+from aeon.classification.convolution_based import RocketClassifier
+from aeon.datasets import load_classification
+
+# Load data
+X_train, y_train = load_classification("GunPoint", split="train")
+X_test, y_test = load_classification("GunPoint", split="test")
+
+# Train classifier
+clf = RocketClassifier(n_kernels=10000)
+clf.fit(X_train, y_train)
+accuracy = clf.score(X_test, y_test)
+```
+
+**Algorithm Selection:**
+- **Speed + Performance**: `MiniRocketClassifier`, `Arsenal`
+- **Maximum Accuracy**: `HIVECOTEV2`, `InceptionTimeClassifier`
+- **Interpretability**: `ShapeletTransformClassifier`, `Catch22Classifier`
+- **Small Datasets**: `KNeighborsTimeSeriesClassifier` with DTW distance
+
+### 2. Time Series Regression
+
+Predict continuous values from time series. See `references/regression.md` for algorithms.
+
+**Quick Start:**
+```python
+from aeon.regression.convolution_based import RocketRegressor
+from aeon.datasets import load_regression
+
+X_train, y_train = load_regression("Covid3Month", split="train")
+X_test, y_test = load_regression("Covid3Month", split="test")
+
+reg = RocketRegressor()
+reg.fit(X_train, y_train)
+predictions = reg.predict(X_test)
+```
+
+### 3. Time Series Clustering
+
+Group similar time series without labels. See `references/clustering.md` for methods.
+
+**Quick Start:**
+```python
+from aeon.clustering import TimeSeriesKMeans
+
+clusterer = TimeSeriesKMeans(
+    n_clusters=3,
+    distance="dtw",
+    averaging_method="ba"
+)
+labels = clusterer.fit_predict(X_train)
+centers = clusterer.cluster_centers_
+```
+
+### 4. Forecasting
+
+Predict future time series values. See `references/forecasting.md` for forecasters.
+
+**Quick Start:**
+```python
+from aeon.forecasting.arima import ARIMA
+
+forecaster = ARIMA(order=(1, 1, 1))
+forecaster.fit(y_train)
+y_pred = forecaster.predict(fh=[1, 2, 3, 4, 5])
+```
+
+### 5. Anomaly Detection
+
+Identify unusual patterns or outliers. See `references/anomaly_detection.md` for detectors.
+
+**Quick Start:**
+```python
+from aeon.anomaly_detection import STOMP
+
+detector = STOMP(window_size=50)
+anomaly_scores = detector.fit_predict(y)
+
+# Higher scores indicate anomalies
+threshold = np.percentile(anomaly_scores, 95)
+anomalies = anomaly_scores > threshold
+```
+
+### 6. Segmentation
+
+Partition time series into regions with change points. See `references/segmentation.md`.
+
+**Quick Start:**
+```python
+from aeon.segmentation import ClaSPSegmenter
+
+segmenter = ClaSPSegmenter()
+change_points = segmenter.fit_predict(y)
+```
+
+### 7. Similarity Search
+
+Find similar patterns within or across time series. See `references/similarity_search.md`.
+
+**Quick Start:**
+```python
+from aeon.similarity_search import StompMotif
+
+# Find recurring patterns
+motif_finder = StompMotif(window_size=50, k=3)
+motifs = motif_finder.fit_predict(y)
+```
+
+## Feature Extraction and Transformations
+
+Transform time series for feature engineering. See `references/transformations.md`.
+
+**ROCKET Features:**
+```python
+from aeon.transformations.collection.convolution_based import RocketTransformer
+
+rocket = RocketTransformer()
+X_features = rocket.fit_transform(X_train)
+
+# Use features with any sklearn classifier
+from sklearn.ensemble import RandomForestClassifier
+clf = RandomForestClassifier()
+clf.fit(X_features, y_train)
+```
+
+**Statistical Features:**
+```python
+from aeon.transformations.collection.feature_based import Catch22
+
+catch22 = Catch22()
+X_features = catch22.fit_transform(X_train)
+```
+
+**Preprocessing:**
+```python
+from aeon.transformations.collection import MinMaxScaler, Normalizer
+
+scaler = Normalizer()  # Z-normalization
+X_normalized = scaler.fit_transform(X_train)
+```
+
+## Distance Metrics
+
+Specialized temporal distance measures. See `references/distances.md` for complete catalog.
+
+**Usage:**
+```python
+from aeon.distances import dtw_distance, dtw_pairwise_distance
+
+# Single distance
+distance = dtw_distance(x, y, window=0.1)
+
+# Pairwise distances
+distance_matrix = dtw_pairwise_distance(X_train)
+
+# Use with classifiers
+from aeon.classification.distance_based import KNeighborsTimeSeriesClassifier
+
+clf = KNeighborsTimeSeriesClassifier(
+    n_neighbors=5,
+    distance="dtw",
+    distance_params={"window": 0.2}
+)
+```
+
+**Available Distances:**
+- **Elastic**: DTW, DDTW, WDTW, ERP, EDR, LCSS, TWE, MSM
+- **Lock-step**: Euclidean, Manhattan, Minkowski
+- **Shape-based**: Shape DTW, SBD
+
+## Deep Learning Networks
+
+Neural architectures for time series. See `references/networks.md`.
+
+**Architectures:**
+- Convolutional: `FCNClassifier`, `ResNetClassifier`, `InceptionTimeClassifier`
+- Recurrent: `RecurrentNetwork`, `TCNNetwork`
+- Autoencoders: `AEFCNClusterer`, `AEResNetClusterer`
+
+**Usage:**
+```python
+from aeon.classification.deep_learning import InceptionTimeClassifier
+
+clf = InceptionTimeClassifier(n_epochs=100, batch_size=32)
+clf.fit(X_train, y_train)
+predictions = clf.predict(X_test)
+```
+
+## Datasets and Benchmarking
+
+Load standard benchmarks and evaluate performance. See `references/datasets_benchmarking.md`.
+
+**Load Datasets:**
+```python
+from aeon.datasets import load_classification, load_regression
+
+# Classification
+X_train, y_train = load_classification("ArrowHead", split="train")
+
+# Regression
+X_train, y_train = load_regression("Covid3Month", split="train")
+```
+
+**Benchmarking:**
+```python
+from aeon.benchmarking import get_estimator_results
+
+# Compare with published results
+published = get_estimator_results("ROCKET", "GunPoint")
+```
+
+## Common Workflows
+
+### Classification Pipeline
+
+```python
+from aeon.transformations.collection import Normalizer
+from aeon.classification.convolution_based import RocketClassifier
+from sklearn.pipeline import Pipeline
+
+pipeline = Pipeline([
+    ('normalize', Normalizer()),
+    ('classify', RocketClassifier())
+])
+
+pipeline.fit(X_train, y_train)
+accuracy = pipeline.score(X_test, y_test)
+```
+
+### Feature Extraction + Traditional ML
+
+```python
+from aeon.transformations.collection import RocketTransformer
+from sklearn.ensemble import GradientBoostingClassifier
+
+# Extract features
+rocket = RocketTransformer()
+X_train_features = rocket.fit_transform(X_train)
+X_test_features = rocket.transform(X_test)
+
+# Train traditional ML
+clf = GradientBoostingClassifier()
+clf.fit(X_train_features, y_train)
+predictions = clf.predict(X_test_features)
+```
+
+### Anomaly Detection with Visualization
+
+```python
+from aeon.anomaly_detection import STOMP
+import matplotlib.pyplot as plt
+
+detector = STOMP(window_size=50)
+scores = detector.fit_predict(y)
+
+plt.figure(figsize=(15, 5))
+plt.subplot(2, 1, 1)
+plt.plot(y, label='Time Series')
+plt.subplot(2, 1, 2)
+plt.plot(scores, label='Anomaly Scores', color='red')
+plt.axhline(np.percentile(scores, 95), color='k', linestyle='--')
+plt.show()
+```
+
+## Best Practices
+
+### Data Preparation
+
+1. **Normalize**: Most algorithms benefit from z-normalization
+   ```python
+   from aeon.transformations.collection import Normalizer
+   normalizer = Normalizer()
+   X_train = normalizer.fit_transform(X_train)
+   X_test = normalizer.transform(X_test)
+   ```
+
+2. **Handle Missing Values**: Impute before analysis
+   ```python
+   from aeon.transformations.collection import SimpleImputer
+   imputer = SimpleImputer(strategy='mean')
+   X_train = imputer.fit_transform(X_train)
+   ```
+
+3. **Check Data Format**: Aeon expects shape `(n_samples, n_channels, n_timepoints)`
+
+### Model Selection
+
+1. **Start Simple**: Begin with ROCKET variants before deep learning
+2. **Use Validation**: Split training data for hyperparameter tuning
+3. **Compare Baselines**: Test against simple methods (1-NN Euclidean, Naive)
+4. **Consider Resources**: ROCKET for speed, deep learning if GPU available
+
+### Algorithm Selection Guide
+
+**For Fast Prototyping:**
+- Classification: `MiniRocketClassifier`
+- Regression: `MiniRocketRegressor`
+- Clustering: `TimeSeriesKMeans` with Euclidean
+
+**For Maximum Accuracy:**
+- Classification: `HIVECOTEV2`, `InceptionTimeClassifier`
+- Regression: `InceptionTimeRegressor`
+- Forecasting: `ARIMA`, `TCNForecaster`
+
+**For Interpretability:**
+- Classification: `ShapeletTransformClassifier`, `Catch22Classifier`
+- Features: `Catch22`, `TSFresh`
+
+**For Small Datasets:**
+- Distance-based: `KNeighborsTimeSeriesClassifier` with DTW
+- Avoid: Deep learning (requires large data)
+
+## Reference Documentation
+
+Detailed information available in `references/`:
+- `classification.md` - All classification algorithms
+- `regression.md` - Regression methods
+- `clustering.md` - Clustering algorithms
+- `forecasting.md` - Forecasting approaches
+- `anomaly_detection.md` - Anomaly detection methods
+- `segmentation.md` - Segmentation algorithms
+- `similarity_search.md` - Pattern matching and motif discovery
+- `transformations.md` - Feature extraction and preprocessing
+- `distances.md` - Time series distance metrics
+- `networks.md` - Deep learning architectures
+- `datasets_benchmarking.md` - Data loading and evaluation tools
+
+## Additional Resources
+
+- Documentation: https://www.aeon-toolkit.org/
+- GitHub: https://github.com/aeon-toolkit/aeon
+- Examples: https://www.aeon-toolkit.org/en/stable/examples.html
+- API Reference: https://www.aeon-toolkit.org/en/stable/api_reference.html
--- a/scientific-skills/aeon/references/anomaly_detection.md
+++ b/scientific-skills/aeon/references/anomaly_detection.md
@@ -0,0 +1,154 @@
+# Anomaly Detection
+
+Aeon provides anomaly detection methods for identifying unusual patterns in time series at both series and collection levels.
+
+## Collection Anomaly Detectors
+
+Detect anomalous time series within a collection:
+
+- `ClassificationAdapter` - Adapts classifiers for anomaly detection
+  - Train on normal data, flag outliers during prediction
+  - **Use when**: Have labeled normal data, want classification-based approach
+
+- `OutlierDetectionAdapter` - Wraps sklearn outlier detectors
+  - Works with IsolationForest, LOF, OneClassSVM
+  - **Use when**: Want to use sklearn anomaly detectors on collections
+
+## Series Anomaly Detectors
+
+Detect anomalous points or subsequences within a single time series.
+
+### Distance-Based Methods
+
+Use similarity metrics to identify anomalies:
+
+- `CBLOF` - Cluster-Based Local Outlier Factor
+  - Clusters data, identifies outliers based on cluster properties
+  - **Use when**: Anomalies form sparse clusters
+
+- `KMeansAD` - K-means based anomaly detection
+  - Distance to nearest cluster center indicates anomaly
+  - **Use when**: Normal patterns cluster well
+
+- `LeftSTAMPi` - Left STAMP incremental
+  - Matrix profile for online anomaly detection
+  - **Use when**: Streaming data, need online detection
+
+- `STOMP` - Scalable Time series Ordered-search Matrix Profile
+  - Computes matrix profile for subsequence anomalies
+  - **Use when**: Discord discovery, motif detection
+
+- `MERLIN` - Matrix profile-based method
+  - Efficient matrix profile computation
+  - **Use when**: Large time series, need scalability
+
+- `LOF` - Local Outlier Factor adapted for time series
+  - Density-based outlier detection
+  - **Use when**: Anomalies in low-density regions
+
+- `ROCKAD` - ROCKET-based semi-supervised detection
+  - Uses ROCKET features for anomaly identification
+  - **Use when**: Have some labeled data, want feature-based approach
+
+### Distribution-Based Methods
+
+Analyze statistical distributions:
+
+- `COPOD` - Copula-Based Outlier Detection
+  - Models marginal and joint distributions
+  - **Use when**: Multi-dimensional time series, complex dependencies
+
+- `DWT_MLEAD` - Discrete Wavelet Transform Multi-Level Anomaly Detection
+  - Decomposes series into frequency bands
+  - **Use when**: Anomalies at specific frequencies
+
+### Isolation-Based Methods
+
+Use isolation principles:
+
+- `IsolationForest` - Random forest-based isolation
+  - Anomalies easier to isolate than normal points
+  - **Use when**: High-dimensional data, no assumptions about distribution
+
+- `OneClassSVM` - Support vector machine for novelty detection
+  - Learns boundary around normal data
+  - **Use when**: Well-defined normal region, need robust boundary
+
+- `STRAY` - Streaming Robust Anomaly Detection
+  - Robust to data distribution changes
+  - **Use when**: Streaming data, distribution shifts
+
+### External Library Integration
+
+- `PyODAdapter` - Bridges PyOD library to aeon
+  - Access 40+ PyOD anomaly detectors
+  - **Use when**: Need specific PyOD algorithm
+
+## Quick Start
+
+```python
+from aeon.anomaly_detection import STOMP
+import numpy as np
+
+# Create time series with anomaly
+y = np.concatenate([
+    np.sin(np.linspace(0, 10, 100)),
+    [5.0],  # Anomaly spike
+    np.sin(np.linspace(10, 20, 100))
+])
+
+# Detect anomalies
+detector = STOMP(window_size=10)
+anomaly_scores = detector.fit_predict(y)
+
+# Higher scores indicate more anomalous points
+threshold = np.percentile(anomaly_scores, 95)
+anomalies = anomaly_scores > threshold
+```
+
+## Point vs Subsequence Anomalies
+
+- **Point anomalies**: Single unusual values
+  - Use: COPOD, DWT_MLEAD, IsolationForest
+
+- **Subsequence anomalies** (discords): Unusual patterns
+  - Use: STOMP, LeftSTAMPi, MERLIN
+
+- **Collective anomalies**: Groups of points forming unusual pattern
+  - Use: Matrix profile methods, clustering-based
+
+## Evaluation Metrics
+
+Specialized metrics for anomaly detection:
+
+```python
+from aeon.benchmarking.metrics.anomaly_detection import (
+    range_precision,
+    range_recall,
+    range_f_score,
+    roc_auc_score
+)
+
+# Range-based metrics account for window detection
+precision = range_precision(y_true, y_pred, alpha=0.5)
+recall = range_recall(y_true, y_pred, alpha=0.5)
+f1 = range_f_score(y_true, y_pred, alpha=0.5)
+```
+
+## Algorithm Selection
+
+- **Speed priority**: KMeansAD, IsolationForest
+- **Accuracy priority**: STOMP, COPOD
+- **Streaming data**: LeftSTAMPi, STRAY
+- **Discord discovery**: STOMP, MERLIN
+- **Multi-dimensional**: COPOD, PyODAdapter
+- **Semi-supervised**: ROCKAD, OneClassSVM
+- **No training data**: IsolationForest, STOMP
+
+## Best Practices
+
+1. **Normalize data**: Many methods sensitive to scale
+2. **Choose window size**: For matrix profile methods, window size critical
+3. **Set threshold**: Use percentile-based or domain-specific thresholds
+4. **Validate results**: Visualize detections to verify meaningfulness
+5. **Handle seasonality**: Detrend/deseasonalize before detection
--- a/scientific-skills/aeon/references/classification.md
+++ b/scientific-skills/aeon/references/classification.md
@@ -0,0 +1,144 @@
+# Time Series Classification
+
+Aeon provides 13 categories of time series classifiers with scikit-learn compatible APIs.
+
+## Convolution-Based Classifiers
+
+Apply random convolutional transformations for efficient feature extraction:
+
+- `Arsenal` - Ensemble of ROCKET classifiers with varied kernels
+- `HydraClassifier` - Multi-resolution convolution with dilation
+- `RocketClassifier` - Random convolution kernels with ridge regression
+- `MiniRocketClassifier` - Simplified ROCKET variant for speed
+- `MultiRocketClassifier` - Combines multiple ROCKET variants
+
+**Use when**: Need fast, scalable classification with strong performance across diverse datasets.
+
+## Deep Learning Classifiers
+
+Neural network architectures optimized for temporal sequences:
+
+- `FCNClassifier` - Fully convolutional network
+- `ResNetClassifier` - Residual networks with skip connections
+- `InceptionTimeClassifier` - Multi-scale inception modules
+- `TimeCNNClassifier` - Standard CNN for time series
+- `MLPClassifier` - Multi-layer perceptron baseline
+- `EncoderClassifier` - Generic encoder wrapper
+- `DisjointCNNClassifier` - Shapelet-focused architecture
+
+**Use when**: Large datasets available, need end-to-end learning, or complex temporal patterns.
+
+## Dictionary-Based Classifiers
+
+Transform time series into symbolic representations:
+
+- `BOSSEnsemble` - Bag-of-SFA-Symbols with ensemble voting
+- `TemporalDictionaryEnsemble` - Multiple dictionary methods combined
+- `WEASEL` - Word ExtrAction for time SEries cLassification
+- `MrSEQLClassifier` - Multiple symbolic sequence learning
+
+**Use when**: Need interpretable models, sparse patterns, or symbolic reasoning.
+
+## Distance-Based Classifiers
+
+Leverage specialized time series distance metrics:
+
+- `KNeighborsTimeSeriesClassifier` - k-NN with temporal distances (DTW, LCSS, ERP, etc.)
+- `ElasticEnsemble` - Combines multiple elastic distance measures
+- `ProximityForest` - Tree ensemble using distance-based splits
+
+**Use when**: Small datasets, need similarity-based classification, or interpretable decisions.
+
+## Feature-Based Classifiers
+
+Extract statistical and signature features before classification:
+
+- `Catch22Classifier` - 22 canonical time-series characteristics
+- `TSFreshClassifier` - Automated feature extraction via tsfresh
+- `SignatureClassifier` - Path signature transformations
+- `SummaryClassifier` - Summary statistics extraction
+- `FreshPRINCEClassifier` - Combines multiple feature extractors
+
+**Use when**: Need interpretable features, domain expertise available, or feature engineering approach.
+
+## Interval-Based Classifiers
+
+Extract features from random or supervised intervals:
+
+- `CanonicalIntervalForestClassifier` - Random interval features with decision trees
+- `DrCIFClassifier` - Diverse Representation CIF with catch22 features
+- `TimeSeriesForestClassifier` - Random intervals with summary statistics
+- `RandomIntervalClassifier` - Simple interval-based approach
+- `RandomIntervalSpectralEnsembleClassifier` - Spectral features from intervals
+- `SupervisedTimeSeriesForest` - Supervised interval selection
+
+**Use when**: Discriminative patterns occur in specific time windows.
+
+## Shapelet-Based Classifiers
+
+Identify discriminative subsequences (shapelets):
+
+- `ShapeletTransformClassifier` - Discovers and uses discriminative shapelets
+- `LearningShapeletClassifier` - Learns shapelets via gradient descent
+- `SASTClassifier` - Scalable approximate shapelet transform
+- `RDSTClassifier` - Random dilated shapelet transform
+
+**Use when**: Need interpretable discriminative patterns or phase-invariant features.
+
+## Hybrid Classifiers
+
+Combine multiple classification paradigms:
+
+- `HIVECOTEV1` - Hierarchical Vote Collective of Transformation-based Ensembles (version 1)
+- `HIVECOTEV2` - Enhanced version with updated components
+
+**Use when**: Maximum accuracy required, computational resources available.
+
+## Early Classification
+
+Make predictions before observing entire time series:
+
+- `TEASER` - Two-tier Early and Accurate Series Classifier
+- `ProbabilityThresholdEarlyClassifier` - Prediction when confidence exceeds threshold
+
+**Use when**: Real-time decisions needed, or observations have cost.
+
+## Ordinal Classification
+
+Handle ordered class labels:
+
+- `OrdinalTDE` - Temporal dictionary ensemble for ordinal outputs
+
+**Use when**: Classes have natural ordering (e.g., severity levels).
+
+## Composition Tools
+
+Build custom pipelines and ensembles:
+
+- `ClassifierPipeline` - Chain transformers with classifiers
+- `WeightedEnsembleClassifier` - Weighted combination of classifiers
+- `SklearnClassifierWrapper` - Adapt sklearn classifiers for time series
+
+## Quick Start
+
+```python
+from aeon.classification.convolution_based import RocketClassifier
+from aeon.datasets import load_classification
+
+# Load data
+X_train, y_train = load_classification("GunPoint", split="train")
+X_test, y_test = load_classification("GunPoint", split="test")
+
+# Train and predict
+clf = RocketClassifier()
+clf.fit(X_train, y_train)
+accuracy = clf.score(X_test, y_test)
+```
+
+## Algorithm Selection
+
+- **Speed priority**: MiniRocketClassifier, Arsenal
+- **Accuracy priority**: HIVECOTEV2, InceptionTimeClassifier
+- **Interpretability**: ShapeletTransformClassifier, Catch22Classifier
+- **Small data**: KNeighborsTimeSeriesClassifier, Distance-based methods
+- **Large data**: Deep learning classifiers, ROCKET variants
--- a/scientific-skills/aeon/references/clustering.md
+++ b/scientific-skills/aeon/references/clustering.md
@@ -0,0 +1,123 @@
+# Time Series Clustering
+
+Aeon provides clustering algorithms adapted for temporal data with specialized distance metrics and averaging methods.
+
+## Partitioning Algorithms
+
+Standard k-means/k-medoids adapted for time series:
+
+- `TimeSeriesKMeans` - K-means with temporal distance metrics (DTW, Euclidean, etc.)
+- `TimeSeriesKMedoids` - Uses actual time series as cluster centers
+- `TimeSeriesKShape` - Shape-based clustering algorithm
+- `TimeSeriesKernelKMeans` - Kernel-based variant for nonlinear patterns
+
+**Use when**: Known number of clusters, spherical cluster shapes expected.
+
+## Large Dataset Methods
+
+Efficient clustering for large collections:
+
+- `TimeSeriesCLARA` - Clustering Large Applications with sampling
+- `TimeSeriesCLARANS` - Randomized search variant of CLARA
+
+**Use when**: Dataset too large for standard k-medoids, need scalability.
+
+## Elastic Distance Clustering
+
+Specialized for alignment-based similarity:
+
+- `KASBA` - K-means with shift-invariant elastic averaging
+- `ElasticSOM` - Self-organizing map using elastic distances
+
+**Use when**: Time series have temporal shifts or warping.
+
+## Spectral Methods
+
+Graph-based clustering:
+
+- `KSpectralCentroid` - Spectral clustering with centroid computation
+
+**Use when**: Non-convex cluster shapes, need graph-based approach.
+
+## Deep Learning Clustering
+
+Neural network-based clustering with auto-encoders:
+
+- `AEFCNClusterer` - Fully convolutional auto-encoder
+- `AEResNetClusterer` - Residual network auto-encoder
+- `AEDCNNClusterer` - Dilated CNN auto-encoder
+- `AEDRNNClusterer` - Dilated RNN auto-encoder
+- `AEBiGRUClusterer` - Bidirectional GRU auto-encoder
+- `AEAttentionBiGRUClusterer` - Attention-enhanced BiGRU auto-encoder
+
+**Use when**: Large datasets, need learned representations, or complex patterns.
+
+## Feature-Based Clustering
+
+Transform to feature space before clustering:
+
+- `Catch22Clusterer` - Clusters on 22 canonical features
+- `SummaryClusterer` - Uses summary statistics
+- `TSFreshClusterer` - Automated tsfresh features
+
+**Use when**: Raw time series not informative, need interpretable features.
+
+## Composition
+
+Build custom clustering pipelines:
+
+- `ClustererPipeline` - Chain transformers with clusterers
+
+## Averaging Methods
+
+Compute cluster centers for time series:
+
+- `mean_average` - Arithmetic mean
+- `ba_average` - Barycentric averaging with DTW
+- `kasba_average` - Shift-invariant averaging
+- `shift_invariant_average` - General shift-invariant method
+
+**Use when**: Need representative cluster centers for visualization or initialization.
+
+## Quick Start
+
+```python
+from aeon.clustering import TimeSeriesKMeans
+from aeon.datasets import load_classification
+
+# Load data (using classification data for clustering)
+X_train, _ = load_classification("GunPoint", split="train")
+
+# Cluster time series
+clusterer = TimeSeriesKMeans(
+    n_clusters=3,
+    distance="dtw",  # Use DTW distance
+    averaging_method="ba"  # Barycentric averaging
+)
+labels = clusterer.fit_predict(X_train)
+centers = clusterer.cluster_centers_
+```
+
+## Algorithm Selection
+
+- **Speed priority**: TimeSeriesKMeans with Euclidean distance
+- **Temporal alignment**: KASBA, TimeSeriesKMeans with DTW
+- **Large datasets**: TimeSeriesCLARA, TimeSeriesCLARANS
+- **Complex patterns**: Deep learning clusterers
+- **Interpretability**: Catch22Clusterer, SummaryClusterer
+- **Non-convex clusters**: KSpectralCentroid
+
+## Distance Metrics
+
+Compatible distance metrics include:
+- Euclidean, Manhattan, Minkowski (lock-step)
+- DTW, DDTW, WDTW (elastic with alignment)
+- ERP, EDR, LCSS (edit-based)
+- MSM, TWE (specialized elastic)
+
+## Evaluation
+
+Use clustering metrics from sklearn or aeon benchmarking:
+- Silhouette score
+- Davies-Bouldin index
+- Calinski-Harabasz index
--- a/scientific-skills/aeon/references/datasets_benchmarking.md
+++ b/scientific-skills/aeon/references/datasets_benchmarking.md
@@ -0,0 +1,387 @@
+# Datasets and Benchmarking
+
+Aeon provides comprehensive tools for loading datasets and benchmarking time series algorithms.
+
+## Dataset Loading
+
+### Task-Specific Loaders
+
+**Classification Datasets**:
+```python
+from aeon.datasets import load_classification
+
+# Load train/test split
+X_train, y_train = load_classification("GunPoint", split="train")
+X_test, y_test = load_classification("GunPoint", split="test")
+
+# Load entire dataset
+X, y = load_classification("GunPoint")
+```
+
+**Regression Datasets**:
+```python
+from aeon.datasets import load_regression
+
+X_train, y_train = load_regression("Covid3Month", split="train")
+X_test, y_test = load_regression("Covid3Month", split="test")
+
+# Bulk download
+from aeon.datasets import download_all_regression
+download_all_regression()  # Downloads Monash TSER archive
+```
+
+**Forecasting Datasets**:
+```python
+from aeon.datasets import load_forecasting
+
+# Load from forecastingdata.org
+y, X = load_forecasting("airline", return_X_y=True)
+```
+
+**Anomaly Detection Datasets**:
+```python
+from aeon.datasets import load_anomaly_detection
+
+X, y = load_anomaly_detection("NAB_realKnownCause")
+```
+
+### File Format Loaders
+
+**Load from .ts files**:
+```python
+from aeon.datasets import load_from_ts_file
+
+X, y = load_from_ts_file("path/to/data.ts")
+```
+
+**Load from .tsf files**:
+```python
+from aeon.datasets import load_from_tsf_file
+
+df, metadata = load_from_tsf_file("path/to/data.tsf")
+```
+
+**Load from ARFF files**:
+```python
+from aeon.datasets import load_from_arff_file
+
+X, y = load_from_arff_file("path/to/data.arff")
+```
+
+**Load from TSV files**:
+```python
+from aeon.datasets import load_from_tsv_file
+
+data = load_from_tsv_file("path/to/data.tsv")
+```
+
+**Load TimeEval CSV**:
+```python
+from aeon.datasets import load_from_timeeval_csv_file
+
+X, y = load_from_timeeval_csv_file("path/to/timeeval.csv")
+```
+
+### Writing Datasets
+
+**Write to .ts format**:
+```python
+from aeon.datasets import write_to_ts_file
+
+write_to_ts_file(X, "output.ts", y=y, problem_name="MyDataset")
+```
+
+**Write to ARFF format**:
+```python
+from aeon.datasets import write_to_arff_file
+
+write_to_arff_file(X, "output.arff", y=y)
+```
+
+## Built-in Datasets
+
+Aeon includes several benchmark datasets for quick testing:
+
+### Classification
+- `ArrowHead` - Shape classification
+- `GunPoint` - Gesture recognition
+- `ItalyPowerDemand` - Energy demand
+- `BasicMotions` - Motion classification
+- And 100+ more from UCR/UEA archives
+
+### Regression
+- `Covid3Month` - COVID forecasting
+- Various datasets from Monash TSER archive
+
+### Segmentation
+- Time series segmentation datasets
+- Human activity data
+- Sensor data collections
+
+### Special Collections
+- `RehabPile` - Rehabilitation data (classification & regression)
+
+## Dataset Metadata
+
+Get information about datasets:
+
+```python
+from aeon.datasets import get_dataset_meta_data
+
+metadata = get_dataset_meta_data("GunPoint")
+print(metadata)
+# {'n_train': 50, 'n_test': 150, 'length': 150, 'n_classes': 2, ...}
+```
+
+## Benchmarking Tools
+
+### Loading Published Results
+
+Access pre-computed benchmark results:
+
+```python
+from aeon.benchmarking import get_estimator_results
+
+# Get results for specific algorithm on dataset
+results = get_estimator_results(
+    estimator_name="ROCKET",
+    dataset_name="GunPoint"
+)
+
+# Get all available estimators for a dataset
+estimators = get_available_estimators("GunPoint")
+```
+
+### Resampling Strategies
+
+Create reproducible train/test splits:
+
+```python
+from aeon.benchmarking import stratified_resample
+
+# Stratified resampling maintaining class distribution
+X_train, X_test, y_train, y_test = stratified_resample(
+    X, y,
+    random_state=42,
+    test_size=0.3
+)
+```
+
+### Performance Metrics
+
+Specialized metrics for time series tasks:
+
+**Anomaly Detection Metrics**:
+```python
+from aeon.benchmarking.metrics.anomaly_detection import (
+    range_precision,
+    range_recall,
+    range_f_score,
+    range_roc_auc_score
+)
+
+# Range-based metrics for window detection
+precision = range_precision(y_true, y_pred, alpha=0.5)
+recall = range_recall(y_true, y_pred, alpha=0.5)
+f1 = range_f_score(y_true, y_pred, alpha=0.5)
+auc = range_roc_auc_score(y_true, y_scores)
+```
+
+**Clustering Metrics**:
+```python
+from aeon.benchmarking.metrics.clustering import clustering_accuracy
+
+# Clustering accuracy with label matching
+accuracy = clustering_accuracy(y_true, y_pred)
+```
+
+**Segmentation Metrics**:
+```python
+from aeon.benchmarking.metrics.segmentation import (
+    count_error,
+    hausdorff_error
+)
+
+# Number of change points difference
+count_err = count_error(y_true, y_pred)
+
+# Maximum distance between predicted and true change points
+hausdorff_err = hausdorff_error(y_true, y_pred)
+```
+
+### Statistical Testing
+
+Post-hoc analysis for algorithm comparison:
+
+```python
+from aeon.benchmarking import (
+    nemenyi_test,
+    wilcoxon_test
+)
+
+# Nemenyi test for multiple algorithms
+results = nemenyi_test(scores_matrix, alpha=0.05)
+
+# Pairwise Wilcoxon signed-rank test
+stat, p_value = wilcoxon_test(scores_alg1, scores_alg2)
+```
+
+## Benchmark Collections
+
+### UCR/UEA Time Series Archives
+
+Access to comprehensive benchmark repositories:
+
+```python
+# Classification: 112 univariate + 30 multivariate datasets
+X_train, y_train = load_classification("Chinatown", split="train")
+
+# Automatically downloads from timeseriesclassification.com
+```
+
+### Monash Forecasting Archive
+
+```python
+# Load forecasting datasets
+y = load_forecasting("nn5_daily", return_X_y=False)
+```
+
+### Published Benchmark Results
+
+Pre-computed results from major competitions:
+
+- 2017 Univariate Bake-off
+- 2021 Multivariate Classification
+- 2023 Univariate Bake-off
+
+## Workflow Example
+
+Complete benchmarking workflow:
+
+```python
+from aeon.datasets import load_classification
+from aeon.classification.convolution_based import RocketClassifier
+from aeon.benchmarking import get_estimator_results
+from sklearn.metrics import accuracy_score
+import numpy as np
+
+# Load dataset
+dataset_name = "GunPoint"
+X_train, y_train = load_classification(dataset_name, split="train")
+X_test, y_test = load_classification(dataset_name, split="test")
+
+# Train model
+clf = RocketClassifier(n_kernels=10000, random_state=42)
+clf.fit(X_train, y_train)
+y_pred = clf.predict(X_test)
+
+# Evaluate
+accuracy = accuracy_score(y_test, y_pred)
+print(f"Accuracy: {accuracy:.4f}")
+
+# Compare with published results
+published = get_estimator_results("ROCKET", dataset_name)
+print(f"Published ROCKET accuracy: {published['accuracy']:.4f}")
+```
+
+## Best Practices
+
+### 1. Use Standard Splits
+
+For reproducibility, use provided train/test splits:
+
+```python
+# Good: Use standard splits
+X_train, y_train = load_classification("GunPoint", split="train")
+X_test, y_test = load_classification("GunPoint", split="test")
+
+# Avoid: Creating custom splits
+X, y = load_classification("GunPoint")
+X_train, X_test, y_train, y_test = train_test_split(X, y)
+```
+
+### 2. Set Random Seeds
+
+Ensure reproducibility:
+
+```python
+clf = RocketClassifier(random_state=42)
+results = stratified_resample(X, y, random_state=42)
+```
+
+### 3. Report Multiple Metrics
+
+Don't rely on single metric:
+
+```python
+from sklearn.metrics import accuracy_score, f1_score, precision_score
+
+accuracy = accuracy_score(y_test, y_pred)
+f1 = f1_score(y_test, y_pred, average='weighted')
+precision = precision_score(y_test, y_pred, average='weighted')
+```
+
+### 4. Cross-Validation
+
+For robust evaluation on small datasets:
+
+```python
+from sklearn.model_selection import cross_val_score
+
+scores = cross_val_score(
+    clf, X_train, y_train,
+    cv=5,
+    scoring='accuracy'
+)
+print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
+```
+
+### 5. Compare Against Baselines
+
+Always compare with simple baselines:
+
+```python
+from aeon.classification.distance_based import KNeighborsTimeSeriesClassifier
+
+# Simple baseline: 1-NN with Euclidean distance
+baseline = KNeighborsTimeSeriesClassifier(n_neighbors=1, distance="euclidean")
+baseline.fit(X_train, y_train)
+baseline_acc = baseline.score(X_test, y_test)
+
+print(f"Baseline: {baseline_acc:.4f}")
+print(f"Your model: {accuracy:.4f}")
+```
+
+### 6. Statistical Significance
+
+Test if improvements are statistically significant:
+
+```python
+from aeon.benchmarking import wilcoxon_test
+
+# Run on multiple datasets
+accuracies_alg1 = [0.85, 0.92, 0.78, 0.88]
+accuracies_alg2 = [0.83, 0.90, 0.76, 0.86]
+
+stat, p_value = wilcoxon_test(accuracies_alg1, accuracies_alg2)
+if p_value < 0.05:
+    print("Difference is statistically significant")
+```
+
+## Dataset Discovery
+
+Find datasets matching criteria:
+
+```python
+# List all available classification datasets
+from aeon.datasets import get_available_datasets
+
+datasets = get_available_datasets("classification")
+print(f"Found {len(datasets)} classification datasets")
+
+# Filter by properties
+univariate_datasets = [
+    d for d in datasets
+    if get_dataset_meta_data(d)['n_channels'] == 1
+]
+```
--- a/scientific-skills/aeon/references/distances.md
+++ b/scientific-skills/aeon/references/distances.md
@@ -0,0 +1,256 @@
+# Distance Metrics
+
+Aeon provides specialized distance functions for measuring similarity between time series, compatible with both aeon and scikit-learn estimators.
+
+## Distance Categories
+
+### Elastic Distances
+
+Allow flexible temporal alignment between series:
+
+**Dynamic Time Warping Family:**
+- `dtw` - Classic Dynamic Time Warping
+- `ddtw` - Derivative DTW (compares derivatives)
+- `wdtw` - Weighted DTW (penalizes warping by location)
+- `wddtw` - Weighted Derivative DTW
+- `shape_dtw` - Shape-based DTW
+
+**Edit-Based:**
+- `erp` - Edit distance with Real Penalty
+- `edr` - Edit Distance on Real sequences
+- `lcss` - Longest Common SubSequence
+- `twe` - Time Warp Edit distance
+
+**Specialized:**
+- `msm` - Move-Split-Merge distance
+- `adtw` - Amerced DTW
+- `sbd` - Shape-Based Distance
+
+**Use when**: Time series may have temporal shifts, speed variations, or phase differences.
+
+### Lock-Step Distances
+
+Compare time series point-by-point without alignment:
+
+- `euclidean` - Euclidean distance (L2 norm)
+- `manhattan` - Manhattan distance (L1 norm)
+- `minkowski` - Generalized Minkowski distance (Lp norm)
+- `squared` - Squared Euclidean distance
+
+**Use when**: Series already aligned, need computational speed, or no temporal warping expected.
+
+## Usage Patterns
+
+### Computing Single Distance
+
+```python
+from aeon.distances import dtw_distance
+
+# Distance between two time series
+distance = dtw_distance(x, y)
+
+# With window constraint (Sakoe-Chiba band)
+distance = dtw_distance(x, y, window=0.1)
+```
+
+### Pairwise Distance Matrix
+
+```python
+from aeon.distances import dtw_pairwise_distance
+
+# All pairwise distances in collection
+X = [series1, series2, series3, series4]
+distance_matrix = dtw_pairwise_distance(X)
+
+# Cross-collection distances
+distance_matrix = dtw_pairwise_distance(X_train, X_test)
+```
+
+### Cost Matrix and Alignment Path
+
+```python
+from aeon.distances import dtw_cost_matrix, dtw_alignment_path
+
+# Get full cost matrix
+cost_matrix = dtw_cost_matrix(x, y)
+
+# Get optimal alignment path
+path = dtw_alignment_path(x, y)
+# Returns indices: [(0,0), (1,1), (2,1), (2,2), ...]
+```
+
+### Using with Estimators
+
+```python
+from aeon.classification.distance_based import KNeighborsTimeSeriesClassifier
+
+# Use DTW distance in classifier
+clf = KNeighborsTimeSeriesClassifier(
+    n_neighbors=5,
+    distance="dtw",
+    distance_params={"window": 0.2}
+)
+clf.fit(X_train, y_train)
+```
+
+## Distance Parameters
+
+### Window Constraints
+
+Limit warping path deviation (improves speed and prevents pathological warping):
+
+```python
+# Sakoe-Chiba band: window as fraction of series length
+dtw_distance(x, y, window=0.1)  # Allow 10% deviation
+
+# Itakura parallelogram: slopes constrain path
+dtw_distance(x, y, itakura_max_slope=2.0)
+```
+
+### Normalization
+
+Control whether to z-normalize series before distance computation:
+
+```python
+# Most elastic distances support normalization
+distance = dtw_distance(x, y, normalize=True)
+```
+
+### Distance-Specific Parameters
+
+```python
+# ERP: penalty for gaps
+distance = erp_distance(x, y, g=0.5)
+
+# TWE: stiffness and penalty parameters
+distance = twe_distance(x, y, nu=0.001, lmbda=1.0)
+
+# LCSS: epsilon threshold for matching
+distance = lcss_distance(x, y, epsilon=0.5)
+```
+
+## Algorithm Selection
+
+### By Use Case:
+
+**Temporal misalignment**: DTW, DDTW, WDTW
+**Speed variations**: DTW with window constraint
+**Shape similarity**: Shape DTW, SBD
+**Edit operations**: ERP, EDR, LCSS
+**Derivative matching**: DDTW
+**Computational speed**: Euclidean, Manhattan
+**Outlier robustness**: Manhattan, LCSS
+
+### By Computational Cost:
+
+**Fastest**: Euclidean (O(n))
+**Fast**: Constrained DTW (O(nw) where w is window)
+**Medium**: Full DTW (O(n²))
+**Slower**: Complex elastic distances (ERP, TWE, MSM)
+
+## Quick Reference Table
+
+| Distance | Alignment | Speed | Robustness | Interpretability |
+|----------|-----------|-------|------------|------------------|
+| Euclidean | Lock-step | Very Fast | Low | High |
+| DTW | Elastic | Medium | Medium | Medium |
+| DDTW | Elastic | Medium | High | Medium |
+| WDTW | Elastic | Medium | Medium | Medium |
+| ERP | Edit-based | Slow | High | Low |
+| LCSS | Edit-based | Slow | Very High | Low |
+| Shape DTW | Elastic | Medium | Medium | High |
+
+## Best Practices
+
+### 1. Normalization
+
+Most distances sensitive to scale; normalize when appropriate:
+
+```python
+from aeon.transformations.collection import Normalizer
+
+normalizer = Normalizer()
+X_normalized = normalizer.fit_transform(X)
+```
+
+### 2. Window Constraints
+
+For DTW variants, use window constraints for speed and better generalization:
+
+```python
+# Start with 10-20% window
+distance = dtw_distance(x, y, window=0.1)
+```
+
+### 3. Series Length
+
+- Equal-length required: Most lock-step distances
+- Unequal-length supported: Elastic distances (DTW, ERP, etc.)
+
+### 4. Multivariate Series
+
+Most distances support multivariate time series:
+
+```python
+# x.shape = (n_channels, n_timepoints)
+distance = dtw_distance(x_multivariate, y_multivariate)
+```
+
+### 5. Performance Optimization
+
+- Use numba-compiled implementations (default in aeon)
+- Consider lock-step distances if alignment not needed
+- Use windowed DTW instead of full DTW
+- Precompute distance matrices for repeated use
+
+### 6. Choosing the Right Distance
+
+```python
+# Quick decision tree:
+if series_aligned:
+    use_distance = "euclidean"
+elif need_speed:
+    use_distance = "dtw"  # with window constraint
+elif temporal_shifts_expected:
+    use_distance = "dtw" or "shape_dtw"
+elif outliers_present:
+    use_distance = "lcss" or "manhattan"
+elif derivatives_matter:
+    use_distance = "ddtw" or "wddtw"
+```
+
+## Integration with scikit-learn
+
+Aeon distances work with sklearn estimators:
+
+```python
+from sklearn.neighbors import KNeighborsClassifier
+from aeon.distances import dtw_pairwise_distance
+
+# Precompute distance matrix
+X_train_distances = dtw_pairwise_distance(X_train)
+
+# Use with sklearn
+clf = KNeighborsClassifier(metric='precomputed')
+clf.fit(X_train_distances, y_train)
+```
+
+## Available Distance Functions
+
+Get list of all available distances:
+
+```python
+from aeon.distances import get_distance_function_names
+
+print(get_distance_function_names())
+# ['dtw', 'ddtw', 'wdtw', 'euclidean', 'erp', 'edr', ...]
+```
+
+Retrieve specific distance function:
+
+```python
+from aeon.distances import get_distance_function
+
+distance_func = get_distance_function("dtw")
+result = distance_func(x, y, window=0.1)
+```
--- a/scientific-skills/aeon/references/forecasting.md
+++ b/scientific-skills/aeon/references/forecasting.md
@@ -0,0 +1,140 @@
+# Time Series Forecasting
+
+Aeon provides forecasting algorithms for predicting future time series values.
+
+## Naive and Baseline Methods
+
+Simple forecasting strategies for comparison:
+
+- `NaiveForecaster` - Multiple strategies: last value, mean, seasonal naive
+  - Parameters: `strategy` ("last", "mean", "seasonal"), `sp` (seasonal period)
+  - **Use when**: Establishing baselines or simple patterns
+
+## Statistical Models
+
+Classical time series forecasting methods:
+
+### ARIMA
+- `ARIMA` - AutoRegressive Integrated Moving Average
+  - Parameters: `p` (AR order), `d` (differencing), `q` (MA order)
+  - **Use when**: Linear patterns, stationary or difference-stationary series
+
+### Exponential Smoothing
+- `ETS` - Error-Trend-Seasonal decomposition
+  - Parameters: `error`, `trend`, `seasonal` types
+  - **Use when**: Trend and seasonal patterns present
+
+### Threshold Autoregressive
+- `TAR` - Threshold Autoregressive model for regime switching
+- `AutoTAR` - Automated threshold discovery
+  - **Use when**: Series exhibits different behaviors in different regimes
+
+### Theta Method
+- `Theta` - Classical Theta forecasting
+  - Parameters: `theta`, `weights` for decomposition
+  - **Use when**: Simple but effective baseline needed
+
+### Time-Varying Parameter
+- `TVP` - Time-varying parameter model with Kalman filtering
+  - **Use when**: Parameters change over time
+
+## Deep Learning Forecasters
+
+Neural networks for complex temporal patterns:
+
+- `TCNForecaster` - Temporal Convolutional Network
+  - Dilated convolutions for large receptive fields
+  - **Use when**: Long sequences, need non-recurrent architecture
+
+- `DeepARNetwork` - Probabilistic forecasting with RNNs
+  - Provides prediction intervals
+  - **Use when**: Need probabilistic forecasts, uncertainty quantification
+
+## Regression-Based Forecasting
+
+Apply regression to lagged features:
+
+- `RegressionForecaster` - Wraps regressors for forecasting
+  - Parameters: `window_length`, `horizon`
+  - **Use when**: Want to use any regressor as forecaster
+
+## Quick Start
+
+```python
+from aeon.forecasting.naive import NaiveForecaster
+from aeon.forecasting.arima import ARIMA
+import numpy as np
+
+# Create time series
+y = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
+
+# Naive baseline
+naive = NaiveForecaster(strategy="last")
+naive.fit(y)
+forecast_naive = naive.predict(fh=[1, 2, 3])
+
+# ARIMA model
+arima = ARIMA(order=(1, 1, 1))
+arima.fit(y)
+forecast_arima = arima.predict(fh=[1, 2, 3])
+```
+
+## Forecasting Horizon
+
+The forecasting horizon (`fh`) specifies which future time points to predict:
+
+```python
+# Relative horizon (next 3 steps)
+fh = [1, 2, 3]
+
+# Absolute horizon (specific time indices)
+from aeon.forecasting.base import ForecastingHorizon
+fh = ForecastingHorizon([11, 12, 13], is_relative=False)
+```
+
+## Model Selection
+
+- **Baseline**: NaiveForecaster with seasonal strategy
+- **Linear patterns**: ARIMA
+- **Trend + seasonality**: ETS
+- **Regime changes**: TAR, AutoTAR
+- **Complex patterns**: TCNForecaster
+- **Probabilistic**: DeepARNetwork
+- **Long sequences**: TCNForecaster
+- **Short sequences**: ARIMA, ETS
+
+## Evaluation Metrics
+
+Use standard forecasting metrics:
+
+```python
+from aeon.performance_metrics.forecasting import (
+    mean_absolute_error,
+    mean_squared_error,
+    mean_absolute_percentage_error
+)
+
+# Calculate error
+mae = mean_absolute_error(y_true, y_pred)
+mse = mean_squared_error(y_true, y_pred)
+mape = mean_absolute_percentage_error(y_true, y_pred)
+```
+
+## Exogenous Variables
+
+Many forecasters support exogenous features:
+
+```python
+# Train with exogenous variables
+forecaster.fit(y, X=X_train)
+
+# Predict requires future exogenous values
+y_pred = forecaster.predict(fh=[1, 2, 3], X=X_test)
+```
+
+## Base Classes
+
+- `BaseForecaster` - Abstract base for all forecasters
+- `BaseDeepForecaster` - Base for deep learning forecasters
+
+Extend these to implement custom forecasting algorithms.
--- a/scientific-skills/aeon/references/networks.md
+++ b/scientific-skills/aeon/references/networks.md
@@ -0,0 +1,289 @@
+# Deep Learning Networks
+
+Aeon provides neural network architectures specifically designed for time series tasks. These networks serve as building blocks for classification, regression, clustering, and forecasting.
+
+## Core Network Architectures
+
+### Convolutional Networks
+
+**FCNNetwork** - Fully Convolutional Network
+- Three convolutional blocks with batch normalization
+- Global average pooling for dimensionality reduction
+- **Use when**: Need simple yet effective CNN baseline
+
+**ResNetNetwork** - Residual Network
+- Residual blocks with skip connections
+- Prevents vanishing gradients in deep networks
+- **Use when**: Deep networks needed, training stability important
+
+**InceptionNetwork** - Inception Modules
+- Multi-scale feature extraction with parallel convolutions
+- Different kernel sizes capture patterns at various scales
+- **Use when**: Patterns exist at multiple temporal scales
+
+**TimeCNNNetwork** - Standard CNN
+- Basic convolutional architecture
+- **Use when**: Simple CNN sufficient, interpretability valued
+
+**DisjointCNNNetwork** - Separate Pathways
+- Disjoint convolutional pathways
+- **Use when**: Different feature extraction strategies needed
+
+**DCNNNetwork** - Dilated CNN
+- Dilated convolutions for large receptive fields
+- **Use when**: Long-range dependencies without many layers
+
+### Recurrent Networks
+
+**RecurrentNetwork** - RNN/LSTM/GRU
+- Configurable cell type (RNN, LSTM, GRU)
+- Sequential modeling of temporal dependencies
+- **Use when**: Sequential dependencies critical, variable-length series
+
+### Temporal Convolutional Network
+
+**TCNNetwork** - Temporal Convolutional Network
+- Dilated causal convolutions
+- Large receptive field without recurrence
+- **Use when**: Long sequences, need parallelizable architecture
+
+### Multi-Layer Perceptron
+
+**MLPNetwork** - Basic Feedforward
+- Simple fully-connected layers
+- Flattens time series before processing
+- **Use when**: Baseline needed, computational limits, or simple patterns
+
+## Encoder-Based Architectures
+
+Networks designed for representation learning and clustering.
+
+### Autoencoder Variants
+
+**EncoderNetwork** - Generic Encoder
+- Flexible encoder structure
+- **Use when**: Custom encoding needed
+
+**AEFCNNetwork** - FCN-based Autoencoder
+- Fully convolutional encoder-decoder
+- **Use when**: Need convolutional representation learning
+
+**AEResNetNetwork** - ResNet Autoencoder
+- Residual blocks in encoder-decoder
+- **Use when**: Deep autoencoding with skip connections
+
+**AEDCNNNetwork** - Dilated CNN Autoencoder
+- Dilated convolutions for compression
+- **Use when**: Need large receptive field in autoencoder
+
+**AEDRNNNetwork** - Dilated RNN Autoencoder
+- Dilated recurrent connections
+- **Use when**: Sequential patterns with long-range dependencies
+
+**AEBiGRUNetwork** - Bidirectional GRU
+- Bidirectional recurrent encoding
+- **Use when**: Context from both directions helpful
+
+**AEAttentionBiGRUNetwork** - Attention + BiGRU
+- Attention mechanism on BiGRU outputs
+- **Use when**: Need to focus on important time steps
+
+## Specialized Architectures
+
+**LITENetwork** - Lightweight Inception Time Ensemble
+- Efficient inception-based architecture
+- LITEMV variant for multivariate series
+- **Use when**: Need efficiency with strong performance
+
+**DeepARNetwork** - Probabilistic Forecasting
+- Autoregressive RNN for forecasting
+- Produces probabilistic predictions
+- **Use when**: Need forecast uncertainty quantification
+
+## Usage with Estimators
+
+Networks are typically used within estimators, not directly:
+
+```python
+from aeon.classification.deep_learning import FCNClassifier
+from aeon.regression.deep_learning import ResNetRegressor
+from aeon.clustering.deep_learning import AEFCNClusterer
+
+# Classification with FCN
+clf = FCNClassifier(n_epochs=100, batch_size=16)
+clf.fit(X_train, y_train)
+
+# Regression with ResNet
+reg = ResNetRegressor(n_epochs=100)
+reg.fit(X_train, y_train)
+
+# Clustering with autoencoder
+clusterer = AEFCNClusterer(n_clusters=3, n_epochs=100)
+labels = clusterer.fit_predict(X_train)
+```
+
+## Custom Network Configuration
+
+Many networks accept configuration parameters:
+
+```python
+# Configure FCN layers
+clf = FCNClassifier(
+    n_epochs=200,
+    batch_size=32,
+    kernel_size=[7, 5, 3],  # Kernel sizes for each layer
+    n_filters=[128, 256, 128],  # Filters per layer
+    learning_rate=0.001
+)
+```
+
+## Base Classes
+
+- `BaseDeepLearningNetwork` - Abstract base for all networks
+- `BaseDeepRegressor` - Base for deep regression
+- `BaseDeepClassifier` - Base for deep classification
+- `BaseDeepForecaster` - Base for deep forecasting
+
+Extend these to implement custom architectures.
+
+## Training Considerations
+
+### Hyperparameters
+
+Key hyperparameters to tune:
+
+- `n_epochs` - Training iterations (50-200 typical)
+- `batch_size` - Samples per batch (16-64 typical)
+- `learning_rate` - Step size (0.0001-0.01)
+- Network-specific: layers, filters, kernel sizes
+
+### Callbacks
+
+Many networks support callbacks for training monitoring:
+
+```python
+from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
+
+clf = FCNClassifier(
+    n_epochs=200,
+    callbacks=[
+        EarlyStopping(patience=20, restore_best_weights=True),
+        ReduceLROnPlateau(patience=10, factor=0.5)
+    ]
+)
+```
+
+### GPU Acceleration
+
+Deep learning networks benefit from GPU:
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'  # Use first GPU
+
+# Networks automatically use GPU if available
+clf = InceptionTimeClassifier(n_epochs=100)
+clf.fit(X_train, y_train)
+```
+
+## Architecture Selection
+
+### By Task:
+
+**Classification**: InceptionNetwork, ResNetNetwork, FCNNetwork
+**Regression**: InceptionNetwork, ResNetNetwork, TCNNetwork
+**Forecasting**: TCNNetwork, DeepARNetwork, RecurrentNetwork
+**Clustering**: AEFCNNetwork, AEResNetNetwork, AEAttentionBiGRUNetwork
+
+### By Data Characteristics:
+
+**Long sequences**: TCNNetwork, DCNNNetwork (dilated convolutions)
+**Short sequences**: MLPNetwork, FCNNetwork
+**Multivariate**: InceptionNetwork, FCNNetwork, LITENetwork
+**Variable length**: RecurrentNetwork with masking
+**Multi-scale patterns**: InceptionNetwork
+
+### By Computational Resources:
+
+**Limited compute**: MLPNetwork, LITENetwork
+**Moderate compute**: FCNNetwork, TimeCNNNetwork
+**High compute available**: InceptionNetwork, ResNetNetwork
+**GPU available**: Any deep network (major speedup)
+
+## Best Practices
+
+### 1. Data Preparation
+
+Normalize input data:
+
+```python
+from aeon.transformations.collection import Normalizer
+
+normalizer = Normalizer()
+X_train_norm = normalizer.fit_transform(X_train)
+X_test_norm = normalizer.transform(X_test)
+```
+
+### 2. Training/Validation Split
+
+Use validation set for early stopping:
+
+```python
+from sklearn.model_selection import train_test_split
+
+X_train_fit, X_val, y_train_fit, y_val = train_test_split(
+    X_train, y_train, test_size=0.2, stratify=y_train
+)
+
+clf = FCNClassifier(n_epochs=200)
+clf.fit(X_train_fit, y_train_fit, validation_data=(X_val, y_val))
+```
+
+### 3. Start Simple
+
+Begin with simpler architectures before complex ones:
+
+1. Try MLPNetwork or FCNNetwork first
+2. If insufficient, try ResNetNetwork or InceptionNetwork
+3. Consider ensembles if single models insufficient
+
+### 4. Hyperparameter Tuning
+
+Use grid search or random search:
+
+```python
+from sklearn.model_selection import GridSearchCV
+
+param_grid = {
+    'n_epochs': [100, 200],
+    'batch_size': [16, 32],
+    'learning_rate': [0.001, 0.0001]
+}
+
+clf = FCNClassifier()
+grid = GridSearchCV(clf, param_grid, cv=3)
+grid.fit(X_train, y_train)
+```
+
+### 5. Regularization
+
+Prevent overfitting:
+- Use dropout (if network supports)
+- Early stopping
+- Data augmentation (if available)
+- Reduce model complexity
+
+### 6. Reproducibility
+
+Set random seeds:
+
+```python
+import numpy as np
+import random
+import tensorflow as tf
+
+seed = 42
+np.random.seed(seed)
+random.seed(seed)
+tf.random.set_seed(seed)
+```
--- a/scientific-skills/aeon/references/regression.md
+++ b/scientific-skills/aeon/references/regression.md
@@ -0,0 +1,118 @@
+# Time Series Regression
+
+Aeon provides time series regressors across 9 categories for predicting continuous values from temporal sequences.
+
+## Convolution-Based Regressors
+
+Apply convolutional kernels for feature extraction:
+
+- `HydraRegressor` - Multi-resolution dilated convolutions
+- `RocketRegressor` - Random convolutional kernels
+- `MiniRocketRegressor` - Simplified ROCKET for speed
+- `MultiRocketRegressor` - Combined ROCKET variants
+- `MultiRocketHydraRegressor` - Merges ROCKET and Hydra approaches
+
+**Use when**: Need fast regression with strong baseline performance.
+
+## Deep Learning Regressors
+
+Neural architectures for end-to-end temporal regression:
+
+- `FCNRegressor` - Fully convolutional network
+- `ResNetRegressor` - Residual blocks with skip connections
+- `InceptionTimeRegressor` - Multi-scale inception modules
+- `TimeCNNRegressor` - Standard CNN architecture
+- `RecurrentRegressor` - RNN/LSTM/GRU variants
+- `MLPRegressor` - Multi-layer perceptron
+- `EncoderRegressor` - Generic encoder wrapper
+- `LITERegressor` - Lightweight inception time ensemble
+- `DisjointCNNRegressor` - Specialized CNN architecture
+
+**Use when**: Large datasets, complex patterns, or need feature learning.
+
+## Distance-Based Regressors
+
+k-nearest neighbors with temporal distance metrics:
+
+- `KNeighborsTimeSeriesRegressor` - k-NN with DTW, LCSS, ERP, or other distances
+
+**Use when**: Small datasets, local similarity patterns, or interpretable predictions.
+
+## Feature-Based Regressors
+
+Extract statistical features before regression:
+
+- `Catch22Regressor` - 22 canonical time-series characteristics
+- `FreshPRINCERegressor` - Pipeline combining multiple feature extractors
+- `SummaryRegressor` - Summary statistics features
+- `TSFreshRegressor` - Automated tsfresh feature extraction
+
+**Use when**: Need interpretable features or domain-specific feature engineering.
+
+## Hybrid Regressors
+
+Combine multiple approaches:
+
+- `RISTRegressor` - Randomized Interval-Shapelet Transformation
+
+**Use when**: Benefit from combining interval and shapelet methods.
+
+## Interval-Based Regressors
+
+Extract features from time intervals:
+
+- `CanonicalIntervalForestRegressor` - Random intervals with decision trees
+- `DrCIFRegressor` - Diverse Representation CIF
+- `TimeSeriesForestRegressor` - Random interval ensemble
+- `RandomIntervalRegressor` - Simple interval-based approach
+- `RandomIntervalSpectralEnsembleRegressor` - Spectral interval features
+- `QUANTRegressor` - Quantile-based interval features
+
+**Use when**: Predictive patterns occur in specific time windows.
+
+## Shapelet-Based Regressors
+
+Use discriminative subsequences for prediction:
+
+- `RDSTRegressor` - Random Dilated Shapelet Transform
+
+**Use when**: Need phase-invariant discriminative patterns.
+
+## Composition Tools
+
+Build custom regression pipelines:
+
+- `RegressorPipeline` - Chain transformers with regressors
+- `RegressorEnsemble` - Weighted ensemble with learnable weights
+- `SklearnRegressorWrapper` - Adapt sklearn regressors for time series
+
+## Utilities
+
+- `DummyRegressor` - Baseline strategies (mean, median)
+- `BaseRegressor` - Abstract base for custom regressors
+- `BaseDeepRegressor` - Base for deep learning regressors
+
+## Quick Start
+
+```python
+from aeon.regression.convolution_based import RocketRegressor
+from aeon.datasets import load_regression
+
+# Load data
+X_train, y_train = load_regression("Covid3Month", split="train")
+X_test, y_test = load_regression("Covid3Month", split="test")
+
+# Train and predict
+reg = RocketRegressor()
+reg.fit(X_train, y_train)
+predictions = reg.predict(X_test)
+```
+
+## Algorithm Selection
+
+- **Speed priority**: MiniRocketRegressor
+- **Accuracy priority**: InceptionTimeRegressor, MultiRocketHydraRegressor
+- **Interpretability**: Catch22Regressor, SummaryRegressor
+- **Small data**: KNeighborsTimeSeriesRegressor
+- **Large data**: Deep learning regressors, ROCKET variants
+- **Interval patterns**: DrCIFRegressor, CanonicalIntervalForestRegressor
--- a/scientific-skills/aeon/references/segmentation.md
+++ b/scientific-skills/aeon/references/segmentation.md
@@ -0,0 +1,163 @@
+# Time Series Segmentation
+
+Aeon provides algorithms to partition time series into regions with distinct characteristics, identifying change points and boundaries.
+
+## Segmentation Algorithms
+
+### Binary Segmentation
+- `BinSegmenter` - Recursive binary segmentation
+  - Iteratively splits series at most significant change points
+  - Parameters: `n_segments`, `cost_function`
+  - **Use when**: Known number of segments, hierarchical structure
+
+### Classification-Based
+- `ClaSPSegmenter` - Classification Score Profile
+  - Uses classification performance to identify boundaries
+  - Discovers segments where classification distinguishes neighbors
+  - **Use when**: Segments have different temporal patterns
+
+### Fast Pattern-Based
+- `FLUSSSegmenter` - Fast Low-cost Unipotent Semantic Segmentation
+  - Efficient semantic segmentation using arc crossings
+  - Based on matrix profile
+  - **Use when**: Large time series, need speed and pattern discovery
+
+### Information Theory
+- `InformationGainSegmenter` - Information gain maximization
+  - Finds boundaries maximizing information gain
+  - **Use when**: Statistical differences between segments
+
+### Gaussian Modeling
+- `GreedyGaussianSegmenter` - Greedy Gaussian approximation
+  - Models segments as Gaussian distributions
+  - Incrementally adds change points
+  - **Use when**: Segments follow Gaussian distributions
+
+### Hierarchical Agglomerative
+- `EAggloSegmenter` - Bottom-up merging approach
+  - Estimates change points via agglomeration
+  - **Use when**: Want hierarchical segmentation structure
+
+### Hidden Markov Models
+- `HMMSegmenter` - HMM with Viterbi decoding
+  - Probabilistic state-based segmentation
+  - **Use when**: Segments represent hidden states
+
+### Dimensionality-Based
+- `HidalgoSegmenter` - Heterogeneous Intrinsic Dimensionality Algorithm
+  - Detects changes in local dimensionality
+  - **Use when**: Dimensionality shifts between segments
+
+### Baseline
+- `RandomSegmenter` - Random change point generation
+  - **Use when**: Need null hypothesis baseline
+
+## Quick Start
+
+```python
+from aeon.segmentation import ClaSPSegmenter
+import numpy as np
+
+# Create time series with regime changes
+y = np.concatenate([
+    np.sin(np.linspace(0, 10, 100)),      # Segment 1
+    np.cos(np.linspace(0, 10, 100)),      # Segment 2
+    np.sin(2 * np.linspace(0, 10, 100))   # Segment 3
+])
+
+# Segment the series
+segmenter = ClaSPSegmenter()
+change_points = segmenter.fit_predict(y)
+
+print(f"Detected change points: {change_points}")
+```
+
+## Output Format
+
+Segmenters return change point indices:
+
+```python
+# change_points = [100, 200]  # Boundaries between segments
+# This divides series into: [0:100], [100:200], [200:end]
+```
+
+## Algorithm Selection
+
+- **Speed priority**: FLUSSSegmenter, BinSegmenter
+- **Accuracy priority**: ClaSPSegmenter, HMMSegmenter
+- **Known segment count**: BinSegmenter with n_segments parameter
+- **Unknown segment count**: ClaSPSegmenter, InformationGainSegmenter
+- **Pattern changes**: FLUSSSegmenter, ClaSPSegmenter
+- **Statistical changes**: InformationGainSegmenter, GreedyGaussianSegmenter
+- **State transitions**: HMMSegmenter
+
+## Common Use Cases
+
+### Regime Change Detection
+Identify when time series behavior fundamentally changes:
+
+```python
+from aeon.segmentation import InformationGainSegmenter
+
+segmenter = InformationGainSegmenter(k=3)  # Up to 3 change points
+change_points = segmenter.fit_predict(stock_prices)
+```
+
+### Activity Segmentation
+Segment sensor data into activities:
+
+```python
+from aeon.segmentation import ClaSPSegmenter
+
+segmenter = ClaSPSegmenter()
+boundaries = segmenter.fit_predict(accelerometer_data)
+```
+
+### Seasonal Boundary Detection
+Find season transitions in time series:
+
+```python
+from aeon.segmentation import HMMSegmenter
+
+segmenter = HMMSegmenter(n_states=4)  # 4 seasons
+segments = segmenter.fit_predict(temperature_data)
+```
+
+## Evaluation Metrics
+
+Use segmentation quality metrics:
+
+```python
+from aeon.benchmarking.metrics.segmentation import (
+    count_error,
+    hausdorff_error
+)
+
+# Count error: difference in number of change points
+count_err = count_error(y_true, y_pred)
+
+# Hausdorff: maximum distance between predicted and true points
+hausdorff_err = hausdorff_error(y_true, y_pred)
+```
+
+## Best Practices
+
+1. **Normalize data**: Ensures change detection not dominated by scale
+2. **Choose appropriate metric**: Different algorithms optimize different criteria
+3. **Validate segments**: Visualize to verify meaningful boundaries
+4. **Handle noise**: Consider smoothing before segmentation
+5. **Domain knowledge**: Use expected segment count if known
+6. **Parameter tuning**: Adjust sensitivity parameters (thresholds, penalties)
+
+## Visualization
+
+```python
+import matplotlib.pyplot as plt
+
+plt.figure(figsize=(12, 4))
+plt.plot(y, label='Time Series')
+for cp in change_points:
+    plt.axvline(cp, color='r', linestyle='--', label='Change Point')
+plt.legend()
+plt.show()
+```
--- a/scientific-skills/aeon/references/similarity_search.md
+++ b/scientific-skills/aeon/references/similarity_search.md
@@ -0,0 +1,187 @@
+# Similarity Search
+
+Aeon provides tools for finding similar patterns within and across time series, including subsequence search, motif discovery, and approximate nearest neighbors.
+
+## Subsequence Nearest Neighbors (SNN)
+
+Find most similar subsequences within a time series.
+
+### MASS Algorithm
+- `MassSNN` - Mueen's Algorithm for Similarity Search
+  - Fast normalized cross-correlation for similarity
+  - Computes distance profile efficiently
+  - **Use when**: Need exact nearest neighbor distances, large series
+
+### STOMP-Based Motif Discovery
+- `StompMotif` - Discovers recurring patterns (motifs)
+  - Finds top-k most similar subsequence pairs
+  - Based on matrix profile computation
+  - **Use when**: Want to discover repeated patterns
+
+### Brute Force Baseline
+- `DummySNN` - Exhaustive distance computation
+  - Computes all pairwise distances
+  - **Use when**: Small series, need exact baseline
+
+## Collection-Level Search
+
+Find similar time series across collections.
+
+### Approximate Nearest Neighbors (ANN)
+- `RandomProjectionIndexANN` - Locality-sensitive hashing
+  - Uses random projections with cosine similarity
+  - Builds index for fast approximate search
+  - **Use when**: Large collection, speed more important than exactness
+
+## Quick Start: Motif Discovery
+
+```python
+from aeon.similarity_search import StompMotif
+import numpy as np
+
+# Create time series with repeated patterns
+pattern = np.sin(np.linspace(0, 2*np.pi, 50))
+y = np.concatenate([
+    pattern + np.random.normal(0, 0.1, 50),
+    np.random.normal(0, 1, 100),
+    pattern + np.random.normal(0, 0.1, 50),
+    np.random.normal(0, 1, 100)
+])
+
+# Find top-3 motifs
+motif_finder = StompMotif(window_size=50, k=3)
+motifs = motif_finder.fit_predict(y)
+
+# motifs contains indices of motif occurrences
+for i, (idx1, idx2) in enumerate(motifs):
+    print(f"Motif {i+1} at positions {idx1} and {idx2}")
+```
+
+## Quick Start: Subsequence Search
+
+```python
+from aeon.similarity_search import MassSNN
+import numpy as np
+
+# Time series to search within
+y = np.sin(np.linspace(0, 20, 500))
+
+# Query subsequence
+query = np.sin(np.linspace(0, 2, 50))
+
+# Find nearest subsequences
+searcher = MassSNN()
+distances = searcher.fit_transform(y, query)
+
+# Find best match
+best_match_idx = np.argmin(distances)
+print(f"Best match at index {best_match_idx}")
+```
+
+## Quick Start: Approximate NN on Collections
+
+```python
+from aeon.similarity_search import RandomProjectionIndexANN
+from aeon.datasets import load_classification
+
+# Load time series collection
+X_train, _ = load_classification("GunPoint", split="train")
+
+# Build index
+ann = RandomProjectionIndexANN(n_projections=8, n_bits=4)
+ann.fit(X_train)
+
+# Find approximate nearest neighbors
+query = X_train[0]
+neighbors, distances = ann.kneighbors(query, k=5)
+```
+
+## Matrix Profile
+
+The matrix profile is a fundamental data structure for many similarity search tasks:
+
+- **Distance Profile**: Distances from a query to all subsequences
+- **Matrix Profile**: Minimum distance for each subsequence to any other
+- **Motif**: Pair of subsequences with minimum distance
+- **Discord**: Subsequence with maximum minimum distance (anomaly)
+
+```python
+from aeon.similarity_search import StompMotif
+
+# Compute matrix profile and find motifs/discords
+mp = StompMotif(window_size=50)
+mp.fit(y)
+
+# Access matrix profile
+profile = mp.matrix_profile_
+profile_indices = mp.matrix_profile_index_
+
+# Find discords (anomalies)
+discord_idx = np.argmax(profile)
+```
+
+## Algorithm Selection
+
+- **Exact subsequence search**: MassSNN
+- **Motif discovery**: StompMotif
+- **Anomaly detection**: Matrix profile (see anomaly_detection.md)
+- **Fast approximate search**: RandomProjectionIndexANN
+- **Small data**: DummySNN for exact results
+
+## Use Cases
+
+### Pattern Matching
+Find where a pattern occurs in a long series:
+
+```python
+# Find heartbeat pattern in ECG data
+searcher = MassSNN()
+distances = searcher.fit_transform(ecg_data, heartbeat_pattern)
+occurrences = np.where(distances < threshold)[0]
+```
+
+### Motif Discovery
+Identify recurring patterns:
+
+```python
+# Find repeated behavioral patterns
+motif_finder = StompMotif(window_size=100, k=5)
+motifs = motif_finder.fit_predict(activity_data)
+```
+
+### Time Series Retrieval
+Find similar time series in database:
+
+```python
+# Build searchable index
+ann = RandomProjectionIndexANN()
+ann.fit(time_series_database)
+
+# Query for similar series
+neighbors = ann.kneighbors(query_series, k=10)
+```
+
+## Best Practices
+
+1. **Window size**: Critical parameter for subsequence methods
+   - Too small: Captures noise
+   - Too large: Misses fine-grained patterns
+   - Rule of thumb: 10-20% of series length
+
+2. **Normalization**: Most methods assume z-normalized subsequences
+   - Handles amplitude variations
+   - Focus on shape similarity
+
+3. **Distance metrics**: Different metrics for different needs
+   - Euclidean: Fast, shape-based
+   - DTW: Handles temporal warping
+   - Cosine: Scale-invariant
+
+4. **Exclusion zone**: For motif discovery, exclude trivial matches
+   - Typically set to 0.5-1.0 × window_size
+   - Prevents finding overlapping occurrences
+
+5. **Performance**:
+   - MASS is O(n log n) vs O(n²) brute force
+   - ANN trades accuracy for speed
+   - GPU acceleration available for some methods
--- a/scientific-skills/aeon/references/transformations.md
+++ b/scientific-skills/aeon/references/transformations.md
@@ -0,0 +1,246 @@
+# Transformations
+
+Aeon provides extensive transformation capabilities for preprocessing, feature extraction, and representation learning from time series data.
+
+## Transformation Types
+
+Aeon distinguishes between:
+- **CollectionTransformers**: Transform multiple time series (collections)
+- **SeriesTransformers**: Transform individual time series
+
+## Collection Transformers
+
+### Convolution-Based Feature Extraction
+
+Fast, scalable feature generation using random kernels:
+
+- `RocketTransformer` - Random convolutional kernels
+- `MiniRocketTransformer` - Simplified ROCKET for speed
+- `MultiRocketTransformer` - Enhanced ROCKET variant
+- `HydraTransformer` - Multi-resolution dilated convolutions
+- `MultiRocketHydraTransformer` - Combines ROCKET and Hydra
+- `ROCKETGPU` - GPU-accelerated variant
+
+**Use when**: Need fast, scalable features for any ML algorithm, strong baseline performance.
+
+### Statistical Feature Extraction
+
+Domain-agnostic features based on time series characteristics:
+
+- `Catch22` - 22 canonical time-series characteristics
+- `TSFresh` - Comprehensive automated feature extraction (100+ features)
+- `TSFreshRelevant` - Feature extraction with relevance filtering
+- `SevenNumberSummary` - Descriptive statistics (mean, std, quantiles)
+
+**Use when**: Need interpretable features, domain-agnostic approach, or feeding traditional ML.
+
+### Dictionary-Based Representations
+
+Symbolic approximations for discrete representations:
+
+- `SAX` - Symbolic Aggregate approXimation
+- `PAA` - Piecewise Aggregate Approximation
+- `SFA` - Symbolic Fourier Approximation
+- `SFAFast` - Optimized SFA
+- `SFAWhole` - SFA on entire series (no windowing)
+- `BORF` - Bag-of-Receptive-Fields
+
+**Use when**: Need discrete/symbolic representation, dimensionality reduction, interpretability.
+
+### Shapelet-Based Features
+
+Discriminative subsequence extraction:
+
+- `RandomShapeletTransform` - Random discriminative shapelets
+- `RandomDilatedShapeletTransform` - Dilated shapelets for multi-scale
+- `SAST` - Scalable And Accurate Subsequence Transform
+- `RSAST` - Randomized SAST
+
+**Use when**: Need interpretable discriminative patterns, phase-invariant features.
+
+### Interval-Based Features
+
+Statistical summaries from time intervals:
+
+- `RandomIntervals` - Features from random intervals
+- `SupervisedIntervals` - Supervised interval selection
+- `QUANTTransformer` - Quantile-based interval features
+
+**Use when**: Predictive patterns localized to specific windows.
+
+### Preprocessing Transformations
+
+Data preparation and normalization:
+
+- `MinMaxScaler` - Scale to [0, 1] range
+- `Normalizer` - Z-normalization (zero mean, unit variance)
+- `Centerer` - Center to zero mean
+- `SimpleImputer` - Fill missing values
+- `DownsampleTransformer` - Reduce temporal resolution
+- `Tabularizer` - Convert time series to tabular format
+
+**Use when**: Need standardization, missing value handling, format conversion.
+
+### Specialized Transformations
+
+Advanced analysis methods:
+
+- `MatrixProfile` - Computes distance profiles for pattern discovery
+- `DWTTransformer` - Discrete Wavelet Transform
+- `AutocorrelationFunctionTransformer` - ACF computation
+- `Dobin` - Distance-based Outlier BasIs using Neighbors
+- `SignatureTransformer` - Path signature methods
+- `PLATransformer` - Piecewise Linear Approximation
+
+### Class Imbalance Handling
+
+- `ADASYN` - Adaptive Synthetic Sampling
+- `SMOTE` - Synthetic Minority Over-sampling
+- `OHIT` - Over-sampling with Highly Imbalanced Time series
+
+**Use when**: Classification with imbalanced classes.
+
+### Pipeline Composition
+
+- `CollectionTransformerPipeline` - Chain multiple transformers
+
+## Series Transformers
+
+Transform individual time series (e.g., for preprocessing in forecasting).
+
+### Statistical Analysis
+
+- `AutoCorrelationSeriesTransformer` - Autocorrelation
+- `StatsModelsACF` - ACF using statsmodels
+- `StatsModelsPACF` - Partial autocorrelation
+
+### Smoothing and Filtering
+
+- `ExponentialSmoothing` - Exponentially weighted moving average
+- `MovingAverage` - Simple or weighted moving average
+- `SavitzkyGolayFilter` - Polynomial smoothing
+- `GaussianFilter` - Gaussian kernel smoothing
+- `BKFilter` - Baxter-King bandpass filter
+- `DiscreteFourierApproximation` - Fourier-based filtering
+
+**Use when**: Need noise reduction, trend extraction, or frequency filtering.
+
+### Dimensionality Reduction
+
+- `PCASeriesTransformer` - Principal component analysis
+- `PlASeriesTransformer` - Piecewise Linear Approximation
+
+### Transformations
+
+- `BoxCoxTransformer` - Variance stabilization
+- `LogTransformer` - Logarithmic scaling
+- `ClaSPTransformer` - Classification Score Profile
+
+### Pipeline Composition
+
+- `SeriesTransformerPipeline` - Chain series transformers
+
+## Quick Start: Feature Extraction
+
+```python
+from aeon.transformations.collection.convolution_based import RocketTransformer
+from aeon.classification.sklearn import RotationForest
+from aeon.datasets import load_classification
+
+# Load data
+X_train, y_train = load_classification("GunPoint", split="train")
+X_test, y_test = load_classification("GunPoint", split="test")
+
+# Extract ROCKET features
+rocket = RocketTransformer()
+X_train_features = rocket.fit_transform(X_train)
+X_test_features = rocket.transform(X_test)
+
+# Use with any sklearn classifier
+clf = RotationForest()
+clf.fit(X_train_features, y_train)
+accuracy = clf.score(X_test_features, y_test)
+```
+
+## Quick Start: Preprocessing Pipeline
+
+```python
+from aeon.transformations.collection import (
+    MinMaxScaler,
+    SimpleImputer,
+    CollectionTransformerPipeline
+)
+
+# Build preprocessing pipeline
+pipeline = CollectionTransformerPipeline([
+    ('imputer', SimpleImputer(strategy='mean')),
+    ('scaler', MinMaxScaler())
+])
+
+X_transformed = pipeline.fit_transform(X_train)
+```
+
+## Quick Start: Series Smoothing
+
+```python
+from aeon.transformations.series import MovingAverage
+
+# Smooth individual time series
+smoother = MovingAverage(window_size=5)
+y_smoothed = smoother.fit_transform(y)
+```
+
+## Algorithm Selection
+
+### For Feature Extraction:
+- **Speed + Performance**: MiniRocketTransformer
+- **Interpretability**: Catch22, TSFresh
+- **Dimensionality reduction**: PAA, SAX, PCA
+- **Discriminative patterns**: Shapelet transforms
+- **Comprehensive features**: TSFresh (with longer runtime)
+
+### For Preprocessing:
+- **Normalization**: Normalizer, MinMaxScaler
+- **Smoothing**: MovingAverage, SavitzkyGolayFilter
+- **Missing values**: SimpleImputer
+- **Frequency analysis**: DWTTransformer, Fourier methods
+
+### For Symbolic Representation:
+- **Fast approximation**: PAA
+- **Alphabet-based**: SAX
+- **Frequency-based**: SFA, SFAFast
+
+## Best Practices
+
+1. **Fit on training data only**: Avoid data leakage
+   ```python
+   transformer.fit(X_train)
+   X_train_tf = transformer.transform(X_train)
+   X_test_tf = transformer.transform(X_test)
+   ```
+
+2. **Pipeline composition**: Chain transformers for complex workflows
+   ```python
+   pipeline = CollectionTransformerPipeline([
+       ('imputer', SimpleImputer()),
+       ('scaler', Normalizer()),
+       ('features', RocketTransformer())
+   ])
+   ```
+
+3. **Feature selection**: TSFresh can generate many features; consider selection
+   ```python
+   from sklearn.feature_selection import SelectKBest
+   selector = SelectKBest(k=100)
+   X_selected = selector.fit_transform(X_features, y)
+   ```
+
+4. **Memory considerations**: Some transformers memory-intensive on large datasets
+   - Use MiniRocket instead of ROCKET for speed
+   - Consider downsampling for very long series
+   - Use ROCKETGPU for GPU acceleration
+
+5. **Domain knowledge**: Choose transformations matching domain:
+   - Periodic data: Fourier-based methods
+   - Noisy data: Smoothing filters
+   - Spike detection: Wavelet transforms
--- a/scientific-databases/alphafold-database/SKILL.md
+++ b/scientific-databases/alphafold-database/SKILL.md
@@ -195,7 +195,7 @@ For large-scale analyses, use Google Cloud datasets:

 ```bash
 # Install gsutil
-pip install gsutil
+uv pip install gsutil

 # List available data
 gsutil ls gs://public-datasets-deepmind-alphafold-v4/
@@ -359,16 +359,16 @@ print(df)

 ```bash
 # Install Biopython for structure access
-pip install biopython
+uv pip install biopython

 # Install requests for API access
-pip install requests
+uv pip install requests

 # For visualization and analysis
-pip install numpy matplotlib pandas scipy
+uv pip install numpy matplotlib pandas scipy

 # For Google Cloud access (optional)
-pip install google-cloud-bigquery gsutil
+uv pip install google-cloud-bigquery gsutil
 ```

 ### 3D-Beacons API Alternative
--- a/scientific-databases/alphafold-database/references/api_reference.md
+++ b/scientific-databases/alphafold-database/references/api_reference.md
--- a/scientific-skills/anndata/SKILL.md
+++ b/scientific-skills/anndata/SKILL.md
@@ -0,0 +1,394 @@
+---
+name: anndata
+description: This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
+---
+
+# AnnData
+
+## Overview
+
+AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.
+
+## When to Use This Skill
+
+Use this skill when:
+- Creating, reading, or writing AnnData objects
+- Working with h5ad, zarr, or other genomics data formats
+- Performing single-cell RNA-seq analysis
+- Managing large datasets with sparse matrices or backed mode
+- Concatenating multiple datasets or experimental batches
+- Subsetting, filtering, or transforming annotated data
+- Integrating with scanpy, scvi-tools, or other scverse ecosystem tools
+
+## Installation
+
+```bash
+uv pip install anndata
+
+# With optional dependencies
+uv pip install anndata[dev,test,doc]
+```
+
+## Quick Start
+
+### Creating an AnnData object
+```python
+import anndata as ad
+import numpy as np
+import pandas as pd
+
+# Minimal creation
+X = np.random.rand(100, 2000)  # 100 cells × 2000 genes
+adata = ad.AnnData(X)
+
+# With metadata
+obs = pd.DataFrame({
+    'cell_type': ['T cell', 'B cell'] * 50,
+    'sample': ['A', 'B'] * 50
+}, index=[f'cell_{i}' for i in range(100)])
+
+var = pd.DataFrame({
+    'gene_name': [f'Gene_{i}' for i in range(2000)]
+}, index=[f'ENSG{i:05d}' for i in range(2000)])
+
+adata = ad.AnnData(X=X, obs=obs, var=var)
+```
+
+### Reading data
+```python
+# Read h5ad file
+adata = ad.read_h5ad('data.h5ad')
+
+# Read with backed mode (for large files)
+adata = ad.read_h5ad('large_data.h5ad', backed='r')
+
+# Read other formats
+adata = ad.read_csv('data.csv')
+adata = ad.read_loom('data.loom')
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
+```
+
+### Writing data
+```python
+# Write h5ad file
+adata.write_h5ad('output.h5ad')
+
+# Write with compression
+adata.write_h5ad('output.h5ad', compression='gzip')
+
+# Write other formats
+adata.write_zarr('output.zarr')
+adata.write_csvs('output_dir/')
+```
+
+### Basic operations
+```python
+# Subset by conditions
+t_cells = adata[adata.obs['cell_type'] == 'T cell']
+
+# Subset by indices
+subset = adata[0:50, 0:100]
+
+# Add metadata
+adata.obs['quality_score'] = np.random.rand(adata.n_obs)
+adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8
+
+# Access dimensions
+print(f"{adata.n_obs} observations × {adata.n_vars} variables")
+```
+
+## Core Capabilities
+
+### 1. Data Structure
+
+Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.
+
+**See**: `references/data_structure.md` for comprehensive information on:
+- Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
+- Creating AnnData objects from various sources
+- Accessing and manipulating data components
+- Memory-efficient practices
+
+### 2. Input/Output Operations
+
+Read and write data in various formats with support for compression, backed mode, and cloud storage.
+
+**See**: `references/io_operations.md` for details on:
+- Native formats (h5ad, zarr)
+- Alternative formats (CSV, MTX, Loom, 10X, Excel)
+- Backed mode for large datasets
+- Remote data access
+- Format conversion
+- Performance optimization
+
+Common commands:
+```python
+# Read/write h5ad
+adata = ad.read_h5ad('data.h5ad', backed='r')
+adata.write_h5ad('output.h5ad', compression='gzip')
+
+# Read 10X data
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
+
+# Read MTX format
+adata = ad.read_mtx('matrix.mtx').T
+```
+
+### 3. Concatenation
+
+Combine multiple AnnData objects along observations or variables with flexible join strategies.
+
+**See**: `references/concatenation.md` for comprehensive coverage of:
+- Basic concatenation (axis=0 for observations, axis=1 for variables)
+- Join types (inner, outer)
+- Merge strategies (same, unique, first, only)
+- Tracking data sources with labels
+- Lazy concatenation (AnnCollection)
+- On-disk concatenation for large datasets
+
+Common commands:
+```python
+# Concatenate observations (combine samples)
+adata = ad.concat(
+    [adata1, adata2, adata3],
+    axis=0,
+    join='inner',
+    label='batch',
+    keys=['batch1', 'batch2', 'batch3']
+)
+
+# Concatenate variables (combine modalities)
+adata = ad.concat([adata_rna, adata_protein], axis=1)
+
+# Lazy concatenation
+from anndata.experimental import AnnCollection
+collection = AnnCollection(
+    ['data1.h5ad', 'data2.h5ad'],
+    join_obs='outer',
+    label='dataset'
+)
+```
+
+### 4. Data Manipulation
+
+Transform, subset, filter, and reorganize data efficiently.
+
+**See**: `references/manipulation.md` for detailed guidance on:
+- Subsetting (by indices, names, boolean masks, metadata conditions)
+- Transposition
+- Copying (full copies vs views)
+- Renaming (observations, variables, categories)
+- Type conversions (strings to categoricals, sparse/dense)
+- Adding/removing data components
+- Reordering
+- Quality control filtering
+
+Common commands:
+```python
+# Subset by metadata
+filtered = adata[adata.obs['quality_score'] > 0.8]
+hv_genes = adata[:, adata.var['highly_variable']]
+
+# Transpose
+adata_T = adata.T
+
+# Copy vs view
+view = adata[0:100, :]  # View (lightweight reference)
+copy = adata[0:100, :].copy()  # Independent copy
+
+# Convert strings to categoricals
+adata.strings_to_categoricals()
+```
+
+### 5. Best Practices
+
+Follow recommended patterns for memory efficiency, performance, and reproducibility.
+
+**See**: `references/best_practices.md` for guidelines on:
+- Memory management (sparse matrices, categoricals, backed mode)
+- Views vs copies
+- Data storage optimization
+- Performance optimization
+- Working with raw data
+- Metadata management
+- Reproducibility
+- Error handling
+- Integration with other tools
+- Common pitfalls and solutions
+
+Key recommendations:
+```python
+# Use sparse matrices for sparse data
+from scipy.sparse import csr_matrix
+adata.X = csr_matrix(adata.X)
+
+# Convert strings to categoricals
+adata.strings_to_categoricals()
+
+# Use backed mode for large files
+adata = ad.read_h5ad('large.h5ad', backed='r')
+
+# Store raw before filtering
+adata.raw = adata.copy()
+adata = adata[:, adata.var['highly_variable']]
+```
+
+## Integration with Scverse Ecosystem
+
+AnnData serves as the foundational data structure for the scverse ecosystem:
+
+### Scanpy (Single-cell analysis)
+```python
+import scanpy as sc
+
+# Preprocessing
+sc.pp.filter_cells(adata, min_genes=200)
+sc.pp.normalize_total(adata, target_sum=1e4)
+sc.pp.log1p(adata)
+sc.pp.highly_variable_genes(adata, n_top_genes=2000)
+
+# Dimensionality reduction
+sc.pp.pca(adata, n_comps=50)
+sc.pp.neighbors(adata, n_neighbors=15)
+sc.tl.umap(adata)
+sc.tl.leiden(adata)
+
+# Visualization
+sc.pl.umap(adata, color=['cell_type', 'leiden'])
+```
+
+### Muon (Multimodal data)
+```python
+import muon as mu
+
+# Combine RNA and protein data
+mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})
+```
+
+### PyTorch integration
+```python
+from anndata.experimental import AnnLoader
+
+# Create DataLoader for deep learning
+dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
+
+for batch in dataloader:
+    X = batch.X
+    # Train model
+```
+
+## Common Workflows
+
+### Single-cell RNA-seq analysis
+```python
+import anndata as ad
+import scanpy as sc
+
+# 1. Load data
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
+
+# 2. Quality control
+adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
+adata.obs['n_counts'] = adata.X.sum(axis=1)
+adata = adata[adata.obs['n_genes'] > 200]
+adata = adata[adata.obs['n_counts'] < 50000]
+
+# 3. Store raw
+adata.raw = adata.copy()
+
+# 4. Normalize and filter
+sc.pp.normalize_total(adata, target_sum=1e4)
+sc.pp.log1p(adata)
+sc.pp.highly_variable_genes(adata, n_top_genes=2000)
+adata = adata[:, adata.var['highly_variable']]
+
+# 5. Save processed data
+adata.write_h5ad('processed.h5ad')
+```
+
+### Batch integration
+```python
+# Load multiple batches
+adata1 = ad.read_h5ad('batch1.h5ad')
+adata2 = ad.read_h5ad('batch2.h5ad')
+adata3 = ad.read_h5ad('batch3.h5ad')
+
+# Concatenate with batch labels
+adata = ad.concat(
+    [adata1, adata2, adata3],
+    label='batch',
+    keys=['batch1', 'batch2', 'batch3'],
+    join='inner'
+)
+
+# Apply batch correction
+import scanpy as sc
+sc.pp.combat(adata, key='batch')
+
+# Continue analysis
+sc.pp.pca(adata)
+sc.pp.neighbors(adata)
+sc.tl.umap(adata)
+```
+
+### Working with large datasets
+```python
+# Open in backed mode
+adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')
+
+# Filter based on metadata (no data loading)
+high_quality = adata[adata.obs['quality_score'] > 0.8]
+
+# Load filtered subset
+adata_subset = high_quality.to_memory()
+
+# Process subset
+process(adata_subset)
+
+# Or process in chunks
+chunk_size = 1000
+for i in range(0, adata.n_obs, chunk_size):
+    chunk = adata[i:i+chunk_size, :].to_memory()
+    process(chunk)
+```
+
+## Troubleshooting
+
+### Out of memory errors
+Use backed mode or convert to sparse matrices:
+```python
+# Backed mode
+adata = ad.read_h5ad('file.h5ad', backed='r')
+
+# Sparse matrices
+from scipy.sparse import csr_matrix
+adata.X = csr_matrix(adata.X)
+```
+
+### Slow file reading
+Use compression and appropriate formats:
+```python
+# Optimize for storage
+adata.strings_to_categoricals()
+adata.write_h5ad('file.h5ad', compression='gzip')
+
+# Use Zarr for cloud storage
+adata.write_zarr('file.zarr', chunks=(1000, 1000))
+```
+
+### Index alignment issues
+Always align external data on index:
+```python
+# Wrong
+adata.obs['new_col'] = external_data['values']
+
+# Correct
+adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
+```
+
+## Additional Resources
+
+- **Official documentation**: https://anndata.readthedocs.io/
+- **Scanpy tutorials**: https://scanpy.readthedocs.io/
+- **Scverse ecosystem**: https://scverse.org/
+- **GitHub repository**: https://github.com/scverse/anndata
--- a/scientific-skills/anndata/references/best_practices.md
+++ b/scientific-skills/anndata/references/best_practices.md
@@ -0,0 +1,525 @@
+# Best Practices
+
+Guidelines for efficient and effective use of AnnData.
+
+## Memory Management
+
+### Use sparse matrices for sparse data
+```python
+import numpy as np
+from scipy.sparse import csr_matrix
+import anndata as ad
+
+# Check data sparsity
+data = np.random.rand(1000, 2000)
+sparsity = 1 - np.count_nonzero(data) / data.size
+print(f"Sparsity: {sparsity:.2%}")
+
+# Convert to sparse if >50% zeros
+if sparsity > 0.5:
+    adata = ad.AnnData(X=csr_matrix(data))
+else:
+    adata = ad.AnnData(X=data)
+
+# Benefits: 10-100x memory reduction for sparse genomics data
+```
+
+### Convert strings to categoricals
+```python
+# Inefficient: string columns use lots of memory
+adata.obs['cell_type'] = ['Type_A', 'Type_B', 'Type_C'] * 333 + ['Type_A']
+
+# Efficient: convert to categorical
+adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')
+
+# Convert all string columns
+adata.strings_to_categoricals()
+
+# Benefits: 10-50x memory reduction for repeated strings
+```
+
+### Use backed mode for large datasets
+```python
+# Don't load entire dataset into memory
+adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
+
+# Work with metadata
+filtered = adata[adata.obs['quality'] > 0.8]
+
+# Load only filtered subset
+adata_subset = filtered.to_memory()
+
+# Benefits: Work with datasets larger than RAM
+```
+
+## Views vs Copies
+
+### Understanding views
+```python
+# Subsetting creates a view by default
+subset = adata[0:100, :]
+print(subset.is_view)  # True
+
+# Views don't copy data (memory efficient)
+# But modifications can affect original
+
+# Check if object is a view
+if adata.is_view:
+    adata = adata.copy()  # Make independent
+```
+
+### When to use views
+```python
+# Good: Read-only operations on subsets
+mean_expr = adata[adata.obs['cell_type'] == 'T cell'].X.mean()
+
+# Good: Temporary analysis
+temp_subset = adata[:100, :]
+result = analyze(temp_subset.X)
+```
+
+### When to use copies
+```python
+# Create independent copy for modifications
+adata_filtered = adata[keep_cells, :].copy()
+
+# Safe to modify without affecting original
+adata_filtered.obs['new_column'] = values
+
+# Always copy when:
+# - Storing subset for later use
+# - Modifying subset data
+# - Passing to function that modifies data
+```
+
+## Data Storage Best Practices
+
+### Choose the right format
+
+**H5AD (HDF5) - Default choice**
+```python
+adata.write_h5ad('data.h5ad', compression='gzip')
+```
+- Fast random access
+- Supports backed mode
+- Good compression
+- Best for: Most use cases
+
+**Zarr - Cloud and parallel access**
+```python
+adata.write_zarr('data.zarr', chunks=(100, 100))
+```
+- Excellent for cloud storage (S3, GCS)
+- Supports parallel I/O
+- Good compression
+- Best for: Large datasets, cloud workflows, parallel processing
+
+**CSV - Interoperability**
+```python
+adata.write_csvs('output_dir/')
+```
+- Human readable
+- Compatible with all tools
+- Large file sizes, slow
+- Best for: Sharing with non-Python tools, small datasets
+
+### Optimize file size
+```python
+# Before saving, optimize:
+
+# 1. Convert to sparse if appropriate
+from scipy.sparse import csr_matrix, issparse
+if not issparse(adata.X):
+    density = np.count_nonzero(adata.X) / adata.X.size
+    if density < 0.5:
+        adata.X = csr_matrix(adata.X)
+
+# 2. Convert strings to categoricals
+adata.strings_to_categoricals()
+
+# 3. Use compression
+adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9)
+
+# Typical results: 5-20x file size reduction
+```
+
+## Backed Mode Strategies
+
+### Read-only analysis
+```python
+# Open in read-only backed mode
+adata = ad.read_h5ad('data.h5ad', backed='r')
+
+# Perform filtering without loading data
+high_quality = adata[adata.obs['quality_score'] > 0.8]
+
+# Load only filtered data
+adata_filtered = high_quality.to_memory()
+```
+
+### Read-write modifications
+```python
+# Open in read-write backed mode
+adata = ad.read_h5ad('data.h5ad', backed='r+')
+
+# Modify metadata (written to disk)
+adata.obs['new_annotation'] = values
+
+# X remains on disk, modifications saved immediately
+```
+
+### Chunked processing
+```python
+# Process large dataset in chunks
+adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')
+
+results = []
+chunk_size = 1000
+
+for i in range(0, adata.n_obs, chunk_size):
+    chunk = adata[i:i+chunk_size, :].to_memory()
+    result = process(chunk)
+    results.append(result)
+
+final_result = combine(results)
+```
+
+## Performance Optimization
+
+### Subsetting performance
+```python
+# Fast: Boolean indexing with arrays
+mask = np.array(adata.obs['quality'] > 0.5)
+subset = adata[mask, :]
+
+# Slow: Boolean indexing with Series (creates view chain)
+subset = adata[adata.obs['quality'] > 0.5, :]
+
+# Fastest: Integer indices
+indices = np.where(adata.obs['quality'] > 0.5)[0]
+subset = adata[indices, :]
+```
+
+### Avoid repeated subsetting
+```python
+# Inefficient: Multiple subset operations
+for cell_type in ['A', 'B', 'C']:
+    subset = adata[adata.obs['cell_type'] == cell_type]
+    process(subset)
+
+# Efficient: Group and process
+groups = adata.obs.groupby('cell_type').groups
+for cell_type, indices in groups.items():
+    subset = adata[indices, :]
+    process(subset)
+```
+
+### Use chunked operations for large matrices
+```python
+# Process X in chunks
+for chunk in adata.chunked_X(chunk_size=1000):
+    result = compute(chunk)
+
+# More memory efficient than loading full X
+```
+
+## Working with Raw Data
+
+### Store raw before filtering
+```python
+# Original data with all genes
+adata = ad.AnnData(X=counts)
+
+# Store raw before filtering
+adata.raw = adata.copy()
+
+# Filter to highly variable genes
+adata = adata[:, adata.var['highly_variable']]
+
+# Later: access original data
+original_expression = adata.raw.X
+all_genes = adata.raw.var_names
+```
+
+### When to use raw
+```python
+# Use raw for:
+# - Differential expression on filtered genes
+# - Visualization of specific genes not in filtered set
+# - Accessing original counts after normalization
+
+# Access raw data
+if adata.raw is not None:
+    gene_expr = adata.raw[:, 'GENE_NAME'].X
+else:
+    gene_expr = adata[:, 'GENE_NAME'].X
+```
+
+## Metadata Management
+
+### Naming conventions
+```python
+# Consistent naming improves usability
+
+# Observation metadata (obs):
+# - cell_id, sample_id
+# - cell_type, tissue, condition
+# - n_genes, n_counts, percent_mito
+# - cluster, leiden, louvain
+
+# Variable metadata (var):
+# - gene_id, gene_name
+# - highly_variable, n_cells
+# - mean_expression, dispersion
+
+# Embeddings (obsm):
+# - X_pca, X_umap, X_tsne
+# - X_diffmap, X_draw_graph_fr
+
+# Follow conventions from scanpy/scverse ecosystem
+```
+
+### Document metadata
+```python
+# Store metadata descriptions in uns
+adata.uns['metadata_descriptions'] = {
+    'cell_type': 'Cell type annotation from automated clustering',
+    'quality_score': 'QC score from scrublet (0-1, higher is better)',
+    'batch': 'Experimental batch identifier'
+}
+
+# Store processing history
+adata.uns['processing_steps'] = [
+    'Raw counts loaded from 10X',
+    'Filtered: n_genes > 200, n_counts < 50000',
+    'Normalized to 10000 counts per cell',
+    'Log transformed'
+]
+```
+
+## Reproducibility
+
+### Set random seeds
+```python
+import numpy as np
+
+# Set seed for reproducible results
+np.random.seed(42)
+
+# Document in uns
+adata.uns['random_seed'] = 42
+```
+
+### Store parameters
+```python
+# Store analysis parameters in uns
+adata.uns['pca'] = {
+    'n_comps': 50,
+    'svd_solver': 'arpack',
+    'random_state': 42
+}
+
+adata.uns['neighbors'] = {
+    'n_neighbors': 15,
+    'n_pcs': 50,
+    'metric': 'euclidean',
+    'method': 'umap'
+}
+```
+
+### Version tracking
+```python
+import anndata
+import scanpy
+import numpy
+
+# Store versions
+adata.uns['versions'] = {
+    'anndata': anndata.__version__,
+    'scanpy': scanpy.__version__,
+    'numpy': numpy.__version__,
+    'python': sys.version
+}
+```
+
+## Error Handling
+
+### Check data validity
+```python
+# Verify dimensions
+assert adata.n_obs == len(adata.obs)
+assert adata.n_vars == len(adata.var)
+assert adata.X.shape == (adata.n_obs, adata.n_vars)
+
+# Check for NaN values
+has_nan = np.isnan(adata.X.data).any() if issparse(adata.X) else np.isnan(adata.X).any()
+if has_nan:
+    print("Warning: Data contains NaN values")
+
+# Check for negative values (if counts expected)
+has_negative = (adata.X.data < 0).any() if issparse(adata.X) else (adata.X < 0).any()
+if has_negative:
+    print("Warning: Data contains negative values")
+```
+
+### Validate metadata
+```python
+# Check for missing values
+missing_obs = adata.obs.isnull().sum()
+if missing_obs.any():
+    print("Missing values in obs:")
+    print(missing_obs[missing_obs > 0])
+
+# Verify indices are unique
+assert adata.obs_names.is_unique, "Observation names not unique"
+assert adata.var_names.is_unique, "Variable names not unique"
+
+# Check metadata alignment
+assert len(adata.obs) == adata.n_obs
+assert len(adata.var) == adata.n_vars
+```
+
+## Integration with Other Tools
+
+### Scanpy integration
+```python
+import scanpy as sc
+
+# AnnData is native format for scanpy
+sc.pp.filter_cells(adata, min_genes=200)
+sc.pp.filter_genes(adata, min_cells=3)
+sc.pp.normalize_total(adata, target_sum=1e4)
+sc.pp.log1p(adata)
+sc.pp.highly_variable_genes(adata)
+sc.pp.pca(adata)
+sc.pp.neighbors(adata)
+sc.tl.umap(adata)
+```
+
+### Pandas integration
+```python
+import pandas as pd
+
+# Convert to DataFrame
+df = adata.to_df()
+
+# Create from DataFrame
+adata = ad.AnnData(df)
+
+# Work with metadata as DataFrames
+adata.obs = adata.obs.merge(external_metadata, left_index=True, right_index=True)
+```
+
+### PyTorch integration
+```python
+from anndata.experimental import AnnLoader
+
+# Create PyTorch DataLoader
+dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
+
+# Iterate in training loop
+for batch in dataloader:
+    X = batch.X
+    # Train model on batch
+```
+
+## Common Pitfalls
+
+### Pitfall 1: Modifying views
+```python
+# Wrong: Modifying view can affect original
+subset = adata[:100, :]
+subset.X = new_data  # May modify adata.X!
+
+# Correct: Copy before modifying
+subset = adata[:100, :].copy()
+subset.X = new_data  # Independent copy
+```
+
+### Pitfall 2: Index misalignment
+```python
+# Wrong: Assuming order matches
+external_data = pd.read_csv('data.csv')
+adata.obs['new_col'] = external_data['values']  # May misalign!
+
+# Correct: Align on index
+adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
+```
+
+### Pitfall 3: Mixing sparse and dense
+```python
+# Wrong: Converting sparse to dense uses huge memory
+result = adata.X + 1  # Converts sparse to dense!
+
+# Correct: Use sparse operations
+from scipy.sparse import issparse
+if issparse(adata.X):
+    result = adata.X.copy()
+    result.data += 1
+```
+
+### Pitfall 4: Not handling views
+```python
+# Wrong: Assuming subset is independent
+subset = adata[mask, :]
+del adata  # subset may become invalid!
+
+# Correct: Copy when needed
+subset = adata[mask, :].copy()
+del adata  # subset remains valid
+```
+
+### Pitfall 5: Ignoring memory constraints
+```python
+# Wrong: Loading huge dataset into memory
+adata = ad.read_h5ad('100GB_file.h5ad')  # OOM error!
+
+# Correct: Use backed mode
+adata = ad.read_h5ad('100GB_file.h5ad', backed='r')
+subset = adata[adata.obs['keep']].to_memory()
+```
+
+## Workflow Example
+
+Complete best-practices workflow:
+
+```python
+import anndata as ad
+import numpy as np
+from scipy.sparse import csr_matrix
+
+# 1. Load with backed mode if large
+adata = ad.read_h5ad('data.h5ad', backed='r')
+
+# 2. Quick metadata check without loading data
+print(f"Dataset: {adata.n_obs} cells × {adata.n_vars} genes")
+
+# 3. Filter based on metadata
+high_quality = adata[adata.obs['quality_score'] > 0.8]
+
+# 4. Load filtered subset to memory
+adata = high_quality.to_memory()
+
+# 5. Convert to optimal storage types
+adata.strings_to_categoricals()
+if not issparse(adata.X):
+    density = np.count_nonzero(adata.X) / adata.X.size
+    if density < 0.5:
+        adata.X = csr_matrix(adata.X)
+
+# 6. Store raw before filtering genes
+adata.raw = adata.copy()
+
+# 7. Filter to highly variable genes
+adata = adata[:, adata.var['highly_variable']].copy()
+
+# 8. Document processing
+adata.uns['processing'] = {
+    'filtered': 'quality_score > 0.8',
+    'n_hvg': adata.n_vars,
+    'date': '2025-11-03'
+}
+
+# 9. Save optimized
+adata.write_h5ad('processed.h5ad', compression='gzip')
+```
--- a/scientific-skills/anndata/references/concatenation.md
+++ b/scientific-skills/anndata/references/concatenation.md
@@ -0,0 +1,396 @@
+# Concatenating AnnData Objects
+
+Combine multiple AnnData objects along either observations or variables axis.
+
+## Basic Concatenation
+
+### Concatenate along observations (stack cells/samples)
+```python
+import anndata as ad
+import numpy as np
+
+# Create multiple AnnData objects
+adata1 = ad.AnnData(X=np.random.rand(100, 50))
+adata2 = ad.AnnData(X=np.random.rand(150, 50))
+adata3 = ad.AnnData(X=np.random.rand(200, 50))
+
+# Concatenate along observations (axis=0, default)
+adata_combined = ad.concat([adata1, adata2, adata3], axis=0)
+
+print(adata_combined.shape)  # (450, 50)
+```
+
+### Concatenate along variables (stack genes/features)
+```python
+# Create objects with same observations, different variables
+adata1 = ad.AnnData(X=np.random.rand(100, 50))
+adata2 = ad.AnnData(X=np.random.rand(100, 30))
+adata3 = ad.AnnData(X=np.random.rand(100, 70))
+
+# Concatenate along variables (axis=1)
+adata_combined = ad.concat([adata1, adata2, adata3], axis=1)
+
+print(adata_combined.shape)  # (100, 150)
+```
+
+## Join Types
+
+### Inner join (intersection)
+Keep only variables/observations present in all objects.
+
+```python
+import pandas as pd
+
+# Create objects with different variables
+adata1 = ad.AnnData(
+    X=np.random.rand(100, 50),
+    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(50)])
+)
+adata2 = ad.AnnData(
+    X=np.random.rand(150, 60),
+    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(10, 70)])
+)
+
+# Inner join: only genes 10-49 are kept (overlap)
+adata_inner = ad.concat([adata1, adata2], join='inner')
+print(adata_inner.n_vars)  # 40 genes (overlap)
+```
+
+### Outer join (union)
+Keep all variables/observations, filling missing values.
+
+```python
+# Outer join: all genes are kept
+adata_outer = ad.concat([adata1, adata2], join='outer')
+print(adata_outer.n_vars)  # 70 genes (union)
+
+# Missing values are filled with appropriate defaults:
+# - 0 for sparse matrices
+# - NaN for dense matrices
+```
+
+### Fill values for outer joins
+```python
+# Specify fill value for missing data
+adata_filled = ad.concat([adata1, adata2], join='outer', fill_value=0)
+```
+
+## Tracking Data Sources
+
+### Add batch labels
+```python
+# Label which object each observation came from
+adata_combined = ad.concat(
+    [adata1, adata2, adata3],
+    label='batch',  # Column name for labels
+    keys=['batch1', 'batch2', 'batch3']  # Labels for each object
+)
+
+print(adata_combined.obs['batch'].value_counts())
+# batch1    100
+# batch2    150
+# batch3    200
+```
+
+### Automatic batch labels
+```python
+# If keys not provided, uses integer indices
+adata_combined = ad.concat(
+    [adata1, adata2, adata3],
+    label='dataset'
+)
+# dataset column contains: 0, 1, 2
+```
+
+## Merge Strategies
+
+Control how metadata from different objects is combined using the `merge` parameter.
+
+### merge=None (default for observations)
+Exclude metadata on non-concatenation axis.
+
+```python
+# When concatenating observations, var metadata must match
+adata1.var['gene_type'] = 'protein_coding'
+adata2.var['gene_type'] = 'protein_coding'
+
+# var is kept only if identical across all objects
+adata_combined = ad.concat([adata1, adata2], merge=None)
+```
+
+### merge='same'
+Keep metadata that is identical across all objects.
+
+```python
+adata1.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
+adata2.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
+adata1.var['type'] = 'protein_coding'
+adata2.var['type'] = 'lncRNA'  # Different
+
+# 'chromosome' is kept (same), 'type' is excluded (different)
+adata_combined = ad.concat([adata1, adata2], merge='same')
+```
+
+### merge='unique'
+Keep metadata columns where each key has exactly one value.
+
+```python
+adata1.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]
+adata2.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]
+
+# gene_id is kept (unique values for each key)
+adata_combined = ad.concat([adata1, adata2], merge='unique')
+```
+
+### merge='first'
+Take values from the first object containing each key.
+
+```python
+adata1.var['description'] = ['Desc1'] * 50
+adata2.var['description'] = ['Desc2'] * 50
+
+# Uses descriptions from adata1
+adata_combined = ad.concat([adata1, adata2], merge='first')
+```
+
+### merge='only'
+Keep metadata that appears in only one object.
+
+```python
+adata1.var['adata1_specific'] = [1] * 50
+adata2.var['adata2_specific'] = [2] * 50
+
+# Both metadata columns are kept
+adata_combined = ad.concat([adata1, adata2], merge='only')
+```
+
+## Handling Index Conflicts
+
+### Make indices unique
+```python
+import pandas as pd
+
+# Create objects with overlapping observation names
+adata1 = ad.AnnData(
+    X=np.random.rand(3, 10),
+    obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
+)
+adata2 = ad.AnnData(
+    X=np.random.rand(3, 10),
+    obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
+)
+
+# Make indices unique by appending batch keys
+adata_combined = ad.concat(
+    [adata1, adata2],
+    label='batch',
+    keys=['batch1', 'batch2'],
+    index_unique='_'  # Separator for making indices unique
+)
+
+print(adata_combined.obs_names)
+# ['cell_1_batch1', 'cell_2_batch1', 'cell_3_batch1',
+#  'cell_1_batch2', 'cell_2_batch2', 'cell_3_batch2']
+```
+
+## Concatenating Layers
+
+```python
+# Objects with layers
+adata1 = ad.AnnData(X=np.random.rand(100, 50))
+adata1.layers['normalized'] = np.random.rand(100, 50)
+adata1.layers['scaled'] = np.random.rand(100, 50)
+
+adata2 = ad.AnnData(X=np.random.rand(150, 50))
+adata2.layers['normalized'] = np.random.rand(150, 50)
+adata2.layers['scaled'] = np.random.rand(150, 50)
+
+# Layers are concatenated automatically if present in all objects
+adata_combined = ad.concat([adata1, adata2])
+
+print(adata_combined.layers.keys())
+# dict_keys(['normalized', 'scaled'])
+```
+
+## Concatenating Multi-dimensional Annotations
+
+### obsm/varm
+```python
+# Objects with embeddings
+adata1.obsm['X_pca'] = np.random.rand(100, 50)
+adata2.obsm['X_pca'] = np.random.rand(150, 50)
+
+# obsm is concatenated along observation axis
+adata_combined = ad.concat([adata1, adata2])
+print(adata_combined.obsm['X_pca'].shape)  # (250, 50)
+```
+
+### obsp/varp (pairwise annotations)
+```python
+from scipy.sparse import csr_matrix
+
+# Pairwise matrices
+adata1.obsp['connectivities'] = csr_matrix((100, 100))
+adata2.obsp['connectivities'] = csr_matrix((150, 150))
+
+# By default, obsp is NOT concatenated (set pairwise=True to include)
+adata_combined = ad.concat([adata1, adata2])
+# adata_combined.obsp is empty
+
+# Include pairwise data (creates block diagonal matrix)
+adata_combined = ad.concat([adata1, adata2], pairwise=True)
+print(adata_combined.obsp['connectivities'].shape)  # (250, 250)
+```
+
+## Concatenating uns (unstructured)
+
+Unstructured metadata is merged recursively:
+
+```python
+adata1.uns['experiment'] = {'date': '2025-01-01', 'batch': 'A'}
+adata2.uns['experiment'] = {'date': '2025-01-01', 'batch': 'B'}
+
+# Using merge='unique' for uns
+adata_combined = ad.concat([adata1, adata2], uns_merge='unique')
+# 'date' is kept (same value), 'batch' might be excluded (different values)
+```
+
+## Lazy Concatenation (AnnCollection)
+
+For very large datasets, use lazy concatenation that doesn't load all data:
+
+```python
+from anndata.experimental import AnnCollection
+
+# Create collection from file paths (doesn't load data)
+files = ['data1.h5ad', 'data2.h5ad', 'data3.h5ad']
+collection = AnnCollection(
+    files,
+    join_obs='outer',
+    join_vars='inner',
+    label='dataset',
+    keys=['dataset1', 'dataset2', 'dataset3']
+)
+
+# Access data lazily
+print(collection.n_obs)  # Total observations
+print(collection.obs.head())  # Metadata loaded, not X
+
+# Convert to regular AnnData when needed (loads all data)
+adata = collection.to_adata()
+```
+
+### Working with AnnCollection
+```python
+# Subset without loading data
+subset = collection[collection.obs['cell_type'] == 'T cell']
+
+# Iterate through datasets
+for adata in collection:
+    print(adata.shape)
+
+# Access specific dataset
+first_dataset = collection[0]
+```
+
+## Concatenation on Disk
+
+For datasets too large for memory, concatenate directly on disk:
+
+```python
+from anndata.experimental import concat_on_disk
+
+# Concatenate without loading into memory
+concat_on_disk(
+    ['data1.h5ad', 'data2.h5ad', 'data3.h5ad'],
+    'combined.h5ad',
+    join='outer'
+)
+
+# Load result in backed mode
+adata = ad.read_h5ad('combined.h5ad', backed='r')
+```
+
+## Common Concatenation Patterns
+
+### Combine technical replicates
+```python
+# Multiple runs of the same samples
+replicates = [adata_run1, adata_run2, adata_run3]
+adata_combined = ad.concat(
+    replicates,
+    label='technical_replicate',
+    keys=['rep1', 'rep2', 'rep3'],
+    join='inner'  # Keep only genes measured in all runs
+)
+```
+
+### Combine batches from experiment
+```python
+# Different experimental batches
+batches = [adata_batch1, adata_batch2, adata_batch3]
+adata_combined = ad.concat(
+    batches,
+    label='batch',
+    keys=['batch1', 'batch2', 'batch3'],
+    join='outer'  # Keep all genes
+)
+
+# Later: apply batch correction
+```
+
+### Merge multi-modal data
+```python
+# Different measurement modalities (e.g., RNA + protein)
+adata_rna = ad.AnnData(X=np.random.rand(100, 2000))
+adata_protein = ad.AnnData(X=np.random.rand(100, 50))
+
+# Concatenate along variables to combine modalities
+adata_multimodal = ad.concat([adata_rna, adata_protein], axis=1)
+
+# Add labels to distinguish modalities
+adata_multimodal.var['modality'] = ['RNA'] * 2000 + ['protein'] * 50
+```
+
+## Best Practices
+
+1. **Check compatibility before concatenating**
+```python
+# Verify shapes are compatible
+print([adata.n_vars for adata in [adata1, adata2, adata3]])
+
+# Check variable names match
+print([set(adata.var_names) for adata in [adata1, adata2, adata3]])
+```
+
+2. **Use appropriate join type**
+- `inner`: When you need the same features across all samples (most stringent)
+- `outer`: When you want to preserve all features (most inclusive)
+
+3. **Track data sources**
+Always use `label` and `keys` to track which observations came from which dataset.
+
+4. **Consider memory usage**
+- For large datasets, use `AnnCollection` or `concat_on_disk`
+- Consider backed mode for the result
+
+5. **Handle batch effects**
+Concatenation combines data but doesn't correct for batch effects. Apply batch correction after concatenation:
+```python
+# After concatenation, apply batch correction
+import scanpy as sc
+sc.pp.combat(adata_combined, key='batch')
+```
+
+6. **Validate results**
+```python
+# Check dimensions
+print(adata_combined.shape)
+
+# Check batch distribution
+print(adata_combined.obs['batch'].value_counts())
+
+# Verify metadata integrity
+print(adata_combined.var.head())
+print(adata_combined.obs.head())
+```
--- a/scientific-skills/anndata/references/data_structure.md
+++ b/scientific-skills/anndata/references/data_structure.md
@@ -0,0 +1,314 @@
+# AnnData Object Structure
+
+The AnnData object stores a data matrix with associated annotations, providing a flexible framework for managing experimental data and metadata.
+
+## Core Components
+
+### X (Data Matrix)
+The primary data matrix with shape (n_obs, n_vars) storing experimental measurements.
+
+```python
+import anndata as ad
+import numpy as np
+
+# Create with dense array
+adata = ad.AnnData(X=np.random.rand(100, 2000))
+
+# Create with sparse matrix (recommended for large, sparse data)
+from scipy.sparse import csr_matrix
+sparse_data = csr_matrix(np.random.rand(100, 2000))
+adata = ad.AnnData(X=sparse_data)
+```
+
+Access data:
+```python
+# Full matrix (caution with large datasets)
+full_data = adata.X
+
+# Single observation
+obs_data = adata.X[0, :]
+
+# Single variable across all observations
+var_data = adata.X[:, 0]
+```
+
+### obs (Observation Annotations)
+DataFrame storing metadata about observations (rows). Each row corresponds to one observation in X.
+
+```python
+import pandas as pd
+
+# Create AnnData with observation metadata
+obs_df = pd.DataFrame({
+    'cell_type': ['T cell', 'B cell', 'Monocyte'],
+    'treatment': ['control', 'treated', 'control'],
+    'timepoint': [0, 24, 24]
+}, index=['cell_1', 'cell_2', 'cell_3'])
+
+adata = ad.AnnData(X=np.random.rand(3, 100), obs=obs_df)
+
+# Access observation metadata
+print(adata.obs['cell_type'])
+print(adata.obs.loc['cell_1'])
+```
+
+### var (Variable Annotations)
+DataFrame storing metadata about variables (columns). Each row corresponds to one variable in X.
+
+```python
+# Create AnnData with variable metadata
+var_df = pd.DataFrame({
+    'gene_name': ['ACTB', 'GAPDH', 'TP53'],
+    'chromosome': ['7', '12', '17'],
+    'highly_variable': [True, False, True]
+}, index=['ENSG00001', 'ENSG00002', 'ENSG00003'])
+
+adata = ad.AnnData(X=np.random.rand(100, 3), var=var_df)
+
+# Access variable metadata
+print(adata.var['gene_name'])
+print(adata.var.loc['ENSG00001'])
+```
+
+### layers (Alternative Data Representations)
+Dictionary storing alternative matrices with the same dimensions as X.
+
+```python
+# Store raw counts, normalized data, and scaled data
+adata = ad.AnnData(X=np.random.rand(100, 2000))
+adata.layers['raw_counts'] = np.random.randint(0, 100, (100, 2000))
+adata.layers['normalized'] = adata.X / np.sum(adata.X, axis=1, keepdims=True)
+adata.layers['scaled'] = (adata.X - adata.X.mean()) / adata.X.std()
+
+# Access layers
+raw_data = adata.layers['raw_counts']
+normalized_data = adata.layers['normalized']
+```
+
+Common layer uses:
+- `raw_counts`: Original count data before normalization
+- `normalized`: Log-normalized or TPM values
+- `scaled`: Z-scored values for analysis
+- `imputed`: Data after imputation
+
+### obsm (Multi-dimensional Observation Annotations)
+Dictionary storing multi-dimensional arrays aligned to observations.
+
+```python
+# Store PCA coordinates and UMAP embeddings
+adata.obsm['X_pca'] = np.random.rand(100, 50)  # 50 principal components
+adata.obsm['X_umap'] = np.random.rand(100, 2)  # 2D UMAP coordinates
+adata.obsm['X_tsne'] = np.random.rand(100, 2)  # 2D t-SNE coordinates
+
+# Access embeddings
+pca_coords = adata.obsm['X_pca']
+umap_coords = adata.obsm['X_umap']
+```
+
+Common obsm uses:
+- `X_pca`: Principal component coordinates
+- `X_umap`: UMAP embedding coordinates
+- `X_tsne`: t-SNE embedding coordinates
+- `X_diffmap`: Diffusion map coordinates
+- `protein_expression`: Protein abundance measurements (CITE-seq)
+
+### varm (Multi-dimensional Variable Annotations)
+Dictionary storing multi-dimensional arrays aligned to variables.
+
+```python
+# Store PCA loadings
+adata.varm['PCs'] = np.random.rand(2000, 50)  # Loadings for 50 components
+adata.varm['gene_modules'] = np.random.rand(2000, 10)  # Gene module scores
+
+# Access loadings
+pc_loadings = adata.varm['PCs']
+```
+
+Common varm uses:
+- `PCs`: Principal component loadings
+- `gene_modules`: Gene co-expression module assignments
+
+### obsp (Pairwise Observation Relationships)
+Dictionary storing sparse matrices representing relationships between observations.
+
+```python
+from scipy.sparse import csr_matrix
+
+# Store k-nearest neighbor graph
+n_obs = 100
+knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
+adata.obsp['connectivities'] = knn_graph
+adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))
+
+# Access graphs
+knn_connections = adata.obsp['connectivities']
+distances = adata.obsp['distances']
+```
+
+Common obsp uses:
+- `connectivities`: Cell-cell neighborhood graph
+- `distances`: Pairwise distances between cells
+
+### varp (Pairwise Variable Relationships)
+Dictionary storing sparse matrices representing relationships between variables.
+
+```python
+# Store gene-gene correlation matrix
+n_vars = 2000
+gene_corr = csr_matrix(np.random.rand(n_vars, n_vars) > 0.99)
+adata.varp['correlations'] = gene_corr
+
+# Access correlations
+gene_correlations = adata.varp['correlations']
+```
+
+### uns (Unstructured Annotations)
+Dictionary storing arbitrary unstructured metadata.
+
+```python
+# Store analysis parameters and results
+adata.uns['experiment_date'] = '2025-11-03'
+adata.uns['pca'] = {
+    'variance_ratio': [0.15, 0.10, 0.08],
+    'params': {'n_comps': 50}
+}
+adata.uns['neighbors'] = {
+    'params': {'n_neighbors': 15, 'method': 'umap'},
+    'connectivities_key': 'connectivities'
+}
+
+# Access unstructured data
+exp_date = adata.uns['experiment_date']
+pca_params = adata.uns['pca']['params']
+```
+
+Common uns uses:
+- Analysis parameters and settings
+- Color palettes for plotting
+- Cluster information
+- Tool-specific metadata
+
+### raw (Original Data Snapshot)
+Optional attribute preserving the original data matrix and variable annotations before filtering.
+
+```python
+# Create AnnData and store raw state
+adata = ad.AnnData(X=np.random.rand(100, 5000))
+adata.var['gene_name'] = [f'Gene_{i}' for i in range(5000)]
+
+# Store raw state before filtering
+adata.raw = adata.copy()
+
+# Filter to highly variable genes
+highly_variable_mask = np.random.rand(5000) > 0.5
+adata = adata[:, highly_variable_mask]
+
+# Access original data
+original_matrix = adata.raw.X
+original_var = adata.raw.var
+```
+
+## Object Properties
+
+```python
+# Dimensions
+n_observations = adata.n_obs
+n_variables = adata.n_vars
+shape = adata.shape  # (n_obs, n_vars)
+
+# Index information
+obs_names = adata.obs_names  # Observation identifiers
+var_names = adata.var_names  # Variable identifiers
+
+# Storage mode
+is_view = adata.is_view  # True if this is a view of another object
+is_backed = adata.isbacked  # True if backed by on-disk storage
+filename = adata.filename  # Path to backing file (if backed)
+```
+
+## Creating AnnData Objects
+
+### From arrays and DataFrames
+```python
+import anndata as ad
+import numpy as np
+import pandas as pd
+
+# Minimal creation
+X = np.random.rand(100, 2000)
+adata = ad.AnnData(X)
+
+# With metadata
+obs = pd.DataFrame({'cell_type': ['A', 'B'] * 50}, index=[f'cell_{i}' for i in range(100)])
+var = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]}, index=[f'ENSG{i:05d}' for i in range(2000)])
+adata = ad.AnnData(X=X, obs=obs, var=var)
+
+# With all components
+adata = ad.AnnData(
+    X=X,
+    obs=obs,
+    var=var,
+    layers={'raw': np.random.randint(0, 100, (100, 2000))},
+    obsm={'X_pca': np.random.rand(100, 50)},
+    uns={'experiment': 'test'}
+)
+```
+
+### From DataFrame
+```python
+# Create from pandas DataFrame (genes as columns, cells as rows)
+df = pd.DataFrame(
+    np.random.rand(100, 50),
+    columns=[f'Gene_{i}' for i in range(50)],
+    index=[f'Cell_{i}' for i in range(100)]
+)
+adata = ad.AnnData(df)
+```
+
+## Data Access Patterns
+
+### Vector extraction
+```python
+# Get observation annotation as array
+cell_types = adata.obs_vector('cell_type')
+
+# Get variable values across observations
+gene_expression = adata.obs_vector('ACTB')  # If ACTB is in var_names
+
+# Get variable annotation as array
+gene_names = adata.var_vector('gene_name')
+```
+
+### Subsetting
+```python
+# By index
+subset = adata[0:10, 0:100]  # First 10 obs, first 100 vars
+
+# By name
+subset = adata[['cell_1', 'cell_2'], ['ACTB', 'GAPDH']]
+
+# By boolean mask
+high_count_cells = adata.obs['total_counts'] > 1000
+subset = adata[high_count_cells, :]
+
+# By observation metadata
+t_cells = adata[adata.obs['cell_type'] == 'T cell']
+```
+
+## Memory Considerations
+
+The AnnData structure is designed for memory efficiency:
+- Sparse matrices reduce memory for sparse data
+- Views avoid copying data when possible
+- Backed mode enables working with data larger than RAM
+- Categorical annotations reduce memory for discrete values
+
+```python
+# Convert strings to categoricals (more memory efficient)
+adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')
+adata.strings_to_categoricals()
+
+# Check if object is a view (doesn't own data)
+if adata.is_view:
+    adata = adata.copy()  # Create independent copy
+```
--- a/scientific-skills/anndata/references/io_operations.md
+++ b/scientific-skills/anndata/references/io_operations.md
@@ -0,0 +1,404 @@
+# Input/Output Operations
+
+AnnData provides comprehensive I/O functionality for reading and writing data in various formats.
+
+## Native Formats
+
+### H5AD (HDF5-based)
+The recommended native format for AnnData objects, providing efficient storage and fast access.
+
+#### Writing H5AD files
+```python
+import anndata as ad
+
+# Write to file
+adata.write_h5ad('data.h5ad')
+
+# Write with compression
+adata.write_h5ad('data.h5ad', compression='gzip')
+
+# Write with specific compression level (0-9, higher = more compression)
+adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9)
+```
+
+#### Reading H5AD files
+```python
+# Read entire file into memory
+adata = ad.read_h5ad('data.h5ad')
+
+# Read in backed mode (lazy loading for large files)
+adata = ad.read_h5ad('data.h5ad', backed='r')  # Read-only
+adata = ad.read_h5ad('data.h5ad', backed='r+')  # Read-write
+
+# Backed mode enables working with datasets larger than RAM
+# Only accessed data is loaded into memory
+```
+
+#### Backed mode operations
+```python
+# Open in backed mode
+adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
+
+# Access metadata without loading X into memory
+print(adata.obs.head())
+print(adata.var.head())
+
+# Subset operations create views
+subset = adata[:100, :500]  # View, no data loaded
+
+# Load specific data into memory
+X_subset = subset.X[:]  # Now loads this subset
+
+# Convert entire backed object to memory
+adata_memory = adata.to_memory()
+```
+
+### Zarr
+Hierarchical array storage format, optimized for cloud storage and parallel I/O.
+
+#### Writing Zarr
+```python
+# Write to Zarr store
+adata.write_zarr('data.zarr')
+
+# Write with specific chunks (important for performance)
+adata.write_zarr('data.zarr', chunks=(100, 100))
+```
+
+#### Reading Zarr
+```python
+# Read Zarr store
+adata = ad.read_zarr('data.zarr')
+```
+
+#### Remote Zarr access
+```python
+import fsspec
+
+# Access Zarr from S3
+store = fsspec.get_mapper('s3://bucket-name/data.zarr')
+adata = ad.read_zarr(store)
+
+# Access Zarr from URL
+store = fsspec.get_mapper('https://example.com/data.zarr')
+adata = ad.read_zarr(store)
+```
+
+## Alternative Input Formats
+
+### CSV/TSV
+```python
+# Read CSV (genes as columns, cells as rows)
+adata = ad.read_csv('data.csv')
+
+# Read with custom delimiter
+adata = ad.read_csv('data.tsv', delimiter='\t')
+
+# Specify that first column is row names
+adata = ad.read_csv('data.csv', first_column_names=True)
+```
+
+### Excel
+```python
+# Read Excel file
+adata = ad.read_excel('data.xlsx')
+
+# Read specific sheet
+adata = ad.read_excel('data.xlsx', sheet='Sheet1')
+```
+
+### Matrix Market (MTX)
+Common format for sparse matrices in genomics.
+
+```python
+# Read MTX with associated files
+# Requires: matrix.mtx, genes.tsv, barcodes.tsv
+adata = ad.read_mtx('matrix.mtx')
+
+# Read with custom gene and barcode files
+adata = ad.read_mtx(
+    'matrix.mtx',
+    var_names='genes.tsv',
+    obs_names='barcodes.tsv'
+)
+
+# Transpose if needed (MTX often has genes as rows)
+adata = adata.T
+```
+
+### 10X Genomics formats
+```python
+# Read 10X h5 format
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
+
+# Read 10X MTX directory
+adata = ad.read_10x_mtx('filtered_feature_bc_matrix/')
+
+# Specify genome if multiple present
+adata = ad.read_10x_h5('data.h5', genome='GRCh38')
+```
+
+### Loom
+```python
+# Read Loom file
+adata = ad.read_loom('data.loom')
+
+# Read with specific observation and variable annotations
+adata = ad.read_loom(
+    'data.loom',
+    obs_names='CellID',
+    var_names='Gene'
+)
+```
+
+### Text files
+```python
+# Read generic text file
+adata = ad.read_text('data.txt', delimiter='\t')
+
+# Read with custom parameters
+adata = ad.read_text(
+    'data.txt',
+    delimiter=',',
+    first_column_names=True,
+    dtype='float32'
+)
+```
+
+### UMI tools
+```python
+# Read UMI tools format
+adata = ad.read_umi_tools('counts.tsv')
+```
+
+### HDF5 (generic)
+```python
+# Read from HDF5 file (not h5ad format)
+adata = ad.read_hdf('data.h5', key='dataset')
+```
+
+## Alternative Output Formats
+
+### CSV
+```python
+# Write to CSV files (creates multiple files)
+adata.write_csvs('output_dir/')
+
+# This creates:
+# - output_dir/X.csv (expression matrix)
+# - output_dir/obs.csv (observation annotations)
+# - output_dir/var.csv (variable annotations)
+# - output_dir/uns.csv (unstructured annotations, if possible)
+
+# Skip certain components
+adata.write_csvs('output_dir/', skip_data=True)  # Skip X matrix
+```
+
+### Loom
+```python
+# Write to Loom format
+adata.write_loom('output.loom')
+```
+
+## Reading Specific Elements
+
+For fine-grained control, read specific elements from storage:
+
+```python
+from anndata import read_elem
+
+# Read just observation annotations
+obs = read_elem('data.h5ad/obs')
+
+# Read specific layer
+layer = read_elem('data.h5ad/layers/normalized')
+
+# Read unstructured data element
+params = read_elem('data.h5ad/uns/pca_params')
+```
+
+## Writing Specific Elements
+
+```python
+from anndata import write_elem
+import h5py
+
+# Write element to existing file
+with h5py.File('data.h5ad', 'a') as f:
+    write_elem(f, 'new_layer', adata.X.copy())
+```
+
+## Lazy Operations
+
+For very large datasets, use lazy reading to avoid loading entire datasets:
+
+```python
+from anndata.experimental import read_elem_lazy
+
+# Lazy read (returns dask array or similar)
+X_lazy = read_elem_lazy('large_data.h5ad/X')
+
+# Compute only when needed
+subset = X_lazy[:100, :100].compute()
+```
+
+## Common I/O Patterns
+
+### Convert between formats
+```python
+# MTX to H5AD
+adata = ad.read_mtx('matrix.mtx').T
+adata.write_h5ad('data.h5ad')
+
+# CSV to H5AD
+adata = ad.read_csv('data.csv')
+adata.write_h5ad('data.h5ad')
+
+# H5AD to Zarr
+adata = ad.read_h5ad('data.h5ad')
+adata.write_zarr('data.zarr')
+```
+
+### Load metadata without data
+```python
+# Backed mode allows inspecting metadata without loading X
+adata = ad.read_h5ad('large_file.h5ad', backed='r')
+print(f"Dataset contains {adata.n_obs} observations and {adata.n_vars} variables")
+print(adata.obs.columns)
+print(adata.var.columns)
+# X is not loaded into memory
+```
+
+### Append to existing file
+```python
+# Open in read-write mode
+adata = ad.read_h5ad('data.h5ad', backed='r+')
+
+# Modify metadata
+adata.obs['new_column'] = values
+
+# Changes are written to disk
+```
+
+### Download from URL
+```python
+import anndata as ad
+
+# Read directly from URL (for h5ad files)
+url = 'https://example.com/data.h5ad'
+adata = ad.read_h5ad(url, backed='r')  # Streaming access
+
+# For other formats, download first
+import urllib.request
+urllib.request.urlretrieve(url, 'local_file.h5ad')
+adata = ad.read_h5ad('local_file.h5ad')
+```
+
+## Performance Tips
+
+### Reading
+- Use `backed='r'` for large files you only need to query
+- Use `backed='r+'` if you need to modify metadata without loading all data
+- H5AD format is generally fastest for random access
+- Zarr is better for cloud storage and parallel access
+- Consider compression for storage, but note it may slow down reading
+
+### Writing
+- Use compression for long-term storage: `compression='gzip'` or `compression='lzf'`
+- LZF compression is faster but compresses less than GZIP
+- For Zarr, tune chunk sizes based on access patterns:
+  - Larger chunks for sequential reads
+  - Smaller chunks for random access
+- Convert string columns to categorical before writing (smaller files)
+
+### Memory management
+```python
+# Convert strings to categoricals (reduces file size and memory)
+adata.strings_to_categoricals()
+adata.write_h5ad('data.h5ad')
+
+# Use sparse matrices for sparse data
+from scipy.sparse import csr_matrix
+if isinstance(adata.X, np.ndarray):
+    density = np.count_nonzero(adata.X) / adata.X.size
+    if density < 0.5:  # If more than 50% zeros
+        adata.X = csr_matrix(adata.X)
+```
+
+## Handling Large Datasets
+
+### Strategy 1: Backed mode
+```python
+# Work with dataset larger than RAM
+adata = ad.read_h5ad('100GB_file.h5ad', backed='r')
+
+# Filter based on metadata (fast, no data loading)
+filtered = adata[adata.obs['quality_score'] > 0.8]
+
+# Load filtered subset into memory
+adata_memory = filtered.to_memory()
+```
+
+### Strategy 2: Chunked processing
+```python
+# Process data in chunks
+adata = ad.read_h5ad('large_file.h5ad', backed='r')
+
+chunk_size = 1000
+results = []
+
+for i in range(0, adata.n_obs, chunk_size):
+    chunk = adata[i:i+chunk_size, :].to_memory()
+    # Process chunk
+    result = process(chunk)
+    results.append(result)
+```
+
+### Strategy 3: Use AnnCollection
+```python
+from anndata.experimental import AnnCollection
+
+# Create collection without loading data
+adatas = [f'dataset_{i}.h5ad' for i in range(10)]
+collection = AnnCollection(
+    adatas,
+    join_obs='inner',
+    join_vars='inner'
+)
+
+# Process collection lazily
+# Data is loaded only when accessed
+```
+
+## Common Issues and Solutions
+
+### Issue: Out of memory when reading
+**Solution**: Use backed mode or read in chunks
+```python
+adata = ad.read_h5ad('file.h5ad', backed='r')
+```
+
+### Issue: Slow reading from cloud storage
+**Solution**: Use Zarr format with appropriate chunking
+```python
+adata.write_zarr('data.zarr', chunks=(1000, 1000))
+```
+
+### Issue: Large file sizes
+**Solution**: Use compression and convert to sparse/categorical
+```python
+adata.strings_to_categoricals()
+from scipy.sparse import csr_matrix
+adata.X = csr_matrix(adata.X)
+adata.write_h5ad('compressed.h5ad', compression='gzip')
+```
+
+### Issue: Cannot modify backed object
+**Solution**: Either load to memory or open in 'r+' mode
+```python
+# Option 1: Load to memory
+adata = adata.to_memory()
+
+# Option 2: Open in read-write mode
+adata = ad.read_h5ad('file.h5ad', backed='r+')
+```
--- a/scientific-skills/anndata/references/manipulation.md
+++ b/scientific-skills/anndata/references/manipulation.md
@@ -0,0 +1,516 @@
+# Data Manipulation
+
+Operations for transforming, subsetting, and manipulating AnnData objects.
+
+## Subsetting
+
+### By indices
+```python
+import anndata as ad
+import numpy as np
+
+adata = ad.AnnData(X=np.random.rand(1000, 2000))
+
+# Integer indices
+subset = adata[0:100, 0:500]  # First 100 obs, first 500 vars
+
+# List of indices
+obs_indices = [0, 10, 20, 30, 40]
+var_indices = [0, 1, 2, 3, 4]
+subset = adata[obs_indices, var_indices]
+
+# Single observation or variable
+single_obs = adata[0, :]
+single_var = adata[:, 0]
+```
+
+### By names
+```python
+import pandas as pd
+
+# Create with named indices
+obs_names = [f'cell_{i}' for i in range(1000)]
+var_names = [f'gene_{i}' for i in range(2000)]
+adata = ad.AnnData(
+    X=np.random.rand(1000, 2000),
+    obs=pd.DataFrame(index=obs_names),
+    var=pd.DataFrame(index=var_names)
+)
+
+# Subset by observation names
+subset = adata[['cell_0', 'cell_1', 'cell_2'], :]
+
+# Subset by variable names
+subset = adata[:, ['gene_0', 'gene_10', 'gene_20']]
+
+# Both axes
+subset = adata[['cell_0', 'cell_1'], ['gene_0', 'gene_1']]
+```
+
+### By boolean masks
+```python
+# Create boolean masks
+high_count_obs = np.random.rand(1000) > 0.5
+high_var_genes = np.random.rand(2000) > 0.7
+
+# Subset using masks
+subset = adata[high_count_obs, :]
+subset = adata[:, high_var_genes]
+subset = adata[high_count_obs, high_var_genes]
+```
+
+### By metadata conditions
+```python
+# Add metadata
+adata.obs['cell_type'] = np.random.choice(['A', 'B', 'C'], 1000)
+adata.obs['quality_score'] = np.random.rand(1000)
+adata.var['highly_variable'] = np.random.rand(2000) > 0.8
+
+# Filter by cell type
+t_cells = adata[adata.obs['cell_type'] == 'A']
+
+# Filter by multiple conditions
+high_quality_a_cells = adata[
+    (adata.obs['cell_type'] == 'A') &
+    (adata.obs['quality_score'] > 0.7)
+]
+
+# Filter by variable metadata
+hv_genes = adata[:, adata.var['highly_variable']]
+
+# Complex conditions
+filtered = adata[
+    (adata.obs['quality_score'] > 0.5) &
+    (adata.obs['cell_type'].isin(['A', 'B'])),
+    adata.var['highly_variable']
+]
+```
+
+## Transposition
+
+```python
+# Transpose AnnData object (swap observations and variables)
+adata_T = adata.T
+
+# Shape changes
+print(adata.shape)    # (1000, 2000)
+print(adata_T.shape)  # (2000, 1000)
+
+# obs and var are swapped
+print(adata.obs.head())   # Observation metadata
+print(adata_T.var.head()) # Same data, now as variable metadata
+
+# Useful when data is in opposite orientation
+# Common with some file formats where genes are rows
+```
+
+## Copying
+
+### Full copy
+```python
+# Create independent copy
+adata_copy = adata.copy()
+
+# Modifications to copy don't affect original
+adata_copy.obs['new_column'] = 1
+print('new_column' in adata.obs.columns)  # False
+```
+
+### Shallow copy
+```python
+# View (doesn't copy data, modifications affect original)
+adata_view = adata[0:100, :]
+
+# Check if object is a view
+print(adata_view.is_view)  # True
+
+# Convert view to independent copy
+adata_independent = adata_view.copy()
+print(adata_independent.is_view)  # False
+```
+
+## Renaming
+
+### Rename observations and variables
+```python
+# Rename all observations
+adata.obs_names = [f'new_cell_{i}' for i in range(adata.n_obs)]
+
+# Rename all variables
+adata.var_names = [f'new_gene_{i}' for i in range(adata.n_vars)]
+
+# Make names unique (add suffix to duplicates)
+adata.obs_names_make_unique()
+adata.var_names_make_unique()
+```
+
+### Rename categories
+```python
+# Create categorical column
+adata.obs['cell_type'] = pd.Categorical(['A', 'B', 'C'] * 333 + ['A'])
+
+# Rename categories
+adata.rename_categories('cell_type', ['Type_A', 'Type_B', 'Type_C'])
+
+# Or using dictionary
+adata.rename_categories('cell_type', {
+    'Type_A': 'T_cell',
+    'Type_B': 'B_cell',
+    'Type_C': 'Monocyte'
+})
+```
+
+## Type Conversions
+
+### Strings to categoricals
+```python
+# Convert string columns to categorical (more memory efficient)
+adata.obs['cell_type'] = ['TypeA', 'TypeB'] * 500
+adata.obs['tissue'] = ['brain', 'liver'] * 500
+
+# Convert all string columns to categorical
+adata.strings_to_categoricals()
+
+print(adata.obs['cell_type'].dtype)  # category
+print(adata.obs['tissue'].dtype)     # category
+```
+
+### Sparse to dense and vice versa
+```python
+from scipy.sparse import csr_matrix
+
+# Dense to sparse
+if not isinstance(adata.X, csr_matrix):
+    adata.X = csr_matrix(adata.X)
+
+# Sparse to dense
+if isinstance(adata.X, csr_matrix):
+    adata.X = adata.X.toarray()
+
+# Convert layer
+adata.layers['normalized'] = csr_matrix(adata.layers['normalized'])
+```
+
+## Chunked Operations
+
+Process large datasets in chunks:
+
+```python
+# Iterate through data in chunks
+chunk_size = 100
+for chunk in adata.chunked_X(chunk_size):
+    # Process chunk
+    result = process_chunk(chunk)
+```
+
+## Extracting Vectors
+
+### Get observation vectors
+```python
+# Get observation metadata as array
+cell_types = adata.obs_vector('cell_type')
+
+# Get gene expression across observations
+actb_expression = adata.obs_vector('ACTB')  # If ACTB in var_names
+```
+
+### Get variable vectors
+```python
+# Get variable metadata as array
+gene_names = adata.var_vector('gene_name')
+```
+
+## Adding/Modifying Data
+
+### Add observations
+```python
+# Create new observations
+new_obs = ad.AnnData(X=np.random.rand(100, adata.n_vars))
+new_obs.var_names = adata.var_names
+
+# Concatenate with existing
+adata_extended = ad.concat([adata, new_obs], axis=0)
+```
+
+### Add variables
+```python
+# Create new variables
+new_vars = ad.AnnData(X=np.random.rand(adata.n_obs, 100))
+new_vars.obs_names = adata.obs_names
+
+# Concatenate with existing
+adata_extended = ad.concat([adata, new_vars], axis=1)
+```
+
+### Add metadata columns
+```python
+# Add observation annotation
+adata.obs['new_score'] = np.random.rand(adata.n_obs)
+
+# Add variable annotation
+adata.var['new_label'] = ['label'] * adata.n_vars
+
+# Add from external data
+external_data = pd.read_csv('metadata.csv', index_col=0)
+adata.obs['external_info'] = external_data.loc[adata.obs_names, 'column']
+```
+
+### Add layers
+```python
+# Add new layer
+adata.layers['raw_counts'] = np.random.randint(0, 100, adata.shape)
+adata.layers['log_transformed'] = np.log1p(adata.X)
+
+# Replace layer
+adata.layers['normalized'] = new_normalized_data
+```
+
+### Add embeddings
+```python
+# Add PCA
+adata.obsm['X_pca'] = np.random.rand(adata.n_obs, 50)
+
+# Add UMAP
+adata.obsm['X_umap'] = np.random.rand(adata.n_obs, 2)
+
+# Add multiple embeddings
+adata.obsm['X_tsne'] = np.random.rand(adata.n_obs, 2)
+adata.obsm['X_diffmap'] = np.random.rand(adata.n_obs, 10)
+```
+
+### Add pairwise relationships
+```python
+from scipy.sparse import csr_matrix
+
+# Add nearest neighbor graph
+n_obs = adata.n_obs
+knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
+adata.obsp['connectivities'] = knn_graph
+
+# Add distance matrix
+adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))
+```
+
+### Add unstructured data
+```python
+# Add analysis parameters
+adata.uns['pca'] = {
+    'variance': [0.2, 0.15, 0.1],
+    'variance_ratio': [0.4, 0.3, 0.2],
+    'params': {'n_comps': 50}
+}
+
+# Add color schemes
+adata.uns['cell_type_colors'] = ['#FF0000', '#00FF00', '#0000FF']
+```
+
+## Removing Data
+
+### Remove observations or variables
+```python
+# Keep only specific observations
+keep_obs = adata.obs['quality_score'] > 0.5
+adata = adata[keep_obs, :]
+
+# Remove specific variables
+remove_vars = adata.var['low_count']
+adata = adata[:, ~remove_vars]
+```
+
+### Remove metadata columns
+```python
+# Remove observation column
+adata.obs.drop('unwanted_column', axis=1, inplace=True)
+
+# Remove variable column
+adata.var.drop('unwanted_column', axis=1, inplace=True)
+```
+
+### Remove layers
+```python
+# Remove specific layer
+del adata.layers['unwanted_layer']
+
+# Remove all layers
+adata.layers = {}
+```
+
+### Remove embeddings
+```python
+# Remove specific embedding
+del adata.obsm['X_tsne']
+
+# Remove all embeddings
+adata.obsm = {}
+```
+
+### Remove unstructured data
+```python
+# Remove specific key
+del adata.uns['unwanted_key']
+
+# Remove all unstructured data
+adata.uns = {}
+```
+
+## Reordering
+
+### Sort observations
+```python
+# Sort by observation metadata
+adata = adata[adata.obs.sort_values('quality_score').index, :]
+
+# Sort by observation names
+adata = adata[sorted(adata.obs_names), :]
+```
+
+### Sort variables
+```python
+# Sort by variable metadata
+adata = adata[:, adata.var.sort_values('gene_name').index]
+
+# Sort by variable names
+adata = adata[:, sorted(adata.var_names)]
+```
+
+### Reorder to match external list
+```python
+# Reorder observations to match external list
+desired_order = ['cell_10', 'cell_5', 'cell_20', ...]
+adata = adata[desired_order, :]
+
+# Reorder variables
+desired_genes = ['TP53', 'ACTB', 'GAPDH', ...]
+adata = adata[:, desired_genes]
+```
+
+## Data Transformations
+
+### Normalize
+```python
+# Total count normalization (CPM/TPM-like)
+total_counts = adata.X.sum(axis=1)
+adata.layers['normalized'] = adata.X / total_counts[:, np.newaxis] * 1e6
+
+# Log transformation
+adata.layers['log1p'] = np.log1p(adata.X)
+
+# Z-score normalization
+mean = adata.X.mean(axis=0)
+std = adata.X.std(axis=0)
+adata.layers['scaled'] = (adata.X - mean) / std
+```
+
+### Filter
+```python
+# Filter cells by total counts
+total_counts = np.array(adata.X.sum(axis=1)).flatten()
+adata.obs['total_counts'] = total_counts
+adata = adata[adata.obs['total_counts'] > 1000, :]
+
+# Filter genes by detection rate
+detection_rate = (adata.X > 0).sum(axis=0) / adata.n_obs
+adata.var['detection_rate'] = np.array(detection_rate).flatten()
+adata = adata[:, adata.var['detection_rate'] > 0.01]
+```
+
+## Working with Views
+
+Views are lightweight references to subsets of data that don't copy the underlying matrix:
+
+```python
+# Create view
+view = adata[0:100, 0:500]
+print(view.is_view)  # True
+
+# Views allow read access
+data = view.X
+
+# Modifying view data affects original
+# (Be careful!)
+
+# Convert view to independent copy
+independent = view.copy()
+
+# Force AnnData to be a copy, not a view
+adata = adata.copy()
+```
+
+## Merging Metadata
+
+```python
+# Merge external metadata
+external_metadata = pd.read_csv('additional_metadata.csv', index_col=0)
+
+# Join metadata (inner join on index)
+adata.obs = adata.obs.join(external_metadata)
+
+# Left join (keep all adata observations)
+adata.obs = adata.obs.merge(
+    external_metadata,
+    left_index=True,
+    right_index=True,
+    how='left'
+)
+```
+
+## Common Manipulation Patterns
+
+### Quality control filtering
+```python
+# Calculate QC metrics
+adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
+adata.obs['total_counts'] = adata.X.sum(axis=1)
+adata.var['n_cells'] = (adata.X > 0).sum(axis=0)
+
+# Filter low-quality cells
+adata = adata[adata.obs['n_genes'] > 200, :]
+adata = adata[adata.obs['total_counts'] < 50000, :]
+
+# Filter rarely detected genes
+adata = adata[:, adata.var['n_cells'] >= 3]
+```
+
+### Select highly variable genes
+```python
+# Mark highly variable genes
+gene_variance = np.var(adata.X, axis=0)
+adata.var['variance'] = np.array(gene_variance).flatten()
+adata.var['highly_variable'] = adata.var['variance'] > np.percentile(gene_variance, 90)
+
+# Subset to highly variable genes
+adata_hvg = adata[:, adata.var['highly_variable']].copy()
+```
+
+### Downsample
+```python
+# Random sampling of observations
+np.random.seed(42)
+n_sample = 500
+sample_indices = np.random.choice(adata.n_obs, n_sample, replace=False)
+adata_downsampled = adata[sample_indices, :].copy()
+
+# Stratified sampling by cell type
+from sklearn.model_selection import train_test_split
+train_idx, test_idx = train_test_split(
+    range(adata.n_obs),
+    test_size=0.2,
+    stratify=adata.obs['cell_type']
+)
+adata_train = adata[train_idx, :].copy()
+adata_test = adata[test_idx, :].copy()
+```
+
+### Split train/test
+```python
+# Random train/test split
+np.random.seed(42)
+n_obs = adata.n_obs
+train_size = int(0.8 * n_obs)
+indices = np.random.permutation(n_obs)
+train_indices = indices[:train_size]
+test_indices = indices[train_size:]
+
+adata_train = adata[train_indices, :].copy()
+adata_test = adata[test_indices, :].copy()
+```
--- a/scientific-skills/arboreto/SKILL.md
+++ b/scientific-skills/arboreto/SKILL.md
@@ -0,0 +1,237 @@
+---
+name: arboreto
+description: Infer gene regulatory networks (GRNs) from gene expression data using scalable algorithms (GRNBoost2, GENIE3). Use when analyzing transcriptomics data (bulk RNA-seq, single-cell RNA-seq) to identify transcription factor-target gene relationships and regulatory interactions. Supports distributed computation for large-scale datasets.
+---
+
+# Arboreto
+
+## Overview
+
+Arboreto is a computational library for inferring gene regulatory networks (GRNs) from gene expression data using parallelized algorithms that scale from single machines to multi-node clusters.
+
+**Core capability**: Identify which transcription factors (TFs) regulate which target genes based on expression patterns across observations (cells, samples, conditions).
+
+## Quick Start
+
+Install arboreto:
+```bash
+uv pip install arboreto
+```
+
+Basic GRN inference:
+```python
+import pandas as pd
+from arboreto.algo import grnboost2
+
+if __name__ == '__main__':
+    # Load expression data (genes as columns)
+    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
+
+    # Infer regulatory network
+    network = grnboost2(expression_data=expression_matrix)
+
+    # Save results (TF, target, importance)
+    network.to_csv('network.tsv', sep='\t', index=False, header=False)
+```
+
+**Critical**: Always use `if __name__ == '__main__':` guard because Dask spawns new processes.
+
+## Core Capabilities
+
+### 1. Basic GRN Inference
+
+For standard GRN inference workflows including:
+- Input data preparation (Pandas DataFrame or NumPy array)
+- Running inference with GRNBoost2 or GENIE3
+- Filtering by transcription factors
+- Output format and interpretation
+
+**See**: `references/basic_inference.md`
+
+**Use the ready-to-run script**: `scripts/basic_grn_inference.py` for standard inference tasks:
+```bash
+python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777
+```
+
+### 2. Algorithm Selection
+
+Arboreto provides two algorithms:
+
+**GRNBoost2 (Recommended)**:
+- Fast gradient boosting-based inference
+- Optimized for large datasets (10k+ observations)
+- Default choice for most analyses
+
+**GENIE3**:
+- Random Forest-based inference
+- Original multiple regression approach
+- Use for comparison or validation
+
+Quick comparison:
+```python
+from arboreto.algo import grnboost2, genie3
+
+# Fast, recommended
+network_grnboost = grnboost2(expression_data=matrix)
+
+# Classic algorithm
+network_genie3 = genie3(expression_data=matrix)
+```
+
+**For detailed algorithm comparison, parameters, and selection guidance**: `references/algorithms.md`
+
+### 3. Distributed Computing
+
+Scale inference from local multi-core to cluster environments:
+
+**Local (default)** - Uses all available cores automatically:
+```python
+network = grnboost2(expression_data=matrix)
+```
+
+**Custom local client** - Control resources:
+```python
+from distributed import LocalCluster, Client
+
+local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')
+client = Client(local_cluster)
+
+network = grnboost2(expression_data=matrix, client_or_address=client)
+
+client.close()
+local_cluster.close()
+```
+
+**Cluster computing** - Connect to remote Dask scheduler:
+```python
+from distributed import Client
+
+client = Client('tcp://scheduler:8786')
+network = grnboost2(expression_data=matrix, client_or_address=client)
+```
+
+**For cluster setup, performance optimization, and large-scale workflows**: `references/distributed_computing.md`
+
+## Installation
+
+```bash
+uv pip install arboreto
+```
+
+**Dependencies**: scipy, scikit-learn, numpy, pandas, dask, distributed
+
+## Common Use Cases
+
+### Single-Cell RNA-seq Analysis
+```python
+import pandas as pd
+from arboreto.algo import grnboost2
+
+if __name__ == '__main__':
+    # Load single-cell expression matrix (cells x genes)
+    sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')
+
+    # Infer cell-type-specific regulatory network
+    network = grnboost2(expression_data=sc_data, seed=42)
+
+    # Filter high-confidence links
+    high_confidence = network[network['importance'] > 0.5]
+    high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)
+```
+
+### Bulk RNA-seq with TF Filtering
+```python
+from arboreto.utils import load_tf_names
+from arboreto.algo import grnboost2
+
+if __name__ == '__main__':
+    # Load data
+    expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')
+    tf_names = load_tf_names('human_tfs.txt')
+
+    # Infer with TF restriction
+    network = grnboost2(
+        expression_data=expression_data,
+        tf_names=tf_names,
+        seed=123
+    )
+
+    network.to_csv('tf_target_network.tsv', sep='\t', index=False)
+```
+
+### Comparative Analysis (Multiple Conditions)
+```python
+from arboreto.algo import grnboost2
+
+if __name__ == '__main__':
+    # Infer networks for different conditions
+    conditions = ['control', 'treatment_24h', 'treatment_48h']
+
+    for condition in conditions:
+        data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
+        network = grnboost2(expression_data=data, seed=42)
+        network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)
+```
+
+## Output Interpretation
+
+Arboreto returns a DataFrame with regulatory links:
+
+| Column | Description |
+|--------|-------------|
+| `TF` | Transcription factor (regulator) |
+| `target` | Target gene |
+| `importance` | Regulatory importance score (higher = stronger) |
+
+**Filtering strategy**:
+- Top N links per target gene
+- Importance threshold (e.g., > 0.5)
+- Statistical significance testing (permutation tests)
+
+## Integration with pySCENIC
+
+Arboreto is a core component of the SCENIC pipeline for single-cell regulatory network analysis:
+
+```python
+# Step 1: Use arboreto for GRN inference
+from arboreto.algo import grnboost2
+network = grnboost2(expression_data=sc_data, tf_names=tf_list)
+
+# Step 2: Use pySCENIC for regulon identification and activity scoring
+# (See pySCENIC documentation for downstream analysis)
+```
+
+## Reproducibility
+
+Always set a seed for reproducible results:
+```python
+network = grnboost2(expression_data=matrix, seed=777)
+```
+
+Run multiple seeds for robustness analysis:
+```python
+from distributed import LocalCluster, Client
+
+if __name__ == '__main__':
+    client = Client(LocalCluster())
+
+    seeds = [42, 123, 777]
+    networks = []
+
+    for seed in seeds:
+        net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
+        networks.append(net)
+
+    # Combine networks and filter consensus links
+    consensus = analyze_consensus(networks)
+```
+
+## Troubleshooting
+
+**Memory errors**: Reduce dataset size by filtering low-variance genes or use distributed computing
+
+**Slow performance**: Use GRNBoost2 instead of GENIE3, enable distributed client, filter TF list
+
+**Dask errors**: Ensure `if __name__ == '__main__':` guard is present in scripts
+
+**Empty results**: Check data format (genes as columns), verify TF names match gene names
--- a/scientific-skills/arboreto/references/algorithms.md
+++ b/scientific-skills/arboreto/references/algorithms.md
@@ -0,0 +1,138 @@
+# GRN Inference Algorithms
+
+Arboreto provides two algorithms for gene regulatory network (GRN) inference, both based on the multiple regression approach.
+
+## Algorithm Overview
+
+Both algorithms follow the same inference strategy:
+1. For each target gene in the dataset, train a regression model
+2. Identify the most important features (potential regulators) from the model
+3. Emit these features as candidate regulators with importance scores
+
+The key difference is **computational efficiency** and the underlying regression method.
+
+## GRNBoost2 (Recommended)
+
+**Purpose**: Fast GRN inference for large-scale datasets using gradient boosting.
+
+### When to Use
+- **Large datasets**: Tens of thousands of observations (e.g., single-cell RNA-seq)
+- **Time-constrained analysis**: Need faster results than GENIE3
+- **Default choice**: GRNBoost2 is the flagship algorithm and recommended for most use cases
+
+### Technical Details
+- **Method**: Stochastic gradient boosting with early-stopping regularization
+- **Performance**: Significantly faster than GENIE3 on large datasets
+- **Output**: Same format as GENIE3 (TF-target-importance triplets)
+
+### Usage
+```python
+from arboreto.algo import grnboost2
+
+network = grnboost2(
+    expression_data=expression_matrix,
+    tf_names=tf_names,
+    seed=42  # For reproducibility
+)
+```
+
+### Parameters
+```python
+grnboost2(
+    expression_data,           # Required: pandas DataFrame or numpy array
+    gene_names=None,           # Required for numpy arrays
+    tf_names='all',            # List of TF names or 'all'
+    verbose=False,             # Print progress messages
+    client_or_address='local', # Dask client or scheduler address
+    seed=None                  # Random seed for reproducibility
+)
+```
+
+## GENIE3
+
+**Purpose**: Classic Random Forest-based GRN inference, serving as the conceptual blueprint.
+
+### When to Use
+- **Smaller datasets**: When dataset size allows for longer computation
+- **Comparison studies**: When comparing with published GENIE3 results
+- **Validation**: To validate GRNBoost2 results
+
+### Technical Details
+- **Method**: Random Forest or ExtraTrees regression
+- **Foundation**: Original multiple regression GRN inference strategy
+- **Trade-off**: More computationally expensive but well-established
+
+### Usage
+```python
+from arboreto.algo import genie3
+
+network = genie3(
+    expression_data=expression_matrix,
+    tf_names=tf_names,
+    seed=42
+)
+```
+
+### Parameters
+```python
+genie3(
+    expression_data,           # Required: pandas DataFrame or numpy array
+    gene_names=None,           # Required for numpy arrays
+    tf_names='all',            # List of TF names or 'all'
+    verbose=False,             # Print progress messages
+    client_or_address='local', # Dask client or scheduler address
+    seed=None                  # Random seed for reproducibility
+)
+```
+
+## Algorithm Comparison
+
+| Feature | GRNBoost2 | GENIE3 |
+|---------|-----------|--------|
+| **Speed** | Fast (optimized for large data) | Slower |
+| **Method** | Gradient boosting | Random Forest |
+| **Best for** | Large-scale data (10k+ observations) | Small-medium datasets |
+| **Output format** | Same | Same |
+| **Inference strategy** | Multiple regression | Multiple regression |
+| **Recommended** | Yes (default choice) | For comparison/validation |
+
+## Advanced: Custom Regressor Parameters
+
+For advanced users, pass custom scikit-learn regressor parameters:
+
+```python
+from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
+
+# Custom GRNBoost2 parameters
+custom_grnboost2 = grnboost2(
+    expression_data=expression_matrix,
+    regressor_type='GBM',
+    regressor_kwargs={
+        'n_estimators': 100,
+        'max_depth': 5,
+        'learning_rate': 0.1
+    }
+)
+
+# Custom GENIE3 parameters
+custom_genie3 = genie3(
+    expression_data=expression_matrix,
+    regressor_type='RF',
+    regressor_kwargs={
+        'n_estimators': 1000,
+        'max_features': 'sqrt'
+    }
+)
+```
+
+## Choosing the Right Algorithm
+
+**Decision guide**:
+
+1. **Start with GRNBoost2** - It's faster and handles large datasets better
+2. **Use GENIE3 if**:
+   - Comparing with existing GENIE3 publications
+   - Dataset is small-medium sized
+   - Validating GRNBoost2 results
+
+Both algorithms produce comparable regulatory networks with the same output format, making them interchangeable for most analyses.
--- a/scientific-skills/arboreto/references/basic_inference.md
+++ b/scientific-skills/arboreto/references/basic_inference.md
@@ -0,0 +1,151 @@
+# Basic GRN Inference with Arboreto
+
+## Input Data Requirements
+
+Arboreto requires gene expression data in one of two formats:
+
+### Pandas DataFrame (Recommended)
+- **Rows**: Observations (cells, samples, conditions)
+- **Columns**: Genes (with gene names as column headers)
+- **Format**: Numeric expression values
+
+Example:
+```python
+import pandas as pd
+
+# Load expression matrix with genes as columns
+expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
+# Columns: ['gene1', 'gene2', 'gene3', ...]
+# Rows: observation data
+```
+
+### NumPy Array
+- **Shape**: (observations, genes)
+- **Requirement**: Separately provide gene names list matching column order
+
+Example:
+```python
+import numpy as np
+
+expression_matrix = np.genfromtxt('expression_data.tsv', delimiter='\t', skip_header=1)
+with open('expression_data.tsv') as f:
+    gene_names = [gene.strip() for gene in f.readline().split('\t')]
+
+assert expression_matrix.shape[1] == len(gene_names)
+```
+
+## Transcription Factors (TFs)
+
+Optionally provide a list of transcription factor names to restrict regulatory inference:
+
+```python
+from arboreto.utils import load_tf_names
+
+# Load from file (one TF per line)
+tf_names = load_tf_names('transcription_factors.txt')
+
+# Or define directly
+tf_names = ['TF1', 'TF2', 'TF3']
+```
+
+If not provided, all genes are considered potential regulators.
+
+## Basic Inference Workflow
+
+### Using Pandas DataFrame
+
+```python
+import pandas as pd
+from arboreto.utils import load_tf_names
+from arboreto.algo import grnboost2
+
+if __name__ == '__main__':
+    # Load expression data
+    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
+
+    # Load transcription factors (optional)
+    tf_names = load_tf_names('tf_list.txt')
+
+    # Run GRN inference
+    network = grnboost2(
+        expression_data=expression_matrix,
+        tf_names=tf_names  # Optional
+    )
+
+    # Save results
+    network.to_csv('network_output.tsv', sep='\t', index=False, header=False)
+```
+
+**Critical**: The `if __name__ == '__main__':` guard is required because Dask spawns new processes internally.
+
+### Using NumPy Array
+
+```python
+import numpy as np
+from arboreto.algo import grnboost2
+
+if __name__ == '__main__':
+    # Load expression matrix
+    expression_matrix = np.genfromtxt('expression_data.tsv', delimiter='\t', skip_header=1)
+
+    # Extract gene names from header
+    with open('expression_data.tsv') as f:
+        gene_names = [gene.strip() for gene in f.readline().split('\t')]
+
+    # Verify dimensions match
+    assert expression_matrix.shape[1] == len(gene_names)
+
+    # Run inference with explicit gene names
+    network = grnboost2(
+        expression_data=expression_matrix,
+        gene_names=gene_names,
+        tf_names=tf_names
+    )
+
+    network.to_csv('network_output.tsv', sep='\t', index=False, header=False)
+```
+
+## Output Format
+
+Arboreto returns a Pandas DataFrame with three columns:
+
+| Column | Description |
+|--------|-------------|
+| `TF` | Transcription factor (regulator) gene name |
+| `target` | Target gene name |
+| `importance` | Regulatory importance score (higher = stronger regulation) |
+
+Example output:
+```
+TF1    gene5    0.856
+TF2    gene12   0.743
+TF1    gene8    0.621
+```
+
+## Setting Random Seed
+
+For reproducible results, provide a seed parameter:
+
+```python
+network = grnboost2(
+    expression_data=expression_matrix,
+    tf_names=tf_names,
+    seed=777
+)
+```
+
+## Algorithm Selection
+
+Use `grnboost2()` for most cases (faster, handles large datasets):
+```python
+from arboreto.algo import grnboost2
+network = grnboost2(expression_data=expression_matrix)
+```
+
+Use `genie3()` for comparison or specific requirements:
+```python
+from arboreto.algo import genie3
+network = genie3(expression_data=expression_matrix)
+```
+
+See `references/algorithms.md` for detailed algorithm comparison.
--- a/scientific-skills/arboreto/references/distributed_computing.md
+++ b/scientific-skills/arboreto/references/distributed_computing.md
@@ -0,0 +1,242 @@
+# Distributed Computing with Arboreto
+
+Arboreto leverages Dask for parallelized computation, enabling efficient GRN inference from single-machine multi-core processing to multi-node cluster environments.
+
+## Computation Architecture
+
+GRN inference is inherently parallelizable:
+- Each target gene's regression model can be trained independently
+- Arboreto represents computation as a Dask task graph
+- Tasks are distributed across available computational resources
+
+## Local Multi-Core Processing (Default)
+
+By default, arboreto uses all available CPU cores on the local machine:
+
+```python
+from arboreto.algo import grnboost2
+
+# Automatically uses all local cores
+network = grnboost2(expression_data=expression_matrix, tf_names=tf_names)
+```
+
+This is sufficient for most use cases and requires no additional configuration.
+
+## Custom Local Dask Client
+
+For fine-grained control over local resources, create a custom Dask client:
+
+```python
+from distributed import LocalCluster, Client
+from arboreto.algo import grnboost2
+
+if __name__ == '__main__':
+    # Configure local cluster
+    local_cluster = LocalCluster(
+        n_workers=10,              # Number of worker processes
+        threads_per_worker=1,       # Threads per worker
+        memory_limit='8GB'          # Memory limit per worker
+    )
+
+    # Create client
+    custom_client = Client(local_cluster)
+
+    # Run inference with custom client
+    network = grnboost2(
+        expression_data=expression_matrix,
+        tf_names=tf_names,
+        client_or_address=custom_client
+    )
+
+    # Clean up
+    custom_client.close()
+    local_cluster.close()
+```
+
+### Benefits of Custom Client
+- **Resource control**: Limit CPU and memory usage
+- **Multiple runs**: Reuse same client for different parameter sets
+- **Monitoring**: Access Dask dashboard for performance insights
+
+## Multiple Inference Runs with Same Client
+
+Reuse a single Dask client for multiple inference runs with different parameters:
+
+```python
+from distributed import LocalCluster, Client
+from arboreto.algo import grnboost2
+
+if __name__ == '__main__':
+    # Initialize client once
+    local_cluster = LocalCluster(n_workers=8, threads_per_worker=1)
+    client = Client(local_cluster)
+
+    # Run multiple inferences
+    network_seed1 = grnboost2(
+        expression_data=expression_matrix,
+        tf_names=tf_names,
+        client_or_address=client,
+        seed=666
+    )
+
+    network_seed2 = grnboost2(
+        expression_data=expression_matrix,
+        tf_names=tf_names,
+        client_or_address=client,
+        seed=777
+    )
+
+    # Different algorithms with same client
+    from arboreto.algo import genie3
+    network_genie3 = genie3(
+        expression_data=expression_matrix,
+        tf_names=tf_names,
+        client_or_address=client
+    )
+
+    # Clean up once
+    client.close()
+    local_cluster.close()
+```
+
+## Distributed Cluster Computing
+
+For very large datasets, connect to a remote Dask distributed scheduler running on a cluster:
+
+### Step 1: Set Up Dask Scheduler (on cluster head node)
+```bash
+dask-scheduler
+# Output: Scheduler at tcp://10.118.224.134:8786
+```
+
+### Step 2: Start Dask Workers (on cluster compute nodes)
+```bash
+dask-worker tcp://10.118.224.134:8786
+```
+
+### Step 3: Connect from Client
+```python
+from distributed import Client
+from arboreto.algo import grnboost2
+
+if __name__ == '__main__':
+    # Connect to remote scheduler
+    scheduler_address = 'tcp://10.118.224.134:8786'
+    cluster_client = Client(scheduler_address)
+
+    # Run inference on cluster
+    network = grnboost2(
+        expression_data=expression_matrix,
+        tf_names=tf_names,
+        client_or_address=cluster_client
+    )
+
+    cluster_client.close()
+```
+
+### Cluster Configuration Best Practices
+
+**Worker configuration**:
+```bash
+dask-worker tcp://scheduler:8786 \
+    --nprocs 4 \              # Number of processes per node
+    --nthreads 1 \            # Threads per process
+    --memory-limit 16GB       # Memory per process
+```
+
+**For large-scale inference**:
+- Use more workers with moderate memory rather than fewer workers with large memory
+- Set `threads_per_worker=1` to avoid GIL contention in scikit-learn
+- Monitor memory usage to prevent workers from being killed
+
+## Monitoring and Debugging
+
+### Dask Dashboard
+
+Access the Dask dashboard for real-time monitoring:
+
+```python
+from distributed import Client
+
+client = Client()  # Prints dashboard URL
+# Dashboard available at: http://localhost:8787/status
+```
+
+The dashboard shows:
+- **Task progress**: Number of tasks completed/pending
+- **Resource usage**: CPU, memory per worker
+- **Task stream**: Real-time visualization of computation
+- **Performance**: Bottleneck identification
+
+### Verbose Output
+
+Enable verbose logging to track inference progress:
+
+```python
+network = grnboost2(
+    expression_data=expression_matrix,
+    tf_names=tf_names,
+    verbose=True
+)
+```
+
+## Performance Optimization Tips
+
+### 1. Data Format
+- **Use Pandas DataFrame when possible**: More efficient than NumPy for Dask operations
+- **Reduce data size**: Filter low-variance genes before inference
+
+### 2. Worker Configuration
+- **CPU-bound tasks**: Set `threads_per_worker=1`, increase `n_workers`
+- **Memory-bound tasks**: Increase `memory_limit` per worker
+
+### 3. Cluster Setup
+- **Network**: Ensure high-bandwidth, low-latency network between nodes
+- **Storage**: Use shared filesystem or object storage for large datasets
+- **Scheduling**: Allocate dedicated nodes to avoid resource contention
+
+### 4. Transcription Factor Filtering
+- **Limit TF list**: Providing specific TF names reduces computation
+```python
+# Full search (slow)
+network = grnboost2(expression_data=matrix)
+
+# Filtered search (faster)
+network = grnboost2(expression_data=matrix, tf_names=known_tfs)
+```
+
+## Example: Large-Scale Single-Cell Analysis
+
+Complete workflow for processing single-cell RNA-seq data on a cluster:
+
+```python
+from distributed import Client
+from arboreto.algo import grnboost2
+import pandas as pd
+
+if __name__ == '__main__':
+    # Connect to cluster
+    client = Client('tcp://cluster-scheduler:8786')
+
+    # Load large single-cell dataset (50,000 cells x 20,000 genes)
+    expression_data = pd.read_csv('scrnaseq_data.tsv', sep='\t')
+
+    # Load cell-type-specific TFs
+    tf_names = pd.read_csv('tf_list.txt', header=None)[0].tolist()
+
+    # Run distributed inference
+    network = grnboost2(
+        expression_data=expression_data,
+        tf_names=tf_names,
+        client_or_address=client,
+        verbose=True,
+        seed=42
+    )
+
+    # Save results
+    network.to_csv('grn_results.tsv', sep='\t', index=False)
+
+    client.close()
+```
+
+This approach enables analysis of datasets that would be impractical on a single machine.
--- a/scientific-skills/arboreto/scripts/basic_grn_inference.py
+++ b/scientific-skills/arboreto/scripts/basic_grn_inference.py
@@ -0,0 +1,97 @@
+#!/usr/bin/env python3
+"""
+Basic GRN inference example using Arboreto.
+
+This script demonstrates the standard workflow for inferring gene regulatory
+networks from expression data using GRNBoost2.
+
+Usage:
+    python basic_grn_inference.py <expression_file> <output_file> [--tf-file TF_FILE] [--seed SEED]
+
+Arguments:
+    expression_file: Path to expression matrix (TSV format, genes as columns)
+    output_file: Path for output network (TSV format)
+    --tf-file: Optional path to transcription factors file (one per line)
+    --seed: Random seed for reproducibility (default: 777)
+"""
+
+import argparse
+import pandas as pd
+from arboreto.algo import grnboost2
+from arboreto.utils import load_tf_names
+
+
+def run_grn_inference(expression_file, output_file, tf_file=None, seed=777):
+    """
+    Run GRN inference using GRNBoost2.
+
+    Args:
+        expression_file: Path to expression matrix TSV file
+        output_file: Path for output network file
+        tf_file: Optional path to TF names file
+        seed: Random seed for reproducibility
+    """
+    print(f"Loading expression data from {expression_file}...")
+    expression_data = pd.read_csv(expression_file, sep='\t')
+
+    print(f"Expression matrix shape: {expression_data.shape}")
+    print(f"Number of genes: {expression_data.shape[1]}")
+    print(f"Number of observations: {expression_data.shape[0]}")
+
+    # Load TF names if provided
+    tf_names = 'all'
+    if tf_file:
+        print(f"Loading transcription factors from {tf_file}...")
+        tf_names = load_tf_names(tf_file)
+        print(f"Number of TFs: {len(tf_names)}")
+
+    # Run GRN inference
+    print(f"Running GRNBoost2 with seed={seed}...")
+    network = grnboost2(
+        expression_data=expression_data,
+        tf_names=tf_names,
+        seed=seed,
+        verbose=True
+    )
+
+    # Save results
+    print(f"Saving network to {output_file}...")
+    network.to_csv(output_file, sep='\t', index=False, header=False)
+
+    print(f"Done! Network contains {len(network)} regulatory links.")
+    print(f"\nTop 10 regulatory links:")
+    print(network.head(10).to_string(index=False))
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description='Infer gene regulatory network using GRNBoost2'
+    )
+    parser.add_argument(
+        'expression_file',
+        help='Path to expression matrix (TSV format, genes as columns)'
+    )
+    parser.add_argument(
+        'output_file',
+        help='Path for output network (TSV format)'
+    )
+    parser.add_argument(
+        '--tf-file',
+        help='Path to transcription factors file (one per line)',
+        default=None
+    )
+    parser.add_argument(
+        '--seed',
+        help='Random seed for reproducibility (default: 777)',
+        type=int,
+        default=777
+    )
+
+    args = parser.parse_args()
+
+    run_grn_inference(
+        expression_file=args.expression_file,
+        output_file=args.output_file,
+        tf_file=args.tf_file,
+        seed=args.seed
+    )
--- a/scientific-skills/astropy/SKILL.md
+++ b/scientific-skills/astropy/SKILL.md
@@ -0,0 +1,325 @@
+---
+name: astropy
+description: Comprehensive Python library for astronomy and astrophysics. This skill should be used when working with astronomical data including celestial coordinates, physical units, FITS files, cosmological calculations, time systems, tables, world coordinate systems (WCS), and astronomical data analysis. Use when tasks involve coordinate transformations, unit conversions, FITS file manipulation, cosmological distance calculations, time scale conversions, or astronomical data processing.
+---
+
+# Astropy
+
+## Overview
+
+Astropy is the core Python package for astronomy, providing essential functionality for astronomical research and data analysis. Use astropy for coordinate transformations, unit and quantity calculations, FITS file operations, cosmological calculations, precise time handling, tabular data manipulation, and astronomical image processing.
+
+## When to Use This Skill
+
+Use astropy when tasks involve:
+- Converting between celestial coordinate systems (ICRS, Galactic, FK5, AltAz, etc.)
+- Working with physical units and quantities (converting Jy to mJy, parsecs to km, etc.)
+- Reading, writing, or manipulating FITS files (images or tables)
+- Cosmological calculations (luminosity distance, lookback time, Hubble parameter)
+- Precise time handling with different time scales (UTC, TAI, TT, TDB) and formats (JD, MJD, ISO)
+- Table operations (reading catalogs, cross-matching, filtering, joining)
+- WCS transformations between pixel and world coordinates
+- Astronomical constants and calculations
+
+## Quick Start
+
+```python
+import astropy.units as u
+from astropy.coordinates import SkyCoord
+from astropy.time import Time
+from astropy.io import fits
+from astropy.table import Table
+from astropy.cosmology import Planck18
+
+# Units and quantities
+distance = 100 * u.pc
+distance_km = distance.to(u.km)
+
+# Coordinates
+coord = SkyCoord(ra=10.5*u.degree, dec=41.2*u.degree, frame='icrs')
+coord_galactic = coord.galactic
+
+# Time
+t = Time('2023-01-15 12:30:00')
+jd = t.jd  # Julian Date
+
+# FITS files
+data = fits.getdata('image.fits')
+header = fits.getheader('image.fits')
+
+# Tables
+table = Table.read('catalog.fits')
+
+# Cosmology
+d_L = Planck18.luminosity_distance(z=1.0)
+```
+
+## Core Capabilities
+
+### 1. Units and Quantities (`astropy.units`)
+
+Handle physical quantities with units, perform unit conversions, and ensure dimensional consistency in calculations.
+
+**Key operations:**
+- Create quantities by multiplying values with units
+- Convert between units using `.to()` method
+- Perform arithmetic with automatic unit handling
+- Use equivalencies for domain-specific conversions (spectral, doppler, parallax)
+- Work with logarithmic units (magnitudes, decibels)
+
+**See:** `references/units.md` for comprehensive documentation, unit systems, equivalencies, performance optimization, and unit arithmetic.
+
+### 2. Coordinate Systems (`astropy.coordinates`)
+
+Represent celestial positions and transform between different coordinate frames.
+
+**Key operations:**
+- Create coordinates with `SkyCoord` in any frame (ICRS, Galactic, FK5, AltAz, etc.)
+- Transform between coordinate systems
+- Calculate angular separations and position angles
+- Match coordinates to catalogs
+- Include distance for 3D coordinate operations
+- Handle proper motions and radial velocities
+- Query named objects from online databases
+
+**See:** `references/coordinates.md` for detailed coordinate frame descriptions, transformations, observer-dependent frames (AltAz), catalog matching, and performance tips.
+
+### 3. Cosmological Calculations (`astropy.cosmology`)
+
+Perform cosmological calculations using standard cosmological models.
+
+**Key operations:**
+- Use built-in cosmologies (Planck18, WMAP9, etc.)
+- Create custom cosmological models
+- Calculate distances (luminosity, comoving, angular diameter)
+- Compute ages and lookback times
+- Determine Hubble parameter at any redshift
+- Calculate density parameters and volumes
+- Perform inverse calculations (find z for given distance)
+
+**See:** `references/cosmology.md` for available models, distance calculations, time calculations, density parameters, and neutrino effects.
+
+### 4. FITS File Handling (`astropy.io.fits`)
+
+Read, write, and manipulate FITS (Flexible Image Transport System) files.
+
+**Key operations:**
+- Open FITS files with context managers
+- Access HDUs (Header Data Units) by index or name
+- Read and modify headers (keywords, comments, history)
+- Work with image data (NumPy arrays)
+- Handle table data (binary and ASCII tables)
+- Create new FITS files (single or multi-extension)
+- Use memory mapping for large files
+- Access remote FITS files (S3, HTTP)
+
+**See:** `references/fits.md` for comprehensive file operations, header manipulation, image and table handling, multi-extension files, and performance considerations.
+
+### 5. Table Operations (`astropy.table`)
+
+Work with tabular data with support for units, metadata, and various file formats.
+
+**Key operations:**
+- Create tables from arrays, lists, or dictionaries
+- Read/write tables in multiple formats (FITS, CSV, HDF5, VOTable)
+- Access and modify columns and rows
+- Sort, filter, and index tables
+- Perform database-style operations (join, group, aggregate)
+- Stack and concatenate tables
+- Work with unit-aware columns (QTable)
+- Handle missing data with masking
+
+**See:** `references/tables.md` for table creation, I/O operations, data manipulation, sorting, filtering, joins, grouping, and performance tips.
+
+### 6. Time Handling (`astropy.time`)
+
+Precise time representation and conversion between time scales and formats.
+
+**Key operations:**
+- Create Time objects in various formats (ISO, JD, MJD, Unix, etc.)
+- Convert between time scales (UTC, TAI, TT, TDB, etc.)
+- Perform time arithmetic with TimeDelta
+- Calculate sidereal time for observers
+- Compute light travel time corrections (barycentric, heliocentric)
+- Work with time arrays efficiently
+- Handle masked (missing) times
+
+**See:** `references/time.md` for time formats, time scales, conversions, arithmetic, observing features, and precision handling.
+
+### 7. World Coordinate System (`astropy.wcs`)
+
+Transform between pixel coordinates in images and world coordinates.
+
+**Key operations:**
+- Read WCS from FITS headers
+- Convert pixel coordinates to world coordinates (and vice versa)
+- Calculate image footprints
+- Access WCS parameters (reference pixel, projection, scale)
+- Create custom WCS objects
+
+**See:** `references/wcs_and_other_modules.md` for WCS operations and transformations.
+
+## Additional Capabilities
+
+The `references/wcs_and_other_modules.md` file also covers:
+
+### NDData and CCDData
+Containers for n-dimensional datasets with metadata, uncertainty, masking, and WCS information.
+
+### Modeling
+Framework for creating and fitting mathematical models to astronomical data.
+
+### Visualization
+Tools for astronomical image display with appropriate stretching and scaling.
+
+### Constants
+Physical and astronomical constants with proper units (speed of light, solar mass, Planck constant, etc.).
+
+### Convolution
+Image processing kernels for smoothing and filtering.
+
+### Statistics
+Robust statistical functions including sigma clipping and outlier rejection.
+
+## Installation
+
+```bash
+# Install astropy
+uv pip install astropy
+
+# With optional dependencies for full functionality
+uv pip install astropy[all]
+```
+
+## Common Workflows
+
+### Converting Coordinates Between Systems
+
+```python
+from astropy.coordinates import SkyCoord
+import astropy.units as u
+
+# Create coordinate
+c = SkyCoord(ra='05h23m34.5s', dec='-69d45m22s', frame='icrs')
+
+# Transform to galactic
+c_gal = c.galactic
+print(f"l={c_gal.l.deg}, b={c_gal.b.deg}")
+
+# Transform to alt-az (requires time and location)
+from astropy.time import Time
+from astropy.coordinates import EarthLocation, AltAz
+
+observing_time = Time('2023-06-15 23:00:00')
+observing_location = EarthLocation(lat=40*u.deg, lon=-120*u.deg)
+aa_frame = AltAz(obstime=observing_time, location=observing_location)
+c_altaz = c.transform_to(aa_frame)
+print(f"Alt={c_altaz.alt.deg}, Az={c_altaz.az.deg}")
+```
+
+### Reading and Analyzing FITS Files
+
+```python
+from astropy.io import fits
+import numpy as np
+
+# Open FITS file
+with fits.open('observation.fits') as hdul:
+    # Display structure
+    hdul.info()
+
+    # Get image data and header
+    data = hdul[1].data
+    header = hdul[1].header
+
+    # Access header values
+    exptime = header['EXPTIME']
+    filter_name = header['FILTER']
+
+    # Analyze data
+    mean = np.mean(data)
+    median = np.median(data)
+    print(f"Mean: {mean}, Median: {median}")
+```
+
+### Cosmological Distance Calculations
+
+```python
+from astropy.cosmology import Planck18
+import astropy.units as u
+import numpy as np
+
+# Calculate distances at z=1.5
+z = 1.5
+d_L = Planck18.luminosity_distance(z)
+d_A = Planck18.angular_diameter_distance(z)
+
+print(f"Luminosity distance: {d_L}")
+print(f"Angular diameter distance: {d_A}")
+
+# Age of universe at that redshift
+age = Planck18.age(z)
+print(f"Age at z={z}: {age.to(u.Gyr)}")
+
+# Lookback time
+t_lookback = Planck18.lookback_time(z)
+print(f"Lookback time: {t_lookback.to(u.Gyr)}")
+```
+
+### Cross-Matching Catalogs
+
+```python
+from astropy.table import Table
+from astropy.coordinates import SkyCoord, match_coordinates_sky
+import astropy.units as u
+
+# Read catalogs
+cat1 = Table.read('catalog1.fits')
+cat2 = Table.read('catalog2.fits')
+
+# Create coordinate objects
+coords1 = SkyCoord(ra=cat1['RA']*u.degree, dec=cat1['DEC']*u.degree)
+coords2 = SkyCoord(ra=cat2['RA']*u.degree, dec=cat2['DEC']*u.degree)
+
+# Find matches
+idx, sep, _ = coords1.match_to_catalog_sky(coords2)
+
+# Filter by separation threshold
+max_sep = 1 * u.arcsec
+matches = sep < max_sep
+
+# Create matched catalogs
+cat1_matched = cat1[matches]
+cat2_matched = cat2[idx[matches]]
+print(f"Found {len(cat1_matched)} matches")
+```
+
+## Best Practices
+
+1. **Always use units**: Attach units to quantities to avoid errors and ensure dimensional consistency
+2. **Use context managers for FITS files**: Ensures proper file closing
+3. **Prefer arrays over loops**: Process multiple coordinates/times as arrays for better performance
+4. **Check coordinate frames**: Verify the frame before transformations
+5. **Use appropriate cosmology**: Choose the right cosmological model for your analysis
+6. **Handle missing data**: Use masked columns for tables with missing values
+7. **Specify time scales**: Be explicit about time scales (UTC, TT, TDB) for precise timing
+8. **Use QTable for unit-aware tables**: When table columns have units
+9. **Check WCS validity**: Verify WCS before using transformations
+10. **Cache frequently used values**: Expensive calculations (e.g., cosmological distances) can be cached
+
+## Documentation and Resources
+
+- Official Astropy Documentation: https://docs.astropy.org/en/stable/
+- Tutorials: https://learn.astropy.org/
+- GitHub: https://github.com/astropy/astropy
+
+## Reference Files
+
+For detailed information on specific modules:
+- `references/units.md` - Units, quantities, conversions, and equivalencies
+- `references/coordinates.md` - Coordinate systems, transformations, and catalog matching
+- `references/cosmology.md` - Cosmological models and calculations
+- `references/fits.md` - FITS file operations and manipulation
+- `references/tables.md` - Table creation, I/O, and operations
+- `references/time.md` - Time formats, scales, and calculations
+- `references/wcs_and_other_modules.md` - WCS, NDData, modeling, visualization, constants, and utilities
--- a/scientific-skills/astropy/references/coordinates.md
+++ b/scientific-skills/astropy/references/coordinates.md
@@ -0,0 +1,273 @@
+# Astronomical Coordinates (astropy.coordinates)
+
+The `astropy.coordinates` package provides tools for representing celestial coordinates and transforming between different coordinate systems.
+
+## Creating Coordinates with SkyCoord
+
+The high-level `SkyCoord` class is the recommended interface:
+
+```python
+from astropy import units as u
+from astropy.coordinates import SkyCoord
+
+# Decimal degrees
+c = SkyCoord(ra=10.625*u.degree, dec=41.2*u.degree, frame='icrs')
+
+# Sexagesimal strings
+c = SkyCoord(ra='00h42m30s', dec='+41d12m00s', frame='icrs')
+
+# Mixed formats
+c = SkyCoord('00h42.5m +41d12m', unit=(u.hourangle, u.deg))
+
+# Galactic coordinates
+c = SkyCoord(l=120.5*u.degree, b=-23.4*u.degree, frame='galactic')
+```
+
+## Array Coordinates
+
+Process multiple coordinates efficiently using arrays:
+
+```python
+# Create array of coordinates
+coords = SkyCoord(ra=[10, 11, 12]*u.degree,
+                  dec=[41, -5, 42]*u.degree)
+
+# Access individual elements
+coords[0]
+coords[1:3]
+
+# Array operations
+coords.shape
+len(coords)
+```
+
+## Accessing Components
+
+```python
+c = SkyCoord(ra=10.68*u.degree, dec=41.27*u.degree, frame='icrs')
+
+# Access coordinates
+c.ra        # <Longitude 10.68 deg>
+c.dec       # <Latitude 41.27 deg>
+c.ra.hour   # Convert to hours
+c.ra.hms    # Hours, minutes, seconds tuple
+c.dec.dms   # Degrees, arcminutes, arcseconds tuple
+```
+
+## String Formatting
+
+```python
+c.to_string('decimal')      # '10.68 41.27'
+c.to_string('dms')          # '10d40m48s 41d16m12s'
+c.to_string('hmsdms')       # '00h42m43.2s +41d16m12s'
+
+# Custom formatting
+c.ra.to_string(unit=u.hour, sep=':', precision=2)
+```
+
+## Coordinate Transformations
+
+Transform between reference frames:
+
+```python
+c_icrs = SkyCoord(ra=10.68*u.degree, dec=41.27*u.degree, frame='icrs')
+
+# Simple transformations (as attributes)
+c_galactic = c_icrs.galactic
+c_fk5 = c_icrs.fk5
+c_fk4 = c_icrs.fk4
+
+# Explicit transformations
+c_icrs.transform_to('galactic')
+c_icrs.transform_to(FK5(equinox='J1975'))  # Custom frame parameters
+```
+
+## Common Coordinate Frames
+
+### Celestial Frames
+- **ICRS**: International Celestial Reference System (default, most common)
+- **FK5**: Fifth Fundamental Catalogue (equinox J2000.0 by default)
+- **FK4**: Fourth Fundamental Catalogue (older, requires equinox specification)
+- **GCRS**: Geocentric Celestial Reference System
+- **CIRS**: Celestial Intermediate Reference System
+
+### Galactic Frames
+- **Galactic**: IAU 1958 galactic coordinates
+- **Supergalactic**: De Vaucouleurs supergalactic coordinates
+- **Galactocentric**: Galactic center-based 3D coordinates
+
+### Horizontal Frames
+- **AltAz**: Altitude-azimuth (observer-dependent)
+- **HADec**: Hour angle-declination
+
+### Ecliptic Frames
+- **GeocentricMeanEcliptic**: Geocentric mean ecliptic
+- **BarycentricMeanEcliptic**: Barycentric mean ecliptic
+- **HeliocentricMeanEcliptic**: Heliocentric mean ecliptic
+
+## Observer-Dependent Transformations
+
+For altitude-azimuth coordinates, specify observation time and location:
+
+```python
+from astropy.time import Time
+from astropy.coordinates import EarthLocation, AltAz
+
+# Define observer location
+observing_location = EarthLocation(lat=40.8*u.deg, lon=-121.5*u.deg, height=1060*u.m)
+# Or use named observatory
+observing_location = EarthLocation.of_site('Apache Point Observatory')
+
+# Define observation time
+observing_time = Time('2023-01-15 23:00:00')
+
+# Transform to alt-az
+aa_frame = AltAz(obstime=observing_time, location=observing_location)
+aa = c_icrs.transform_to(aa_frame)
+
+print(f"Altitude: {aa.alt}")
+print(f"Azimuth: {aa.az}")
+```
+
+## Working with Distances
+
+Add distance information for 3D coordinates:
+
+```python
+# With distance
+c = SkyCoord(ra=10*u.degree, dec=9*u.degree, distance=770*u.kpc, frame='icrs')
+
+# Access 3D Cartesian coordinates
+c.cartesian.x
+c.cartesian.y
+c.cartesian.z
+
+# Distance from origin
+c.distance
+
+# 3D separation
+c1 = SkyCoord(ra=10*u.degree, dec=9*u.degree, distance=10*u.pc)
+c2 = SkyCoord(ra=11*u.degree, dec=10*u.degree, distance=11.5*u.pc)
+sep_3d = c1.separation_3d(c2)  # 3D distance
+```
+
+## Angular Separation
+
+Calculate on-sky separations:
+
+```python
+c1 = SkyCoord(ra=10*u.degree, dec=9*u.degree, frame='icrs')
+c2 = SkyCoord(ra=11*u.degree, dec=10*u.degree, frame='fk5')
+
+# Angular separation (handles frame conversion automatically)
+sep = c1.separation(c2)
+print(f"Separation: {sep.arcsec} arcsec")
+
+# Position angle
+pa = c1.position_angle(c2)
+```
+
+## Catalog Matching
+
+Match coordinates to catalog sources:
+
+```python
+# Single target matching
+catalog = SkyCoord(ra=ra_array*u.degree, dec=dec_array*u.degree)
+target = SkyCoord(ra=10.5*u.degree, dec=41.2*u.degree)
+
+# Find closest match
+idx, sep2d, dist3d = target.match_to_catalog_sky(catalog)
+matched_coord = catalog[idx]
+
+# Match with maximum separation constraint
+matches = target.separation(catalog) < 1*u.arcsec
+```
+
+## Named Objects
+
+Retrieve coordinates from online catalogs:
+
+```python
+# Query by name (requires internet)
+m31 = SkyCoord.from_name("M31")
+crab = SkyCoord.from_name("Crab Nebula")
+psr = SkyCoord.from_name("PSR J1012+5307")
+```
+
+## Earth Locations
+
+Define observer locations:
+
+```python
+# By coordinates
+location = EarthLocation(lat=40*u.deg, lon=-120*u.deg, height=1000*u.m)
+
+# By named observatory
+keck = EarthLocation.of_site('Keck Observatory')
+vlt = EarthLocation.of_site('Paranal Observatory')
+
+# By address (requires internet)
+location = EarthLocation.of_address('1002 Holy Grail Court, St. Louis, MO')
+
+# List available observatories
+EarthLocation.get_site_names()
+```
+
+## Velocity Information
+
+Include proper motion and radial velocity:
+
+```python
+# Proper motion
+c = SkyCoord(ra=10*u.degree, dec=41*u.degree,
+             pm_ra_cosdec=15*u.mas/u.yr,
+             pm_dec=5*u.mas/u.yr,
+             distance=150*u.pc)
+
+# Radial velocity
+c = SkyCoord(ra=10*u.degree, dec=41*u.degree,
+             radial_velocity=20*u.km/u.s)
+
+# Both
+c = SkyCoord(ra=10*u.degree, dec=41*u.degree, distance=150*u.pc,
+             pm_ra_cosdec=15*u.mas/u.yr, pm_dec=5*u.mas/u.yr,
+             radial_velocity=20*u.km/u.s)
+```
+
+## Representation Types
+
+Switch between coordinate representations:
+
+```python
+# Cartesian representation
+c = SkyCoord(x=1*u.kpc, y=2*u.kpc, z=3*u.kpc,
+             representation_type='cartesian', frame='icrs')
+
+# Change representation
+c.representation_type = 'cylindrical'
+c.rho  # Cylindrical radius
+c.phi  # Azimuthal angle
+c.z    # Height
+
+# Spherical (default for most frames)
+c.representation_type = 'spherical'
+```
+
+## Performance Tips
+
+1. **Use arrays, not loops**: Process multiple coordinates as single array
+2. **Pre-compute frames**: Reuse frame objects for multiple transformations
+3. **Use broadcasting**: Efficiently transform many positions across many times
+4. **Enable interpolation**: For dense time sampling, use ErfaAstromInterpolator
+
+```python
+# Fast approach
+coords = SkyCoord(ra=ra_array*u.degree, dec=dec_array*u.degree)
+coords_transformed = coords.transform_to('galactic')
+
+# Slow approach (avoid)
+for ra, dec in zip(ra_array, dec_array):
+    c = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
+    c_transformed = c.transform_to('galactic')
+```
--- a/scientific-skills/astropy/references/cosmology.md
+++ b/scientific-skills/astropy/references/cosmology.md
@@ -0,0 +1,307 @@
+# Cosmological Calculations (astropy.cosmology)
+
+The `astropy.cosmology` subpackage provides tools for cosmological calculations based on various cosmological models.
+
+## Using Built-in Cosmologies
+
+Preloaded cosmologies based on WMAP and Planck observations:
+
+```python
+from astropy.cosmology import Planck18, Planck15, Planck13
+from astropy.cosmology import WMAP9, WMAP7, WMAP5
+from astropy import units as u
+
+# Use Planck 2018 cosmology
+cosmo = Planck18
+
+# Calculate distance to z=4
+d = cosmo.luminosity_distance(4)
+print(f"Luminosity distance at z=4: {d}")
+
+# Age of universe at z=0
+age = cosmo.age(0)
+print(f"Current age of universe: {age.to(u.Gyr)}")
+```
+
+## Creating Custom Cosmologies
+
+### FlatLambdaCDM (Most Common)
+
+Flat universe with cosmological constant:
+
+```python
+from astropy.cosmology import FlatLambdaCDM
+
+# Define cosmology
+cosmo = FlatLambdaCDM(
+    H0=70 * u.km / u.s / u.Mpc,  # Hubble constant at z=0
+    Om0=0.3,                      # Matter density parameter at z=0
+    Tcmb0=2.725 * u.K             # CMB temperature (optional)
+)
+```
+
+### LambdaCDM (Non-Flat)
+
+Non-flat universe with cosmological constant:
+
+```python
+from astropy.cosmology import LambdaCDM
+
+cosmo = LambdaCDM(
+    H0=70 * u.km / u.s / u.Mpc,
+    Om0=0.3,
+    Ode0=0.7  # Dark energy density parameter
+)
+```
+
+### wCDM and w0wzCDM
+
+Dark energy with equation of state parameter:
+
+```python
+from astropy.cosmology import FlatwCDM, w0wzCDM
+
+# Constant w
+cosmo_w = FlatwCDM(H0=70 * u.km/u.s/u.Mpc, Om0=0.3, w0=-0.9)
+
+# Evolving w(z) = w0 + wz * z
+cosmo_wz = w0wzCDM(H0=70 * u.km/u.s/u.Mpc, Om0=0.3, Ode0=0.7,
+                   w0=-1.0, wz=0.1)
+```
+
+## Distance Calculations
+
+### Comoving Distance
+
+Line-of-sight comoving distance:
+
+```python
+d_c = cosmo.comoving_distance(z)
+```
+
+### Luminosity Distance
+
+Distance for calculating luminosity from observed flux:
+
+```python
+d_L = cosmo.luminosity_distance(z)
+
+# Calculate absolute magnitude from apparent magnitude
+M = m - 5*np.log10(d_L.to(u.pc).value) + 5
+```
+
+### Angular Diameter Distance
+
+Distance for calculating physical size from angular size:
+
+```python
+d_A = cosmo.angular_diameter_distance(z)
+
+# Calculate physical size from angular size
+theta = 10 * u.arcsec  # Angular size
+physical_size = d_A * theta.to(u.radian).value
+```
+
+### Comoving Transverse Distance
+
+Transverse comoving distance (equals comoving distance in flat universe):
+
+```python
+d_M = cosmo.comoving_transverse_distance(z)
+```
+
+### Distance Modulus
+
+```python
+dm = cosmo.distmod(z)
+# Relates apparent and absolute magnitudes: m - M = dm
+```
+
+## Scale Calculations
+
+### kpc per Arcminute
+
+Physical scale at a given redshift:
+
+```python
+scale = cosmo.kpc_proper_per_arcmin(z)
+# e.g., "50 kpc per arcminute at z=1"
+```
+
+### Comoving Volume
+
+Volume element for survey volume calculations:
+
+```python
+vol = cosmo.comoving_volume(z)  # Total volume to redshift z
+vol_element = cosmo.differential_comoving_volume(z)  # dV/dz
+```
+
+## Time Calculations
+
+### Age of Universe
+
+Age at a given redshift:
+
+```python
+age = cosmo.age(z)
+age_now = cosmo.age(0)  # Current age
+age_at_z1 = cosmo.age(1)  # Age at z=1
+```
+
+### Lookback Time
+
+Time since photons were emitted:
+
+```python
+t_lookback = cosmo.lookback_time(z)
+# Time between z and z=0
+```
+
+## Hubble Parameter
+
+Hubble parameter as function of redshift:
+
+```python
+H_z = cosmo.H(z)  # H(z) in km/s/Mpc
+E_z = cosmo.efunc(z)  # E(z) = H(z)/H0
+```
+
+## Density Parameters
+
+Evolution of density parameters with redshift:
+
+```python
+Om_z = cosmo.Om(z)        # Matter density at z
+Ode_z = cosmo.Ode(z)      # Dark energy density at z
+Ok_z = cosmo.Ok(z)        # Curvature density at z
+Ogamma_z = cosmo.Ogamma(z)  # Photon density at z
+Onu_z = cosmo.Onu(z)      # Neutrino density at z
+```
+
+## Critical and Characteristic Densities
+
+```python
+rho_c = cosmo.critical_density(z)  # Critical density at z
+rho_m = cosmo.critical_density(z) * cosmo.Om(z)  # Matter density
+```
+
+## Inverse Calculations
+
+Find redshift corresponding to a specific value:
+
+```python
+from astropy.cosmology import z_at_value
+
+# Find z at specific lookback time
+z = z_at_value(cosmo.lookback_time, 10*u.Gyr)
+
+# Find z at specific luminosity distance
+z = z_at_value(cosmo.luminosity_distance, 1000*u.Mpc)
+
+# Find z at specific age
+z = z_at_value(cosmo.age, 1*u.Gyr)
+```
+
+## Array Operations
+
+All methods accept array inputs:
+
+```python
+import numpy as np
+
+z_array = np.linspace(0, 5, 100)
+d_L_array = cosmo.luminosity_distance(z_array)
+H_array = cosmo.H(z_array)
+age_array = cosmo.age(z_array)
+```
+
+## Neutrino Effects
+
+Include massive neutrinos:
+
+```python
+from astropy.cosmology import FlatLambdaCDM
+
+# With massive neutrinos
+cosmo = FlatLambdaCDM(
+    H0=70 * u.km/u.s/u.Mpc,
+    Om0=0.3,
+    Tcmb0=2.725 * u.K,
+    Neff=3.04,  # Effective number of neutrino species
+    m_nu=[0., 0., 0.06] * u.eV  # Neutrino masses
+)
+```
+
+Note: Massive neutrinos reduce performance by 3-4x but provide more accurate results.
+
+## Cloning and Modifying Cosmologies
+
+Cosmology objects are immutable. Create modified copies:
+
+```python
+# Clone with different H0
+cosmo_new = cosmo.clone(H0=72 * u.km/u.s/u.Mpc)
+
+# Clone with modified name
+cosmo_named = cosmo.clone(name="My Custom Cosmology")
+```
+
+## Common Use Cases
+
+### Calculating Absolute Magnitude
+
+```python
+# From apparent magnitude and redshift
+z = 1.5
+m_app = 24.5  # Apparent magnitude
+d_L = cosmo.luminosity_distance(z)
+M_abs = m_app - cosmo.distmod(z).value
+```
+
+### Survey Volume Calculations
+
+```python
+# Volume between two redshifts
+z_min, z_max = 0.5, 1.5
+volume = cosmo.comoving_volume(z_max) - cosmo.comoving_volume(z_min)
+
+# Convert to Gpc^3
+volume_gpc3 = volume.to(u.Gpc**3)
+```
+
+### Physical Size from Angular Size
+
+```python
+theta = 1 * u.arcsec  # Angular size
+z = 2.0
+d_A = cosmo.angular_diameter_distance(z)
+size_kpc = (d_A * theta.to(u.radian)).to(u.kpc)
+```
+
+### Time Since Big Bang
+
+```python
+# Age at specific redshift
+z_formation = 6
+age_at_formation = cosmo.age(z_formation)
+time_since_formation = cosmo.age(0) - age_at_formation
+```
+
+## Comparison of Cosmologies
+
+```python
+# Compare different models
+from astropy.cosmology import Planck18, WMAP9
+
+z = 1.0
+print(f"Planck18 d_L: {Planck18.luminosity_distance(z)}")
+print(f"WMAP9 d_L: {WMAP9.luminosity_distance(z)}")
+```
+
+## Performance Considerations
+
+- Calculations are fast for most purposes
+- Massive neutrinos reduce speed significantly
+- Array operations are vectorized and efficient
+- Results valid for z < 5000-6000 (depends on model)
--- a/scientific-skills/astropy/references/fits.md
+++ b/scientific-skills/astropy/references/fits.md
@@ -0,0 +1,396 @@
+# FITS File Handling (astropy.io.fits)
+
+The `astropy.io.fits` module provides comprehensive tools for reading, writing, and manipulating FITS (Flexible Image Transport System) files.
+
+## Opening FITS Files
+
+### Basic File Opening
+
+```python
+from astropy.io import fits
+
+# Open file (returns HDUList - list of HDUs)
+hdul = fits.open('filename.fits')
+
+# Always close when done
+hdul.close()
+
+# Better: use context manager (automatically closes)
+with fits.open('filename.fits') as hdul:
+    hdul.info()  # Display file structure
+    data = hdul[0].data
+```
+
+### File Opening Modes
+
+```python
+fits.open('file.fits', mode='readonly')   # Read-only (default)
+fits.open('file.fits', mode='update')     # Read and write
+fits.open('file.fits', mode='append')     # Add HDUs to file
+```
+
+### Memory Mapping
+
+For large files, use memory mapping (default behavior):
+
+```python
+hdul = fits.open('large_file.fits', memmap=True)
+# Only loads data chunks as needed
+```
+
+### Remote Files
+
+Access cloud-hosted FITS files:
+
+```python
+uri = "s3://bucket-name/image.fits"
+with fits.open(uri, use_fsspec=True, fsspec_kwargs={"anon": True}) as hdul:
+    # Use .section to get cutouts without downloading entire file
+    cutout = hdul[1].section[100:200, 100:200]
+```
+
+## HDU Structure
+
+FITS files contain Header Data Units (HDUs):
+- **Primary HDU** (`hdul[0]`): First HDU, always present
+- **Extension HDUs** (`hdul[1:]`): Image or table extensions
+
+```python
+hdul.info()  # Display all HDUs
+# Output:
+# No.    Name      Ver    Type      Cards   Dimensions   Format
+#  0  PRIMARY       1 PrimaryHDU     220   ()
+#  1  SCI           1 ImageHDU       140   (1014, 1014)   float32
+#  2  ERR           1 ImageHDU        51   (1014, 1014)   float32
+```
+
+## Accessing HDUs
+
+```python
+# By index
+primary = hdul[0]
+extension1 = hdul[1]
+
+# By name
+sci = hdul['SCI']
+
+# By name and version number
+sci2 = hdul['SCI', 2]  # Second SCI extension
+```
+
+## Working with Headers
+
+### Reading Header Values
+
+```python
+hdu = hdul[0]
+header = hdu.header
+
+# Get keyword value (case-insensitive)
+observer = header['OBSERVER']
+exptime = header['EXPTIME']
+
+# Get with default if missing
+filter_name = header.get('FILTER', 'Unknown')
+
+# Access by index
+value = header[7]  # 8th card's value
+```
+
+### Modifying Headers
+
+```python
+# Update existing keyword
+header['OBSERVER'] = 'Edwin Hubble'
+
+# Add/update with comment
+header['OBSERVER'] = ('Edwin Hubble', 'Name of observer')
+
+# Add keyword at specific position
+header.insert(5, ('NEWKEY', 'value', 'comment'))
+
+# Add HISTORY and COMMENT
+header['HISTORY'] = 'File processed on 2025-01-15'
+header['COMMENT'] = 'Note about the data'
+
+# Delete keyword
+del header['OLDKEY']
+```
+
+### Header Cards
+
+Each keyword is stored as a "card" (80-character record):
+
+```python
+# Access full card
+card = header.cards[0]
+print(f"{card.keyword} = {card.value} / {card.comment}")
+
+# Iterate over all cards
+for card in header.cards:
+    print(f"{card.keyword}: {card.value}")
+```
+
+## Working with Image Data
+
+### Reading Image Data
+
+```python
+# Get data from HDU
+data = hdul[1].data  # Returns NumPy array
+
+# Data properties
+print(data.shape)      # e.g., (1024, 1024)
+print(data.dtype)      # e.g., float32
+print(data.min(), data.max())
+
+# Access specific pixels
+pixel_value = data[100, 200]
+region = data[100:200, 300:400]
+```
+
+### Data Operations
+
+Data is a NumPy array, so use standard NumPy operations:
+
+```python
+import numpy as np
+
+# Statistics
+mean = np.mean(data)
+median = np.median(data)
+std = np.std(data)
+
+# Modify data
+data[data < 0] = 0  # Clip negative values
+data = data * gain + bias  # Calibration
+
+# Mathematical operations
+log_data = np.log10(data)
+smoothed = scipy.ndimage.gaussian_filter(data, sigma=2)
+```
+
+### Cutouts and Sections
+
+Extract regions without loading entire array:
+
+```python
+# Section notation [y_start:y_end, x_start:x_end]
+cutout = hdul[1].section[500:600, 700:800]
+```
+
+## Creating New FITS Files
+
+### Simple Image File
+
+```python
+# Create data
+data = np.random.random((100, 100))
+
+# Create HDU
+hdu = fits.PrimaryHDU(data=data)
+
+# Add header keywords
+hdu.header['OBJECT'] = 'Test Image'
+hdu.header['EXPTIME'] = 300.0
+
+# Write to file
+hdu.writeto('new_image.fits')
+
+# Overwrite if exists
+hdu.writeto('new_image.fits', overwrite=True)
+```
+
+### Multi-Extension File
+
+```python
+# Create primary HDU (can have no data)
+primary = fits.PrimaryHDU()
+primary.header['TELESCOP'] = 'HST'
+
+# Create image extensions
+sci_data = np.ones((100, 100))
+sci = fits.ImageHDU(data=sci_data, name='SCI')
+
+err_data = np.ones((100, 100)) * 0.1
+err = fits.ImageHDU(data=err_data, name='ERR')
+
+# Combine into HDUList
+hdul = fits.HDUList([primary, sci, err])
+
+# Write to file
+hdul.writeto('multi_extension.fits')
+```
+
+## Working with Table Data
+
+### Reading Tables
+
+```python
+# Open table
+with fits.open('table.fits') as hdul:
+    table = hdul[1].data  # BinTableHDU or TableHDU
+
+    # Access columns
+    ra = table['RA']
+    dec = table['DEC']
+    mag = table['MAG']
+
+    # Access rows
+    first_row = table[0]
+    subset = table[10:20]
+
+    # Column info
+    cols = hdul[1].columns
+    print(cols.names)
+    cols.info()
+```
+
+### Creating Tables
+
+```python
+# Define columns
+col1 = fits.Column(name='ID', format='K', array=[1, 2, 3, 4])
+col2 = fits.Column(name='RA', format='D', array=[10.5, 11.2, 12.3, 13.1])
+col3 = fits.Column(name='DEC', format='D', array=[41.2, 42.1, 43.5, 44.2])
+col4 = fits.Column(name='Name', format='20A',
+                   array=['Star1', 'Star2', 'Star3', 'Star4'])
+
+# Create table HDU
+table_hdu = fits.BinTableHDU.from_columns([col1, col2, col3, col4])
+table_hdu.name = 'CATALOG'
+
+# Write to file
+table_hdu.writeto('catalog.fits', overwrite=True)
+```
+
+### Column Formats
+
+Common FITS table column formats:
+- `'A'`: Character string (e.g., '20A' for 20 characters)
+- `'L'`: Logical (boolean)
+- `'B'`: Unsigned byte
+- `'I'`: 16-bit integer
+- `'J'`: 32-bit integer
+- `'K'`: 64-bit integer
+- `'E'`: 32-bit floating point
+- `'D'`: 64-bit floating point
+
+## Modifying Existing Files
+
+### Update Mode
+
+```python
+with fits.open('file.fits', mode='update') as hdul:
+    # Modify header
+    hdul[0].header['NEWKEY'] = 'value'
+
+    # Modify data
+    hdul[1].data[100, 100] = 999
+
+    # Changes automatically saved when context exits
+```
+
+### Append Mode
+
+```python
+# Add new extension to existing file
+new_data = np.random.random((50, 50))
+new_hdu = fits.ImageHDU(data=new_data, name='NEW_EXT')
+
+with fits.open('file.fits', mode='append') as hdul:
+    hdul.append(new_hdu)
+```
+
+## Convenience Functions
+
+For quick operations without managing HDU lists:
+
+```python
+# Get data only
+data = fits.getdata('file.fits', ext=1)
+
+# Get header only
+header = fits.getheader('file.fits', ext=0)
+
+# Get both
+data, header = fits.getdata('file.fits', ext=1, header=True)
+
+# Get single keyword value
+exptime = fits.getval('file.fits', 'EXPTIME', ext=0)
+
+# Set keyword value
+fits.setval('file.fits', 'NEWKEY', value='newvalue', ext=0)
+
+# Write simple file
+fits.writeto('output.fits', data, header, overwrite=True)
+
+# Append to file
+fits.append('file.fits', data, header)
+
+# Display file info
+fits.info('file.fits')
+```
+
+## Comparing FITS Files
+
+```python
+# Print differences between two files
+fits.printdiff('file1.fits', 'file2.fits')
+
+# Compare programmatically
+diff = fits.FITSDiff('file1.fits', 'file2.fits')
+print(diff.report())
+```
+
+## Converting Between Formats
+
+### FITS to/from Astropy Table
+
+```python
+from astropy.table import Table
+
+# FITS to Table
+table = Table.read('catalog.fits')
+
+# Table to FITS
+table.write('output.fits', format='fits', overwrite=True)
+```
+
+## Best Practices
+
+1. **Always use context managers** (`with` statements) for safe file handling
+2. **Avoid modifying structural keywords** (SIMPLE, BITPIX, NAXIS, etc.)
+3. **Use memory mapping** for large files to conserve RAM
+4. **Use .section** for remote files to avoid full downloads
+5. **Check HDU structure** with `.info()` before accessing data
+6. **Verify data types** before operations to avoid unexpected behavior
+7. **Use convenience functions** for simple one-off operations
+
+## Common Issues
+
+### Handling Non-Standard FITS
+
+Some files violate FITS standards:
+
+```python
+# Ignore verification warnings
+hdul = fits.open('bad_file.fits', ignore_missing_end=True)
+
+# Fix non-standard files
+hdul = fits.open('bad_file.fits')
+hdul.verify('fix')  # Try to fix issues
+hdul.writeto('fixed_file.fits')
+```
+
+### Large File Performance
+
+```python
+# Use memory mapping (default)
+hdul = fits.open('huge_file.fits', memmap=True)
+
+# For write operations with large arrays, use Dask
+import dask.array as da
+large_array = da.random.random((10000, 10000))
+fits.writeto('output.fits', large_array)
+```
--- a/scientific-skills/astropy/references/tables.md
+++ b/scientific-skills/astropy/references/tables.md
@@ -0,0 +1,489 @@
+# Table Operations (astropy.table)
+
+The `astropy.table` module provides flexible tools for working with tabular data, with support for units, masked values, and various file formats.
+
+## Creating Tables
+
+### Basic Table Creation
+
+```python
+from astropy.table import Table, QTable
+import astropy.units as u
+import numpy as np
+
+# From column arrays
+a = [1, 4, 5]
+b = [2.0, 5.0, 8.2]
+c = ['x', 'y', 'z']
+
+t = Table([a, b, c], names=('id', 'flux', 'name'))
+
+# With units (use QTable)
+flux = [1.2, 2.3, 3.4] * u.Jy
+wavelength = [500, 600, 700] * u.nm
+t = QTable([flux, wavelength], names=('flux', 'wavelength'))
+```
+
+### From Lists of Rows
+
+```python
+# List of tuples
+rows = [(1, 10.5, 'A'), (2, 11.2, 'B'), (3, 12.3, 'C')]
+t = Table(rows=rows, names=('id', 'value', 'name'))
+
+# List of dictionaries
+rows = [{'id': 1, 'value': 10.5}, {'id': 2, 'value': 11.2}]
+t = Table(rows)
+```
+
+### From NumPy Arrays
+
+```python
+# Structured array
+arr = np.array([(1, 2.0, 'x'), (4, 5.0, 'y')],
+               dtype=[('a', 'i4'), ('b', 'f8'), ('c', 'U10')])
+t = Table(arr)
+
+# 2D array with column names
+data = np.random.random((100, 3))
+t = Table(data, names=['col1', 'col2', 'col3'])
+```
+
+### From Pandas DataFrame
+
+```python
+import pandas as pd
+
+df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
+t = Table.from_pandas(df)
+```
+
+## Accessing Table Data
+
+### Basic Access
+
+```python
+# Column access
+ra_col = t['ra']           # Returns Column object
+dec_col = t['dec']
+
+# Row access
+first_row = t[0]           # Returns Row object
+row_slice = t[10:20]       # Returns new Table
+
+# Cell access
+value = t['ra'][5]         # 6th value in 'ra' column
+value = t[5]['ra']         # Same thing
+
+# Multiple columns
+subset = t['ra', 'dec', 'mag']
+```
+
+### Table Properties
+
+```python
+len(t)              # Number of rows
+t.colnames          # List of column names
+t.dtype             # Column data types
+t.info              # Detailed information
+t.meta              # Metadata dictionary
+```
+
+### Iteration
+
+```python
+# Iterate over rows
+for row in t:
+    print(row['ra'], row['dec'])
+
+# Iterate over columns
+for colname in t.colnames:
+    print(t[colname])
+```
+
+## Modifying Tables
+
+### Adding Columns
+
+```python
+# Add new column
+t['new_col'] = [1, 2, 3, 4, 5]
+t['calc'] = t['a'] + t['b']  # Calculated column
+
+# Add column with units
+t['velocity'] = [10, 20, 30] * u.km / u.s
+
+# Add empty column
+from astropy.table import Column
+t['empty'] = Column(length=len(t), dtype=float)
+
+# Insert at specific position
+t.add_column([7, 8, 9], name='inserted', index=2)
+```
+
+### Removing Columns
+
+```python
+# Remove single column
+t.remove_column('old_col')
+
+# Remove multiple columns
+t.remove_columns(['col1', 'col2'])
+
+# Delete syntax
+del t['col_name']
+
+# Keep only specific columns
+t.keep_columns(['ra', 'dec', 'mag'])
+```
+
+### Renaming Columns
+
+```python
+t.rename_column('old_name', 'new_name')
+
+# Rename multiple
+t.rename_columns(['old1', 'old2'], ['new1', 'new2'])
+```
+
+### Adding Rows
+
+```python
+# Add single row
+t.add_row([1, 2.5, 'new'])
+
+# Add row as dict
+t.add_row({'ra': 10.5, 'dec': 41.2, 'mag': 18.5})
+
+# Note: Adding rows one at a time is slow!
+# Better to collect rows and create table at once
+```
+
+### Modifying Data
+
+```python
+# Modify column values
+t['flux'] = t['flux'] * gain
+t['mag'][t['mag'] < 0] = np.nan
+
+# Modify single cell
+t['ra'][5] = 10.5
+
+# Modify entire row
+t[0] = [new_id, new_ra, new_dec]
+```
+
+## Sorting and Filtering
+
+### Sorting
+
+```python
+# Sort by single column
+t.sort('mag')
+
+# Sort descending
+t.sort('mag', reverse=True)
+
+# Sort by multiple columns
+t.sort(['priority', 'mag'])
+
+# Get sorted indices without modifying table
+indices = t.argsort('mag')
+sorted_table = t[indices]
+```
+
+### Filtering
+
+```python
+# Boolean indexing
+bright = t[t['mag'] < 18]
+nearby = t[t['distance'] < 100*u.pc]
+
+# Multiple conditions
+selected = t[(t['mag'] < 18) & (t['dec'] > 0)]
+
+# Using numpy functions
+high_snr = t[np.abs(t['flux'] / t['error']) > 5]
+```
+
+## Reading and Writing Files
+
+### Supported Formats
+
+FITS, HDF5, ASCII (CSV, ECSV, IPAC, etc.), VOTable, Parquet, ASDF
+
+### Reading Files
+
+```python
+# Automatic format detection
+t = Table.read('catalog.fits')
+t = Table.read('data.csv')
+t = Table.read('table.vot')
+
+# Specify format explicitly
+t = Table.read('data.txt', format='ascii')
+t = Table.read('catalog.hdf5', path='/dataset/table')
+
+# Read specific HDU from FITS
+t = Table.read('file.fits', hdu=2)
+```
+
+### Writing Files
+
+```python
+# Automatic format from extension
+t.write('output.fits')
+t.write('output.csv')
+
+# Specify format
+t.write('output.txt', format='ascii.csv')
+t.write('output.hdf5', path='/data/table', serialize_meta=True)
+
+# Overwrite existing file
+t.write('output.fits', overwrite=True)
+```
+
+### ASCII Format Options
+
+```python
+# CSV with custom delimiter
+t.write('output.csv', format='ascii.csv', delimiter='|')
+
+# Fixed-width format
+t.write('output.txt', format='ascii.fixed_width')
+
+# IPAC format
+t.write('output.tbl', format='ascii.ipac')
+
+# LaTeX table
+t.write('table.tex', format='ascii.latex')
+```
+
+## Table Operations
+
+### Stacking Tables (Vertical)
+
+```python
+from astropy.table import vstack
+
+# Concatenate tables vertically
+t1 = Table([[1, 2], [3, 4]], names=('a', 'b'))
+t2 = Table([[5, 6], [7, 8]], names=('a', 'b'))
+t_combined = vstack([t1, t2])
+```
+
+### Joining Tables (Horizontal)
+
+```python
+from astropy.table import hstack
+
+# Concatenate tables horizontally
+t1 = Table([[1, 2]], names=['a'])
+t2 = Table([[3, 4]], names=['b'])
+t_combined = hstack([t1, t2])
+```
+
+### Database-Style Joins
+
+```python
+from astropy.table import join
+
+# Inner join on common column
+t1 = Table([[1, 2, 3], ['a', 'b', 'c']], names=('id', 'data1'))
+t2 = Table([[1, 2, 4], ['x', 'y', 'z']], names=('id', 'data2'))
+t_joined = join(t1, t2, keys='id')
+
+# Left/right/outer joins
+t_joined = join(t1, t2, join_type='left')
+t_joined = join(t1, t2, join_type='outer')
+```
+
+### Grouping and Aggregating
+
+```python
+# Group by column
+g = t.group_by('filter')
+
+# Aggregate groups
+means = g.groups.aggregate(np.mean)
+
+# Iterate over groups
+for group in g.groups:
+    print(f"Filter: {group['filter'][0]}")
+    print(f"Mean mag: {np.mean(group['mag'])}")
+```
+
+### Unique Rows
+
+```python
+# Get unique rows
+t_unique = t.unique('id')
+
+# Multiple columns
+t_unique = t.unique(['ra', 'dec'])
+```
+
+## Units and Quantities
+
+Use QTable for unit-aware operations:
+
+```python
+from astropy.table import QTable
+
+# Create table with units
+t = QTable()
+t['flux'] = [1.2, 2.3, 3.4] * u.Jy
+t['wavelength'] = [500, 600, 700] * u.nm
+
+# Unit conversions
+t['flux'].to(u.mJy)
+t['wavelength'].to(u.angstrom)
+
+# Calculations preserve units
+t['freq'] = t['wavelength'].to(u.Hz, equivalencies=u.spectral())
+```
+
+## Masking Missing Data
+
+```python
+from astropy.table import MaskedColumn
+import numpy as np
+
+# Create masked column
+flux = MaskedColumn([1.2, np.nan, 3.4], mask=[False, True, False])
+t = Table([flux], names=['flux'])
+
+# Operations automatically handle masks
+mean_flux = np.ma.mean(t['flux'])
+
+# Fill masked values
+t['flux'].filled(0)  # Replace masked with 0
+```
+
+## Indexing for Fast Lookup
+
+Create indices for fast row retrieval:
+
+```python
+# Add index on column
+t.add_index('id')
+
+# Fast lookup by index
+row = t.loc[12345]  # Find row where id=12345
+
+# Range queries
+subset = t.loc[100:200]
+```
+
+## Table Metadata
+
+```python
+# Set table-level metadata
+t.meta['TELESCOPE'] = 'HST'
+t.meta['FILTER'] = 'F814W'
+t.meta['EXPTIME'] = 300.0
+
+# Set column-level metadata
+t['ra'].meta['unit'] = 'deg'
+t['ra'].meta['description'] = 'Right Ascension'
+t['ra'].description = 'Right Ascension'  # Shortcut
+```
+
+## Performance Tips
+
+### Fast Table Construction
+
+```python
+# SLOW: Adding rows one at a time
+t = Table(names=['a', 'b'])
+for i in range(1000):
+    t.add_row([i, i**2])
+
+# FAST: Build from lists
+rows = [(i, i**2) for i in range(1000)]
+t = Table(rows=rows, names=['a', 'b'])
+```
+
+### Memory-Mapped FITS Tables
+
+```python
+# Don't load entire table into memory
+t = Table.read('huge_catalog.fits', memmap=True)
+
+# Only loads data when accessed
+subset = t[10000:10100]  # Efficient
+```
+
+### Copy vs. View
+
+```python
+# Create view (shares data, fast)
+t_view = t['ra', 'dec']
+
+# Create copy (independent data)
+t_copy = t['ra', 'dec'].copy()
+```
+
+## Displaying Tables
+
+```python
+# Print to console
+print(t)
+
+# Show in interactive browser
+t.show_in_browser()
+t.show_in_browser(jsviewer=True)  # Interactive sorting/filtering
+
+# Paginated viewing
+t.more()
+
+# Custom formatting
+t['flux'].format = '%.3f'
+t['ra'].format = '{:.6f}'
+```
+
+## Converting to Other Formats
+
+```python
+# To NumPy array
+arr = np.array(t)
+
+# To Pandas DataFrame
+df = t.to_pandas()
+
+# To dictionary
+d = {name: t[name] for name in t.colnames}
+```
+
+## Common Use Cases
+
+### Cross-Matching Catalogs
+
+```python
+from astropy.coordinates import SkyCoord, match_coordinates_sky
+
+# Create coordinate objects from table columns
+coords1 = SkyCoord(t1['ra'], t1['dec'], unit='deg')
+coords2 = SkyCoord(t2['ra'], t2['dec'], unit='deg')
+
+# Find matches
+idx, sep, _ = coords1.match_to_catalog_sky(coords2)
+
+# Filter by separation
+max_sep = 1 * u.arcsec
+matches = sep < max_sep
+t1_matched = t1[matches]
+t2_matched = t2[idx[matches]]
+```
+
+### Binning Data
+
+```python
+from astropy.table import Table
+import numpy as np
+
+# Bin by magnitude
+mag_bins = np.arange(10, 20, 0.5)
+binned = t.group_by(np.digitize(t['mag'], mag_bins))
+counts = binned.groups.aggregate(len)
+```
--- a/scientific-skills/astropy/references/time.md
+++ b/scientific-skills/astropy/references/time.md
@@ -0,0 +1,404 @@
+# Time Handling (astropy.time)
+
+The `astropy.time` module provides robust tools for manipulating times and dates with support for various time scales and formats.
+
+## Creating Time Objects
+
+### Basic Creation
+
+```python
+from astropy.time import Time
+import astropy.units as u
+
+# ISO format (automatically detected)
+t = Time('2023-01-15 12:30:45')
+t = Time('2023-01-15T12:30:45')
+
+# Specify format explicitly
+t = Time('2023-01-15 12:30:45', format='iso', scale='utc')
+
+# Julian Date
+t = Time(2460000.0, format='jd')
+
+# Modified Julian Date
+t = Time(59945.0, format='mjd')
+
+# Unix time (seconds since 1970-01-01)
+t = Time(1673785845.0, format='unix')
+```
+
+### Array of Times
+
+```python
+# Multiple times
+times = Time(['2023-01-01', '2023-06-01', '2023-12-31'])
+
+# From arrays
+import numpy as np
+jd_array = np.linspace(2460000, 2460100, 100)
+times = Time(jd_array, format='jd')
+```
+
+## Time Formats
+
+### Supported Formats
+
+```python
+# ISO 8601
+t = Time('2023-01-15 12:30:45', format='iso')
+t = Time('2023-01-15T12:30:45.123', format='isot')
+
+# Julian dates
+t = Time(2460000.0, format='jd')          # Julian Date
+t = Time(59945.0, format='mjd')           # Modified Julian Date
+
+# Decimal year
+t = Time(2023.5, format='decimalyear')
+t = Time(2023.5, format='jyear')          # Julian year
+t = Time(2023.5, format='byear')          # Besselian year
+
+# Year and day-of-year
+t = Time('2023:046', format='yday')       # 46th day of 2023
+
+# FITS format
+t = Time('2023-01-15T12:30:45', format='fits')
+
+# GPS seconds
+t = Time(1000000000.0, format='gps')
+
+# Unix time
+t = Time(1673785845.0, format='unix')
+
+# Matplotlib dates
+t = Time(738521.0, format='plot_date')
+
+# datetime objects
+from datetime import datetime
+dt = datetime(2023, 1, 15, 12, 30, 45)
+t = Time(dt)
+```
+
+## Time Scales
+
+### Available Time Scales
+
+```python
+# UTC - Coordinated Universal Time (default)
+t = Time('2023-01-15 12:00:00', scale='utc')
+
+# TAI - International Atomic Time
+t = Time('2023-01-15 12:00:00', scale='tai')
+
+# TT - Terrestrial Time
+t = Time('2023-01-15 12:00:00', scale='tt')
+
+# TDB - Barycentric Dynamical Time
+t = Time('2023-01-15 12:00:00', scale='tdb')
+
+# TCG - Geocentric Coordinate Time
+t = Time('2023-01-15 12:00:00', scale='tcg')
+
+# TCB - Barycentric Coordinate Time
+t = Time('2023-01-15 12:00:00', scale='tcb')
+
+# UT1 - Universal Time
+t = Time('2023-01-15 12:00:00', scale='ut1')
+```
+
+### Converting Time Scales
+
+```python
+t = Time('2023-01-15 12:00:00', scale='utc')
+
+# Convert to different scales
+t_tai = t.tai
+t_tt = t.tt
+t_tdb = t.tdb
+t_ut1 = t.ut1
+
+# Check offset
+print(f"TAI - UTC = {(t.tai - t.utc).sec} seconds")
+# TAI - UTC = 37 seconds (leap seconds)
+```
+
+## Format Conversions
+
+### Change Output Format
+
+```python
+t = Time('2023-01-15 12:30:45')
+
+# Access in different formats
+print(t.jd)           # Julian Date
+print(t.mjd)          # Modified Julian Date
+print(t.iso)          # ISO format
+print(t.isot)         # ISO with 'T'
+print(t.unix)         # Unix time
+print(t.decimalyear)  # Decimal year
+
+# Change default format
+t.format = 'mjd'
+print(t)  # Displays as MJD
+```
+
+### High-Precision Output
+
+```python
+# Use subfmt for precision control
+t.to_value('mjd', subfmt='float')    # Standard float
+t.to_value('mjd', subfmt='long')     # Extended precision
+t.to_value('mjd', subfmt='decimal')  # Decimal (highest precision)
+t.to_value('mjd', subfmt='str')      # String representation
+```
+
+## Time Arithmetic
+
+### TimeDelta Objects
+
+```python
+from astropy.time import TimeDelta
+
+# Create time difference
+dt = TimeDelta(1.0, format='jd')      # 1 day
+dt = TimeDelta(3600.0, format='sec')  # 1 hour
+
+# Subtract times
+t1 = Time('2023-01-15')
+t2 = Time('2023-02-15')
+dt = t2 - t1
+print(dt.jd)   # 31 days
+print(dt.sec)  # 2678400 seconds
+```
+
+### Adding/Subtracting Time
+
+```python
+t = Time('2023-01-15 12:00:00')
+
+# Add TimeDelta
+t_future = t + TimeDelta(7, format='jd')  # Add 7 days
+
+# Add Quantity
+t_future = t + 1*u.hour
+t_future = t + 30*u.day
+t_future = t + 1*u.year
+
+# Subtract
+t_past = t - 1*u.week
+```
+
+### Time Ranges
+
+```python
+# Create range of times
+start = Time('2023-01-01')
+end = Time('2023-12-31')
+times = start + np.linspace(0, 365, 100) * u.day
+
+# Or using TimeDelta
+times = start + TimeDelta(np.linspace(0, 365, 100), format='jd')
+```
+
+## Observing-Related Features
+
+### Sidereal Time
+
+```python
+from astropy.coordinates import EarthLocation
+
+# Define observer location
+location = EarthLocation(lat=40*u.deg, lon=-120*u.deg, height=1000*u.m)
+
+# Create time with location
+t = Time('2023-06-15 23:00:00', location=location)
+
+# Calculate sidereal time
+lst_apparent = t.sidereal_time('apparent')
+lst_mean = t.sidereal_time('mean')
+
+print(f"Local Sidereal Time: {lst_apparent}")
+```
+
+### Light Travel Time Corrections
+
+```python
+from astropy.coordinates import SkyCoord, EarthLocation
+
+# Define target and observer
+target = SkyCoord(ra=10*u.deg, dec=20*u.deg)
+location = EarthLocation.of_site('Keck Observatory')
+
+# Observation times
+times = Time(['2023-01-01', '2023-06-01', '2023-12-31'],
+             location=location)
+
+# Calculate light travel time to solar system barycenter
+ltt_bary = times.light_travel_time(target, kind='barycentric')
+ltt_helio = times.light_travel_time(target, kind='heliocentric')
+
+# Apply correction
+times_barycentric = times.tdb + ltt_bary
+```
+
+### Earth Rotation Angle
+
+```python
+# Earth rotation angle (for celestial to terrestrial transformations)
+era = t.earth_rotation_angle()
+```
+
+## Handling Missing or Invalid Times
+
+### Masked Times
+
+```python
+import numpy as np
+
+# Create times with missing values
+times = Time(['2023-01-01', '2023-06-01', '2023-12-31'])
+times[1] = np.ma.masked  # Mark as missing
+
+# Check for masks
+print(times.mask)  # [False True False]
+
+# Get unmasked version
+times_clean = times.unmasked
+
+# Fill masked values
+times_filled = times.filled(Time('2000-01-01'))
+```
+
+## Time Precision and Representation
+
+### Internal Representation
+
+Time objects use two 64-bit floats (jd1, jd2) for high precision:
+
+```python
+t = Time('2023-01-15 12:30:45.123456789', format='iso', scale='utc')
+
+# Access internal representation
+print(t.jd1, t.jd2)  # Integer and fractional parts
+
+# This allows sub-nanosecond precision over astronomical timescales
+```
+
+### Precision
+
+```python
+# High precision for long time intervals
+t1 = Time('1900-01-01')
+t2 = Time('2100-01-01')
+dt = t2 - t1
+print(f"Time span: {dt.sec / (365.25 * 86400)} years")
+# Maintains precision throughout
+```
+
+## Time Formatting
+
+### Custom String Format
+
+```python
+t = Time('2023-01-15 12:30:45')
+
+# Strftime-style formatting
+t.strftime('%Y-%m-%d %H:%M:%S')  # '2023-01-15 12:30:45'
+t.strftime('%B %d, %Y')          # 'January 15, 2023'
+
+# ISO format subformats
+t.iso                    # '2023-01-15 12:30:45.000'
+t.isot                   # '2023-01-15T12:30:45.000'
+t.to_value('iso', subfmt='date_hms')  # '2023-01-15 12:30:45.000'
+```
+
+## Common Use Cases
+
+### Converting Between Formats
+
+```python
+# MJD to ISO
+t_mjd = Time(59945.0, format='mjd')
+iso_string = t_mjd.iso
+
+# ISO to JD
+t_iso = Time('2023-01-15 12:00:00')
+jd_value = t_iso.jd
+
+# Unix to ISO
+t_unix = Time(1673785845.0, format='unix')
+iso_string = t_unix.iso
+```
+
+### Time Differences in Various Units
+
+```python
+t1 = Time('2023-01-01')
+t2 = Time('2023-12-31')
+
+dt = t2 - t1
+print(f"Days: {dt.to(u.day)}")
+print(f"Hours: {dt.to(u.hour)}")
+print(f"Seconds: {dt.sec}")
+print(f"Years: {dt.to(u.year)}")
+```
+
+### Creating Regular Time Series
+
+```python
+# Daily observations for a year
+start = Time('2023-01-01')
+times = start + np.arange(365) * u.day
+
+# Hourly observations for a day
+start = Time('2023-01-15 00:00:00')
+times = start + np.arange(24) * u.hour
+
+# Observations every 30 seconds
+start = Time('2023-01-15 12:00:00')
+times = start + np.arange(1000) * 30 * u.second
+```
+
+### Time Zone Handling
+
+```python
+# UTC to local time (requires datetime)
+t = Time('2023-01-15 12:00:00', scale='utc')
+dt_utc = t.to_datetime()
+
+# Convert to specific timezone using pytz
+import pytz
+eastern = pytz.timezone('US/Eastern')
+dt_eastern = dt_utc.replace(tzinfo=pytz.utc).astimezone(eastern)
+```
+
+### Barycentric Correction Example
+
+```python
+from astropy.coordinates import SkyCoord, EarthLocation
+
+# Target coordinates
+target = SkyCoord(ra='23h23m08.55s', dec='+18d24m59.3s')
+
+# Observatory location
+location = EarthLocation.of_site('Keck Observatory')
+
+# Observation times (must include location)
+times = Time(['2023-01-15 08:30:00', '2023-01-16 08:30:00'],
+             location=location)
+
+# Calculate barycentric correction
+ltt_bary = times.light_travel_time(target, kind='barycentric')
+
+# Apply correction to get barycentric times
+times_bary = times.tdb + ltt_bary
+
+# For radial velocity correction
+rv_correction = ltt_bary.to(u.km, equivalencies=u.dimensionless_angles())
+```
+
+## Performance Considerations
+
+1. **Array operations are fast**: Process multiple times as arrays
+2. **Format conversions are cached**: Repeated access is efficient
+3. **Scale conversions may require IERS data**: Downloads automatically
+4. **High precision maintained**: Sub-nanosecond accuracy across astronomical timescales
--- a/scientific-skills/astropy/references/units.md
+++ b/scientific-skills/astropy/references/units.md
@@ -0,0 +1,178 @@
+# Units and Quantities (astropy.units)
+
+The `astropy.units` module handles defining, converting between, and performing arithmetic with physical quantities.
+
+## Creating Quantities
+
+Multiply or divide numeric values by built-in units to create Quantity objects:
+
+```python
+from astropy import units as u
+import numpy as np
+
+# Scalar quantities
+distance = 42.0 * u.meter
+velocity = 100 * u.km / u.s
+
+# Array quantities
+distances = np.array([1., 2., 3.]) * u.m
+wavelengths = [500, 600, 700] * u.nm
+```
+
+Access components via `.value` and `.unit` attributes:
+```python
+distance.value  # 42.0
+distance.unit   # Unit("m")
+```
+
+## Unit Conversions
+
+Use `.to()` method for conversions:
+
+```python
+distance = 1.0 * u.parsec
+distance.to(u.km)  # <Quantity 30856775814671.914 km>
+
+wavelength = 500 * u.nm
+wavelength.to(u.angstrom)  # <Quantity 5000. Angstrom>
+```
+
+## Arithmetic Operations
+
+Quantities support standard arithmetic with automatic unit management:
+
+```python
+# Basic operations
+speed = 15.1 * u.meter / (32.0 * u.second)  # <Quantity 0.471875 m / s>
+area = (5 * u.m) * (3 * u.m)  # <Quantity 15. m2>
+
+# Units cancel when appropriate
+ratio = (10 * u.m) / (5 * u.m)  # <Quantity 2. (dimensionless)>
+
+# Decompose complex units
+time = (3.0 * u.kilometer / (130.51 * u.meter / u.second))
+time.decompose()  # <Quantity 22.986744310780782 s>
+```
+
+## Unit Systems
+
+Convert between major unit systems:
+
+```python
+# SI to CGS
+pressure = 1.0 * u.Pa
+pressure.cgs  # <Quantity 10. Ba>
+
+# Find equivalent representations
+(u.s ** -1).compose()  # [Unit("Bq"), Unit("Hz"), ...]
+```
+
+## Equivalencies
+
+Domain-specific conversions require equivalencies:
+
+```python
+# Spectral equivalency (wavelength ↔ frequency)
+wavelength = 1000 * u.nm
+wavelength.to(u.Hz, equivalencies=u.spectral())
+# <Quantity 2.99792458e+14 Hz>
+
+# Doppler equivalencies
+velocity = 1000 * u.km / u.s
+velocity.to(u.Hz, equivalencies=u.doppler_optical(500*u.nm))
+
+# Other equivalencies
+u.brightness_temperature(500*u.GHz)
+u.doppler_radio(1.4*u.GHz)
+u.mass_energy()
+u.parallax()
+```
+
+## Logarithmic Units
+
+Special units for magnitudes, decibels, and dex:
+
+```python
+# Magnitudes
+flux = -2.5 * u.mag(u.ct / u.s)
+
+# Decibels
+power_ratio = 3 * u.dB(u.W)
+
+# Dex (base-10 logarithm)
+abundance = 8.5 * u.dex(u.cm**-3)
+```
+
+## Common Units
+
+### Length
+`u.m, u.km, u.cm, u.mm, u.micron, u.angstrom, u.au, u.pc, u.kpc, u.Mpc, u.lyr`
+
+### Time
+`u.s, u.min, u.hour, u.day, u.year, u.Myr, u.Gyr`
+
+### Mass
+`u.kg, u.g, u.M_sun, u.M_earth, u.M_jup`
+
+### Temperature
+`u.K, u.deg_C`
+
+### Angle
+`u.deg, u.arcmin, u.arcsec, u.rad, u.hourangle, u.mas`
+
+### Energy/Power
+`u.J, u.erg, u.eV, u.keV, u.MeV, u.GeV, u.W, u.L_sun`
+
+### Frequency
+`u.Hz, u.kHz, u.MHz, u.GHz`
+
+### Flux
+`u.Jy, u.mJy, u.erg / u.s / u.cm**2`
+
+## Performance Optimization
+
+Pre-compute composite units for array operations:
+
+```python
+# Slow (creates intermediate quantities)
+result = array * u.m / u.s / u.kg / u.sr
+
+# Fast (pre-computed composite unit)
+UNIT_COMPOSITE = u.m / u.s / u.kg / u.sr
+result = array * UNIT_COMPOSITE
+
+# Fastest (avoid copying with <<)
+result = array << UNIT_COMPOSITE  # 10000x faster
+```
+
+## String Formatting
+
+Format quantities with standard Python syntax:
+
+```python
+velocity = 15.1 * u.meter / (32.0 * u.second)
+f"{velocity:0.03f}"     # '0.472 m / s'
+f"{velocity:.2e}"       # '4.72e-01 m / s'
+f"{velocity.unit:FITS}" # 'm s-1'
+```
+
+## Defining Custom Units
+
+```python
+# Create new unit
+bakers_fortnight = u.def_unit('bakers_fortnight', 13 * u.day)
+
+# Enable in string parsing
+u.add_enabled_units([bakers_fortnight])
+```
+
+## Constants
+
+Access physical constants with units:
+
+```python
+from astropy.constants import c, G, M_sun, h, k_B
+
+speed_of_light = c.to(u.km/u.s)
+gravitational_constant = G.to(u.m**3 / u.kg / u.s**2)
+```
--- a/scientific-skills/astropy/references/wcs_and_other_modules.md
+++ b/scientific-skills/astropy/references/wcs_and_other_modules.md
@@ -0,0 +1,373 @@
+# WCS and Other Astropy Modules
+
+## World Coordinate System (astropy.wcs)
+
+The WCS module manages transformations between pixel coordinates in images and world coordinates (e.g., celestial coordinates).
+
+### Reading WCS from FITS
+
+```python
+from astropy.wcs import WCS
+from astropy.io import fits
+
+# Read WCS from FITS header
+with fits.open('image.fits') as hdul:
+    wcs = WCS(hdul[0].header)
+```
+
+### Pixel to World Transformations
+
+```python
+# Single pixel to world coordinates
+world = wcs.pixel_to_world(100, 200)  # Returns SkyCoord
+print(f"RA: {world.ra}, Dec: {world.dec}")
+
+# Arrays of pixels
+import numpy as np
+x_pixels = np.array([100, 200, 300])
+y_pixels = np.array([150, 250, 350])
+world_coords = wcs.pixel_to_world(x_pixels, y_pixels)
+```
+
+### World to Pixel Transformations
+
+```python
+from astropy.coordinates import SkyCoord
+import astropy.units as u
+
+# Single coordinate
+coord = SkyCoord(ra=10.5*u.degree, dec=41.2*u.degree)
+x, y = wcs.world_to_pixel(coord)
+
+# Array of coordinates
+coords = SkyCoord(ra=[10, 11, 12]*u.degree, dec=[41, 42, 43]*u.degree)
+x_pixels, y_pixels = wcs.world_to_pixel(coords)
+```
+
+### WCS Information
+
+```python
+# Print WCS details
+print(wcs)
+
+# Access key properties
+print(wcs.wcs.crpix)  # Reference pixel
+print(wcs.wcs.crval)  # Reference value (world coords)
+print(wcs.wcs.cd)     # CD matrix
+print(wcs.wcs.ctype)  # Coordinate types
+
+# Pixel scale
+pixel_scale = wcs.proj_plane_pixel_scales()  # Returns Quantity array
+```
+
+### Creating WCS
+
+```python
+from astropy.wcs import WCS
+
+# Create new WCS
+wcs = WCS(naxis=2)
+wcs.wcs.crpix = [512.0, 512.0]  # Reference pixel
+wcs.wcs.crval = [10.5, 41.2]     # RA, Dec at reference pixel
+wcs.wcs.ctype = ['RA---TAN', 'DEC--TAN']  # Projection type
+wcs.wcs.cdelt = [-0.0001, 0.0001]  # Pixel scale (degrees/pixel)
+wcs.wcs.cunit = ['deg', 'deg']
+```
+
+### Footprint and Coverage
+
+```python
+# Calculate image footprint (corner coordinates)
+footprint = wcs.calc_footprint()
+# Returns array of [RA, Dec] for each corner
+```
+
+## NDData (astropy.nddata)
+
+Container for n-dimensional datasets with metadata, uncertainty, and masking.
+
+### Creating NDData
+
+```python
+from astropy.nddata import NDData
+import numpy as np
+import astropy.units as u
+
+# Basic NDData
+data = np.random.random((100, 100))
+ndd = NDData(data)
+
+# With units
+ndd = NDData(data, unit=u.electron/u.s)
+
+# With uncertainty
+from astropy.nddata import StdDevUncertainty
+uncertainty = StdDevUncertainty(np.sqrt(data))
+ndd = NDData(data, uncertainty=uncertainty, unit=u.electron/u.s)
+
+# With mask
+mask = data < 0.1  # Mask low values
+ndd = NDData(data, mask=mask)
+
+# With WCS
+from astropy.wcs import WCS
+ndd = NDData(data, wcs=wcs)
+```
+
+### CCDData for CCD Images
+
+```python
+from astropy.nddata import CCDData
+
+# Create CCDData
+ccd = CCDData(data, unit=u.adu, meta={'object': 'M31'})
+
+# Read from FITS
+ccd = CCDData.read('image.fits', unit=u.adu)
+
+# Write to FITS
+ccd.write('output.fits', overwrite=True)
+```
+
+## Modeling (astropy.modeling)
+
+Framework for creating and fitting models to data.
+
+### Common Models
+
+```python
+from astropy.modeling import models, fitting
+import numpy as np
+
+# 1D Gaussian
+gauss = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
+x = np.linspace(0, 10, 100)
+y = gauss(x)
+
+# 2D Gaussian
+gauss_2d = models.Gaussian2D(amplitude=10, x_mean=50, y_mean=50,
+                              x_stddev=5, y_stddev=3)
+
+# Polynomial
+poly = models.Polynomial1D(degree=3)
+
+# Power law
+power_law = models.PowerLaw1D(amplitude=10, x_0=1, alpha=2)
+```
+
+### Fitting Models to Data
+
+```python
+# Generate noisy data
+true_model = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
+x = np.linspace(0, 10, 100)
+y_true = true_model(x)
+y_noisy = y_true + np.random.normal(0, 0.5, x.shape)
+
+# Fit model
+fitter = fitting.LevMarLSQFitter()
+initial_model = models.Gaussian1D(amplitude=8, mean=4, stddev=1.5)
+fitted_model = fitter(initial_model, x, y_noisy)
+
+print(f"Fitted amplitude: {fitted_model.amplitude.value}")
+print(f"Fitted mean: {fitted_model.mean.value}")
+print(f"Fitted stddev: {fitted_model.stddev.value}")
+```
+
+### Compound Models
+
+```python
+# Add models
+double_gauss = models.Gaussian1D(amp=5, mean=3, stddev=1) + \
+               models.Gaussian1D(amp=8, mean=7, stddev=1.5)
+
+# Compose models
+composite = models.Gaussian1D(amp=10, mean=5, stddev=1) | \
+            models.Scale(factor=2)  # Scale output
+```
+
+## Visualization (astropy.visualization)
+
+Tools for visualizing astronomical images and data.
+
+### Image Normalization
+
+```python
+from astropy.visualization import simple_norm
+import matplotlib.pyplot as plt
+
+# Load image
+from astropy.io import fits
+data = fits.getdata('image.fits')
+
+# Normalize for display
+norm = simple_norm(data, 'sqrt', percent=99)
+
+# Display
+plt.imshow(data, norm=norm, cmap='gray', origin='lower')
+plt.colorbar()
+plt.show()
+```
+
+### Stretching and Intervals
+
+```python
+from astropy.visualization import (MinMaxInterval, AsinhStretch,
+                                    ImageNormalize, ZScaleInterval)
+
+# Z-scale interval
+interval = ZScaleInterval()
+vmin, vmax = interval.get_limits(data)
+
+# Asinh stretch
+stretch = AsinhStretch()
+norm = ImageNormalize(data, interval=interval, stretch=stretch)
+
+plt.imshow(data, norm=norm, cmap='gray', origin='lower')
+```
+
+### PercentileInterval
+
+```python
+from astropy.visualization import PercentileInterval
+
+# Show data between 5th and 95th percentiles
+interval = PercentileInterval(90)  # 90% of data
+vmin, vmax = interval.get_limits(data)
+
+plt.imshow(data, vmin=vmin, vmax=vmax, cmap='gray', origin='lower')
+```
+
+## Constants (astropy.constants)
+
+Physical and astronomical constants with units.
+
+```python
+from astropy import constants as const
+
+# Speed of light
+c = const.c
+print(f"c = {c}")
+print(f"c in km/s = {c.to(u.km/u.s)}")
+
+# Gravitational constant
+G = const.G
+
+# Astronomical constants
+M_sun = const.M_sun     # Solar mass
+R_sun = const.R_sun     # Solar radius
+L_sun = const.L_sun     # Solar luminosity
+au = const.au           # Astronomical unit
+pc = const.pc           # Parsec
+
+# Fundamental constants
+h = const.h             # Planck constant
+hbar = const.hbar       # Reduced Planck constant
+k_B = const.k_B         # Boltzmann constant
+m_e = const.m_e         # Electron mass
+m_p = const.m_p         # Proton mass
+e = const.e             # Elementary charge
+N_A = const.N_A         # Avogadro constant
+```
+
+### Using Constants in Calculations
+
+```python
+# Calculate Schwarzschild radius
+M = 10 * const.M_sun
+r_s = 2 * const.G * M / const.c**2
+print(f"Schwarzschild radius: {r_s.to(u.km)}")
+
+# Calculate escape velocity
+M = const.M_earth
+R = const.R_earth
+v_esc = np.sqrt(2 * const.G * M / R)
+print(f"Earth escape velocity: {v_esc.to(u.km/u.s)}")
+```
+
+## Convolution (astropy.convolution)
+
+Convolution kernels for image processing.
+
+```python
+from astropy.convolution import Gaussian2DKernel, convolve
+
+# Create Gaussian kernel
+kernel = Gaussian2DKernel(x_stddev=2)
+
+# Convolve image
+smoothed_image = convolve(data, kernel)
+
+# Handle NaNs
+from astropy.convolution import convolve_fft
+smoothed = convolve_fft(data, kernel, nan_treatment='interpolate')
+```
+
+## Stats (astropy.stats)
+
+Statistical functions for astronomical data.
+
+```python
+from astropy.stats import sigma_clip, sigma_clipped_stats
+
+# Sigma clipping
+clipped_data = sigma_clip(data, sigma=3, maxiters=5)
+
+# Get statistics with sigma clipping
+mean, median, std = sigma_clipped_stats(data, sigma=3.0)
+
+# Robust statistics
+from astropy.stats import mad_std, biweight_location, biweight_scale
+robust_std = mad_std(data)
+robust_mean = biweight_location(data)
+robust_scale = biweight_scale(data)
+```
+
+## Utils
+
+### Data Downloads
+
+```python
+from astropy.utils.data import download_file
+
+# Download file (caches locally)
+url = 'https://example.com/data.fits'
+local_file = download_file(url, cache=True)
+```
+
+### Progress Bars
+
+```python
+from astropy.utils.console import ProgressBar
+
+with ProgressBar(len(data_list)) as bar:
+    for item in data_list:
+        # Process item
+        bar.update()
+```
+
+## SAMP (Simple Application Messaging Protocol)
+
+Interoperability with other astronomy tools.
+
+```python
+from astropy.samp import SAMPIntegratedClient
+
+# Connect to SAMP hub
+client = SAMPIntegratedClient()
+client.connect()
+
+# Broadcast table to other applications
+message = {
+    'samp.mtype': 'table.load.votable',
+    'samp.params': {
+        'url': 'file:///path/to/table.xml',
+        'table-id': 'my_table',
+        'name': 'My Catalog'
+    }
+}
+client.notify_all(message)
+
+# Disconnect
+client.disconnect()
+```
--- a/scientific-integrations/benchling-integration/SKILL.md
+++ b/scientific-integrations/benchling-integration/SKILL.md
@@ -28,7 +28,7 @@ This skill should be used when:
 **Python SDK Installation:**
 ```python
 # Stable release
-pip install benchling-sdk
+uv pip install benchling-sdk
 # or with Poetry
 poetry add benchling-sdk
 ```
--- a/scientific-integrations/benchling-integration/references/api_endpoints.md
+++ b/scientific-integrations/benchling-integration/references/api_endpoints.md
--- a/scientific-integrations/benchling-integration/references/authentication.md
+++ b/scientific-integrations/benchling-integration/references/authentication.md
--- a/scientific-integrations/benchling-integration/references/sdk_reference.md
+++ b/scientific-integrations/benchling-integration/references/sdk_reference.md
--- a/scientific-skills/biomni/SKILL.md
+++ b/scientific-skills/biomni/SKILL.md
@@ -0,0 +1,309 @@
+---
+name: biomni
+description: Autonomous biomedical AI agent framework for executing complex research tasks across genomics, drug discovery, molecular biology, and clinical analysis. Use this skill when conducting multi-step biomedical research including CRISPR screening design, single-cell RNA-seq analysis, ADMET prediction, GWAS interpretation, rare disease diagnosis, or lab protocol optimization. Leverages LLM reasoning with code execution and integrated biomedical databases.
+---
+
+# Biomni
+
+## Overview
+
+Biomni is an open-source biomedical AI agent framework from Stanford's SNAP lab that autonomously executes complex research tasks across biomedical domains. Use this skill when working on multi-step biological reasoning tasks, analyzing biomedical data, or conducting research spanning genomics, drug discovery, molecular biology, and clinical analysis.
+
+## Core Capabilities
+
+Biomni excels at:
+
+1. **Multi-step biological reasoning** - Autonomous task decomposition and planning for complex biomedical queries
+2. **Code generation and execution** - Dynamic analysis pipeline creation for data processing
+3. **Knowledge retrieval** - Access to ~11GB of integrated biomedical databases and literature
+4. **Cross-domain problem solving** - Unified interface for genomics, proteomics, drug discovery, and clinical tasks
+
+## When to Use This Skill
+
+Use biomni for:
+- **CRISPR screening** - Design screens, prioritize genes, analyze knockout effects
+- **Single-cell RNA-seq** - Cell type annotation, differential expression, trajectory analysis
+- **Drug discovery** - ADMET prediction, target identification, compound optimization
+- **GWAS analysis** - Variant interpretation, causal gene identification, pathway enrichment
+- **Clinical genomics** - Rare disease diagnosis, variant pathogenicity, phenotype-genotype mapping
+- **Lab protocols** - Protocol optimization, literature synthesis, experimental design
+
+## Quick Start
+
+### Installation and Setup
+
+Install Biomni and configure API keys for LLM providers:
+
+```bash
+uv pip install biomni --upgrade
+```
+
+Configure API keys (store in `.env` file or environment variables):
+```bash
+export ANTHROPIC_API_KEY="your-key-here"
+# Optional: OpenAI, Azure, Google, Groq, AWS Bedrock keys
+```
+
+Use `scripts/setup_environment.py` for interactive setup assistance.
+
+### Basic Usage Pattern
+
+```python
+from biomni.agent import A1
+
+# Initialize agent with data path and LLM choice
+agent = A1(path='./data', llm='claude-sonnet-4-20250514')
+
+# Execute biomedical task autonomously
+agent.go("Your biomedical research question or task")
+
+# Save conversation history and results
+agent.save_conversation_history("report.pdf")
+```
+
+## Working with Biomni
+
+### 1. Agent Initialization
+
+The A1 class is the primary interface for biomni:
+
+```python
+from biomni.agent import A1
+from biomni.config import default_config
+
+# Basic initialization
+agent = A1(
+    path='./data',  # Path to data lake (~11GB downloaded on first use)
+    llm='claude-sonnet-4-20250514'  # LLM model selection
+)
+
+# Advanced configuration
+default_config.llm = "gpt-4"
+default_config.timeout_seconds = 1200
+default_config.max_iterations = 50
+```
+
+**Supported LLM Providers:**
+- Anthropic Claude (recommended): `claude-sonnet-4-20250514`, `claude-opus-4-20250514`
+- OpenAI: `gpt-4`, `gpt-4-turbo`
+- Azure OpenAI: via Azure configuration
+- Google Gemini: `gemini-2.0-flash-exp`
+- Groq: `llama-3.3-70b-versatile`
+- AWS Bedrock: Various models via Bedrock API
+
+See `references/llm_providers.md` for detailed LLM configuration instructions.
+
+### 2. Task Execution Workflow
+
+Biomni follows an autonomous agent workflow:
+
+```python
+# Step 1: Initialize agent
+agent = A1(path='./data', llm='claude-sonnet-4-20250514')
+
+# Step 2: Execute task with natural language query
+result = agent.go("""
+Design a CRISPR screen to identify genes regulating autophagy in
+HEK293 cells. Prioritize genes based on essentiality and pathway
+relevance.
+""")
+
+# Step 3: Review generated code and analysis
+# Agent autonomously:
+# - Decomposes task into sub-steps
+# - Retrieves relevant biological knowledge
+# - Generates and executes analysis code
+# - Interprets results and provides insights
+
+# Step 4: Save results
+agent.save_conversation_history("autophagy_screen_report.pdf")
+```
+
+### 3. Common Task Patterns
+
+#### CRISPR Screening Design
+```python
+agent.go("""
+Design a genome-wide CRISPR knockout screen for identifying genes
+affecting [phenotype] in [cell type]. Include:
+1. sgRNA library design
+2. Gene prioritization criteria
+3. Expected hit genes based on pathway analysis
+""")
+```
+
+#### Single-Cell RNA-seq Analysis
+```python
+agent.go("""
+Analyze this single-cell RNA-seq dataset:
+- Perform quality control and filtering
+- Identify cell populations via clustering
+- Annotate cell types using marker genes
+- Conduct differential expression between conditions
+File path: [path/to/data.h5ad]
+""")
+```
+
+#### Drug ADMET Prediction
+```python
+agent.go("""
+Predict ADMET properties for these drug candidates:
+[SMILES strings or compound IDs]
+Focus on:
+- Absorption (Caco-2 permeability, HIA)
+- Distribution (plasma protein binding, BBB penetration)
+- Metabolism (CYP450 interaction)
+- Excretion (clearance)
+- Toxicity (hERG liability, hepatotoxicity)
+""")
+```
+
+#### GWAS Variant Interpretation
+```python
+agent.go("""
+Interpret GWAS results for [trait/disease]:
+- Identify genome-wide significant variants
+- Map variants to causal genes
+- Perform pathway enrichment analysis
+- Predict functional consequences
+Summary statistics file: [path/to/gwas_summary.txt]
+""")
+```
+
+See `references/use_cases.md` for comprehensive task examples across all biomedical domains.
+
+### 4. Data Integration
+
+Biomni integrates ~11GB of biomedical knowledge sources:
+- **Gene databases** - Ensembl, NCBI Gene, UniProt
+- **Protein structures** - PDB, AlphaFold
+- **Clinical datasets** - ClinVar, OMIM, HPO
+- **Literature indices** - PubMed abstracts, biomedical ontologies
+- **Pathway databases** - KEGG, Reactome, GO
+
+Data is automatically downloaded to the specified `path` on first use.
+
+### 5. MCP Server Integration
+
+Extend biomni with external tools via Model Context Protocol:
+
+```python
+# MCP servers can provide:
+# - FDA drug databases
+# - Web search for literature
+# - Custom biomedical APIs
+# - Laboratory equipment interfaces
+
+# Configure MCP servers in .biomni/mcp_config.json
+```
+
+### 6. Evaluation Framework
+
+Benchmark agent performance on biomedical tasks:
+
+```python
+from biomni.eval import BiomniEval1
+
+evaluator = BiomniEval1()
+
+# Evaluate on specific task types
+score = evaluator.evaluate(
+    task_type='crispr_design',
+    instance_id='test_001',
+    answer=agent_output
+)
+
+# Access evaluation dataset
+dataset = evaluator.load_dataset()
+```
+
+## Best Practices
+
+### Task Formulation
+- **Be specific** - Include biological context, organism, cell type, conditions
+- **Specify outputs** - Clearly state desired analysis outputs and formats
+- **Provide data paths** - Include file paths for datasets to analyze
+- **Set constraints** - Mention time/computational limits if relevant
+
+### Security Considerations
+⚠️ **Important**: Biomni executes LLM-generated code with full system privileges. For production use:
+- Run in isolated environments (Docker, VMs)
+- Avoid exposing sensitive credentials
+- Review generated code before execution in sensitive contexts
+- Use sandboxed execution environments when possible
+
+### Performance Optimization
+- **Choose appropriate LLMs** - Claude Sonnet 4 recommended for balance of speed/quality
+- **Set reasonable timeouts** - Adjust `default_config.timeout_seconds` for complex tasks
+- **Monitor iterations** - Track `max_iterations` to prevent runaway loops
+- **Cache data** - Reuse downloaded data lake across sessions
+
+### Result Documentation
+```python
+# Always save conversation history for reproducibility
+agent.save_conversation_history("results/project_name_YYYYMMDD.pdf")
+
+# Include in reports:
+# - Original task description
+# - Generated analysis code
+# - Results and interpretations
+# - Data sources used
+```
+
+## Resources
+
+### References
+Detailed documentation available in the `references/` directory:
+
+- **`api_reference.md`** - Complete API documentation for A1 class, configuration, and evaluation
+- **`llm_providers.md`** - LLM provider setup (Anthropic, OpenAI, Azure, Google, Groq, AWS)
+- **`use_cases.md`** - Comprehensive task examples for all biomedical domains
+
+### Scripts
+Helper scripts in the `scripts/` directory:
+
+- **`setup_environment.py`** - Interactive environment and API key configuration
+- **`generate_report.py`** - Enhanced PDF report generation with custom formatting
+
+### External Resources
+- **GitHub**: https://github.com/snap-stanford/biomni
+- **Web Platform**: https://biomni.stanford.edu
+- **Paper**: https://www.biorxiv.org/content/10.1101/2025.05.30.656746v1
+- **Model**: https://huggingface.co/biomni/Biomni-R0-32B-Preview
+- **Evaluation Dataset**: https://huggingface.co/datasets/biomni/Eval1
+
+## Troubleshooting
+
+### Common Issues
+
+**Data download fails**
+```python
+# Manually trigger data lake download
+agent = A1(path='./data', llm='your-llm')
+# First .go() call will download data
+```
+
+**API key errors**
+```bash
+# Verify environment variables
+echo $ANTHROPIC_API_KEY
+# Or check .env file in working directory
+```
+
+**Timeout on complex tasks**
+```python
+from biomni.config import default_config
+default_config.timeout_seconds = 3600  # 1 hour
+```
+
+**Memory issues with large datasets**
+- Use streaming for large files
+- Process data in chunks
+- Increase system memory allocation
+
+### Getting Help
+
+For issues or questions:
+- GitHub Issues: https://github.com/snap-stanford/biomni/issues
+- Documentation: Check `references/` files for detailed guidance
+- Community: Stanford SNAP lab and biomni contributors
--- a/scientific-skills/biomni/references/api_reference.md
+++ b/scientific-skills/biomni/references/api_reference.md
@@ -0,0 +1,460 @@
+# Biomni API Reference
+
+Comprehensive API documentation for the biomni framework.
+
+## A1 Agent Class
+
+The A1 class is the primary interface for interacting with biomni.
+
+### Initialization
+
+```python
+from biomni.agent import A1
+
+agent = A1(
+    path: str,              # Path to data lake directory
+    llm: str,               # LLM model identifier
+    verbose: bool = True,   # Enable verbose logging
+    mcp_config: str = None  # Path to MCP server configuration
+)
+```
+
+**Parameters:**
+
+- **`path`** (str, required) - Directory path for biomni data lake (~11GB). Data is automatically downloaded on first use if not present.
+
+- **`llm`** (str, required) - LLM model identifier. Options include:
+  - `'claude-sonnet-4-20250514'` - Recommended for balanced performance
+  - `'claude-opus-4-20250514'` - Maximum capability
+  - `'gpt-4'`, `'gpt-4-turbo'` - OpenAI models
+  - `'gemini-2.0-flash-exp'` - Google Gemini
+  - `'llama-3.3-70b-versatile'` - Via Groq
+  - Custom model endpoints via provider configuration
+
+- **`verbose`** (bool, optional, default=True) - Enable detailed logging of agent reasoning, tool use, and code execution.
+
+- **`mcp_config`** (str, optional) - Path to MCP (Model Context Protocol) server configuration file for external tool integration.
+
+**Example:**
+```python
+# Basic initialization
+agent = A1(path='./biomni_data', llm='claude-sonnet-4-20250514')
+
+# With MCP integration
+agent = A1(
+    path='./biomni_data',
+    llm='claude-sonnet-4-20250514',
+    mcp_config='./.biomni/mcp_config.json'
+)
+```
+
+### Core Methods
+
+#### `go(query: str) -> str`
+
+Execute a biomedical research task autonomously.
+
+```python
+result = agent.go(query: str)
+```
+
+**Parameters:**
+- **`query`** (str) - Natural language description of the biomedical task to execute
+
+**Returns:**
+- **`str`** - Final answer or analysis result from the agent
+
+**Behavior:**
+1. Decomposes query into executable sub-tasks
+2. Retrieves relevant knowledge from integrated databases
+3. Generates and executes Python code for analysis
+4. Iterates on results until task completion
+5. Returns final synthesized answer
+
+**Example:**
+```python
+result = agent.go("""
+Identify genes associated with Alzheimer's disease from GWAS data.
+Perform pathway enrichment analysis on top hits.
+""")
+print(result)
+```
+
+#### `save_conversation_history(output_path: str, format: str = 'pdf')`
+
+Save complete conversation history including task, reasoning, code, and results.
+
+```python
+agent.save_conversation_history(
+    output_path: str,
+    format: str = 'pdf'
+)
+```
+
+**Parameters:**
+- **`output_path`** (str) - File path for saved report
+- **`format`** (str, optional, default='pdf') - Output format: `'pdf'`, `'html'`, or `'markdown'`
+
+**Example:**
+```python
+agent.save_conversation_history('reports/alzheimers_gwas_analysis.pdf')
+```
+
+#### `reset()`
+
+Reset agent state and clear conversation history.
+
+```python
+agent.reset()
+```
+
+Use when starting a new independent task to clear previous context.
+
+**Example:**
+```python
+# Task 1
+agent.go("Analyze dataset A")
+agent.save_conversation_history("task1.pdf")
+
+# Reset for fresh context
+agent.reset()
+
+# Task 2 - independent of Task 1
+agent.go("Analyze dataset B")
+```
+
+### Configuration via default_config
+
+Global configuration parameters accessible via `biomni.config.default_config`.
+
+```python
+from biomni.config import default_config
+
+# LLM Configuration
+default_config.llm = "claude-sonnet-4-20250514"
+default_config.llm_temperature = 0.7
+
+# Execution Parameters
+default_config.timeout_seconds = 1200  # 20 minutes
+default_config.max_iterations = 50     # Max reasoning loops
+default_config.max_tokens = 4096       # Max tokens per LLM call
+
+# Code Execution
+default_config.enable_code_execution = True
+default_config.sandbox_mode = False    # Enable for restricted execution
+
+# Data and Caching
+default_config.data_cache_dir = "./biomni_cache"
+default_config.enable_caching = True
+```
+
+**Key Parameters:**
+
+- **`timeout_seconds`** (int, default=1200) - Maximum time for task execution. Increase for complex analyses.
+
+- **`max_iterations`** (int, default=50) - Maximum agent reasoning loops. Prevents infinite loops.
+
+- **`enable_code_execution`** (bool, default=True) - Allow agent to execute generated code. Disable for code generation only.
+
+- **`sandbox_mode`** (bool, default=False) - Enable sandboxed code execution (requires additional setup).
+
+## BiomniEval1 Evaluation Framework
+
+Framework for benchmarking agent performance on biomedical tasks.
+
+### Initialization
+
+```python
+from biomni.eval import BiomniEval1
+
+evaluator = BiomniEval1(
+    dataset_path: str = None,  # Path to evaluation dataset
+    metrics: list = None        # Evaluation metrics to compute
+)
+```
+
+**Example:**
+```python
+evaluator = BiomniEval1()
+```
+
+### Methods
+
+#### `evaluate(task_type: str, instance_id: str, answer: str) -> float`
+
+Evaluate agent answer against ground truth.
+
+```python
+score = evaluator.evaluate(
+    task_type: str,     # Task category
+    instance_id: str,   # Specific task instance
+    answer: str         # Agent-generated answer
+)
+```
+
+**Parameters:**
+- **`task_type`** (str) - Task category: `'crispr_design'`, `'scrna_analysis'`, `'gwas_interpretation'`, `'drug_admet'`, `'clinical_diagnosis'`
+- **`instance_id`** (str) - Unique identifier for task instance from dataset
+- **`answer`** (str) - Agent's answer to evaluate
+
+**Returns:**
+- **`float`** - Evaluation score (0.0 to 1.0)
+
+**Example:**
+```python
+# Generate answer
+result = agent.go("Design CRISPR screen for autophagy genes")
+
+# Evaluate
+score = evaluator.evaluate(
+    task_type='crispr_design',
+    instance_id='autophagy_001',
+    answer=result
+)
+print(f"Score: {score:.2f}")
+```
+
+#### `load_dataset() -> dict`
+
+Load the Biomni-Eval1 benchmark dataset.
+
+```python
+dataset = evaluator.load_dataset()
+```
+
+**Returns:**
+- **`dict`** - Dictionary with task instances organized by task type
+
+**Example:**
+```python
+dataset = evaluator.load_dataset()
+
+for task_type, instances in dataset.items():
+    print(f"{task_type}: {len(instances)} instances")
+```
+
+#### `run_benchmark(agent: A1, task_types: list = None) -> dict`
+
+Run full benchmark evaluation on agent.
+
+```python
+results = evaluator.run_benchmark(
+    agent: A1,
+    task_types: list = None  # Specific task types or None for all
+)
+```
+
+**Returns:**
+- **`dict`** - Results with scores, timing, and detailed metrics per task
+
+**Example:**
+```python
+results = evaluator.run_benchmark(
+    agent=agent,
+    task_types=['crispr_design', 'scrna_analysis']
+)
+
+print(f"Overall accuracy: {results['mean_score']:.2f}")
+print(f"Average time: {results['mean_time']:.1f}s")
+```
+
+## Data Lake API
+
+Access integrated biomedical databases programmatically.
+
+### Gene Database Queries
+
+```python
+from biomni.data import GeneDB
+
+gene_db = GeneDB(path='./biomni_data')
+
+# Query gene information
+gene_info = gene_db.get_gene('BRCA1')
+# Returns: {'symbol': 'BRCA1', 'name': '...', 'function': '...', ...}
+
+# Search genes by pathway
+pathway_genes = gene_db.search_by_pathway('DNA repair')
+# Returns: List of gene symbols in pathway
+
+# Get gene interactions
+interactions = gene_db.get_interactions('TP53')
+# Returns: List of interacting genes with interaction types
+```
+
+### Protein Structure Access
+
+```python
+from biomni.data import ProteinDB
+
+protein_db = ProteinDB(path='./biomni_data')
+
+# Get AlphaFold structure
+structure = protein_db.get_structure('P38398')  # BRCA1 UniProt ID
+# Returns: Path to PDB file or structure object
+
+# Search PDB database
+pdb_entries = protein_db.search_pdb('kinase', resolution_max=2.5)
+# Returns: List of PDB IDs matching criteria
+```
+
+### Clinical Data Access
+
+```python
+from biomni.data import ClinicalDB
+
+clinical_db = ClinicalDB(path='./biomni_data')
+
+# Query ClinVar variants
+variant_info = clinical_db.get_variant('rs429358')  # APOE4 variant
+# Returns: {'significance': '...', 'disease': '...', 'frequency': ...}
+
+# Search OMIM for disease
+disease_info = clinical_db.search_omim('Alzheimer')
+# Returns: List of OMIM entries with gene associations
+```
+
+### Literature Search
+
+```python
+from biomni.data import LiteratureDB
+
+lit_db = LiteratureDB(path='./biomni_data')
+
+# Search PubMed abstracts
+papers = lit_db.search('CRISPR screening cancer', max_results=10)
+# Returns: List of paper dictionaries with titles, abstracts, PMIDs
+
+# Get citations for paper
+citations = lit_db.get_citations('PMID:12345678')
+# Returns: List of citing papers
+```
+
+## MCP Server Integration
+
+Extend biomni with external tools via Model Context Protocol.
+
+### Configuration Format
+
+Create `.biomni/mcp_config.json`:
+
+```json
+{
+  "servers": {
+    "fda-drugs": {
+      "command": "python",
+      "args": ["-m", "mcp_server_fda"],
+      "env": {
+        "FDA_API_KEY": "${FDA_API_KEY}"
+      }
+    },
+    "web-search": {
+      "command": "npx",
+      "args": ["-y", "@modelcontextprotocol/server-brave-search"],
+      "env": {
+        "BRAVE_API_KEY": "${BRAVE_API_KEY}"
+      }
+    }
+  }
+}
+```
+
+### Using MCP Tools in Tasks
+
+```python
+# Initialize with MCP config
+agent = A1(
+    path='./data',
+    llm='claude-sonnet-4-20250514',
+    mcp_config='./.biomni/mcp_config.json'
+)
+
+# Agent can now use MCP tools automatically
+result = agent.go("""
+Search for FDA-approved drugs targeting EGFR.
+Get their approval dates and indications.
+""")
+# Agent uses fda-drugs MCP server automatically
+```
+
+## Error Handling
+
+Common exceptions and handling strategies:
+
+```python
+from biomni.exceptions import (
+    BiomniException,
+    LLMError,
+    CodeExecutionError,
+    DataNotFoundError,
+    TimeoutError
+)
+
+try:
+    result = agent.go("Complex biomedical task")
+except TimeoutError:
+    # Task exceeded timeout_seconds
+    print("Task timed out. Consider increasing timeout.")
+    default_config.timeout_seconds = 3600
+except CodeExecutionError as e:
+    # Generated code failed to execute
+    print(f"Code execution error: {e}")
+    # Review generated code in conversation history
+except DataNotFoundError:
+    # Required data not in data lake
+    print("Data not found. Ensure data lake is downloaded.")
+except LLMError as e:
+    # LLM API error
+    print(f"LLM error: {e}")
+    # Check API keys and rate limits
+```
+
+## Best Practices
+
+### Efficient API Usage
+
+1. **Reuse agent instances** for related tasks to maintain context
+2. **Set appropriate timeouts** based on task complexity
+3. **Use caching** to avoid redundant data downloads
+4. **Monitor iterations** to detect reasoning loops early
+
+### Production Deployment
+
+```python
+from biomni.agent import A1
+from biomni.config import default_config
+import logging
+
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+
+# Production settings
+default_config.timeout_seconds = 3600
+default_config.max_iterations = 100
+default_config.sandbox_mode = True  # Enable sandboxing
+
+# Initialize with error handling
+try:
+    agent = A1(path='/data/biomni', llm='claude-sonnet-4-20250514')
+    result = agent.go(task_query)
+    agent.save_conversation_history(f'reports/{task_id}.pdf')
+except Exception as e:
+    logging.error(f"Task {task_id} failed: {e}")
+    # Handle failure appropriately
+```
+
+### Memory Management
+
+For large-scale analyses:
+
+```python
+# Process datasets in chunks
+chunk_results = []
+for chunk in dataset_chunks:
+    agent.reset()  # Clear memory between chunks
+    result = agent.go(f"Analyze chunk: {chunk}")
+    chunk_results.append(result)
+
+# Combine results
+final_result = combine_results(chunk_results)
+```
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Timothy Kassis	7e8deebf96	Add support for Qiskit from IBM: software stack for quantum computing and algorithms research. Build, optimize, and execute quantum workloads at scale.	2025-11-30 08:46:41 -05:00
Timothy Kassis	7763491813	Add support for PennyLane: a cross-platform library for quantum computing, quantum machine learning, and quantum chemistry.	2025-11-30 08:27:24 -05:00
Timothy Kassis	16e47a1755	Add support for Cirq: library for writing, manipulating, and optimizing quantum circuits, and then running them on quantum computers and quantum simulators.	2025-11-30 08:02:58 -05:00
Timothy Kassis	a077cee836	Add support for QuTiP: an open-source software for simulating the dynamics of open quantum systems.	2025-11-30 07:46:17 -05:00
Timothy Kassis	7caef7df68	Fix Reactome database nesting	2025-11-30 06:28:41 -05:00
Timothy Kassis	6ac2a15e39	Add support for GeoPandas	2025-11-27 18:57:26 -05:00
Timothy Kassis	41f272c2bd	Add Plotly support	2025-11-27 18:47:08 -05:00
Timothy Kassis	02574ba19d	Merge pull request #10 from K-Dense-AI/add_adaptyv_integration Add support for Adaptyv for protein design	2025-11-24 19:53:28 -05:00
Timothy Kassis	ea638c5618	Add support for Adaptyv for protein design	2025-11-24 19:52:45 -05:00
Timothy Kassis	8e7a791871	Merge pull request #8 from alperyilmaz/patch-1 updated CHEMBL function calls	2025-11-23 04:54:15 -08:00
Alper Yilmaz	3bb0ee77be	updated CHEMBL function calls https://bioservices.readthedocs.io/en/latest/references.html#module-bioservices.chembl describes changes to functions after 1.6.0 also, get_compounds_by_chemblId fails when tested	2025-11-21 12:52:30 +03:00
Timothy Kassis	e5fc882746	Merge pull request #6 from K-Dense-AI:open_alex_integration Add support for OpenAlex	2025-11-19 13:39:08 -08:00
Timothy Kassis	65b39d45d6	Add support for OpenAlex	2025-11-19 13:38:32 -08:00
Timothy Kassis	c078c98ad2	Add Perplexity search	2025-11-16 16:16:21 -08:00
Timothy Kassis	2e80732340	Add uv installation instructions	2025-11-16 15:36:33 -08:00
Timothy Kassis	2fc3e6a88e	Updated installation instructions for all skills to always use uv pip install	2025-11-16 15:34:52 -08:00
Timothy Kassis	d94f21c51f	Add support for fluidsim for computational fluid dynamics	2025-11-13 19:46:29 -08:00
Timothy Kassis	19c0b390ee	Merge pull request #5 from K-Dense-AI/refactor_skills Refactor skills into a single installable plugin	2025-11-13 19:21:02 -08:00
Timothy Kassis	54cab8e4b5	Update readme to clarify refactoring	2025-11-13 19:20:09 -08:00
Timothy Kassis	ad2dfc3446	Update readme	2025-11-13 19:17:42 -08:00
Timothy Kassis	63f257d81e	Consolidate plugins	2025-11-13 18:57:02 -08:00
Timothy Kassis	8be6c6c307	Consolidate skills	2025-11-13 18:50:42 -08:00
Timothy Kassis	cc99fdb57d	Add support for Modal so researchers can scale compute on demand for heavy computational tasks beyond running locally	2025-11-10 10:57:05 -08:00
Timothy Kassis	50fdaf1b04	Add Slack link	2025-11-10 09:24:36 -08:00
Timothy Kassis	82663ee1de	Bump version	2025-11-07 09:37:59 -08:00
Timothy Kassis	2873d0e39d	Improve the EDA skill	2025-11-07 09:37:14 -08:00
Timothy Kassis	0e4939147f	Link to examples	2025-11-06 17:05:17 -08:00
Timothy Kassis	5b7081cbff	Add examples	2025-11-06 17:04:04 -08:00
Timothy Kassis	ffad3d81b0	Improve the EDA skill	2025-11-04 17:25:06 -08:00
Timothy Kassis	1225ddecf1	Add support for both geniml and gtars	2025-11-04 10:22:03 -08:00
Timothy Kassis	4ad4f9970f	Improve the scikit-learn skill	2025-11-04 10:11:46 -08:00
Timothy Kassis	63a4293f1a	Add support for the Denario AI scientist	2025-11-03 17:04:20 -08:00
Timothy Kassis	f124e28509	Update the PyOpenMS skill	2025-11-03 16:55:01 -08:00
Timothy Kassis	c56fa43747	Improve the Hugging Face transformers skill	2025-11-03 16:44:15 -08:00
Timothy Kassis	86d8878eeb	Improve the astropy skill	2025-11-03 16:34:14 -08:00
Timothy Kassis	d57129ca3f	Bump version	2025-11-03 16:21:47 -08:00
Timothy Kassis	537edff2a1	Improve the arboreto skill	2025-11-03 16:21:26 -08:00
Timothy Kassis	6ddea4786e	Improve the anndata skill	2025-11-03 16:09:01 -08:00
Timothy Kassis	094d5aa9f1	Improve the aeon skill	2025-11-03 15:59:29 -08:00
Timothy Kassis	862445f531	Add support for Stable-Baselines3, a PyTorch-based library providing reliable, well-tested implementations of popular deep reinforcement learning algorithms with a simple and consistent API for research and training.	2025-11-02 14:54:16 -08:00
Timothy Kassis	b8c4d2bae1	Add support for PufferLib which is a high-performance, open-source Python toolkit that wraps complex RL environments to look like standard Gym/PettingZoo interfaces and supports massively parallel simulation (1M+ steps/sec) to accelerate deep reinforcement learning.	2025-11-02 14:52:06 -08:00
Timothy Kassis	27d6ee387f	Add support for Vaex for fast, memory-efficient exploration and visualization of large tabular datasets using lazy, out-of-core computation.	2025-11-02 14:43:20 -08:00
Timothy Kassis	f32b3f8b42	Add support for SymPY for symbolic mathematics that enables algebraic manipulation, equation solving, calculus, and symbolic computation directly in Python.	2025-11-02 14:30:20 -08:00
Timothy Kassis	ed2d1c4aeb	Add support for NetworkX for creating, analyzing, and visualizing complex networks and graphs.	2025-11-02 14:19:20 -08:00
Timothy Kassis	7e3fae3ad1	Add support to LaminDB	2025-10-31 11:39:54 -07:00
Timothy Kassis	97c03a11e5	Add Claude Code installation instructions	2025-10-31 11:21:01 -07:00
Timothy Kassis	b6bb261a71	Consolidate getting started	2025-10-31 09:49:48 -07:00
Timothy Kassis	c0de4269c0	Merge pull request #3 from K-Dense-AI/orion/add-hosted-mcp Update README to include quick installation instructions for Cursor u…	2025-10-30 16:12:15 -07:00
Haoxuan "Orion" Li	aa5018612f	Update README to include quick installation instructions for Cursor users and clarify MCP server usage across various clients.	2025-10-30 15:01:20 -07:00
Timothy Kassis	f12f9c2904	Enhance README by adding a call to action for users to star the repository, promoting visibility and ongoing maintenance.	2025-10-28 17:57:09 -07:00
Timothy Kassis	55af02f2d4	Add star history	2025-10-28 17:52:44 -07:00
Timothy Kassis	346010a1bf	Add Markitdown skill to read all sorts of documents	2025-10-28 16:31:01 -07:00
Timothy Kassis	5b37b62147	Update README to reflect the increase in equivalent tools from 500 to 1002	2025-10-28 15:54:00 -07:00
Timothy Kassis	82901e9fbe	Add hypothesis generation using HypoGeniC and HypoRefine	2025-10-28 12:53:43 -07:00
Timothy Kassis	752c32949b	Add citation	2025-10-28 10:47:44 -07:00
Timothy Kassis	981823a71a	Add ScholarEval as a thinking framework	2025-10-28 10:45:34 -07:00
Vinayak Agarwal	26d4fde324	Updated skill descrtipotions to resolve possible conflicts in their use	2025-10-26 18:52:50 -07:00
Timothy Kassis	1f03feda5c	Add protocols.io integration	2025-10-26 11:09:03 -07:00
Timothy Kassis	44285238c4	Fix context initialization by using AGENTS.md instead of AGENT.md	2025-10-26 10:18:45 -07:00
Vinayak Agarwal	af246ac562	Revise literature review template to enhance structure and clarity, including expanded sections for authors' affiliations, review types, and detailed methodology. Update abstract format to include background, objectives, methods, results, and conclusions. Introduce new subsections for protocol registration, quality assessment, and knowledge gaps, ensuring comprehensive guidance for systematic reviews.	2025-10-25 23:30:34 -07:00
Timothy Kassis	04d528c4bc	Added support for NeuroKit2	2025-10-25 21:38:31 -07:00
Timothy Kassis	b83942845c	Support for aeon for time-series analysis and machine learning	2025-10-25 21:25:50 -07:00
Timothy Kassis	6cefe6f4cc	Add support for PyLabRobot for lab automation	2025-10-25 21:23:47 -07:00
Timothy Kassis	a5c2ed9bb6	Add literature review skill	2025-10-25 10:55:36 -07:00
Timothy Kassis	295370a730	Update README.md to replace downloads and workflows badges with an equivalent tools badge for improved clarity.	2025-10-25 10:38:43 -07:00
Timothy Kassis	4fde8cb6de	Enhance README.md by adding a downloads badge to track total downloads of the project.	2025-10-24 19:41:22 -07:00
Timothy Kassis	564bd39835	Add support for HistoLab	2025-10-24 13:41:06 -07:00
Timothy Kassis	7240da6e6c	Add support for PathML	2025-10-24 13:40:40 -07:00
Timothy Kassis	1871693348	Enhance README.md with detailed instructions for various workflows, emphasizing the importance of organized output and the creation of comprehensive documentation and visualizations.	2025-10-24 09:30:56 -07:00
Timothy Kassis	0e03bbcf38	Update readme	2025-10-24 09:23:19 -07:00
Timothy Kassis	ca6fd369d7	Add DrugBank access	2025-10-24 09:04:43 -07:00
Timothy Kassis	808fa11206	Add DataCommons which enables scientists to programmatically access nodes in the Data Commons knowledge graph.	2025-10-24 09:04:27 -07:00
Timothy Kassis	d923063309	Change license to MIT allowing commercial use!	2025-10-23 20:48:01 -07:00
Timothy Kassis	43dc67a316	Version bump	2025-10-23 10:02:31 -07:00
Timothy Kassis	ee73143b53	Add scikit-survival for survival analysis allowing users to establish connections between covariates and time of an event	2025-10-23 10:02:16 -07:00
Timothy Kassis	b1df506eba	Add support for PyHealth making healthcare AI simpler and more powerful	2025-10-23 09:50:07 -07:00
Timothy Kassis	bb1c9f4573	Add support for PyDicom	2025-10-23 09:35:46 -07:00
Timothy Kassis	6cb25aea28	Add SimPy a process-based discrete-event simulation framework	2025-10-23 09:26:44 -07:00
Timothy Kassis	92b4e024d1	Addition of ESM, SHAP, scvi-tools...and much more!	2025-10-23 09:12:41 -07:00
Timothy Kassis	3fbd9ba6a9	Add utility skill to get system resources	2025-10-23 09:12:01 -07:00
Timothy Kassis	62ddd1296c	Add ESM3 and ESM C models protein models	2025-10-23 09:10:34 -07:00
Timothy Kassis	a11ca9c100	Add sciv-tools providing models for single cell sequencing	2025-10-23 09:09:18 -07:00
Timothy Kassis	db2cedb2fb	Add SHAP for model explainability	2025-10-23 09:08:27 -07:00
Timothy Kassis	87a608730c	Add support for bioRxiv database search and paper downloading	2025-10-23 08:51:45 -07:00
Timothy Kassis	0943c243bb	Added full support for ToolUniverse. Claude Code can now use the tools it provide as it needs without the MCP being a gateway.	2025-10-22 14:59:27 -07:00
Timothy Kassis	0b5a648e98	Add Paper2Web	2025-10-22 08:50:06 -07:00
Timothy Kassis	77822efeed	Improved Biomni support	2025-10-22 08:38:06 -07:00
Timothy Kassis	71a3c3750f	Enhance README with detailed examples for complex scientific workflows across drug discovery, bioinformatics, and clinical genomics, showcasing multi-step processes and integration of various tools.	2025-10-22 08:24:40 -07:00
Timothy Kassis	9584469988	Refactor release notes generation in workflow to remove version header and streamline output	2025-10-22 08:21:43 -07:00