Merge pull request #46 from fedorov/update-idc-v1.3.0

update imaging-data-commons skill to v1.3.1
2026-03-27 07:09:27 +08:00 · 2026-02-16 10:24:23 -08:00
parent 3a5f2e2227 5a471d9c36
commit 326b043b8f
6 changed files with 1214 additions and 436 deletions
--- a/scientific-skills/imaging-data-commons/SKILL.md
+++ b/scientific-skills/imaging-data-commons/SKILL.md
@@ -3,9 +3,10 @@ name: imaging-data-commons
 description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
 license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
 metadata:
-    version: 1.2.0
+    version: 1.3.1
    skill-author: Andrey Fedorov, @fedorov
-    idc-index: "0.11.7"
+    idc-index: "0.11.9"
    idc-data-version: "v23"
    repository: https://github.com/ImagingDataCommons/idc-claude-skill
 ---
@@ -15,20 +16,39 @@ metadata:
 Use the `idc-index` Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.
 **Current IDC Data Version: v23** (always verify with `IDCClient().get_idc_version()`)
 **Primary tool:** `idc-index` ([GitHub](https://github.com/imagingdatacommons/idc-index))
-**Check current data scale for the latest version:**
+**CRITICAL - Check package version and upgrade if needed (run this FIRST):**
 ```python
 import idc_index
 REQUIRED_VERSION = "0.11.9"  # Must match metadata.idc-index in this file
 installed = idc_index.__version__
 if installed < REQUIRED_VERSION:
    print(f"Upgrading idc-index from {installed} to {REQUIRED_VERSION}...")
    import subprocess
    subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", "idc-index"], check=True)
    print("Upgrade complete. Restart Python to use new version.")
 else:
    print(f"idc-index {installed} meets requirement ({REQUIRED_VERSION})")
 ```
 **Verify IDC data version and check current data scale:**
 ```python
 from idc_index import IDCClient
 client = IDCClient()
-# get IDC data version
+# Verify IDC data version (should be "v23")
-print(client.get_idc_version())
+print(f"IDC data version: {client.get_idc_version()}")
 # Get collection count and total series
 stats = client.sql_query("""
-    SELECT   
+    SELECT
        COUNT(DISTINCT collection_id) as collections,
        COUNT(DISTINCT analysis_result_id) as analysis_results,
        COUNT(DISTINCT PatientID) as patients,
@@ -54,6 +74,30 @@ print(stats)
 - Checking data licenses before use in research or commercial applications
 - Visualizing medical images in a browser without local DICOM viewer software
 ## Quick Navigation
 **Core Sections (inline):**
 - IDC Data Model - Collection and analysis result hierarchy
 - Index Tables - Available tables and joining patterns
 - Installation - Package setup and version verification
 - Core Capabilities - Essential API patterns (query, download, visualize, license, citations, batch)
 - Best Practices - Usage guidelines
 - Troubleshooting - Common issues and solutions
 **Reference Guides (load on demand):**
 | Guide | When to Load |
 |-------|--------------|
 | `index_tables_guide.md` | Complex JOINs, schema discovery, DataFrame access |
 | `use_cases.md` | End-to-end workflow examples (training datasets, batch downloads) |
 | `sql_patterns.md` | Quick SQL patterns for filter discovery, annotations, size estimation |
 | `clinical_data_guide.md` | Clinical/tabular data, imaging+clinical joins, value mapping |
 | `cloud_storage_guide.md` | Direct S3/GCS access, versioning, UUID mapping |
 | `dicomweb_guide.md` | DICOMweb endpoints, PACS integration |
 | `digital_pathology_guide.md` | Slide microscopy (SM), annotations (ANN), pathology workflows |
 | `bigquery_guide.md` | Full DICOM metadata, private elements (requires GCP) |
 | `cli_guide.md` | Command-line tools (`idc download`, manifest files) |
 ## IDC Data Model
 IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):
@@ -75,6 +119,8 @@ Use `collection_id` to find original imaging data, may include annotations depos
 The `idc-index` package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.
 **Complete index table documentation:** Use https://idc-index.readthedocs.io/en/latest/indices_reference.html for quick check of available tables and columns without executing any code.
 **Important:** Use `client.indices_overview` to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.
 ### Available Tables
@@ -89,6 +135,9 @@ The `idc-index` package provides multiple metadata index tables, accessible via
 | `sm_index` | 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata |
 | `sm_instance_index` | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy |
 | `seg_index` | 1 row = 1 DICOM Segmentation series | fetch_index() | Segmentation metadata: algorithm, segment count, reference to source image series |
 | `ann_index` | 1 row = 1 DICOM ANN series | fetch_index() | Microscopy Bulk Simple Annotations series metadata; references annotated image series |
 | `ann_group_index` | 1 row = 1 annotation group | fetch_index() | Detailed annotation group metadata: graphic type, annotation count, property codes, algorithm |
 | `contrast_index` | 1 row = 1 series with contrast info | fetch_index() | Contrast agent metadata: agent name, ingredient, administration route (CT, MR, PT, XA, RF) |
 **Auto** = loaded automatically when `IDCClient()` is instantiated
 **fetch_index()** = requires `client.fetch_index("table_name")` to load
@@ -107,140 +156,13 @@ The `idc-index` package provides multiple metadata index tables, accessible via
 | `source_DOI` | index, analysis_results_index | Link by publication DOI |
 | `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier |
 | `Modality` | index, prior_versions_index | Filter by imaging modality |
-| `SeriesInstanceUID` | index, seg_index | Link segmentation series to its index metadata |
+| `SeriesInstanceUID` | index, seg_index, ann_index, ann_group_index, contrast_index | Link segmentation/annotation/contrast series to its index metadata |
 | `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
 | `referenced_SeriesInstanceUID` | ann_index → index | Link annotation to its source image series (join ann_index.referenced_SeriesInstanceUID = index.SeriesInstanceUID) |
 **Note:** `Subjects`, `Updated`, and `Description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).
-**Example joins:**
+For detailed join examples, schema discovery patterns, key columns reference, and DataFrame access, see `references/index_tables_guide.md`.
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 # Join index with collections_index to get cancer types
 client.fetch_index("collections_index")
 result = client.sql_query("""
    SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations
    FROM index i
    JOIN collections_index c ON i.collection_id = c.collection_id
    WHERE i.Modality = 'MR'
    LIMIT 10
 """)
 # Join index with sm_index for slide microscopy details
 client.fetch_index("sm_index")
 result = client.sql_query("""
    SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
    FROM index i
    JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
    LIMIT 10
 """)
 # Join seg_index with index to find segmentations and their source images
 client.fetch_index("seg_index")
 result = client.sql_query("""
    SELECT
        s.SeriesInstanceUID as seg_series,
        s.AlgorithmName,
        s.total_segments,
        src.collection_id,
        src.Modality as source_modality,
        src.BodyPartExamined
    FROM seg_index s
    JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
    WHERE s.AlgorithmType = 'AUTOMATIC'
    LIMIT 10
 """)
 ```
 ### Accessing Index Tables
 **Via SQL (recommended for filtering/aggregation):**
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 # Query the primary index (always available)
 results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
 # Fetch and query additional indices
 client.fetch_index("collections_index")
 collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
 client.fetch_index("analysis_results_index")
 analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
 ```
 **As pandas DataFrames (direct access):**
 ```python
 # Primary index (always available after client initialization)
 df = client.index
 # Fetch and access on-demand indices
 client.fetch_index("sm_index")
 sm_df = client.sm_index
 ```
 ### Discovering Table Schemas (Essential for Query Writing)
 The `indices_overview` dictionary contains complete schema information for all tables. **Always consult this when writing queries or exploring data structure.**
 **DICOM attribute mapping:** Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like `PatientID`, `StudyInstanceUID`, `Modality`, `BodyPartExamined` work as expected.
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 # List all available indices with descriptions
 for name, info in client.indices_overview.items():
    print(f"\n{name}:")
    print(f"  Installed: {info['installed']}")
    print(f"  Description: {info['description']}")
 # Get complete schema for a specific index (columns, types, descriptions)
 schema = client.indices_overview["index"]["schema"]
 print(f"\nTable: {schema['table_description']}")
 print("\nColumns:")
 for col in schema['columns']:
    desc = col.get('description', 'No description')
    # Description indicates if column is from DICOM attribute
    print(f"  {col['name']} ({col['type']}): {desc}")
 # Find columns that are DICOM attributes (check description for "DICOM" reference)
 dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()]
 print(f"\nDICOM-sourced columns: {dicom_cols}")
 ```
 **Alternative: use `get_index_schema()` method:**
 ```python
 schema = client.get_index_schema("index")
 # Returns same schema dict: {'table_description': ..., 'columns': [...]}
 ```
 ### Key Columns in Primary `index` Table
 Most common columns for queries (use `indices_overview` for complete list and descriptions):
 | Column | Type | DICOM | Description |
 |--------|------|-------|-------------|
 | `collection_id` | STRING | No | IDC collection identifier |
 | `analysis_result_id` | STRING | No | If applicable, indicates what analysis results collection given series is part of |
 | `source_DOI` | STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) |
 | `PatientID` | STRING | Yes | Patient identifier |
 | `StudyInstanceUID` | STRING | Yes | DICOM Study UID |
 | `SeriesInstanceUID` | STRING | Yes | DICOM Series UID — use for downloads/viewing |
 | `Modality` | STRING | Yes | Imaging modality (CT, MR, PT, SM, etc.) |
 | `BodyPartExamined` | STRING | Yes | Anatomical region |
 | `SeriesDescription` | STRING | Yes | Description of the series |
 | `Manufacturer` | STRING | Yes | Equipment manufacturer |
 | `StudyDate` | STRING | Yes | Date study was performed |
 | `PatientSex` | STRING | Yes | Patient sex |
 | `PatientAge` | STRING | Yes | Patient age at time of study |
 | `license_short_name` | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) |
 | `series_size_MB` | FLOAT | No | Size of series in megabytes |
 | `instanceCount` | INTEGER | No | Number of DICOM instances in series |
 **DICOM = Yes**: Column value extracted from the DICOM attribute with the same name. Refer to the [DICOM standard](https://dicom.nema.org/medical/dicom/current/output/chtml/part06/chapter_6.html) for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.
 ### Clinical Data Access
@@ -301,7 +223,13 @@ pip install --upgrade idc-index
 **Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.
-**Tested with:** idc-index 0.11.7 (IDC data version v23)
+**IMPORTANT:** IDC data version v23 is current. Always verify your version:
 ```python
 print(client.get_idc_version())  # Should return "v23"
 ```
 If you see an older version, upgrade with: `pip install --upgrade idc-index`
 **Tested with:** idc-index 0.11.9 (IDC data version v23)
 **Optional (for data analysis):**
 ```bash
@@ -484,6 +412,15 @@ client.download_from_selection(
 # Results in: ./data/flat/*.dcm
 ```
 **Downloaded file names:**
 Individual DICOM files are named using their CRDC instance UUID: `<crdc_instance_uuid>.dcm` (e.g., `0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm`). This UUID-based naming:
 - Enables version tracking (UUIDs change when file content changes)
 - Matches cloud storage organization (`s3://idc-open-data/<crdc_series_uuid>/<crdc_instance_uuid>.dcm`)
 - Differs from DICOM UIDs (SOPInstanceUID) which are preserved inside the file metadata
 To identify files, use the `crdc_instance_uuid` column in queries or read DICOM metadata (SOPInstanceUID) from the files.
 ### Command-Line Download
 The `idc download` command provides command-line access to download functionality without writing Python code. Available after installing `idc-index`.
@@ -705,6 +642,13 @@ For queries requiring full DICOM metadata, complex JOINs, clinical data tables,
 See `references/bigquery_guide.md` for setup, table schemas, query patterns, private element access, and cost optimization.
 **Before using BigQuery**, always check if a specialized index table already has the metadata you need:
 1. Use `client.indices_overview` or the [idc-index indices reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html) to discover all available tables and their columns
 2. Fetch the relevant index: `client.fetch_index("table_name")`
 3. Query locally with `client.sql_query()` (free, no GCP account needed)
 Common specialized indices: `seg_index` (segmentations), `ann_index` / `ann_group_index` (microscopy annotations), `sm_index` (slide microscopy), `collections_index` (collection metadata). Only use BigQuery if you need private DICOM elements or attributes not in any index.
 ### 8. Tool Selection Guide
 | Task | Tool | Reference |
@@ -782,166 +726,15 @@ sitk.WriteImage(smoothed, "processed_volume.nii.gz")
 ## Common Use Cases
-### Use Case 1: Find and Download Lung CT Scans for Deep Learning
+See `references/use_cases.md` for complete end-to-end workflow examples including:
-
+- Building deep learning training datasets from lung CT scans
-**Objective:** Build training dataset of lung CT scans from NLST collection
+- Comparing image quality across scanner manufacturers
-
+- Previewing data in browser before downloading
-**Steps:**
+- License-aware batch downloads for commercial use
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 # 1. Query for lung CT scans with specific criteria
 query = """
 SELECT
  PatientID,
  SeriesInstanceUID,
  SeriesDescription
 FROM index
 WHERE collection_id = 'nlst'
  AND Modality = 'CT'
  AND BodyPartExamined = 'CHEST'
  AND license_short_name = 'CC BY 4.0'
 ORDER BY PatientID
 LIMIT 100
 """
 results = client.sql_query(query)
 print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")
 # 2. Download data organized by patient
 client.download_from_selection(
    seriesInstanceUID=list(results['SeriesInstanceUID'].values),
    downloadDir="./training_data",
    dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
 )
 # 3. Save manifest for reproducibility
 results.to_csv('training_manifest.csv', index=False)
 ```
 ### Use Case 2: Query Brain MRI by Manufacturer for Quality Study
 **Objective:** Compare image quality across different MRI scanner manufacturers
 **Steps:**
 ```python
 from idc_index import IDCClient
 import pandas as pd
 client = IDCClient()
 # Query for brain MRI grouped by manufacturer
 query = """
 SELECT
  Manufacturer,
  ManufacturerModelName,
  COUNT(DISTINCT SeriesInstanceUID) as num_series,
  COUNT(DISTINCT PatientID) as num_patients
 FROM index
 WHERE Modality = 'MR'
  AND BodyPartExamined LIKE '%BRAIN%'
 GROUP BY Manufacturer, ManufacturerModelName
 HAVING num_series >= 10
 ORDER BY num_series DESC
 """
 manufacturers = client.sql_query(query)
 print(manufacturers)
 # Download sample from each manufacturer for comparison
 for _, row in manufacturers.head(3).iterrows():
    mfr = row['Manufacturer']
    model = row['ManufacturerModelName']
    query = f"""
    SELECT SeriesInstanceUID
    FROM index
    WHERE Manufacturer = '{mfr}'
      AND ManufacturerModelName = '{model}'
      AND Modality = 'MR'
      AND BodyPartExamined LIKE '%BRAIN%'
    LIMIT 5
    """
    series = client.sql_query(query)
    client.download_from_selection(
        seriesInstanceUID=list(series['SeriesInstanceUID'].values),
        downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
    )
 ```
 ### Use Case 3: Visualize Series Without Downloading
 **Objective:** Preview imaging data before committing to download
 ```python
 from idc_index import IDCClient
 import webbrowser
 client = IDCClient()
 series_list = client.sql_query("""
    SELECT SeriesInstanceUID, PatientID, SeriesDescription
    FROM index
    WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
    LIMIT 10
 """)
 # Preview each in browser
 for _, row in series_list.iterrows():
    viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID'])
    print(f"Patient {row['PatientID']}: {row['SeriesDescription']}")
    print(f"  View at: {viewer_url}")
    # webbrowser.open(viewer_url)  # Uncomment to open automatically
 ```
 For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration.
 ### Use Case 4: License-Aware Batch Download for Commercial Use
 **Objective:** Download only CC-BY licensed data suitable for commercial applications
 **Steps:**
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 # Query ONLY for CC BY licensed data (allows commercial use with attribution)
 query = """
 SELECT
  SeriesInstanceUID,
  collection_id,
  PatientID,
  Modality
 FROM index
 WHERE license_short_name LIKE 'CC BY%'
  AND license_short_name NOT LIKE '%NC%'
  AND Modality IN ('CT', 'MR')
  AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN')
 LIMIT 200
 """
 cc_by_data = client.sql_query(query)
 print(f"Found {len(cc_by_data)} CC BY licensed series")
 print(f"Collections: {cc_by_data['collection_id'].unique()}")
 # Download with license verification
 client.download_from_selection(
    seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values),
    downloadDir="./commercial_dataset",
    dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID"
 )
 # Save license information
 cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
 ```
 ## Best Practices
 - **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v23). If using an older version, recommend `pip install --upgrade idc-index`
 - **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC)
 - **Generate citations for attribution** - Use `citations_from_selection()` to get properly formatted citations from `source_DOI` values; include these in publications
 - **Start with small queries** - Use `LIMIT` clause when exploring to avoid long downloads and understand data structure
@@ -989,142 +782,14 @@ cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
 ## Common SQL Query Patterns
-Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above.
+See `references/sql_patterns.md` for quick-reference SQL patterns including:
 - Filter value discovery (modalities, body parts, manufacturers)
 - Annotation and segmentation queries (including seg_index, ann_index joins)
 - Slide microscopy queries (sm_index patterns)
 - Download size estimation
 - Clinical data linking
-### Discover available filter values
+For segmentation and annotation details, also see `references/digital_pathology_guide.md`.
 ```python
 # What modalities exist?
 client.sql_query("SELECT DISTINCT Modality FROM index")
 # What body parts for a specific modality?
 client.sql_query("""
    SELECT DISTINCT BodyPartExamined, COUNT(*) as n
    FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
    GROUP BY BodyPartExamined ORDER BY n DESC
 """)
 # What manufacturers for MR?
 client.sql_query("""
    SELECT DISTINCT Manufacturer, COUNT(*) as n
    FROM index WHERE Modality = 'MR'
    GROUP BY Manufacturer ORDER BY n DESC
 """)
 ```
 ### Find annotations and segmentations
 **Note:** Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
 ```python
 # Find ALL segmentations and structure sets by DICOM Modality
 # SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set
 client.sql_query("""
    SELECT collection_id, Modality, COUNT(*) as series_count
    FROM index
    WHERE Modality IN ('SEG', 'RTSTRUCT')
    GROUP BY collection_id, Modality
    ORDER BY series_count DESC
 """)
 # Find segmentations for a specific collection (includes non-analysis-result items)
 client.sql_query("""
    SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
    FROM index
    WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
 """)
 # List analysis result collections (curated derived datasets)
 client.fetch_index("analysis_results_index")
 client.sql_query("""
    SELECT analysis_result_id, analysis_result_title, Collections, Modalities
    FROM analysis_results_index
 """)
 # Find analysis results for a specific source collection
 client.sql_query("""
    SELECT analysis_result_id, analysis_result_title
    FROM analysis_results_index
    WHERE Collections LIKE '%tcga_luad%'
 """)
 # Use seg_index for detailed DICOM Segmentation metadata
 client.fetch_index("seg_index")
 # Get segmentation statistics by algorithm
 client.sql_query("""
    SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
    FROM seg_index
    WHERE AlgorithmName IS NOT NULL
    GROUP BY AlgorithmName, AlgorithmType
    ORDER BY seg_count DESC
    LIMIT 10
 """)
 # Find segmentations for specific source images (e.g., chest CT)
 client.sql_query("""
    SELECT
        s.SeriesInstanceUID as seg_series,
        s.AlgorithmName,
        s.total_segments,
        s.segmented_SeriesInstanceUID as source_series
    FROM seg_index s
    JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
    WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
    LIMIT 10
 """)
 # Find TotalSegmentator results with source image context
 client.sql_query("""
    SELECT
        seg_info.collection_id,
        COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
        SUM(s.total_segments) as total_segments
    FROM seg_index s
    JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
    WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
    GROUP BY seg_info.collection_id
    ORDER BY seg_count DESC
 """)
 ```
 ### Query slide microscopy data
 ```python
 # sm_index has detailed metadata; join with index for collection_id
 client.fetch_index("sm_index")
 client.sql_query("""
    SELECT i.collection_id, COUNT(*) as slides,
           MIN(s.min_PixelSpacing_2sf) as min_resolution
    FROM sm_index s
    JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
    GROUP BY i.collection_id
    ORDER BY slides DESC
 """)
 ```
 ### Estimate download size
 ```python
 # Size for specific criteria
 client.sql_query("""
    SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
    FROM index
    WHERE collection_id = 'nlst' AND Modality = 'CT'
 """)
 ```
 ### Link to clinical data
 ```python
 client.fetch_index("clinical_index")
 # Find collections with clinical data and their tables
 client.sql_query("""
    SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
    FROM clinical_index
    GROUP BY collection_id, table_name
    ORDER BY collection_id
 """)
 ```
 See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection.
 ## Related Skills
@@ -1134,8 +799,7 @@ The following skills complement IDC workflows for downstream analysis and visual
 - **pydicom** - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET).
 ### Pathology and Slide Microscopy
- **histolab** - Lightweight tile extraction and preprocessing for whole slide images. Use for basic slide processing, tissue detection, and dataset preparation from IDC slide microscopy data.
+See `references/digital_pathology_guide.md` for DICOM-compatible tools (highdicom, wsidicom, TIA-Toolbox, Slim viewer).
 - **pathml** - Full-featured computational pathology toolkit. Use for advanced WSI analysis including multiplexed imaging, nucleus segmentation, and ML model training on pathology data downloaded from IDC.
 ### Metadata Visualization
 - **matplotlib** - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.).
@@ -1159,11 +823,8 @@ columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['col
 ### Reference Documentation
- **clinical_data_guide.md** - Clinical/tabular data navigation, value mapping, and joining with imaging data
+See the Quick Navigation section at the top for the full list of reference guides with decision triggers.
- **cloud_storage_guide.md** - Direct cloud bucket access (S3/GCS), file organization, CRDC UUIDs, versioning, and reproducibility
+
 - **cli_guide.md** - Complete idc-index command-line interface reference (`idc download`, `idc download-from-manifest`, `idc download-from-selection`)
 - **bigquery_guide.md** - Advanced BigQuery usage guide for complex metadata queries
 - **dicomweb_guide.md** - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details
 - **[indices_reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html)** - External documentation for index tables (may be ahead of the installed version)
 ### External Links
--- a/scientific-skills/imaging-data-commons/references/clinical_data_guide.md
+++ b/scientific-skills/imaging-data-commons/references/clinical_data_guide.md
@@ -0,0 +1,324 @@
 # Clinical Data Guide for IDC
 **Tested with:** idc-index 0.11.7 (IDC data version v23)
 Clinical data (demographics, diagnoses, therapies, lab tests, staging) accompanies many IDC imaging collections. This guide covers how to discover, access, and integrate clinical data with imaging data using `idc-index`.
 ## When to Use This Guide
 Use this guide when you need to:
 - Find what clinical metadata is available for a collection
 - Filter patients by clinical criteria (e.g., cancer stage, treatment history)
 - Join clinical attributes with imaging data for cohort selection
 - Understand and decode coded values in clinical tables
 For basic clinical data access, see the "Clinical Data Access" section in the main SKILL.md. This guide provides detailed workflows and advanced patterns.
 ## Prerequisites
 ```bash
 pip install --upgrade idc-index
 ```
 No BigQuery credentials required - clinical data is packaged with `idc-index`.
 ## Understanding Clinical Data in IDC
 ### What is Clinical Data?
 Clinical data refers to non-imaging information that accompanies medical images:
 - Patient demographics (age, sex, race)
 - Clinical history (diagnoses, surgeries, therapies)
 - Lab tests and pathology results
 - Cancer staging (clinical and pathological)
 - Treatment outcomes
 ### Data Organization
 Clinical data in IDC comes from collection-specific spreadsheets provided by data submitters. IDC parses these into queryable tables accessible via `idc-index`.
 **Important characteristics:**
 - Clinical data is **not harmonized** across collections (terms and formats vary)
 - Not all collections have clinical data (check availability first)
 - All data is **anonymized** - `dicom_patient_id` links to imaging
 ### The clinical_index Table
 The `clinical_index` serves as a dictionary/catalog of all available clinical data:
 | Column | Purpose | Use For |
 |--------|---------|---------|
 | `collection_id` | Collection identifier | Filtering by collection |
 | `table_name` | Full BigQuery table reference | BigQuery queries (if needed) |
 | `short_table_name` | Short name | `get_clinical_table()` method |
 | `column` | Column name in table | Selecting data columns |
 | `column_label` | Human-readable description | Searching for concepts |
 | `values` | Observed attribute values for the column | Interpreting coded values |
 ### The `values` Column
 The `values` column contains an array of observed attribute values for the column defined in the `column` field. Each entry has:
 - **option_code**: The actual value observed in that column
 - **option_description**: Human-readable description of that value (from data dictionary if available, otherwise `None`)
 For ACRIN collections, value descriptions come from provided data dictionaries. For other collections, they are derived from inspection of the actual data values.
 **Note:** For columns with >20 unique values, the `values` array is left empty (`[]`) for simplicity.
 ## Core Workflow
 ### Step 1: Fetch Clinical Index
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 client.fetch_index('clinical_index')
 # View available columns
 print(client.clinical_index.columns.tolist())
 ```
 ### Step 2: Discover Available Clinical Data
 ```python
 # List all collections with clinical data
 collections_with_clinical = client.clinical_index["collection_id"].unique().tolist()
 print(f"{len(collections_with_clinical)} collections have clinical data")
 # Find clinical attributes for a specific collection
 nlst_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']
 nlst_columns[['short_table_name', 'column', 'column_label', 'values']]
 ```
 ### Step 3: Search for Specific Attributes
 ```python
 # Search by keyword in column_label (case-insensitive)
 stage_attrs = client.clinical_index[
    client.clinical_index["column_label"].str.contains("[Ss]tage", na=False)
 ]
 stage_attrs[["collection_id", "short_table_name", "column", "column_label"]]
 ```
 ### Step 4: Load Clinical Table
 ```python
 # Load table using short_table_name
 nlst_canc_df = client.get_clinical_table("nlst_canc")
 # Examine structure
 print(f"Rows: {len(nlst_canc_df)}, Columns: {len(nlst_canc_df.columns)}")
 nlst_canc_df.head()
 ```
 ### Step 5: Map Coded Values to Descriptions
 Many clinical attributes use coded values. The `values` column in `clinical_index` contains an array of observed values with their descriptions (when available).
 ```python
 # Get the clinical_index rows for NLST
 nlst_clinical_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']
 # Get observed values for a specific column
 # Filter to the row for 'clinical_stag' and extract the values array
 clinical_stag_values = nlst_clinical_columns[
    nlst_clinical_columns['column']=='clinical_stag'
 ]['values'].values[0]
 # View the observed values and their descriptions
 print(clinical_stag_values)
 # Output: array([{'option_code': '.M', 'option_description': 'Missing'},
 #                {'option_code': '110', 'option_description': 'Stage IA'},
 #                {'option_code': '120', 'option_description': 'Stage IB'}, ...])
 # Create mapping dictionary from codes to descriptions
 mapping_dict = {item['option_code']: item['option_description'] for item in clinical_stag_values}
 # Apply to DataFrame - convert column to string first for consistent matching
 nlst_canc_df['clinical_stag_meaning'] = nlst_canc_df['clinical_stag'].astype(str).map(mapping_dict)
 ```
 ### Step 6: Join with Imaging Data
 The `dicom_patient_id` column links clinical data to imaging. It matches the `PatientID` column in the imaging index.
 ```python
 # Pandas merge approach
 import pandas as pd
 # Get NLST CT imaging data
 nlst_imaging = client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')]
 # Join with clinical data
 merged = pd.merge(
    nlst_imaging[['PatientID', 'StudyInstanceUID']].drop_duplicates(),
    nlst_canc_df[['dicom_patient_id', 'clinical_stag', 'clinical_stag_meaning']],
    left_on='PatientID',
    right_on='dicom_patient_id',
    how='inner'
 )
 ```
 ```python
 # SQL join approach
 query = """
 SELECT
  index.PatientID,
  index.StudyInstanceUID,
  index.Modality,
  nlst_canc.clinical_stag
 FROM index
 JOIN nlst_canc ON index.PatientID = nlst_canc.dicom_patient_id
 WHERE index.collection_id = 'nlst' AND index.Modality = 'CT'
 """
 results = client.sql_query(query)
 ```
 ## Common Use Cases
 ### Use Case 1: Select Patients by Cancer Stage
 ```python
 from idc_index import IDCClient
 import pandas as pd
 client = IDCClient()
 client.fetch_index('clinical_index')
 # Load clinical table
 nlst_canc = client.get_clinical_table("nlst_canc")
 # Select Stage IV patients (code '400')
 stage_iv_patients = nlst_canc[nlst_canc['clinical_stag'] == '400']['dicom_patient_id']
 # Get CT imaging studies for these patients
 stage_iv_studies = pd.merge(
    client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')],
    stage_iv_patients,
    left_on='PatientID',
    right_on='dicom_patient_id',
    how='inner'
 )['StudyInstanceUID'].drop_duplicates()
 print(f"Found {len(stage_iv_studies)} CT studies for Stage IV patients")
 ```
 ### Use Case 2: Find Collections with Specific Clinical Attributes
 ```python
 # Find collections with chemotherapy information
 chemo_collections = client.clinical_index[
    client.clinical_index["column_label"].str.contains("[Cc]hemotherapy", na=False)
 ]["collection_id"].unique()
 print(f"Collections with chemotherapy data: {list(chemo_collections)}")
 ```
 ### Use Case 3: Examine Observed Values for a Clinical Attribute
 ```python
 # Find what values have been observed for a specific attribute
 chemotherapy_rows = client.clinical_index[
    (client.clinical_index["collection_id"] == "hcc_tace_seg") &
    (client.clinical_index["column"] == "chemotherapy")
 ]
 # Get the observed values array
 values_list = chemotherapy_rows["values"].tolist()
 print(values_list)
 # Output: [[{'option_code': 'Cisplastin', 'option_description': None},
 #           {'option_code': 'Cisplatin, Mitomycin-C', 'option_description': None}, ...]]
 ```
 ### Use Case 4: Generate Viewer URLs for Selected Patients
 ```python
 import random
 # Get studies for a sample Stage IV patient
 sample_patient = stage_iv_patients.iloc[0]
 studies = client.index[client.index['PatientID'] == sample_patient]['StudyInstanceUID'].unique()
 # Generate viewer URL
 if len(studies) > 0:
    viewer_url = client.get_viewer_URL(studyInstanceUID=studies[0])
    print(viewer_url)
 ```
 ## Key Concepts
 ### column vs column_label
 - **column**: Use for selecting data from tables (programmatic access)
 - **column_label**: Use for searching/understanding what data means (human-readable)
 Some collections (like `c4kc_kits`) have identical column and column_label. Others (like ACRIN collections) have cryptic column names but descriptive labels.
 ### option_code vs option_description
 The `values` array contains observed attribute values:
 - **option_code**: The actual value observed in the column (what you filter on)
 - **option_description**: Human-readable description (from data dictionary if available, otherwise `None`)
 ### dicom_patient_id
 Every clinical table includes `dicom_patient_id`, which matches the `PatientID` column in the imaging index. This is the key for joining clinical and imaging data.
 ## Troubleshooting
 ### Issue: Clinical table not found
 **Cause:** Using wrong table name or table doesn't exist for collection
 **Solution:** Query clinical_index first to find available tables:
 ```python
 client.clinical_index[client.clinical_index['collection_id']=='your_collection']['short_table_name'].unique()
 ```
 ### Issue: Empty values array
 **Cause:** The `values` array is left empty when a column has >20 unique values
 **Solution:** Load the clinical table and examine unique values directly:
 ```python
 clinical_df = client.get_clinical_table("table_name")
 clinical_df['column_name'].unique()
 ```
 ### Issue: Coded values not in mapping
 **Cause:** Some values may be missing from the dictionary (e.g., empty strings, special codes like `.M` for missing)
 **Solution:** Handle unmapped values gracefully:
 ```python
 df['meaning'] = df['code'].astype(str).map(mapping_dict).fillna('Unknown/Missing')
 ```
 ### Issue: No matching patients when joining
 **Cause:** Clinical data may include patients without images, or vice versa
 **Solution:** Verify patient overlap before joining:
 ```python
 imaging_patients = set(client.index[client.index['collection_id']=='nlst']['PatientID'].unique())
 clinical_patients = set(clinical_df['dicom_patient_id'].unique())
 overlap = imaging_patients & clinical_patients
 print(f"Patients with both imaging and clinical data: {len(overlap)}")
 ```
 ## Resources
 **IDC Documentation:**
 - [Clinical data organization](https://learn.canceridc.dev/data/organization-of-data/clinical) - How clinical data is organized in IDC
 - [Clinical data dashboard](https://datastudio.google.com/u/0/reporting/04cf5976-4ea0-4fee-a749-8bfd162f2e87/page/p_s7mk6eybqc) - Visual summary of available clinical data
 - [idc-index clinical_index documentation](https://idc-index.readthedocs.io/en/latest/column_descriptions.html#clinical-index)
 **Related Guides:**
 - `bigquery_guide.md` - Advanced clinical queries via BigQuery
 - Main SKILL.md - Core IDC workflows
 **IDC Tutorials:**
 - [clinical_data_intro.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/advanced_topics/clinical_data_intro.ipynb)
 - [exploring_clinical_data.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/exploring_clinical_data.ipynb)
 - [nlst_clinical_data.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/collections_demos/nlst_clinical_data.ipynb)
--- a/scientific-skills/imaging-data-commons/references/digital_pathology_guide.md
+++ b/scientific-skills/imaging-data-commons/references/digital_pathology_guide.md
@@ -0,0 +1,254 @@
 # Digital Pathology Guide for IDC
 **Tested with:** IDC data version v23, idc-index 0.11.9
 For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
 ## Index Tables for Digital Pathology
 Five specialized index tables provide curated metadata without needing BigQuery:
 | Table | Row Granularity | Description |
 |-------|-----------------|-------------|
 | `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: lens power, pixel spacing, image dimensions |
 | `sm_instance_index` | 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images |
 | `seg_index` | 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations |
 | `ann_index` | 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide |
 | `ann_group_index` | 1 row = 1 annotation group | Annotation group details: `AnnotationGroupLabel`, `GraphicType`, `NumberOfAnnotations`, `AlgorithmName`, property codes |
 All require `client.fetch_index("table_name")` before querying. Use `client.indices_overview` to inspect column schemas programmatically.
 ## Slide Microscopy Queries
 ### Basic SM metadata
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 # sm_index has detailed metadata; join with index for collection_id
 client.fetch_index("sm_index")
 client.sql_query("""
    SELECT i.collection_id, COUNT(*) as slides,
           MIN(s.min_PixelSpacing_2sf) as min_resolution
    FROM sm_index s
    JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
    GROUP BY i.collection_id
    ORDER BY slides DESC
 """)
 ```
 ### Find SM series with specific properties
 ```python
 # Find high-resolution slides with specific objective lens power
 client.fetch_index("sm_index")
 client.sql_query("""
    SELECT
        i.collection_id,
        i.PatientID,
        s.ObjectiveLensPower,
        s.min_PixelSpacing_2sf
    FROM sm_index s
    JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
    WHERE s.ObjectiveLensPower >= 40
    ORDER BY s.min_PixelSpacing_2sf
    LIMIT 20
 """)
 ```
 ## Annotation Queries (ANN)
 DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`.
 ### Basic annotation discovery
 ```python
 # Find annotation series and their referenced images
 client.fetch_index("ann_index")
 client.fetch_index("ann_group_index")
 client.sql_query("""
    SELECT
        a.SeriesInstanceUID as ann_series,
        a.AnnotationCoordinateType,
        a.referenced_SeriesInstanceUID as source_series
    FROM ann_index a
    LIMIT 10
 """)
 ```
 ### Annotation group statistics
 ```python
 # Get annotation group details (graphic types, counts, algorithms)
 client.sql_query("""
    SELECT
        GraphicType,
        SUM(NumberOfAnnotations) as total_annotations,
        COUNT(*) as group_count
    FROM ann_group_index
    GROUP BY GraphicType
    ORDER BY total_annotations DESC
 """)
 ```
 ### Find annotations with source slide context
 ```python
 # Find annotations with their source slide microscopy context
 client.sql_query("""
    SELECT
        i.collection_id,
        g.GraphicType,
        g.AnnotationPropertyType_CodeMeaning,
        g.AlgorithmName,
        g.NumberOfAnnotations
    FROM ann_group_index g
    JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
    JOIN index i ON a.referenced_SeriesInstanceUID = i.SeriesInstanceUID
    WHERE g.AlgorithmName IS NOT NULL
    LIMIT 10
 """)
 ```
 ## Segmentations on Slide Microscopy
 DICOM Segmentations (Modality = 'SEG') are used for both radiology (e.g., organ segmentations on CT) and pathology (e.g., tissue region segmentations on whole slide images). Use `seg_index.segmented_SeriesInstanceUID` to find the source series, then filter by source Modality to isolate pathology segmentations.
 ```python
 # Find segmentations whose source is a slide microscopy image
 client.fetch_index("seg_index")
 client.fetch_index("sm_index")
 client.sql_query("""
    SELECT
        seg.SeriesInstanceUID as seg_series,
        seg.AlgorithmName,
        seg.total_segments,
        src.collection_id,
        src.Modality as source_modality
    FROM seg_index seg
    JOIN index src ON seg.segmented_SeriesInstanceUID = src.SeriesInstanceUID
    WHERE src.Modality = 'SM'
    LIMIT 20
 """)
 ```
 ## Filter by AnnotationGroupLabel
 `AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search.
 ### Simple label filtering
 ```python
 # Find annotation groups by label (e.g., groups mentioning "blast")
 client.fetch_index("ann_group_index")
 client.sql_query("""
    SELECT
        g.SeriesInstanceUID,
        g.AnnotationGroupLabel,
        g.GraphicType,
        g.NumberOfAnnotations,
        g.AlgorithmName
    FROM ann_group_index g
    WHERE LOWER(g.AnnotationGroupLabel) LIKE '%blast%'
    ORDER BY g.NumberOfAnnotations DESC
 """)
 ```
 ### Label filtering with collection context
 ```python
 # Find annotation groups matching a label within a specific collection
 client.fetch_index("ann_index")
 client.fetch_index("ann_group_index")
 client.sql_query("""
    SELECT
        i.collection_id,
        g.AnnotationGroupLabel,
        g.GraphicType,
        g.NumberOfAnnotations,
        g.AnnotationPropertyType_CodeMeaning
    FROM ann_group_index g
    JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
    JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
    WHERE i.collection_id = 'your_collection_id'
      AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%'
    ORDER BY g.NumberOfAnnotations DESC
 """)
 ```
 ## Annotations on Slide Microscopy (SM + ANN Cross-Reference)
 When looking for annotations related to slide microscopy data, use both SM and ANN tables together. The `ann_index.referenced_SeriesInstanceUID` links each annotation series to its source slide.
 ```python
 # Find slide microscopy images and their annotations in a collection
 client.fetch_index("sm_index")
 client.fetch_index("ann_index")
 client.fetch_index("ann_group_index")
 client.sql_query("""
    SELECT
        i.collection_id,
        s.ObjectiveLensPower,
        g.AnnotationGroupLabel,
        g.NumberOfAnnotations,
        g.GraphicType
    FROM ann_group_index g
    JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
    JOIN sm_index s ON a.referenced_SeriesInstanceUID = s.SeriesInstanceUID
    JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
    WHERE i.collection_id = 'your_collection_id'
    ORDER BY g.NumberOfAnnotations DESC
 """)
 ```
 ## Join Patterns
 ### SM join (slide microscopy details with collection context)
 ```python
 client.fetch_index("sm_index")
 result = client.sql_query("""
    SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
    FROM index i
    JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
    LIMIT 10
 """)
 ```
 ### ANN join (annotation groups with collection context)
 ```python
 client.fetch_index("ann_index")
 client.fetch_index("ann_group_index")
 result = client.sql_query("""
    SELECT
        i.collection_id,
        g.AnnotationGroupLabel,
        g.GraphicType,
        g.NumberOfAnnotations,
        a.referenced_SeriesInstanceUID as source_series
    FROM ann_group_index g
    JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
    JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
    LIMIT 10
 """)
 ```
 ## Related Tools
 The following tools work with DICOM format for digital pathology workflows:
 **Python Libraries:**
 - [highdicom](https://github.com/ImagingDataCommons/highdicom) - High-level DICOM abstractions for Python. Create and read DICOM Segmentations (SEG), Structured Reports (SR), and parametric maps for pathology and radiology. Developed by IDC.
 - [wsidicom](https://github.com/imi-bigpicture/wsidicom) - Python package for reading DICOM WSI datasets. Parses metadata into easy-to-use dataclasses for whole slide image analysis.
 - [TIA-Toolbox](https://github.com/TissueImageAnalytics/tiatoolbox) - End-to-end computational pathology library with DICOM support via `DICOMWSIReader`. Provides tile extraction, feature extraction, and pretrained deep learning models.
 - [EZ-WSI-DICOMweb](https://github.com/GoogleCloudPlatform/EZ-WSI-DICOMweb) - Extract image patches from DICOM whole slide images via DICOMweb. Designed for AI/ML workflows with cloud DICOM stores.
 **Viewers:**
 - [Slim](https://github.com/ImagingDataCommons/slim) - Web-based DICOM slide microscopy viewer and annotation tool. Supports brightfield and multiplexed immunofluorescence imaging via DICOMweb. Developed by IDC.
 - [QuPath](https://qupath.github.io/) - Cross-platform open source software for whole slide image analysis. Supports DICOM WSI via Bio-Formats and OpenSlide (v0.4.0+).
 **Conversion:**
 - [dicom_wsi](https://github.com/Steven-N-Hart/dicom_wsi) - Python implementation for converting proprietary WSI formats to DICOM-compliant files.
--- a/scientific-skills/imaging-data-commons/references/index_tables_guide.md
+++ b/scientific-skills/imaging-data-commons/references/index_tables_guide.md
@@ -0,0 +1,146 @@
 # Index Tables Guide for IDC
 **Tested with:** idc-index 0.11.9 (IDC data version v23)
 This guide covers the structure and access patterns for IDC index tables: programmatic schema discovery, DataFrame access, and join column references. For the overview of available tables and their purposes, see the "Index Tables" section in the main SKILL.md.
 **Complete index table documentation:** https://idc-index.readthedocs.io/en/latest/indices_reference.html
 ## When to Use This Guide
 Load this guide when you need to:
 - Discover table schemas and column types programmatically
 - Access index tables as pandas DataFrames (not via SQL)
 - Understand key columns and join relationships between tables
 For SQL query examples (filter discovery, finding annotations, size estimation), see `references/sql_patterns.md`.
 ## Prerequisites
 ```bash
 pip install --upgrade idc-index
 ```
 ## Accessing Index Tables
 ### Via SQL (recommended for filtering/aggregation)
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 # Query the primary index (always available)
 results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
 # Fetch and query additional indices
 client.fetch_index("collections_index")
 collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
 client.fetch_index("analysis_results_index")
 analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
 ```
 ### As pandas DataFrames (direct access)
 ```python
 # Primary index (always available after client initialization)
 df = client.index
 # Fetch and access on-demand indices
 client.fetch_index("sm_index")
 sm_df = client.sm_index
 ```
 ## Discovering Table Schemas
 The `indices_overview` dictionary contains complete schema information for all tables. **Always consult this when writing queries or exploring data structure.**
 **DICOM attribute mapping:** Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like `PatientID`, `StudyInstanceUID`, `Modality`, `BodyPartExamined` work as expected.
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 # List all available indices with descriptions
 for name, info in client.indices_overview.items():
    print(f"\n{name}:")
    print(f"  Installed: {info['installed']}")
    print(f"  Description: {info['description']}")
 # Get complete schema for a specific index (columns, types, descriptions)
 schema = client.indices_overview["index"]["schema"]
 print(f"\nTable: {schema['table_description']}")
 print("\nColumns:")
 for col in schema['columns']:
    desc = col.get('description', 'No description')
    # Description indicates if column is from DICOM attribute
    print(f"  {col['name']} ({col['type']}): {desc}")
 # Find columns that are DICOM attributes (check description for "DICOM" reference)
 dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()]
 print(f"\nDICOM-sourced columns: {dicom_cols}")
 ```
 **Alternative: use `get_index_schema()` method:**
 ```python
 schema = client.get_index_schema("index")
 # Returns same schema dict: {'table_description': ..., 'columns': [...]}
 ```
 ## Key Columns Reference
 Most common columns in the primary `index` table (use `indices_overview` for complete list and descriptions):
 | Column | Type | DICOM | Description |
 |--------|------|-------|-------------|
 | `collection_id` | STRING | No | IDC collection identifier |
 | `analysis_result_id` | STRING | No | If applicable, indicates what analysis results collection given series is part of |
 | `source_DOI` | STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) |
 | `PatientID` | STRING | Yes | Patient identifier |
 | `StudyInstanceUID` | STRING | Yes | DICOM Study UID |
 | `SeriesInstanceUID` | STRING | Yes | DICOM Series UID — use for downloads/viewing |
 | `Modality` | STRING | Yes | Imaging modality (CT, MR, PT, SM, SEG, ANN, RTSTRUCT, etc.) |
 | `BodyPartExamined` | STRING | Yes | Anatomical region |
 | `SeriesDescription` | STRING | Yes | Description of the series |
 | `Manufacturer` | STRING | Yes | Equipment manufacturer |
 | `StudyDate` | STRING | Yes | Date study was performed |
 | `PatientSex` | STRING | Yes | Patient sex |
 | `PatientAge` | STRING | Yes | Patient age at time of study |
 | `license_short_name` | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) |
 | `series_size_MB` | FLOAT | No | Size of series in megabytes |
 | `instanceCount` | INTEGER | No | Number of DICOM instances in series |
 **DICOM = Yes**: Column value extracted from the DICOM attribute with the same name. Refer to the [DICOM standard](https://dicom.nema.org/medical/dicom/current/output/chtml/part06/chapter_6.html) for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.
 ## Join Column Reference
 Use this table to identify join columns between index tables. Always call `client.fetch_index("table_name")` before using a table in SQL.
 | Table A | Table B | Join Condition |
 |---------|---------|----------------|
 | `index` | `collections_index` | `index.collection_id = collections_index.collection_id` |
 | `index` | `sm_index` | `index.SeriesInstanceUID = sm_index.SeriesInstanceUID` |
 | `index` | `seg_index` | `index.SeriesInstanceUID = seg_index.segmented_SeriesInstanceUID` |
 | `index` | `ann_index` | `index.SeriesInstanceUID = ann_index.SeriesInstanceUID` |
 | `ann_index` | `ann_group_index` | `ann_index.SeriesInstanceUID = ann_group_index.SeriesInstanceUID` |
 | `index` | `clinical_index` | `index.collection_id = clinical_index.collection_id` (then filter by patient) |
 | `index` | `contrast_index` | `index.SeriesInstanceUID = contrast_index.SeriesInstanceUID` |
 For complete query examples using these joins, see `references/sql_patterns.md`.
 ## Troubleshooting
 **Issue:** Column not found in table
 - **Cause:** Column name misspelled or doesn't exist in that table
 - **Solution:** Use `client.indices_overview["table_name"]["schema"]["columns"]` to list available columns
 **Issue:** DataFrame access returns None
 - **Cause:** Index not fetched or property name incorrect
 - **Solution:** Fetch first with `client.fetch_index()`, then access via property matching the index name
 ## Resources
 - Complete index table documentation: https://idc-index.readthedocs.io/en/latest/indices_reference.html
 - `references/sql_patterns.md` for query examples using these tables
 - `references/clinical_data_guide.md` for clinical data workflows
 - `references/digital_pathology_guide.md` for pathology-specific indices
--- a/scientific-skills/imaging-data-commons/references/sql_patterns.md
+++ b/scientific-skills/imaging-data-commons/references/sql_patterns.md
@@ -0,0 +1,207 @@
 # SQL Query Patterns for IDC
 **Tested with:** idc-index 0.11.9 (IDC data version v23)
 Quick reference for common SQL query patterns when working with IDC data. For detailed examples with context, see the "Core Capabilities" section in the main SKILL.md.
 ## When to Use This Guide
 Load this guide when you need quick-reference SQL patterns for:
 - Discovering available filter values (modalities, body parts, manufacturers)
 - Finding annotations and segmentations across collections
 - Querying slide microscopy and annotation data
 - Estimating download sizes before download
 - Linking imaging data to clinical data
 For table schemas, DataFrame access, and join column references, see `references/index_tables_guide.md`.
 ## Prerequisites
 ```bash
 pip install --upgrade idc-index
 ```
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 ```
 ## Discover Available Filter Values
 ```python
 # What modalities exist?
 client.sql_query("SELECT DISTINCT Modality FROM index")
 # What body parts for a specific modality?
 client.sql_query("""
    SELECT DISTINCT BodyPartExamined, COUNT(*) as n
    FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
    GROUP BY BodyPartExamined ORDER BY n DESC
 """)
 # What manufacturers for MR?
 client.sql_query("""
    SELECT DISTINCT Manufacturer, COUNT(*) as n
    FROM index WHERE Modality = 'MR'
    GROUP BY Manufacturer ORDER BY n DESC
 """)
 ```
 ## Find Annotations and Segmentations
 **Note:** Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
 ```python
 # Find ALL segmentations and structure sets by DICOM Modality
 # SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set
 client.sql_query("""
    SELECT collection_id, Modality, COUNT(*) as series_count
    FROM index
    WHERE Modality IN ('SEG', 'RTSTRUCT')
    GROUP BY collection_id, Modality
    ORDER BY series_count DESC
 """)
 # Find segmentations for a specific collection (includes non-analysis-result items)
 client.sql_query("""
    SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
    FROM index
    WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
 """)
 # List analysis result collections (curated derived datasets)
 client.fetch_index("analysis_results_index")
 client.sql_query("""
    SELECT analysis_result_id, analysis_result_title, Collections, Modalities
    FROM analysis_results_index
 """)
 # Find analysis results for a specific source collection
 client.sql_query("""
    SELECT analysis_result_id, analysis_result_title
    FROM analysis_results_index
    WHERE Collections LIKE '%tcga_luad%'
 """)
 # Use seg_index for detailed DICOM Segmentation metadata
 client.fetch_index("seg_index")
 # Get segmentation statistics by algorithm
 client.sql_query("""
    SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
    FROM seg_index
    WHERE AlgorithmName IS NOT NULL
    GROUP BY AlgorithmName, AlgorithmType
    ORDER BY seg_count DESC
    LIMIT 10
 """)
 # Find segmentations for specific source images (e.g., chest CT)
 client.sql_query("""
    SELECT
        s.SeriesInstanceUID as seg_series,
        s.AlgorithmName,
        s.total_segments,
        s.segmented_SeriesInstanceUID as source_series
    FROM seg_index s
    JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
    WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
    LIMIT 10
 """)
 # Find TotalSegmentator results with source image context
 client.sql_query("""
    SELECT
        seg_info.collection_id,
        COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
        SUM(s.total_segments) as total_segments
    FROM seg_index s
    JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
    WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
    GROUP BY seg_info.collection_id
    ORDER BY seg_count DESC
 """)
 # Use ann_index and ann_group_index for Microscopy Bulk Simple Annotations
 # ann_group_index has AnnotationGroupLabel, GraphicType, NumberOfAnnotations, AlgorithmName
 client.fetch_index("ann_index")
 client.fetch_index("ann_group_index")
 client.sql_query("""
    SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations, i.collection_id
    FROM ann_group_index g
    JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
    JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
    WHERE g.AlgorithmName IS NOT NULL
    LIMIT 10
 """)
 # See references/digital_pathology_guide.md for AnnotationGroupLabel filtering, SM+ANN joins, and more
 ```
 ## Query Slide Microscopy and Annotation Data
 Use `sm_index` for slide microscopy metadata and `ann_index`/`ann_group_index` for annotations on slides (DICOM ANN objects). Filter annotation groups by `AnnotationGroupLabel` to find annotations by name.
 ```python
 client.fetch_index("sm_index")
 client.fetch_index("ann_index")
 client.fetch_index("ann_group_index")
 # Example: find annotation groups by label within a collection
 client.sql_query("""
    SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations
    FROM ann_group_index g
    JOIN index i ON g.SeriesInstanceUID = i.SeriesInstanceUID
    WHERE i.collection_id = 'your_collection_id'
      AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%'
 """)
 ```
 See `references/digital_pathology_guide.md` for SM queries, ANN filtering patterns, SM+ANN cross-references, and join examples.
 ## Estimate Download Size
 ```python
 # Size for specific criteria
 client.sql_query("""
    SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
    FROM index
    WHERE collection_id = 'nlst' AND Modality = 'CT'
 """)
 ```
 ## Link to Clinical Data
 ```python
 client.fetch_index("clinical_index")
 # Find collections with clinical data and their tables
 client.sql_query("""
    SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
    FROM clinical_index
    GROUP BY collection_id, table_name
    ORDER BY collection_id
 """)
 ```
 See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection.
 ## Troubleshooting
 **Issue:** Query returns error "table not found"
 - **Cause:** Index not fetched before query
 - **Solution:** Call `client.fetch_index("table_name")` before using tables other than the primary `index`
 **Issue:** LIKE pattern not matching expected results
 - **Cause:** Case sensitivity or whitespace
 - **Solution:** Use `LOWER(column)` for case-insensitive matching, `TRIM()` for whitespace
 **Issue:** JOIN returns fewer rows than expected
 - **Cause:** NULL values in join columns or no matching records
 - **Solution:** Use `LEFT JOIN` to include rows without matches, check for NULLs with `IS NOT NULL`
 ## Resources
 - `references/index_tables_guide.md` for table schemas, DataFrame access, and join column references
 - `references/clinical_data_guide.md` for clinical data patterns and value mapping
 - `references/digital_pathology_guide.md` for pathology-specific queries
 - `references/bigquery_guide.md` for advanced queries requiring full DICOM metadata
--- a/scientific-skills/imaging-data-commons/references/use_cases.md
+++ b/scientific-skills/imaging-data-commons/references/use_cases.md
@@ -0,0 +1,186 @@
 # Common Use Cases for IDC
 **Tested with:** idc-index 0.11.9 (IDC data version v23)
 This guide provides complete end-to-end workflow examples for common IDC use cases. Each use case demonstrates the full workflow from query to download with best practices.
 ## When to Use This Guide
 Load this guide when you need:
 - Complete end-to-end workflow examples for training dataset creation
 - Patterns for multi-step data selection and download workflows
 - Examples of license-aware data handling for commercial use
 - Visualization workflows for data preview before download
 For core API patterns (query, download, visualize, citations), see the "Core Capabilities" section in the main SKILL.md.
 ## Prerequisites
 ```bash
 pip install --upgrade idc-index
 ```
 ## Use Case 1: Find and Download Lung CT Scans for Deep Learning
 **Objective:** Build training dataset of lung CT scans from NLST collection
 **Steps:**
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 # 1. Query for lung CT scans with specific criteria
 query = """
 SELECT
  PatientID,
  SeriesInstanceUID,
  SeriesDescription
 FROM index
 WHERE collection_id = 'nlst'
  AND Modality = 'CT'
  AND BodyPartExamined = 'CHEST'
  AND license_short_name = 'CC BY 4.0'
 ORDER BY PatientID
 LIMIT 100
 """
 results = client.sql_query(query)
 print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")
 # 2. Download data organized by patient
 client.download_from_selection(
    seriesInstanceUID=list(results['SeriesInstanceUID'].values),
    downloadDir="./training_data",
    dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
 )
 # 3. Save manifest for reproducibility
 results.to_csv('training_manifest.csv', index=False)
 ```
 ## Use Case 2: Query Brain MRI by Manufacturer for Quality Study
 **Objective:** Compare image quality across different MRI scanner manufacturers
 **Steps:**
 ```python
 from idc_index import IDCClient
 import pandas as pd
 client = IDCClient()
 # Query for brain MRI grouped by manufacturer
 query = """
 SELECT
  Manufacturer,
  ManufacturerModelName,
  COUNT(DISTINCT SeriesInstanceUID) as num_series,
  COUNT(DISTINCT PatientID) as num_patients
 FROM index
 WHERE Modality = 'MR'
  AND BodyPartExamined LIKE '%BRAIN%'
 GROUP BY Manufacturer, ManufacturerModelName
 HAVING num_series >= 10
 ORDER BY num_series DESC
 """
 manufacturers = client.sql_query(query)
 print(manufacturers)
 # Download sample from each manufacturer for comparison
 for _, row in manufacturers.head(3).iterrows():
    mfr = row['Manufacturer']
    model = row['ManufacturerModelName']
    query = f"""
    SELECT SeriesInstanceUID
    FROM index
    WHERE Manufacturer = '{mfr}'
      AND ManufacturerModelName = '{model}'
      AND Modality = 'MR'
      AND BodyPartExamined LIKE '%BRAIN%'
    LIMIT 5
    """
    series = client.sql_query(query)
    client.download_from_selection(
        seriesInstanceUID=list(series['SeriesInstanceUID'].values),
        downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
    )
 ```
 ## Use Case 3: Visualize Series Without Downloading
 **Objective:** Preview imaging data before committing to download
 ```python
 from idc_index import IDCClient
 import webbrowser
 client = IDCClient()
 series_list = client.sql_query("""
    SELECT SeriesInstanceUID, PatientID, SeriesDescription
    FROM index
    WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
    LIMIT 10
 """)
 # Preview each in browser
 for _, row in series_list.iterrows():
    viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID'])
    print(f"Patient {row['PatientID']}: {row['SeriesDescription']}")
    print(f"  View at: {viewer_url}")
    # webbrowser.open(viewer_url)  # Uncomment to open automatically
 ```
 For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration.
 ## Use Case 4: License-Aware Batch Download for Commercial Use
 **Objective:** Download only CC-BY licensed data suitable for commercial applications
 **Steps:**
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 # Query ONLY for CC BY licensed data (allows commercial use with attribution)
 query = """
 SELECT
  SeriesInstanceUID,
  collection_id,
  PatientID,
  Modality
 FROM index
 WHERE license_short_name LIKE 'CC BY%'
  AND license_short_name NOT LIKE '%NC%'
  AND Modality IN ('CT', 'MR')
  AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN')
 LIMIT 200
 """
 cc_by_data = client.sql_query(query)
 print(f"Found {len(cc_by_data)} CC BY licensed series")
 print(f"Collections: {cc_by_data['collection_id'].unique()}")
 # Download with license verification
 client.download_from_selection(
    seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values),
    downloadDir="./commercial_dataset",
    dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID"
 )
 # Save license information
 cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
 ```
 ## Resources
 - Main SKILL.md for core API patterns (query, download, visualize)
 - `references/clinical_data_guide.md` for clinical data integration workflows
 - `references/sql_patterns.md` for additional SQL query patterns
 - `references/index_tables_guide.md` for complex join patterns