diff --git a/scientific-skills/imaging-data-commons/SKILL.md b/scientific-skills/imaging-data-commons/SKILL.md index 2e65b4b..ef55896 100644 --- a/scientific-skills/imaging-data-commons/SKILL.md +++ b/scientific-skills/imaging-data-commons/SKILL.md @@ -3,9 +3,10 @@ name: imaging-data-commons description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses. license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data. metadata: - version: 1.2.0 + version: 1.3.1 skill-author: Andrey Fedorov, @fedorov - idc-index: "0.11.7" + idc-index: "0.11.9" + idc-data-version: "v23" repository: https://github.com/ImagingDataCommons/idc-claude-skill --- @@ -15,20 +16,39 @@ metadata: Use the `idc-index` Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access. +**Current IDC Data Version: v23** (always verify with `IDCClient().get_idc_version()`) + **Primary tool:** `idc-index` ([GitHub](https://github.com/imagingdatacommons/idc-index)) -**Check current data scale for the latest version:** +**CRITICAL - Check package version and upgrade if needed (run this FIRST):** + +```python +import idc_index + +REQUIRED_VERSION = "0.11.9" # Must match metadata.idc-index in this file +installed = idc_index.__version__ + +if installed < REQUIRED_VERSION: + print(f"Upgrading idc-index from {installed} to {REQUIRED_VERSION}...") + import subprocess + subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", "idc-index"], check=True) + print("Upgrade complete. Restart Python to use new version.") +else: + print(f"idc-index {installed} meets requirement ({REQUIRED_VERSION})") +``` + +**Verify IDC data version and check current data scale:** ```python from idc_index import IDCClient client = IDCClient() -# get IDC data version -print(client.get_idc_version()) +# Verify IDC data version (should be "v23") +print(f"IDC data version: {client.get_idc_version()}") # Get collection count and total series stats = client.sql_query(""" - SELECT + SELECT COUNT(DISTINCT collection_id) as collections, COUNT(DISTINCT analysis_result_id) as analysis_results, COUNT(DISTINCT PatientID) as patients, @@ -54,6 +74,30 @@ print(stats) - Checking data licenses before use in research or commercial applications - Visualizing medical images in a browser without local DICOM viewer software +## Quick Navigation + +**Core Sections (inline):** +- IDC Data Model - Collection and analysis result hierarchy +- Index Tables - Available tables and joining patterns +- Installation - Package setup and version verification +- Core Capabilities - Essential API patterns (query, download, visualize, license, citations, batch) +- Best Practices - Usage guidelines +- Troubleshooting - Common issues and solutions + +**Reference Guides (load on demand):** + +| Guide | When to Load | +|-------|--------------| +| `index_tables_guide.md` | Complex JOINs, schema discovery, DataFrame access | +| `use_cases.md` | End-to-end workflow examples (training datasets, batch downloads) | +| `sql_patterns.md` | Quick SQL patterns for filter discovery, annotations, size estimation | +| `clinical_data_guide.md` | Clinical/tabular data, imaging+clinical joins, value mapping | +| `cloud_storage_guide.md` | Direct S3/GCS access, versioning, UUID mapping | +| `dicomweb_guide.md` | DICOMweb endpoints, PACS integration | +| `digital_pathology_guide.md` | Slide microscopy (SM), annotations (ANN), pathology workflows | +| `bigquery_guide.md` | Full DICOM metadata, private elements (requires GCP) | +| `cli_guide.md` | Command-line tools (`idc download`, manifest files) | + ## IDC Data Model IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance): @@ -75,6 +119,8 @@ Use `collection_id` to find original imaging data, may include annotations depos The `idc-index` package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames. +**Complete index table documentation:** Use https://idc-index.readthedocs.io/en/latest/indices_reference.html for quick check of available tables and columns without executing any code. + **Important:** Use `client.indices_overview` to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure. ### Available Tables @@ -89,6 +135,9 @@ The `idc-index` package provides multiple metadata index tables, accessible via | `sm_index` | 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata | | `sm_instance_index` | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy | | `seg_index` | 1 row = 1 DICOM Segmentation series | fetch_index() | Segmentation metadata: algorithm, segment count, reference to source image series | +| `ann_index` | 1 row = 1 DICOM ANN series | fetch_index() | Microscopy Bulk Simple Annotations series metadata; references annotated image series | +| `ann_group_index` | 1 row = 1 annotation group | fetch_index() | Detailed annotation group metadata: graphic type, annotation count, property codes, algorithm | +| `contrast_index` | 1 row = 1 series with contrast info | fetch_index() | Contrast agent metadata: agent name, ingredient, administration route (CT, MR, PT, XA, RF) | **Auto** = loaded automatically when `IDCClient()` is instantiated **fetch_index()** = requires `client.fetch_index("table_name")` to load @@ -107,140 +156,13 @@ The `idc-index` package provides multiple metadata index tables, accessible via | `source_DOI` | index, analysis_results_index | Link by publication DOI | | `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier | | `Modality` | index, prior_versions_index | Filter by imaging modality | -| `SeriesInstanceUID` | index, seg_index | Link segmentation series to its index metadata | +| `SeriesInstanceUID` | index, seg_index, ann_index, ann_group_index, contrast_index | Link segmentation/annotation/contrast series to its index metadata | | `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) | +| `referenced_SeriesInstanceUID` | ann_index → index | Link annotation to its source image series (join ann_index.referenced_SeriesInstanceUID = index.SeriesInstanceUID) | **Note:** `Subjects`, `Updated`, and `Description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts). -**Example joins:** -```python -from idc_index import IDCClient -client = IDCClient() - -# Join index with collections_index to get cancer types -client.fetch_index("collections_index") -result = client.sql_query(""" - SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations - FROM index i - JOIN collections_index c ON i.collection_id = c.collection_id - WHERE i.Modality = 'MR' - LIMIT 10 -""") - -# Join index with sm_index for slide microscopy details -client.fetch_index("sm_index") -result = client.sql_query(""" - SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf - FROM index i - JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID - LIMIT 10 -""") - -# Join seg_index with index to find segmentations and their source images -client.fetch_index("seg_index") -result = client.sql_query(""" - SELECT - s.SeriesInstanceUID as seg_series, - s.AlgorithmName, - s.total_segments, - src.collection_id, - src.Modality as source_modality, - src.BodyPartExamined - FROM seg_index s - JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID - WHERE s.AlgorithmType = 'AUTOMATIC' - LIMIT 10 -""") -``` - -### Accessing Index Tables - -**Via SQL (recommended for filtering/aggregation):** -```python -from idc_index import IDCClient -client = IDCClient() - -# Query the primary index (always available) -results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10") - -# Fetch and query additional indices -client.fetch_index("collections_index") -collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index") - -client.fetch_index("analysis_results_index") -analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5") -``` - -**As pandas DataFrames (direct access):** -```python -# Primary index (always available after client initialization) -df = client.index - -# Fetch and access on-demand indices -client.fetch_index("sm_index") -sm_df = client.sm_index -``` - -### Discovering Table Schemas (Essential for Query Writing) - -The `indices_overview` dictionary contains complete schema information for all tables. **Always consult this when writing queries or exploring data structure.** - -**DICOM attribute mapping:** Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like `PatientID`, `StudyInstanceUID`, `Modality`, `BodyPartExamined` work as expected. - -```python -from idc_index import IDCClient -client = IDCClient() - -# List all available indices with descriptions -for name, info in client.indices_overview.items(): - print(f"\n{name}:") - print(f" Installed: {info['installed']}") - print(f" Description: {info['description']}") - -# Get complete schema for a specific index (columns, types, descriptions) -schema = client.indices_overview["index"]["schema"] -print(f"\nTable: {schema['table_description']}") -print("\nColumns:") -for col in schema['columns']: - desc = col.get('description', 'No description') - # Description indicates if column is from DICOM attribute - print(f" {col['name']} ({col['type']}): {desc}") - -# Find columns that are DICOM attributes (check description for "DICOM" reference) -dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()] -print(f"\nDICOM-sourced columns: {dicom_cols}") -``` - -**Alternative: use `get_index_schema()` method:** -```python -schema = client.get_index_schema("index") -# Returns same schema dict: {'table_description': ..., 'columns': [...]} -``` - -### Key Columns in Primary `index` Table - -Most common columns for queries (use `indices_overview` for complete list and descriptions): - -| Column | Type | DICOM | Description | -|--------|------|-------|-------------| -| `collection_id` | STRING | No | IDC collection identifier | -| `analysis_result_id` | STRING | No | If applicable, indicates what analysis results collection given series is part of | -| `source_DOI` | STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) | -| `PatientID` | STRING | Yes | Patient identifier | -| `StudyInstanceUID` | STRING | Yes | DICOM Study UID | -| `SeriesInstanceUID` | STRING | Yes | DICOM Series UID — use for downloads/viewing | -| `Modality` | STRING | Yes | Imaging modality (CT, MR, PT, SM, etc.) | -| `BodyPartExamined` | STRING | Yes | Anatomical region | -| `SeriesDescription` | STRING | Yes | Description of the series | -| `Manufacturer` | STRING | Yes | Equipment manufacturer | -| `StudyDate` | STRING | Yes | Date study was performed | -| `PatientSex` | STRING | Yes | Patient sex | -| `PatientAge` | STRING | Yes | Patient age at time of study | -| `license_short_name` | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) | -| `series_size_MB` | FLOAT | No | Size of series in megabytes | -| `instanceCount` | INTEGER | No | Number of DICOM instances in series | - -**DICOM = Yes**: Column value extracted from the DICOM attribute with the same name. Refer to the [DICOM standard](https://dicom.nema.org/medical/dicom/current/output/chtml/part06/chapter_6.html) for numeric tag mappings. Use standard DICOM knowledge for expected values and formats. +For detailed join examples, schema discovery patterns, key columns reference, and DataFrame access, see `references/index_tables_guide.md`. ### Clinical Data Access @@ -301,7 +223,13 @@ pip install --upgrade idc-index **Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility. -**Tested with:** idc-index 0.11.7 (IDC data version v23) +**IMPORTANT:** IDC data version v23 is current. Always verify your version: +```python +print(client.get_idc_version()) # Should return "v23" +``` +If you see an older version, upgrade with: `pip install --upgrade idc-index` + +**Tested with:** idc-index 0.11.9 (IDC data version v23) **Optional (for data analysis):** ```bash @@ -484,6 +412,15 @@ client.download_from_selection( # Results in: ./data/flat/*.dcm ``` +**Downloaded file names:** + +Individual DICOM files are named using their CRDC instance UUID: `.dcm` (e.g., `0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm`). This UUID-based naming: +- Enables version tracking (UUIDs change when file content changes) +- Matches cloud storage organization (`s3://idc-open-data//.dcm`) +- Differs from DICOM UIDs (SOPInstanceUID) which are preserved inside the file metadata + +To identify files, use the `crdc_instance_uuid` column in queries or read DICOM metadata (SOPInstanceUID) from the files. + ### Command-Line Download The `idc download` command provides command-line access to download functionality without writing Python code. Available after installing `idc-index`. @@ -705,6 +642,13 @@ For queries requiring full DICOM metadata, complex JOINs, clinical data tables, See `references/bigquery_guide.md` for setup, table schemas, query patterns, private element access, and cost optimization. +**Before using BigQuery**, always check if a specialized index table already has the metadata you need: +1. Use `client.indices_overview` or the [idc-index indices reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html) to discover all available tables and their columns +2. Fetch the relevant index: `client.fetch_index("table_name")` +3. Query locally with `client.sql_query()` (free, no GCP account needed) + +Common specialized indices: `seg_index` (segmentations), `ann_index` / `ann_group_index` (microscopy annotations), `sm_index` (slide microscopy), `collections_index` (collection metadata). Only use BigQuery if you need private DICOM elements or attributes not in any index. + ### 8. Tool Selection Guide | Task | Tool | Reference | @@ -782,166 +726,15 @@ sitk.WriteImage(smoothed, "processed_volume.nii.gz") ## Common Use Cases -### Use Case 1: Find and Download Lung CT Scans for Deep Learning - -**Objective:** Build training dataset of lung CT scans from NLST collection - -**Steps:** -```python -from idc_index import IDCClient - -client = IDCClient() - -# 1. Query for lung CT scans with specific criteria -query = """ -SELECT - PatientID, - SeriesInstanceUID, - SeriesDescription -FROM index -WHERE collection_id = 'nlst' - AND Modality = 'CT' - AND BodyPartExamined = 'CHEST' - AND license_short_name = 'CC BY 4.0' -ORDER BY PatientID -LIMIT 100 -""" - -results = client.sql_query(query) -print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients") - -# 2. Download data organized by patient -client.download_from_selection( - seriesInstanceUID=list(results['SeriesInstanceUID'].values), - downloadDir="./training_data", - dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID" -) - -# 3. Save manifest for reproducibility -results.to_csv('training_manifest.csv', index=False) -``` - -### Use Case 2: Query Brain MRI by Manufacturer for Quality Study - -**Objective:** Compare image quality across different MRI scanner manufacturers - -**Steps:** -```python -from idc_index import IDCClient -import pandas as pd - -client = IDCClient() - -# Query for brain MRI grouped by manufacturer -query = """ -SELECT - Manufacturer, - ManufacturerModelName, - COUNT(DISTINCT SeriesInstanceUID) as num_series, - COUNT(DISTINCT PatientID) as num_patients -FROM index -WHERE Modality = 'MR' - AND BodyPartExamined LIKE '%BRAIN%' -GROUP BY Manufacturer, ManufacturerModelName -HAVING num_series >= 10 -ORDER BY num_series DESC -""" - -manufacturers = client.sql_query(query) -print(manufacturers) - -# Download sample from each manufacturer for comparison -for _, row in manufacturers.head(3).iterrows(): - mfr = row['Manufacturer'] - model = row['ManufacturerModelName'] - - query = f""" - SELECT SeriesInstanceUID - FROM index - WHERE Manufacturer = '{mfr}' - AND ManufacturerModelName = '{model}' - AND Modality = 'MR' - AND BodyPartExamined LIKE '%BRAIN%' - LIMIT 5 - """ - - series = client.sql_query(query) - client.download_from_selection( - seriesInstanceUID=list(series['SeriesInstanceUID'].values), - downloadDir=f"./quality_study/{mfr.replace(' ', '_')}" - ) -``` - -### Use Case 3: Visualize Series Without Downloading - -**Objective:** Preview imaging data before committing to download - -```python -from idc_index import IDCClient -import webbrowser - -client = IDCClient() - -series_list = client.sql_query(""" - SELECT SeriesInstanceUID, PatientID, SeriesDescription - FROM index - WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT' - LIMIT 10 -""") - -# Preview each in browser -for _, row in series_list.iterrows(): - viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID']) - print(f"Patient {row['PatientID']}: {row['SeriesDescription']}") - print(f" View at: {viewer_url}") - # webbrowser.open(viewer_url) # Uncomment to open automatically -``` - -For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration. - -### Use Case 4: License-Aware Batch Download for Commercial Use - -**Objective:** Download only CC-BY licensed data suitable for commercial applications - -**Steps:** -```python -from idc_index import IDCClient - -client = IDCClient() - -# Query ONLY for CC BY licensed data (allows commercial use with attribution) -query = """ -SELECT - SeriesInstanceUID, - collection_id, - PatientID, - Modality -FROM index -WHERE license_short_name LIKE 'CC BY%' - AND license_short_name NOT LIKE '%NC%' - AND Modality IN ('CT', 'MR') - AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN') -LIMIT 200 -""" - -cc_by_data = client.sql_query(query) - -print(f"Found {len(cc_by_data)} CC BY licensed series") -print(f"Collections: {cc_by_data['collection_id'].unique()}") - -# Download with license verification -client.download_from_selection( - seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values), - downloadDir="./commercial_dataset", - dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID" -) - -# Save license information -cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False) -``` +See `references/use_cases.md` for complete end-to-end workflow examples including: +- Building deep learning training datasets from lung CT scans +- Comparing image quality across scanner manufacturers +- Previewing data in browser before downloading +- License-aware batch downloads for commercial use ## Best Practices +- **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v23). If using an older version, recommend `pip install --upgrade idc-index` - **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC) - **Generate citations for attribution** - Use `citations_from_selection()` to get properly formatted citations from `source_DOI` values; include these in publications - **Start with small queries** - Use `LIMIT` clause when exploring to avoid long downloads and understand data structure @@ -989,142 +782,14 @@ cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False) ## Common SQL Query Patterns -Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above. +See `references/sql_patterns.md` for quick-reference SQL patterns including: +- Filter value discovery (modalities, body parts, manufacturers) +- Annotation and segmentation queries (including seg_index, ann_index joins) +- Slide microscopy queries (sm_index patterns) +- Download size estimation +- Clinical data linking -### Discover available filter values -```python -# What modalities exist? -client.sql_query("SELECT DISTINCT Modality FROM index") - -# What body parts for a specific modality? -client.sql_query(""" - SELECT DISTINCT BodyPartExamined, COUNT(*) as n - FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL - GROUP BY BodyPartExamined ORDER BY n DESC -""") - -# What manufacturers for MR? -client.sql_query(""" - SELECT DISTINCT Manufacturer, COUNT(*) as n - FROM index WHERE Modality = 'MR' - GROUP BY Manufacturer ORDER BY n DESC -""") -``` - -### Find annotations and segmentations - -**Note:** Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type. - -```python -# Find ALL segmentations and structure sets by DICOM Modality -# SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set -client.sql_query(""" - SELECT collection_id, Modality, COUNT(*) as series_count - FROM index - WHERE Modality IN ('SEG', 'RTSTRUCT') - GROUP BY collection_id, Modality - ORDER BY series_count DESC -""") - -# Find segmentations for a specific collection (includes non-analysis-result items) -client.sql_query(""" - SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id - FROM index - WHERE collection_id = 'tcga_luad' AND Modality = 'SEG' -""") - -# List analysis result collections (curated derived datasets) -client.fetch_index("analysis_results_index") -client.sql_query(""" - SELECT analysis_result_id, analysis_result_title, Collections, Modalities - FROM analysis_results_index -""") - -# Find analysis results for a specific source collection -client.sql_query(""" - SELECT analysis_result_id, analysis_result_title - FROM analysis_results_index - WHERE Collections LIKE '%tcga_luad%' -""") - -# Use seg_index for detailed DICOM Segmentation metadata -client.fetch_index("seg_index") - -# Get segmentation statistics by algorithm -client.sql_query(""" - SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count - FROM seg_index - WHERE AlgorithmName IS NOT NULL - GROUP BY AlgorithmName, AlgorithmType - ORDER BY seg_count DESC - LIMIT 10 -""") - -# Find segmentations for specific source images (e.g., chest CT) -client.sql_query(""" - SELECT - s.SeriesInstanceUID as seg_series, - s.AlgorithmName, - s.total_segments, - s.segmented_SeriesInstanceUID as source_series - FROM seg_index s - JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID - WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST' - LIMIT 10 -""") - -# Find TotalSegmentator results with source image context -client.sql_query(""" - SELECT - seg_info.collection_id, - COUNT(DISTINCT s.SeriesInstanceUID) as seg_count, - SUM(s.total_segments) as total_segments - FROM seg_index s - JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID - WHERE s.AlgorithmName LIKE '%TotalSegmentator%' - GROUP BY seg_info.collection_id - ORDER BY seg_count DESC -""") -``` - -### Query slide microscopy data -```python -# sm_index has detailed metadata; join with index for collection_id -client.fetch_index("sm_index") -client.sql_query(""" - SELECT i.collection_id, COUNT(*) as slides, - MIN(s.min_PixelSpacing_2sf) as min_resolution - FROM sm_index s - JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID - GROUP BY i.collection_id - ORDER BY slides DESC -""") -``` - -### Estimate download size -```python -# Size for specific criteria -client.sql_query(""" - SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count - FROM index - WHERE collection_id = 'nlst' AND Modality = 'CT' -""") -``` - -### Link to clinical data -```python -client.fetch_index("clinical_index") - -# Find collections with clinical data and their tables -client.sql_query(""" - SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns - FROM clinical_index - GROUP BY collection_id, table_name - ORDER BY collection_id -""") -``` - -See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection. +For segmentation and annotation details, also see `references/digital_pathology_guide.md`. ## Related Skills @@ -1134,8 +799,7 @@ The following skills complement IDC workflows for downstream analysis and visual - **pydicom** - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET). ### Pathology and Slide Microscopy -- **histolab** - Lightweight tile extraction and preprocessing for whole slide images. Use for basic slide processing, tissue detection, and dataset preparation from IDC slide microscopy data. -- **pathml** - Full-featured computational pathology toolkit. Use for advanced WSI analysis including multiplexed imaging, nucleus segmentation, and ML model training on pathology data downloaded from IDC. +See `references/digital_pathology_guide.md` for DICOM-compatible tools (highdicom, wsidicom, TIA-Toolbox, Slim viewer). ### Metadata Visualization - **matplotlib** - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.). @@ -1159,11 +823,8 @@ columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['col ### Reference Documentation -- **clinical_data_guide.md** - Clinical/tabular data navigation, value mapping, and joining with imaging data -- **cloud_storage_guide.md** - Direct cloud bucket access (S3/GCS), file organization, CRDC UUIDs, versioning, and reproducibility -- **cli_guide.md** - Complete idc-index command-line interface reference (`idc download`, `idc download-from-manifest`, `idc download-from-selection`) -- **bigquery_guide.md** - Advanced BigQuery usage guide for complex metadata queries -- **dicomweb_guide.md** - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details +See the Quick Navigation section at the top for the full list of reference guides with decision triggers. + - **[indices_reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html)** - External documentation for index tables (may be ahead of the installed version) ### External Links diff --git a/scientific-skills/imaging-data-commons/references/clinical_data_guide.md b/scientific-skills/imaging-data-commons/references/clinical_data_guide.md new file mode 100644 index 0000000..eea3452 --- /dev/null +++ b/scientific-skills/imaging-data-commons/references/clinical_data_guide.md @@ -0,0 +1,324 @@ +# Clinical Data Guide for IDC + +**Tested with:** idc-index 0.11.7 (IDC data version v23) + +Clinical data (demographics, diagnoses, therapies, lab tests, staging) accompanies many IDC imaging collections. This guide covers how to discover, access, and integrate clinical data with imaging data using `idc-index`. + +## When to Use This Guide + +Use this guide when you need to: +- Find what clinical metadata is available for a collection +- Filter patients by clinical criteria (e.g., cancer stage, treatment history) +- Join clinical attributes with imaging data for cohort selection +- Understand and decode coded values in clinical tables + +For basic clinical data access, see the "Clinical Data Access" section in the main SKILL.md. This guide provides detailed workflows and advanced patterns. + +## Prerequisites + +```bash +pip install --upgrade idc-index +``` + +No BigQuery credentials required - clinical data is packaged with `idc-index`. + +## Understanding Clinical Data in IDC + +### What is Clinical Data? + +Clinical data refers to non-imaging information that accompanies medical images: +- Patient demographics (age, sex, race) +- Clinical history (diagnoses, surgeries, therapies) +- Lab tests and pathology results +- Cancer staging (clinical and pathological) +- Treatment outcomes + +### Data Organization + +Clinical data in IDC comes from collection-specific spreadsheets provided by data submitters. IDC parses these into queryable tables accessible via `idc-index`. + +**Important characteristics:** +- Clinical data is **not harmonized** across collections (terms and formats vary) +- Not all collections have clinical data (check availability first) +- All data is **anonymized** - `dicom_patient_id` links to imaging + +### The clinical_index Table + +The `clinical_index` serves as a dictionary/catalog of all available clinical data: + +| Column | Purpose | Use For | +|--------|---------|---------| +| `collection_id` | Collection identifier | Filtering by collection | +| `table_name` | Full BigQuery table reference | BigQuery queries (if needed) | +| `short_table_name` | Short name | `get_clinical_table()` method | +| `column` | Column name in table | Selecting data columns | +| `column_label` | Human-readable description | Searching for concepts | +| `values` | Observed attribute values for the column | Interpreting coded values | + +### The `values` Column + +The `values` column contains an array of observed attribute values for the column defined in the `column` field. Each entry has: +- **option_code**: The actual value observed in that column +- **option_description**: Human-readable description of that value (from data dictionary if available, otherwise `None`) + +For ACRIN collections, value descriptions come from provided data dictionaries. For other collections, they are derived from inspection of the actual data values. + +**Note:** For columns with >20 unique values, the `values` array is left empty (`[]`) for simplicity. + +## Core Workflow + +### Step 1: Fetch Clinical Index + +```python +from idc_index import IDCClient + +client = IDCClient() +client.fetch_index('clinical_index') + +# View available columns +print(client.clinical_index.columns.tolist()) +``` + +### Step 2: Discover Available Clinical Data + +```python +# List all collections with clinical data +collections_with_clinical = client.clinical_index["collection_id"].unique().tolist() +print(f"{len(collections_with_clinical)} collections have clinical data") + +# Find clinical attributes for a specific collection +nlst_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst'] +nlst_columns[['short_table_name', 'column', 'column_label', 'values']] +``` + +### Step 3: Search for Specific Attributes + +```python +# Search by keyword in column_label (case-insensitive) +stage_attrs = client.clinical_index[ + client.clinical_index["column_label"].str.contains("[Ss]tage", na=False) +] +stage_attrs[["collection_id", "short_table_name", "column", "column_label"]] +``` + +### Step 4: Load Clinical Table + +```python +# Load table using short_table_name +nlst_canc_df = client.get_clinical_table("nlst_canc") + +# Examine structure +print(f"Rows: {len(nlst_canc_df)}, Columns: {len(nlst_canc_df.columns)}") +nlst_canc_df.head() +``` + +### Step 5: Map Coded Values to Descriptions + +Many clinical attributes use coded values. The `values` column in `clinical_index` contains an array of observed values with their descriptions (when available). + +```python +# Get the clinical_index rows for NLST +nlst_clinical_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst'] + +# Get observed values for a specific column +# Filter to the row for 'clinical_stag' and extract the values array +clinical_stag_values = nlst_clinical_columns[ + nlst_clinical_columns['column']=='clinical_stag' +]['values'].values[0] + +# View the observed values and their descriptions +print(clinical_stag_values) +# Output: array([{'option_code': '.M', 'option_description': 'Missing'}, +# {'option_code': '110', 'option_description': 'Stage IA'}, +# {'option_code': '120', 'option_description': 'Stage IB'}, ...]) + +# Create mapping dictionary from codes to descriptions +mapping_dict = {item['option_code']: item['option_description'] for item in clinical_stag_values} + +# Apply to DataFrame - convert column to string first for consistent matching +nlst_canc_df['clinical_stag_meaning'] = nlst_canc_df['clinical_stag'].astype(str).map(mapping_dict) +``` + +### Step 6: Join with Imaging Data + +The `dicom_patient_id` column links clinical data to imaging. It matches the `PatientID` column in the imaging index. + +```python +# Pandas merge approach +import pandas as pd + +# Get NLST CT imaging data +nlst_imaging = client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')] + +# Join with clinical data +merged = pd.merge( + nlst_imaging[['PatientID', 'StudyInstanceUID']].drop_duplicates(), + nlst_canc_df[['dicom_patient_id', 'clinical_stag', 'clinical_stag_meaning']], + left_on='PatientID', + right_on='dicom_patient_id', + how='inner' +) +``` + +```python +# SQL join approach +query = """ +SELECT + index.PatientID, + index.StudyInstanceUID, + index.Modality, + nlst_canc.clinical_stag +FROM index +JOIN nlst_canc ON index.PatientID = nlst_canc.dicom_patient_id +WHERE index.collection_id = 'nlst' AND index.Modality = 'CT' +""" +results = client.sql_query(query) +``` + +## Common Use Cases + +### Use Case 1: Select Patients by Cancer Stage + +```python +from idc_index import IDCClient +import pandas as pd + +client = IDCClient() +client.fetch_index('clinical_index') + +# Load clinical table +nlst_canc = client.get_clinical_table("nlst_canc") + +# Select Stage IV patients (code '400') +stage_iv_patients = nlst_canc[nlst_canc['clinical_stag'] == '400']['dicom_patient_id'] + +# Get CT imaging studies for these patients +stage_iv_studies = pd.merge( + client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')], + stage_iv_patients, + left_on='PatientID', + right_on='dicom_patient_id', + how='inner' +)['StudyInstanceUID'].drop_duplicates() + +print(f"Found {len(stage_iv_studies)} CT studies for Stage IV patients") +``` + +### Use Case 2: Find Collections with Specific Clinical Attributes + +```python +# Find collections with chemotherapy information +chemo_collections = client.clinical_index[ + client.clinical_index["column_label"].str.contains("[Cc]hemotherapy", na=False) +]["collection_id"].unique() + +print(f"Collections with chemotherapy data: {list(chemo_collections)}") +``` + +### Use Case 3: Examine Observed Values for a Clinical Attribute + +```python +# Find what values have been observed for a specific attribute +chemotherapy_rows = client.clinical_index[ + (client.clinical_index["collection_id"] == "hcc_tace_seg") & + (client.clinical_index["column"] == "chemotherapy") +] + +# Get the observed values array +values_list = chemotherapy_rows["values"].tolist() +print(values_list) +# Output: [[{'option_code': 'Cisplastin', 'option_description': None}, +# {'option_code': 'Cisplatin, Mitomycin-C', 'option_description': None}, ...]] +``` + +### Use Case 4: Generate Viewer URLs for Selected Patients + +```python +import random + +# Get studies for a sample Stage IV patient +sample_patient = stage_iv_patients.iloc[0] +studies = client.index[client.index['PatientID'] == sample_patient]['StudyInstanceUID'].unique() + +# Generate viewer URL +if len(studies) > 0: + viewer_url = client.get_viewer_URL(studyInstanceUID=studies[0]) + print(viewer_url) +``` + +## Key Concepts + +### column vs column_label + +- **column**: Use for selecting data from tables (programmatic access) +- **column_label**: Use for searching/understanding what data means (human-readable) + +Some collections (like `c4kc_kits`) have identical column and column_label. Others (like ACRIN collections) have cryptic column names but descriptive labels. + +### option_code vs option_description + +The `values` array contains observed attribute values: +- **option_code**: The actual value observed in the column (what you filter on) +- **option_description**: Human-readable description (from data dictionary if available, otherwise `None`) + +### dicom_patient_id + +Every clinical table includes `dicom_patient_id`, which matches the `PatientID` column in the imaging index. This is the key for joining clinical and imaging data. + +## Troubleshooting + +### Issue: Clinical table not found + +**Cause:** Using wrong table name or table doesn't exist for collection + +**Solution:** Query clinical_index first to find available tables: +```python +client.clinical_index[client.clinical_index['collection_id']=='your_collection']['short_table_name'].unique() +``` + +### Issue: Empty values array + +**Cause:** The `values` array is left empty when a column has >20 unique values + +**Solution:** Load the clinical table and examine unique values directly: +```python +clinical_df = client.get_clinical_table("table_name") +clinical_df['column_name'].unique() +``` + +### Issue: Coded values not in mapping + +**Cause:** Some values may be missing from the dictionary (e.g., empty strings, special codes like `.M` for missing) + +**Solution:** Handle unmapped values gracefully: +```python +df['meaning'] = df['code'].astype(str).map(mapping_dict).fillna('Unknown/Missing') +``` + +### Issue: No matching patients when joining + +**Cause:** Clinical data may include patients without images, or vice versa + +**Solution:** Verify patient overlap before joining: +```python +imaging_patients = set(client.index[client.index['collection_id']=='nlst']['PatientID'].unique()) +clinical_patients = set(clinical_df['dicom_patient_id'].unique()) +overlap = imaging_patients & clinical_patients +print(f"Patients with both imaging and clinical data: {len(overlap)}") +``` + +## Resources + +**IDC Documentation:** +- [Clinical data organization](https://learn.canceridc.dev/data/organization-of-data/clinical) - How clinical data is organized in IDC +- [Clinical data dashboard](https://datastudio.google.com/u/0/reporting/04cf5976-4ea0-4fee-a749-8bfd162f2e87/page/p_s7mk6eybqc) - Visual summary of available clinical data +- [idc-index clinical_index documentation](https://idc-index.readthedocs.io/en/latest/column_descriptions.html#clinical-index) + +**Related Guides:** +- `bigquery_guide.md` - Advanced clinical queries via BigQuery +- Main SKILL.md - Core IDC workflows + +**IDC Tutorials:** +- [clinical_data_intro.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/advanced_topics/clinical_data_intro.ipynb) +- [exploring_clinical_data.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/exploring_clinical_data.ipynb) +- [nlst_clinical_data.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/collections_demos/nlst_clinical_data.ipynb) diff --git a/scientific-skills/imaging-data-commons/references/digital_pathology_guide.md b/scientific-skills/imaging-data-commons/references/digital_pathology_guide.md new file mode 100644 index 0000000..ecf0be5 --- /dev/null +++ b/scientific-skills/imaging-data-commons/references/digital_pathology_guide.md @@ -0,0 +1,254 @@ +# Digital Pathology Guide for IDC + +**Tested with:** IDC data version v23, idc-index 0.11.9 + +For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC. + +## Index Tables for Digital Pathology + +Five specialized index tables provide curated metadata without needing BigQuery: + +| Table | Row Granularity | Description | +|-------|-----------------|-------------| +| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: lens power, pixel spacing, image dimensions | +| `sm_instance_index` | 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images | +| `seg_index` | 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations | +| `ann_index` | 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide | +| `ann_group_index` | 1 row = 1 annotation group | Annotation group details: `AnnotationGroupLabel`, `GraphicType`, `NumberOfAnnotations`, `AlgorithmName`, property codes | + +All require `client.fetch_index("table_name")` before querying. Use `client.indices_overview` to inspect column schemas programmatically. + +## Slide Microscopy Queries + +### Basic SM metadata + +```python +from idc_index import IDCClient +client = IDCClient() + +# sm_index has detailed metadata; join with index for collection_id +client.fetch_index("sm_index") +client.sql_query(""" + SELECT i.collection_id, COUNT(*) as slides, + MIN(s.min_PixelSpacing_2sf) as min_resolution + FROM sm_index s + JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID + GROUP BY i.collection_id + ORDER BY slides DESC +""") +``` + +### Find SM series with specific properties + +```python +# Find high-resolution slides with specific objective lens power +client.fetch_index("sm_index") +client.sql_query(""" + SELECT + i.collection_id, + i.PatientID, + s.ObjectiveLensPower, + s.min_PixelSpacing_2sf + FROM sm_index s + JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID + WHERE s.ObjectiveLensPower >= 40 + ORDER BY s.min_PixelSpacing_2sf + LIMIT 20 +""") +``` + +## Annotation Queries (ANN) + +DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`. + +### Basic annotation discovery + +```python +# Find annotation series and their referenced images +client.fetch_index("ann_index") +client.fetch_index("ann_group_index") + +client.sql_query(""" + SELECT + a.SeriesInstanceUID as ann_series, + a.AnnotationCoordinateType, + a.referenced_SeriesInstanceUID as source_series + FROM ann_index a + LIMIT 10 +""") +``` + +### Annotation group statistics + +```python +# Get annotation group details (graphic types, counts, algorithms) +client.sql_query(""" + SELECT + GraphicType, + SUM(NumberOfAnnotations) as total_annotations, + COUNT(*) as group_count + FROM ann_group_index + GROUP BY GraphicType + ORDER BY total_annotations DESC +""") +``` + +### Find annotations with source slide context + +```python +# Find annotations with their source slide microscopy context +client.sql_query(""" + SELECT + i.collection_id, + g.GraphicType, + g.AnnotationPropertyType_CodeMeaning, + g.AlgorithmName, + g.NumberOfAnnotations + FROM ann_group_index g + JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID + JOIN index i ON a.referenced_SeriesInstanceUID = i.SeriesInstanceUID + WHERE g.AlgorithmName IS NOT NULL + LIMIT 10 +""") +``` + +## Segmentations on Slide Microscopy + +DICOM Segmentations (Modality = 'SEG') are used for both radiology (e.g., organ segmentations on CT) and pathology (e.g., tissue region segmentations on whole slide images). Use `seg_index.segmented_SeriesInstanceUID` to find the source series, then filter by source Modality to isolate pathology segmentations. + +```python +# Find segmentations whose source is a slide microscopy image +client.fetch_index("seg_index") +client.fetch_index("sm_index") +client.sql_query(""" + SELECT + seg.SeriesInstanceUID as seg_series, + seg.AlgorithmName, + seg.total_segments, + src.collection_id, + src.Modality as source_modality + FROM seg_index seg + JOIN index src ON seg.segmented_SeriesInstanceUID = src.SeriesInstanceUID + WHERE src.Modality = 'SM' + LIMIT 20 +""") +``` + +## Filter by AnnotationGroupLabel + +`AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search. + +### Simple label filtering + +```python +# Find annotation groups by label (e.g., groups mentioning "blast") +client.fetch_index("ann_group_index") +client.sql_query(""" + SELECT + g.SeriesInstanceUID, + g.AnnotationGroupLabel, + g.GraphicType, + g.NumberOfAnnotations, + g.AlgorithmName + FROM ann_group_index g + WHERE LOWER(g.AnnotationGroupLabel) LIKE '%blast%' + ORDER BY g.NumberOfAnnotations DESC +""") +``` + +### Label filtering with collection context + +```python +# Find annotation groups matching a label within a specific collection +client.fetch_index("ann_index") +client.fetch_index("ann_group_index") +client.sql_query(""" + SELECT + i.collection_id, + g.AnnotationGroupLabel, + g.GraphicType, + g.NumberOfAnnotations, + g.AnnotationPropertyType_CodeMeaning + FROM ann_group_index g + JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID + JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID + WHERE i.collection_id = 'your_collection_id' + AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%' + ORDER BY g.NumberOfAnnotations DESC +""") +``` + +## Annotations on Slide Microscopy (SM + ANN Cross-Reference) + +When looking for annotations related to slide microscopy data, use both SM and ANN tables together. The `ann_index.referenced_SeriesInstanceUID` links each annotation series to its source slide. + +```python +# Find slide microscopy images and their annotations in a collection +client.fetch_index("sm_index") +client.fetch_index("ann_index") +client.fetch_index("ann_group_index") +client.sql_query(""" + SELECT + i.collection_id, + s.ObjectiveLensPower, + g.AnnotationGroupLabel, + g.NumberOfAnnotations, + g.GraphicType + FROM ann_group_index g + JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID + JOIN sm_index s ON a.referenced_SeriesInstanceUID = s.SeriesInstanceUID + JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID + WHERE i.collection_id = 'your_collection_id' + ORDER BY g.NumberOfAnnotations DESC +""") +``` + +## Join Patterns + +### SM join (slide microscopy details with collection context) + +```python +client.fetch_index("sm_index") +result = client.sql_query(""" + SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf + FROM index i + JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID + LIMIT 10 +""") +``` + +### ANN join (annotation groups with collection context) + +```python +client.fetch_index("ann_index") +client.fetch_index("ann_group_index") +result = client.sql_query(""" + SELECT + i.collection_id, + g.AnnotationGroupLabel, + g.GraphicType, + g.NumberOfAnnotations, + a.referenced_SeriesInstanceUID as source_series + FROM ann_group_index g + JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID + JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID + LIMIT 10 +""") +``` + +## Related Tools + +The following tools work with DICOM format for digital pathology workflows: + +**Python Libraries:** +- [highdicom](https://github.com/ImagingDataCommons/highdicom) - High-level DICOM abstractions for Python. Create and read DICOM Segmentations (SEG), Structured Reports (SR), and parametric maps for pathology and radiology. Developed by IDC. +- [wsidicom](https://github.com/imi-bigpicture/wsidicom) - Python package for reading DICOM WSI datasets. Parses metadata into easy-to-use dataclasses for whole slide image analysis. +- [TIA-Toolbox](https://github.com/TissueImageAnalytics/tiatoolbox) - End-to-end computational pathology library with DICOM support via `DICOMWSIReader`. Provides tile extraction, feature extraction, and pretrained deep learning models. +- [EZ-WSI-DICOMweb](https://github.com/GoogleCloudPlatform/EZ-WSI-DICOMweb) - Extract image patches from DICOM whole slide images via DICOMweb. Designed for AI/ML workflows with cloud DICOM stores. + +**Viewers:** +- [Slim](https://github.com/ImagingDataCommons/slim) - Web-based DICOM slide microscopy viewer and annotation tool. Supports brightfield and multiplexed immunofluorescence imaging via DICOMweb. Developed by IDC. +- [QuPath](https://qupath.github.io/) - Cross-platform open source software for whole slide image analysis. Supports DICOM WSI via Bio-Formats and OpenSlide (v0.4.0+). + +**Conversion:** +- [dicom_wsi](https://github.com/Steven-N-Hart/dicom_wsi) - Python implementation for converting proprietary WSI formats to DICOM-compliant files. diff --git a/scientific-skills/imaging-data-commons/references/index_tables_guide.md b/scientific-skills/imaging-data-commons/references/index_tables_guide.md new file mode 100644 index 0000000..0ec6dab --- /dev/null +++ b/scientific-skills/imaging-data-commons/references/index_tables_guide.md @@ -0,0 +1,146 @@ +# Index Tables Guide for IDC + +**Tested with:** idc-index 0.11.9 (IDC data version v23) + +This guide covers the structure and access patterns for IDC index tables: programmatic schema discovery, DataFrame access, and join column references. For the overview of available tables and their purposes, see the "Index Tables" section in the main SKILL.md. + +**Complete index table documentation:** https://idc-index.readthedocs.io/en/latest/indices_reference.html + +## When to Use This Guide + +Load this guide when you need to: +- Discover table schemas and column types programmatically +- Access index tables as pandas DataFrames (not via SQL) +- Understand key columns and join relationships between tables + +For SQL query examples (filter discovery, finding annotations, size estimation), see `references/sql_patterns.md`. + +## Prerequisites + +```bash +pip install --upgrade idc-index +``` + +## Accessing Index Tables + +### Via SQL (recommended for filtering/aggregation) + +```python +from idc_index import IDCClient +client = IDCClient() + +# Query the primary index (always available) +results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10") + +# Fetch and query additional indices +client.fetch_index("collections_index") +collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index") + +client.fetch_index("analysis_results_index") +analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5") +``` + +### As pandas DataFrames (direct access) + +```python +# Primary index (always available after client initialization) +df = client.index + +# Fetch and access on-demand indices +client.fetch_index("sm_index") +sm_df = client.sm_index +``` + +## Discovering Table Schemas + +The `indices_overview` dictionary contains complete schema information for all tables. **Always consult this when writing queries or exploring data structure.** + +**DICOM attribute mapping:** Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like `PatientID`, `StudyInstanceUID`, `Modality`, `BodyPartExamined` work as expected. + +```python +from idc_index import IDCClient +client = IDCClient() + +# List all available indices with descriptions +for name, info in client.indices_overview.items(): + print(f"\n{name}:") + print(f" Installed: {info['installed']}") + print(f" Description: {info['description']}") + +# Get complete schema for a specific index (columns, types, descriptions) +schema = client.indices_overview["index"]["schema"] +print(f"\nTable: {schema['table_description']}") +print("\nColumns:") +for col in schema['columns']: + desc = col.get('description', 'No description') + # Description indicates if column is from DICOM attribute + print(f" {col['name']} ({col['type']}): {desc}") + +# Find columns that are DICOM attributes (check description for "DICOM" reference) +dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()] +print(f"\nDICOM-sourced columns: {dicom_cols}") +``` + +**Alternative: use `get_index_schema()` method:** +```python +schema = client.get_index_schema("index") +# Returns same schema dict: {'table_description': ..., 'columns': [...]} +``` + +## Key Columns Reference + +Most common columns in the primary `index` table (use `indices_overview` for complete list and descriptions): + +| Column | Type | DICOM | Description | +|--------|------|-------|-------------| +| `collection_id` | STRING | No | IDC collection identifier | +| `analysis_result_id` | STRING | No | If applicable, indicates what analysis results collection given series is part of | +| `source_DOI` | STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) | +| `PatientID` | STRING | Yes | Patient identifier | +| `StudyInstanceUID` | STRING | Yes | DICOM Study UID | +| `SeriesInstanceUID` | STRING | Yes | DICOM Series UID — use for downloads/viewing | +| `Modality` | STRING | Yes | Imaging modality (CT, MR, PT, SM, SEG, ANN, RTSTRUCT, etc.) | +| `BodyPartExamined` | STRING | Yes | Anatomical region | +| `SeriesDescription` | STRING | Yes | Description of the series | +| `Manufacturer` | STRING | Yes | Equipment manufacturer | +| `StudyDate` | STRING | Yes | Date study was performed | +| `PatientSex` | STRING | Yes | Patient sex | +| `PatientAge` | STRING | Yes | Patient age at time of study | +| `license_short_name` | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) | +| `series_size_MB` | FLOAT | No | Size of series in megabytes | +| `instanceCount` | INTEGER | No | Number of DICOM instances in series | + +**DICOM = Yes**: Column value extracted from the DICOM attribute with the same name. Refer to the [DICOM standard](https://dicom.nema.org/medical/dicom/current/output/chtml/part06/chapter_6.html) for numeric tag mappings. Use standard DICOM knowledge for expected values and formats. + +## Join Column Reference + +Use this table to identify join columns between index tables. Always call `client.fetch_index("table_name")` before using a table in SQL. + +| Table A | Table B | Join Condition | +|---------|---------|----------------| +| `index` | `collections_index` | `index.collection_id = collections_index.collection_id` | +| `index` | `sm_index` | `index.SeriesInstanceUID = sm_index.SeriesInstanceUID` | +| `index` | `seg_index` | `index.SeriesInstanceUID = seg_index.segmented_SeriesInstanceUID` | +| `index` | `ann_index` | `index.SeriesInstanceUID = ann_index.SeriesInstanceUID` | +| `ann_index` | `ann_group_index` | `ann_index.SeriesInstanceUID = ann_group_index.SeriesInstanceUID` | +| `index` | `clinical_index` | `index.collection_id = clinical_index.collection_id` (then filter by patient) | +| `index` | `contrast_index` | `index.SeriesInstanceUID = contrast_index.SeriesInstanceUID` | + +For complete query examples using these joins, see `references/sql_patterns.md`. + +## Troubleshooting + +**Issue:** Column not found in table +- **Cause:** Column name misspelled or doesn't exist in that table +- **Solution:** Use `client.indices_overview["table_name"]["schema"]["columns"]` to list available columns + +**Issue:** DataFrame access returns None +- **Cause:** Index not fetched or property name incorrect +- **Solution:** Fetch first with `client.fetch_index()`, then access via property matching the index name + +## Resources + +- Complete index table documentation: https://idc-index.readthedocs.io/en/latest/indices_reference.html +- `references/sql_patterns.md` for query examples using these tables +- `references/clinical_data_guide.md` for clinical data workflows +- `references/digital_pathology_guide.md` for pathology-specific indices diff --git a/scientific-skills/imaging-data-commons/references/sql_patterns.md b/scientific-skills/imaging-data-commons/references/sql_patterns.md new file mode 100644 index 0000000..bb862cf --- /dev/null +++ b/scientific-skills/imaging-data-commons/references/sql_patterns.md @@ -0,0 +1,207 @@ +# SQL Query Patterns for IDC + +**Tested with:** idc-index 0.11.9 (IDC data version v23) + +Quick reference for common SQL query patterns when working with IDC data. For detailed examples with context, see the "Core Capabilities" section in the main SKILL.md. + +## When to Use This Guide + +Load this guide when you need quick-reference SQL patterns for: +- Discovering available filter values (modalities, body parts, manufacturers) +- Finding annotations and segmentations across collections +- Querying slide microscopy and annotation data +- Estimating download sizes before download +- Linking imaging data to clinical data + +For table schemas, DataFrame access, and join column references, see `references/index_tables_guide.md`. + +## Prerequisites + +```bash +pip install --upgrade idc-index +``` + +```python +from idc_index import IDCClient +client = IDCClient() +``` + +## Discover Available Filter Values + +```python +# What modalities exist? +client.sql_query("SELECT DISTINCT Modality FROM index") + +# What body parts for a specific modality? +client.sql_query(""" + SELECT DISTINCT BodyPartExamined, COUNT(*) as n + FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL + GROUP BY BodyPartExamined ORDER BY n DESC +""") + +# What manufacturers for MR? +client.sql_query(""" + SELECT DISTINCT Manufacturer, COUNT(*) as n + FROM index WHERE Modality = 'MR' + GROUP BY Manufacturer ORDER BY n DESC +""") +``` + +## Find Annotations and Segmentations + +**Note:** Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type. + +```python +# Find ALL segmentations and structure sets by DICOM Modality +# SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set +client.sql_query(""" + SELECT collection_id, Modality, COUNT(*) as series_count + FROM index + WHERE Modality IN ('SEG', 'RTSTRUCT') + GROUP BY collection_id, Modality + ORDER BY series_count DESC +""") + +# Find segmentations for a specific collection (includes non-analysis-result items) +client.sql_query(""" + SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id + FROM index + WHERE collection_id = 'tcga_luad' AND Modality = 'SEG' +""") + +# List analysis result collections (curated derived datasets) +client.fetch_index("analysis_results_index") +client.sql_query(""" + SELECT analysis_result_id, analysis_result_title, Collections, Modalities + FROM analysis_results_index +""") + +# Find analysis results for a specific source collection +client.sql_query(""" + SELECT analysis_result_id, analysis_result_title + FROM analysis_results_index + WHERE Collections LIKE '%tcga_luad%' +""") + +# Use seg_index for detailed DICOM Segmentation metadata +client.fetch_index("seg_index") + +# Get segmentation statistics by algorithm +client.sql_query(""" + SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count + FROM seg_index + WHERE AlgorithmName IS NOT NULL + GROUP BY AlgorithmName, AlgorithmType + ORDER BY seg_count DESC + LIMIT 10 +""") + +# Find segmentations for specific source images (e.g., chest CT) +client.sql_query(""" + SELECT + s.SeriesInstanceUID as seg_series, + s.AlgorithmName, + s.total_segments, + s.segmented_SeriesInstanceUID as source_series + FROM seg_index s + JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID + WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST' + LIMIT 10 +""") + +# Find TotalSegmentator results with source image context +client.sql_query(""" + SELECT + seg_info.collection_id, + COUNT(DISTINCT s.SeriesInstanceUID) as seg_count, + SUM(s.total_segments) as total_segments + FROM seg_index s + JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID + WHERE s.AlgorithmName LIKE '%TotalSegmentator%' + GROUP BY seg_info.collection_id + ORDER BY seg_count DESC +""") + +# Use ann_index and ann_group_index for Microscopy Bulk Simple Annotations +# ann_group_index has AnnotationGroupLabel, GraphicType, NumberOfAnnotations, AlgorithmName +client.fetch_index("ann_index") +client.fetch_index("ann_group_index") +client.sql_query(""" + SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations, i.collection_id + FROM ann_group_index g + JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID + JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID + WHERE g.AlgorithmName IS NOT NULL + LIMIT 10 +""") +# See references/digital_pathology_guide.md for AnnotationGroupLabel filtering, SM+ANN joins, and more +``` + +## Query Slide Microscopy and Annotation Data + +Use `sm_index` for slide microscopy metadata and `ann_index`/`ann_group_index` for annotations on slides (DICOM ANN objects). Filter annotation groups by `AnnotationGroupLabel` to find annotations by name. + +```python +client.fetch_index("sm_index") +client.fetch_index("ann_index") +client.fetch_index("ann_group_index") + +# Example: find annotation groups by label within a collection +client.sql_query(""" + SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations + FROM ann_group_index g + JOIN index i ON g.SeriesInstanceUID = i.SeriesInstanceUID + WHERE i.collection_id = 'your_collection_id' + AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%' +""") +``` + +See `references/digital_pathology_guide.md` for SM queries, ANN filtering patterns, SM+ANN cross-references, and join examples. + +## Estimate Download Size + +```python +# Size for specific criteria +client.sql_query(""" + SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count + FROM index + WHERE collection_id = 'nlst' AND Modality = 'CT' +""") +``` + +## Link to Clinical Data + +```python +client.fetch_index("clinical_index") + +# Find collections with clinical data and their tables +client.sql_query(""" + SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns + FROM clinical_index + GROUP BY collection_id, table_name + ORDER BY collection_id +""") +``` + +See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection. + +## Troubleshooting + +**Issue:** Query returns error "table not found" +- **Cause:** Index not fetched before query +- **Solution:** Call `client.fetch_index("table_name")` before using tables other than the primary `index` + +**Issue:** LIKE pattern not matching expected results +- **Cause:** Case sensitivity or whitespace +- **Solution:** Use `LOWER(column)` for case-insensitive matching, `TRIM()` for whitespace + +**Issue:** JOIN returns fewer rows than expected +- **Cause:** NULL values in join columns or no matching records +- **Solution:** Use `LEFT JOIN` to include rows without matches, check for NULLs with `IS NOT NULL` + +## Resources + +- `references/index_tables_guide.md` for table schemas, DataFrame access, and join column references +- `references/clinical_data_guide.md` for clinical data patterns and value mapping +- `references/digital_pathology_guide.md` for pathology-specific queries +- `references/bigquery_guide.md` for advanced queries requiring full DICOM metadata diff --git a/scientific-skills/imaging-data-commons/references/use_cases.md b/scientific-skills/imaging-data-commons/references/use_cases.md new file mode 100644 index 0000000..51ae985 --- /dev/null +++ b/scientific-skills/imaging-data-commons/references/use_cases.md @@ -0,0 +1,186 @@ +# Common Use Cases for IDC + +**Tested with:** idc-index 0.11.9 (IDC data version v23) + +This guide provides complete end-to-end workflow examples for common IDC use cases. Each use case demonstrates the full workflow from query to download with best practices. + +## When to Use This Guide + +Load this guide when you need: +- Complete end-to-end workflow examples for training dataset creation +- Patterns for multi-step data selection and download workflows +- Examples of license-aware data handling for commercial use +- Visualization workflows for data preview before download + +For core API patterns (query, download, visualize, citations), see the "Core Capabilities" section in the main SKILL.md. + +## Prerequisites + +```bash +pip install --upgrade idc-index +``` + +## Use Case 1: Find and Download Lung CT Scans for Deep Learning + +**Objective:** Build training dataset of lung CT scans from NLST collection + +**Steps:** +```python +from idc_index import IDCClient + +client = IDCClient() + +# 1. Query for lung CT scans with specific criteria +query = """ +SELECT + PatientID, + SeriesInstanceUID, + SeriesDescription +FROM index +WHERE collection_id = 'nlst' + AND Modality = 'CT' + AND BodyPartExamined = 'CHEST' + AND license_short_name = 'CC BY 4.0' +ORDER BY PatientID +LIMIT 100 +""" + +results = client.sql_query(query) +print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients") + +# 2. Download data organized by patient +client.download_from_selection( + seriesInstanceUID=list(results['SeriesInstanceUID'].values), + downloadDir="./training_data", + dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID" +) + +# 3. Save manifest for reproducibility +results.to_csv('training_manifest.csv', index=False) +``` + +## Use Case 2: Query Brain MRI by Manufacturer for Quality Study + +**Objective:** Compare image quality across different MRI scanner manufacturers + +**Steps:** +```python +from idc_index import IDCClient +import pandas as pd + +client = IDCClient() + +# Query for brain MRI grouped by manufacturer +query = """ +SELECT + Manufacturer, + ManufacturerModelName, + COUNT(DISTINCT SeriesInstanceUID) as num_series, + COUNT(DISTINCT PatientID) as num_patients +FROM index +WHERE Modality = 'MR' + AND BodyPartExamined LIKE '%BRAIN%' +GROUP BY Manufacturer, ManufacturerModelName +HAVING num_series >= 10 +ORDER BY num_series DESC +""" + +manufacturers = client.sql_query(query) +print(manufacturers) + +# Download sample from each manufacturer for comparison +for _, row in manufacturers.head(3).iterrows(): + mfr = row['Manufacturer'] + model = row['ManufacturerModelName'] + + query = f""" + SELECT SeriesInstanceUID + FROM index + WHERE Manufacturer = '{mfr}' + AND ManufacturerModelName = '{model}' + AND Modality = 'MR' + AND BodyPartExamined LIKE '%BRAIN%' + LIMIT 5 + """ + + series = client.sql_query(query) + client.download_from_selection( + seriesInstanceUID=list(series['SeriesInstanceUID'].values), + downloadDir=f"./quality_study/{mfr.replace(' ', '_')}" + ) +``` + +## Use Case 3: Visualize Series Without Downloading + +**Objective:** Preview imaging data before committing to download + +```python +from idc_index import IDCClient +import webbrowser + +client = IDCClient() + +series_list = client.sql_query(""" + SELECT SeriesInstanceUID, PatientID, SeriesDescription + FROM index + WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT' + LIMIT 10 +""") + +# Preview each in browser +for _, row in series_list.iterrows(): + viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID']) + print(f"Patient {row['PatientID']}: {row['SeriesDescription']}") + print(f" View at: {viewer_url}") + # webbrowser.open(viewer_url) # Uncomment to open automatically +``` + +For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration. + +## Use Case 4: License-Aware Batch Download for Commercial Use + +**Objective:** Download only CC-BY licensed data suitable for commercial applications + +**Steps:** +```python +from idc_index import IDCClient + +client = IDCClient() + +# Query ONLY for CC BY licensed data (allows commercial use with attribution) +query = """ +SELECT + SeriesInstanceUID, + collection_id, + PatientID, + Modality +FROM index +WHERE license_short_name LIKE 'CC BY%' + AND license_short_name NOT LIKE '%NC%' + AND Modality IN ('CT', 'MR') + AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN') +LIMIT 200 +""" + +cc_by_data = client.sql_query(query) + +print(f"Found {len(cc_by_data)} CC BY licensed series") +print(f"Collections: {cc_by_data['collection_id'].unique()}") + +# Download with license verification +client.download_from_selection( + seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values), + downloadDir="./commercial_dataset", + dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID" +) + +# Save license information +cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False) +``` + +## Resources + +- Main SKILL.md for core API patterns (query, download, visualize) +- `references/clinical_data_guide.md` for clinical data integration workflows +- `references/sql_patterns.md` for additional SQL query patterns +- `references/index_tables_guide.md` for complex join patterns