Merge pull request #46 from fedorov/update-idc-v1.3.0

update imaging-data-commons skill to v1.3.1
This commit is contained in:
Timothy Kassis
2026-02-16 10:24:23 -08:00
committed by GitHub
6 changed files with 1214 additions and 436 deletions

View File

@@ -3,9 +3,10 @@ name: imaging-data-commons
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
metadata:
version: 1.2.0
version: 1.3.1
skill-author: Andrey Fedorov, @fedorov
idc-index: "0.11.7"
idc-index: "0.11.9"
idc-data-version: "v23"
repository: https://github.com/ImagingDataCommons/idc-claude-skill
---
@@ -15,20 +16,39 @@ metadata:
Use the `idc-index` Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.
**Current IDC Data Version: v23** (always verify with `IDCClient().get_idc_version()`)
**Primary tool:** `idc-index` ([GitHub](https://github.com/imagingdatacommons/idc-index))
**Check current data scale for the latest version:**
**CRITICAL - Check package version and upgrade if needed (run this FIRST):**
```python
import idc_index
REQUIRED_VERSION = "0.11.9" # Must match metadata.idc-index in this file
installed = idc_index.__version__
if installed < REQUIRED_VERSION:
print(f"Upgrading idc-index from {installed} to {REQUIRED_VERSION}...")
import subprocess
subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", "idc-index"], check=True)
print("Upgrade complete. Restart Python to use new version.")
else:
print(f"idc-index {installed} meets requirement ({REQUIRED_VERSION})")
```
**Verify IDC data version and check current data scale:**
```python
from idc_index import IDCClient
client = IDCClient()
# get IDC data version
print(client.get_idc_version())
# Verify IDC data version (should be "v23")
print(f"IDC data version: {client.get_idc_version()}")
# Get collection count and total series
stats = client.sql_query("""
SELECT
SELECT
COUNT(DISTINCT collection_id) as collections,
COUNT(DISTINCT analysis_result_id) as analysis_results,
COUNT(DISTINCT PatientID) as patients,
@@ -54,6 +74,30 @@ print(stats)
- Checking data licenses before use in research or commercial applications
- Visualizing medical images in a browser without local DICOM viewer software
## Quick Navigation
**Core Sections (inline):**
- IDC Data Model - Collection and analysis result hierarchy
- Index Tables - Available tables and joining patterns
- Installation - Package setup and version verification
- Core Capabilities - Essential API patterns (query, download, visualize, license, citations, batch)
- Best Practices - Usage guidelines
- Troubleshooting - Common issues and solutions
**Reference Guides (load on demand):**
| Guide | When to Load |
|-------|--------------|
| `index_tables_guide.md` | Complex JOINs, schema discovery, DataFrame access |
| `use_cases.md` | End-to-end workflow examples (training datasets, batch downloads) |
| `sql_patterns.md` | Quick SQL patterns for filter discovery, annotations, size estimation |
| `clinical_data_guide.md` | Clinical/tabular data, imaging+clinical joins, value mapping |
| `cloud_storage_guide.md` | Direct S3/GCS access, versioning, UUID mapping |
| `dicomweb_guide.md` | DICOMweb endpoints, PACS integration |
| `digital_pathology_guide.md` | Slide microscopy (SM), annotations (ANN), pathology workflows |
| `bigquery_guide.md` | Full DICOM metadata, private elements (requires GCP) |
| `cli_guide.md` | Command-line tools (`idc download`, manifest files) |
## IDC Data Model
IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):
@@ -75,6 +119,8 @@ Use `collection_id` to find original imaging data, may include annotations depos
The `idc-index` package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.
**Complete index table documentation:** Use https://idc-index.readthedocs.io/en/latest/indices_reference.html for quick check of available tables and columns without executing any code.
**Important:** Use `client.indices_overview` to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.
### Available Tables
@@ -89,6 +135,9 @@ The `idc-index` package provides multiple metadata index tables, accessible via
| `sm_index` | 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata |
| `sm_instance_index` | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy |
| `seg_index` | 1 row = 1 DICOM Segmentation series | fetch_index() | Segmentation metadata: algorithm, segment count, reference to source image series |
| `ann_index` | 1 row = 1 DICOM ANN series | fetch_index() | Microscopy Bulk Simple Annotations series metadata; references annotated image series |
| `ann_group_index` | 1 row = 1 annotation group | fetch_index() | Detailed annotation group metadata: graphic type, annotation count, property codes, algorithm |
| `contrast_index` | 1 row = 1 series with contrast info | fetch_index() | Contrast agent metadata: agent name, ingredient, administration route (CT, MR, PT, XA, RF) |
**Auto** = loaded automatically when `IDCClient()` is instantiated
**fetch_index()** = requires `client.fetch_index("table_name")` to load
@@ -107,140 +156,13 @@ The `idc-index` package provides multiple metadata index tables, accessible via
| `source_DOI` | index, analysis_results_index | Link by publication DOI |
| `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier |
| `Modality` | index, prior_versions_index | Filter by imaging modality |
| `SeriesInstanceUID` | index, seg_index | Link segmentation series to its index metadata |
| `SeriesInstanceUID` | index, seg_index, ann_index, ann_group_index, contrast_index | Link segmentation/annotation/contrast series to its index metadata |
| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
| `referenced_SeriesInstanceUID` | ann_index → index | Link annotation to its source image series (join ann_index.referenced_SeriesInstanceUID = index.SeriesInstanceUID) |
**Note:** `Subjects`, `Updated`, and `Description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).
**Example joins:**
```python
from idc_index import IDCClient
client = IDCClient()
# Join index with collections_index to get cancer types
client.fetch_index("collections_index")
result = client.sql_query("""
SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations
FROM index i
JOIN collections_index c ON i.collection_id = c.collection_id
WHERE i.Modality = 'MR'
LIMIT 10
""")
# Join index with sm_index for slide microscopy details
client.fetch_index("sm_index")
result = client.sql_query("""
SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
FROM index i
JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
LIMIT 10
""")
# Join seg_index with index to find segmentations and their source images
client.fetch_index("seg_index")
result = client.sql_query("""
SELECT
s.SeriesInstanceUID as seg_series,
s.AlgorithmName,
s.total_segments,
src.collection_id,
src.Modality as source_modality,
src.BodyPartExamined
FROM seg_index s
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE s.AlgorithmType = 'AUTOMATIC'
LIMIT 10
""")
```
### Accessing Index Tables
**Via SQL (recommended for filtering/aggregation):**
```python
from idc_index import IDCClient
client = IDCClient()
# Query the primary index (always available)
results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
# Fetch and query additional indices
client.fetch_index("collections_index")
collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
client.fetch_index("analysis_results_index")
analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
```
**As pandas DataFrames (direct access):**
```python
# Primary index (always available after client initialization)
df = client.index
# Fetch and access on-demand indices
client.fetch_index("sm_index")
sm_df = client.sm_index
```
### Discovering Table Schemas (Essential for Query Writing)
The `indices_overview` dictionary contains complete schema information for all tables. **Always consult this when writing queries or exploring data structure.**
**DICOM attribute mapping:** Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like `PatientID`, `StudyInstanceUID`, `Modality`, `BodyPartExamined` work as expected.
```python
from idc_index import IDCClient
client = IDCClient()
# List all available indices with descriptions
for name, info in client.indices_overview.items():
print(f"\n{name}:")
print(f" Installed: {info['installed']}")
print(f" Description: {info['description']}")
# Get complete schema for a specific index (columns, types, descriptions)
schema = client.indices_overview["index"]["schema"]
print(f"\nTable: {schema['table_description']}")
print("\nColumns:")
for col in schema['columns']:
desc = col.get('description', 'No description')
# Description indicates if column is from DICOM attribute
print(f" {col['name']} ({col['type']}): {desc}")
# Find columns that are DICOM attributes (check description for "DICOM" reference)
dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()]
print(f"\nDICOM-sourced columns: {dicom_cols}")
```
**Alternative: use `get_index_schema()` method:**
```python
schema = client.get_index_schema("index")
# Returns same schema dict: {'table_description': ..., 'columns': [...]}
```
### Key Columns in Primary `index` Table
Most common columns for queries (use `indices_overview` for complete list and descriptions):
| Column | Type | DICOM | Description |
|--------|------|-------|-------------|
| `collection_id` | STRING | No | IDC collection identifier |
| `analysis_result_id` | STRING | No | If applicable, indicates what analysis results collection given series is part of |
| `source_DOI` | STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) |
| `PatientID` | STRING | Yes | Patient identifier |
| `StudyInstanceUID` | STRING | Yes | DICOM Study UID |
| `SeriesInstanceUID` | STRING | Yes | DICOM Series UID — use for downloads/viewing |
| `Modality` | STRING | Yes | Imaging modality (CT, MR, PT, SM, etc.) |
| `BodyPartExamined` | STRING | Yes | Anatomical region |
| `SeriesDescription` | STRING | Yes | Description of the series |
| `Manufacturer` | STRING | Yes | Equipment manufacturer |
| `StudyDate` | STRING | Yes | Date study was performed |
| `PatientSex` | STRING | Yes | Patient sex |
| `PatientAge` | STRING | Yes | Patient age at time of study |
| `license_short_name` | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) |
| `series_size_MB` | FLOAT | No | Size of series in megabytes |
| `instanceCount` | INTEGER | No | Number of DICOM instances in series |
**DICOM = Yes**: Column value extracted from the DICOM attribute with the same name. Refer to the [DICOM standard](https://dicom.nema.org/medical/dicom/current/output/chtml/part06/chapter_6.html) for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.
For detailed join examples, schema discovery patterns, key columns reference, and DataFrame access, see `references/index_tables_guide.md`.
### Clinical Data Access
@@ -301,7 +223,13 @@ pip install --upgrade idc-index
**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.
**Tested with:** idc-index 0.11.7 (IDC data version v23)
**IMPORTANT:** IDC data version v23 is current. Always verify your version:
```python
print(client.get_idc_version()) # Should return "v23"
```
If you see an older version, upgrade with: `pip install --upgrade idc-index`
**Tested with:** idc-index 0.11.9 (IDC data version v23)
**Optional (for data analysis):**
```bash
@@ -484,6 +412,15 @@ client.download_from_selection(
# Results in: ./data/flat/*.dcm
```
**Downloaded file names:**
Individual DICOM files are named using their CRDC instance UUID: `<crdc_instance_uuid>.dcm` (e.g., `0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm`). This UUID-based naming:
- Enables version tracking (UUIDs change when file content changes)
- Matches cloud storage organization (`s3://idc-open-data/<crdc_series_uuid>/<crdc_instance_uuid>.dcm`)
- Differs from DICOM UIDs (SOPInstanceUID) which are preserved inside the file metadata
To identify files, use the `crdc_instance_uuid` column in queries or read DICOM metadata (SOPInstanceUID) from the files.
### Command-Line Download
The `idc download` command provides command-line access to download functionality without writing Python code. Available after installing `idc-index`.
@@ -705,6 +642,13 @@ For queries requiring full DICOM metadata, complex JOINs, clinical data tables,
See `references/bigquery_guide.md` for setup, table schemas, query patterns, private element access, and cost optimization.
**Before using BigQuery**, always check if a specialized index table already has the metadata you need:
1. Use `client.indices_overview` or the [idc-index indices reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html) to discover all available tables and their columns
2. Fetch the relevant index: `client.fetch_index("table_name")`
3. Query locally with `client.sql_query()` (free, no GCP account needed)
Common specialized indices: `seg_index` (segmentations), `ann_index` / `ann_group_index` (microscopy annotations), `sm_index` (slide microscopy), `collections_index` (collection metadata). Only use BigQuery if you need private DICOM elements or attributes not in any index.
### 8. Tool Selection Guide
| Task | Tool | Reference |
@@ -782,166 +726,15 @@ sitk.WriteImage(smoothed, "processed_volume.nii.gz")
## Common Use Cases
### Use Case 1: Find and Download Lung CT Scans for Deep Learning
**Objective:** Build training dataset of lung CT scans from NLST collection
**Steps:**
```python
from idc_index import IDCClient
client = IDCClient()
# 1. Query for lung CT scans with specific criteria
query = """
SELECT
PatientID,
SeriesInstanceUID,
SeriesDescription
FROM index
WHERE collection_id = 'nlst'
AND Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND license_short_name = 'CC BY 4.0'
ORDER BY PatientID
LIMIT 100
"""
results = client.sql_query(query)
print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")
# 2. Download data organized by patient
client.download_from_selection(
seriesInstanceUID=list(results['SeriesInstanceUID'].values),
downloadDir="./training_data",
dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
)
# 3. Save manifest for reproducibility
results.to_csv('training_manifest.csv', index=False)
```
### Use Case 2: Query Brain MRI by Manufacturer for Quality Study
**Objective:** Compare image quality across different MRI scanner manufacturers
**Steps:**
```python
from idc_index import IDCClient
import pandas as pd
client = IDCClient()
# Query for brain MRI grouped by manufacturer
query = """
SELECT
Manufacturer,
ManufacturerModelName,
COUNT(DISTINCT SeriesInstanceUID) as num_series,
COUNT(DISTINCT PatientID) as num_patients
FROM index
WHERE Modality = 'MR'
AND BodyPartExamined LIKE '%BRAIN%'
GROUP BY Manufacturer, ManufacturerModelName
HAVING num_series >= 10
ORDER BY num_series DESC
"""
manufacturers = client.sql_query(query)
print(manufacturers)
# Download sample from each manufacturer for comparison
for _, row in manufacturers.head(3).iterrows():
mfr = row['Manufacturer']
model = row['ManufacturerModelName']
query = f"""
SELECT SeriesInstanceUID
FROM index
WHERE Manufacturer = '{mfr}'
AND ManufacturerModelName = '{model}'
AND Modality = 'MR'
AND BodyPartExamined LIKE '%BRAIN%'
LIMIT 5
"""
series = client.sql_query(query)
client.download_from_selection(
seriesInstanceUID=list(series['SeriesInstanceUID'].values),
downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
)
```
### Use Case 3: Visualize Series Without Downloading
**Objective:** Preview imaging data before committing to download
```python
from idc_index import IDCClient
import webbrowser
client = IDCClient()
series_list = client.sql_query("""
SELECT SeriesInstanceUID, PatientID, SeriesDescription
FROM index
WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
LIMIT 10
""")
# Preview each in browser
for _, row in series_list.iterrows():
viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID'])
print(f"Patient {row['PatientID']}: {row['SeriesDescription']}")
print(f" View at: {viewer_url}")
# webbrowser.open(viewer_url) # Uncomment to open automatically
```
For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration.
### Use Case 4: License-Aware Batch Download for Commercial Use
**Objective:** Download only CC-BY licensed data suitable for commercial applications
**Steps:**
```python
from idc_index import IDCClient
client = IDCClient()
# Query ONLY for CC BY licensed data (allows commercial use with attribution)
query = """
SELECT
SeriesInstanceUID,
collection_id,
PatientID,
Modality
FROM index
WHERE license_short_name LIKE 'CC BY%'
AND license_short_name NOT LIKE '%NC%'
AND Modality IN ('CT', 'MR')
AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN')
LIMIT 200
"""
cc_by_data = client.sql_query(query)
print(f"Found {len(cc_by_data)} CC BY licensed series")
print(f"Collections: {cc_by_data['collection_id'].unique()}")
# Download with license verification
client.download_from_selection(
seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values),
downloadDir="./commercial_dataset",
dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID"
)
# Save license information
cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
```
See `references/use_cases.md` for complete end-to-end workflow examples including:
- Building deep learning training datasets from lung CT scans
- Comparing image quality across scanner manufacturers
- Previewing data in browser before downloading
- License-aware batch downloads for commercial use
## Best Practices
- **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v23). If using an older version, recommend `pip install --upgrade idc-index`
- **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC)
- **Generate citations for attribution** - Use `citations_from_selection()` to get properly formatted citations from `source_DOI` values; include these in publications
- **Start with small queries** - Use `LIMIT` clause when exploring to avoid long downloads and understand data structure
@@ -989,142 +782,14 @@ cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
## Common SQL Query Patterns
Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above.
See `references/sql_patterns.md` for quick-reference SQL patterns including:
- Filter value discovery (modalities, body parts, manufacturers)
- Annotation and segmentation queries (including seg_index, ann_index joins)
- Slide microscopy queries (sm_index patterns)
- Download size estimation
- Clinical data linking
### Discover available filter values
```python
# What modalities exist?
client.sql_query("SELECT DISTINCT Modality FROM index")
# What body parts for a specific modality?
client.sql_query("""
SELECT DISTINCT BodyPartExamined, COUNT(*) as n
FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
GROUP BY BodyPartExamined ORDER BY n DESC
""")
# What manufacturers for MR?
client.sql_query("""
SELECT DISTINCT Manufacturer, COUNT(*) as n
FROM index WHERE Modality = 'MR'
GROUP BY Manufacturer ORDER BY n DESC
""")
```
### Find annotations and segmentations
**Note:** Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
```python
# Find ALL segmentations and structure sets by DICOM Modality
# SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set
client.sql_query("""
SELECT collection_id, Modality, COUNT(*) as series_count
FROM index
WHERE Modality IN ('SEG', 'RTSTRUCT')
GROUP BY collection_id, Modality
ORDER BY series_count DESC
""")
# Find segmentations for a specific collection (includes non-analysis-result items)
client.sql_query("""
SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
FROM index
WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
""")
# List analysis result collections (curated derived datasets)
client.fetch_index("analysis_results_index")
client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
FROM analysis_results_index
""")
# Find analysis results for a specific source collection
client.sql_query("""
SELECT analysis_result_id, analysis_result_title
FROM analysis_results_index
WHERE Collections LIKE '%tcga_luad%'
""")
# Use seg_index for detailed DICOM Segmentation metadata
client.fetch_index("seg_index")
# Get segmentation statistics by algorithm
client.sql_query("""
SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
FROM seg_index
WHERE AlgorithmName IS NOT NULL
GROUP BY AlgorithmName, AlgorithmType
ORDER BY seg_count DESC
LIMIT 10
""")
# Find segmentations for specific source images (e.g., chest CT)
client.sql_query("""
SELECT
s.SeriesInstanceUID as seg_series,
s.AlgorithmName,
s.total_segments,
s.segmented_SeriesInstanceUID as source_series
FROM seg_index s
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
LIMIT 10
""")
# Find TotalSegmentator results with source image context
client.sql_query("""
SELECT
seg_info.collection_id,
COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
SUM(s.total_segments) as total_segments
FROM seg_index s
JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
GROUP BY seg_info.collection_id
ORDER BY seg_count DESC
""")
```
### Query slide microscopy data
```python
# sm_index has detailed metadata; join with index for collection_id
client.fetch_index("sm_index")
client.sql_query("""
SELECT i.collection_id, COUNT(*) as slides,
MIN(s.min_PixelSpacing_2sf) as min_resolution
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
GROUP BY i.collection_id
ORDER BY slides DESC
""")
```
### Estimate download size
```python
# Size for specific criteria
client.sql_query("""
SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
FROM index
WHERE collection_id = 'nlst' AND Modality = 'CT'
""")
```
### Link to clinical data
```python
client.fetch_index("clinical_index")
# Find collections with clinical data and their tables
client.sql_query("""
SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
FROM clinical_index
GROUP BY collection_id, table_name
ORDER BY collection_id
""")
```
See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection.
For segmentation and annotation details, also see `references/digital_pathology_guide.md`.
## Related Skills
@@ -1134,8 +799,7 @@ The following skills complement IDC workflows for downstream analysis and visual
- **pydicom** - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET).
### Pathology and Slide Microscopy
- **histolab** - Lightweight tile extraction and preprocessing for whole slide images. Use for basic slide processing, tissue detection, and dataset preparation from IDC slide microscopy data.
- **pathml** - Full-featured computational pathology toolkit. Use for advanced WSI analysis including multiplexed imaging, nucleus segmentation, and ML model training on pathology data downloaded from IDC.
See `references/digital_pathology_guide.md` for DICOM-compatible tools (highdicom, wsidicom, TIA-Toolbox, Slim viewer).
### Metadata Visualization
- **matplotlib** - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.).
@@ -1159,11 +823,8 @@ columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['col
### Reference Documentation
- **clinical_data_guide.md** - Clinical/tabular data navigation, value mapping, and joining with imaging data
- **cloud_storage_guide.md** - Direct cloud bucket access (S3/GCS), file organization, CRDC UUIDs, versioning, and reproducibility
- **cli_guide.md** - Complete idc-index command-line interface reference (`idc download`, `idc download-from-manifest`, `idc download-from-selection`)
- **bigquery_guide.md** - Advanced BigQuery usage guide for complex metadata queries
- **dicomweb_guide.md** - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details
See the Quick Navigation section at the top for the full list of reference guides with decision triggers.
- **[indices_reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html)** - External documentation for index tables (may be ahead of the installed version)
### External Links

View File

@@ -0,0 +1,324 @@
# Clinical Data Guide for IDC
**Tested with:** idc-index 0.11.7 (IDC data version v23)
Clinical data (demographics, diagnoses, therapies, lab tests, staging) accompanies many IDC imaging collections. This guide covers how to discover, access, and integrate clinical data with imaging data using `idc-index`.
## When to Use This Guide
Use this guide when you need to:
- Find what clinical metadata is available for a collection
- Filter patients by clinical criteria (e.g., cancer stage, treatment history)
- Join clinical attributes with imaging data for cohort selection
- Understand and decode coded values in clinical tables
For basic clinical data access, see the "Clinical Data Access" section in the main SKILL.md. This guide provides detailed workflows and advanced patterns.
## Prerequisites
```bash
pip install --upgrade idc-index
```
No BigQuery credentials required - clinical data is packaged with `idc-index`.
## Understanding Clinical Data in IDC
### What is Clinical Data?
Clinical data refers to non-imaging information that accompanies medical images:
- Patient demographics (age, sex, race)
- Clinical history (diagnoses, surgeries, therapies)
- Lab tests and pathology results
- Cancer staging (clinical and pathological)
- Treatment outcomes
### Data Organization
Clinical data in IDC comes from collection-specific spreadsheets provided by data submitters. IDC parses these into queryable tables accessible via `idc-index`.
**Important characteristics:**
- Clinical data is **not harmonized** across collections (terms and formats vary)
- Not all collections have clinical data (check availability first)
- All data is **anonymized** - `dicom_patient_id` links to imaging
### The clinical_index Table
The `clinical_index` serves as a dictionary/catalog of all available clinical data:
| Column | Purpose | Use For |
|--------|---------|---------|
| `collection_id` | Collection identifier | Filtering by collection |
| `table_name` | Full BigQuery table reference | BigQuery queries (if needed) |
| `short_table_name` | Short name | `get_clinical_table()` method |
| `column` | Column name in table | Selecting data columns |
| `column_label` | Human-readable description | Searching for concepts |
| `values` | Observed attribute values for the column | Interpreting coded values |
### The `values` Column
The `values` column contains an array of observed attribute values for the column defined in the `column` field. Each entry has:
- **option_code**: The actual value observed in that column
- **option_description**: Human-readable description of that value (from data dictionary if available, otherwise `None`)
For ACRIN collections, value descriptions come from provided data dictionaries. For other collections, they are derived from inspection of the actual data values.
**Note:** For columns with >20 unique values, the `values` array is left empty (`[]`) for simplicity.
## Core Workflow
### Step 1: Fetch Clinical Index
```python
from idc_index import IDCClient
client = IDCClient()
client.fetch_index('clinical_index')
# View available columns
print(client.clinical_index.columns.tolist())
```
### Step 2: Discover Available Clinical Data
```python
# List all collections with clinical data
collections_with_clinical = client.clinical_index["collection_id"].unique().tolist()
print(f"{len(collections_with_clinical)} collections have clinical data")
# Find clinical attributes for a specific collection
nlst_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']
nlst_columns[['short_table_name', 'column', 'column_label', 'values']]
```
### Step 3: Search for Specific Attributes
```python
# Search by keyword in column_label (case-insensitive)
stage_attrs = client.clinical_index[
client.clinical_index["column_label"].str.contains("[Ss]tage", na=False)
]
stage_attrs[["collection_id", "short_table_name", "column", "column_label"]]
```
### Step 4: Load Clinical Table
```python
# Load table using short_table_name
nlst_canc_df = client.get_clinical_table("nlst_canc")
# Examine structure
print(f"Rows: {len(nlst_canc_df)}, Columns: {len(nlst_canc_df.columns)}")
nlst_canc_df.head()
```
### Step 5: Map Coded Values to Descriptions
Many clinical attributes use coded values. The `values` column in `clinical_index` contains an array of observed values with their descriptions (when available).
```python
# Get the clinical_index rows for NLST
nlst_clinical_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']
# Get observed values for a specific column
# Filter to the row for 'clinical_stag' and extract the values array
clinical_stag_values = nlst_clinical_columns[
nlst_clinical_columns['column']=='clinical_stag'
]['values'].values[0]
# View the observed values and their descriptions
print(clinical_stag_values)
# Output: array([{'option_code': '.M', 'option_description': 'Missing'},
# {'option_code': '110', 'option_description': 'Stage IA'},
# {'option_code': '120', 'option_description': 'Stage IB'}, ...])
# Create mapping dictionary from codes to descriptions
mapping_dict = {item['option_code']: item['option_description'] for item in clinical_stag_values}
# Apply to DataFrame - convert column to string first for consistent matching
nlst_canc_df['clinical_stag_meaning'] = nlst_canc_df['clinical_stag'].astype(str).map(mapping_dict)
```
### Step 6: Join with Imaging Data
The `dicom_patient_id` column links clinical data to imaging. It matches the `PatientID` column in the imaging index.
```python
# Pandas merge approach
import pandas as pd
# Get NLST CT imaging data
nlst_imaging = client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')]
# Join with clinical data
merged = pd.merge(
nlst_imaging[['PatientID', 'StudyInstanceUID']].drop_duplicates(),
nlst_canc_df[['dicom_patient_id', 'clinical_stag', 'clinical_stag_meaning']],
left_on='PatientID',
right_on='dicom_patient_id',
how='inner'
)
```
```python
# SQL join approach
query = """
SELECT
index.PatientID,
index.StudyInstanceUID,
index.Modality,
nlst_canc.clinical_stag
FROM index
JOIN nlst_canc ON index.PatientID = nlst_canc.dicom_patient_id
WHERE index.collection_id = 'nlst' AND index.Modality = 'CT'
"""
results = client.sql_query(query)
```
## Common Use Cases
### Use Case 1: Select Patients by Cancer Stage
```python
from idc_index import IDCClient
import pandas as pd
client = IDCClient()
client.fetch_index('clinical_index')
# Load clinical table
nlst_canc = client.get_clinical_table("nlst_canc")
# Select Stage IV patients (code '400')
stage_iv_patients = nlst_canc[nlst_canc['clinical_stag'] == '400']['dicom_patient_id']
# Get CT imaging studies for these patients
stage_iv_studies = pd.merge(
client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')],
stage_iv_patients,
left_on='PatientID',
right_on='dicom_patient_id',
how='inner'
)['StudyInstanceUID'].drop_duplicates()
print(f"Found {len(stage_iv_studies)} CT studies for Stage IV patients")
```
### Use Case 2: Find Collections with Specific Clinical Attributes
```python
# Find collections with chemotherapy information
chemo_collections = client.clinical_index[
client.clinical_index["column_label"].str.contains("[Cc]hemotherapy", na=False)
]["collection_id"].unique()
print(f"Collections with chemotherapy data: {list(chemo_collections)}")
```
### Use Case 3: Examine Observed Values for a Clinical Attribute
```python
# Find what values have been observed for a specific attribute
chemotherapy_rows = client.clinical_index[
(client.clinical_index["collection_id"] == "hcc_tace_seg") &
(client.clinical_index["column"] == "chemotherapy")
]
# Get the observed values array
values_list = chemotherapy_rows["values"].tolist()
print(values_list)
# Output: [[{'option_code': 'Cisplastin', 'option_description': None},
# {'option_code': 'Cisplatin, Mitomycin-C', 'option_description': None}, ...]]
```
### Use Case 4: Generate Viewer URLs for Selected Patients
```python
import random
# Get studies for a sample Stage IV patient
sample_patient = stage_iv_patients.iloc[0]
studies = client.index[client.index['PatientID'] == sample_patient]['StudyInstanceUID'].unique()
# Generate viewer URL
if len(studies) > 0:
viewer_url = client.get_viewer_URL(studyInstanceUID=studies[0])
print(viewer_url)
```
## Key Concepts
### column vs column_label
- **column**: Use for selecting data from tables (programmatic access)
- **column_label**: Use for searching/understanding what data means (human-readable)
Some collections (like `c4kc_kits`) have identical column and column_label. Others (like ACRIN collections) have cryptic column names but descriptive labels.
### option_code vs option_description
The `values` array contains observed attribute values:
- **option_code**: The actual value observed in the column (what you filter on)
- **option_description**: Human-readable description (from data dictionary if available, otherwise `None`)
### dicom_patient_id
Every clinical table includes `dicom_patient_id`, which matches the `PatientID` column in the imaging index. This is the key for joining clinical and imaging data.
## Troubleshooting
### Issue: Clinical table not found
**Cause:** Using wrong table name or table doesn't exist for collection
**Solution:** Query clinical_index first to find available tables:
```python
client.clinical_index[client.clinical_index['collection_id']=='your_collection']['short_table_name'].unique()
```
### Issue: Empty values array
**Cause:** The `values` array is left empty when a column has >20 unique values
**Solution:** Load the clinical table and examine unique values directly:
```python
clinical_df = client.get_clinical_table("table_name")
clinical_df['column_name'].unique()
```
### Issue: Coded values not in mapping
**Cause:** Some values may be missing from the dictionary (e.g., empty strings, special codes like `.M` for missing)
**Solution:** Handle unmapped values gracefully:
```python
df['meaning'] = df['code'].astype(str).map(mapping_dict).fillna('Unknown/Missing')
```
### Issue: No matching patients when joining
**Cause:** Clinical data may include patients without images, or vice versa
**Solution:** Verify patient overlap before joining:
```python
imaging_patients = set(client.index[client.index['collection_id']=='nlst']['PatientID'].unique())
clinical_patients = set(clinical_df['dicom_patient_id'].unique())
overlap = imaging_patients & clinical_patients
print(f"Patients with both imaging and clinical data: {len(overlap)}")
```
## Resources
**IDC Documentation:**
- [Clinical data organization](https://learn.canceridc.dev/data/organization-of-data/clinical) - How clinical data is organized in IDC
- [Clinical data dashboard](https://datastudio.google.com/u/0/reporting/04cf5976-4ea0-4fee-a749-8bfd162f2e87/page/p_s7mk6eybqc) - Visual summary of available clinical data
- [idc-index clinical_index documentation](https://idc-index.readthedocs.io/en/latest/column_descriptions.html#clinical-index)
**Related Guides:**
- `bigquery_guide.md` - Advanced clinical queries via BigQuery
- Main SKILL.md - Core IDC workflows
**IDC Tutorials:**
- [clinical_data_intro.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/advanced_topics/clinical_data_intro.ipynb)
- [exploring_clinical_data.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/exploring_clinical_data.ipynb)
- [nlst_clinical_data.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/collections_demos/nlst_clinical_data.ipynb)

View File

@@ -0,0 +1,254 @@
# Digital Pathology Guide for IDC
**Tested with:** IDC data version v23, idc-index 0.11.9
For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
## Index Tables for Digital Pathology
Five specialized index tables provide curated metadata without needing BigQuery:
| Table | Row Granularity | Description |
|-------|-----------------|-------------|
| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: lens power, pixel spacing, image dimensions |
| `sm_instance_index` | 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images |
| `seg_index` | 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations |
| `ann_index` | 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide |
| `ann_group_index` | 1 row = 1 annotation group | Annotation group details: `AnnotationGroupLabel`, `GraphicType`, `NumberOfAnnotations`, `AlgorithmName`, property codes |
All require `client.fetch_index("table_name")` before querying. Use `client.indices_overview` to inspect column schemas programmatically.
## Slide Microscopy Queries
### Basic SM metadata
```python
from idc_index import IDCClient
client = IDCClient()
# sm_index has detailed metadata; join with index for collection_id
client.fetch_index("sm_index")
client.sql_query("""
SELECT i.collection_id, COUNT(*) as slides,
MIN(s.min_PixelSpacing_2sf) as min_resolution
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
GROUP BY i.collection_id
ORDER BY slides DESC
""")
```
### Find SM series with specific properties
```python
# Find high-resolution slides with specific objective lens power
client.fetch_index("sm_index")
client.sql_query("""
SELECT
i.collection_id,
i.PatientID,
s.ObjectiveLensPower,
s.min_PixelSpacing_2sf
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
WHERE s.ObjectiveLensPower >= 40
ORDER BY s.min_PixelSpacing_2sf
LIMIT 20
""")
```
## Annotation Queries (ANN)
DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`.
### Basic annotation discovery
```python
# Find annotation series and their referenced images
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
client.sql_query("""
SELECT
a.SeriesInstanceUID as ann_series,
a.AnnotationCoordinateType,
a.referenced_SeriesInstanceUID as source_series
FROM ann_index a
LIMIT 10
""")
```
### Annotation group statistics
```python
# Get annotation group details (graphic types, counts, algorithms)
client.sql_query("""
SELECT
GraphicType,
SUM(NumberOfAnnotations) as total_annotations,
COUNT(*) as group_count
FROM ann_group_index
GROUP BY GraphicType
ORDER BY total_annotations DESC
""")
```
### Find annotations with source slide context
```python
# Find annotations with their source slide microscopy context
client.sql_query("""
SELECT
i.collection_id,
g.GraphicType,
g.AnnotationPropertyType_CodeMeaning,
g.AlgorithmName,
g.NumberOfAnnotations
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN index i ON a.referenced_SeriesInstanceUID = i.SeriesInstanceUID
WHERE g.AlgorithmName IS NOT NULL
LIMIT 10
""")
```
## Segmentations on Slide Microscopy
DICOM Segmentations (Modality = 'SEG') are used for both radiology (e.g., organ segmentations on CT) and pathology (e.g., tissue region segmentations on whole slide images). Use `seg_index.segmented_SeriesInstanceUID` to find the source series, then filter by source Modality to isolate pathology segmentations.
```python
# Find segmentations whose source is a slide microscopy image
client.fetch_index("seg_index")
client.fetch_index("sm_index")
client.sql_query("""
SELECT
seg.SeriesInstanceUID as seg_series,
seg.AlgorithmName,
seg.total_segments,
src.collection_id,
src.Modality as source_modality
FROM seg_index seg
JOIN index src ON seg.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE src.Modality = 'SM'
LIMIT 20
""")
```
## Filter by AnnotationGroupLabel
`AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search.
### Simple label filtering
```python
# Find annotation groups by label (e.g., groups mentioning "blast")
client.fetch_index("ann_group_index")
client.sql_query("""
SELECT
g.SeriesInstanceUID,
g.AnnotationGroupLabel,
g.GraphicType,
g.NumberOfAnnotations,
g.AlgorithmName
FROM ann_group_index g
WHERE LOWER(g.AnnotationGroupLabel) LIKE '%blast%'
ORDER BY g.NumberOfAnnotations DESC
""")
```
### Label filtering with collection context
```python
# Find annotation groups matching a label within a specific collection
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
client.sql_query("""
SELECT
i.collection_id,
g.AnnotationGroupLabel,
g.GraphicType,
g.NumberOfAnnotations,
g.AnnotationPropertyType_CodeMeaning
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'your_collection_id'
AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%'
ORDER BY g.NumberOfAnnotations DESC
""")
```
## Annotations on Slide Microscopy (SM + ANN Cross-Reference)
When looking for annotations related to slide microscopy data, use both SM and ANN tables together. The `ann_index.referenced_SeriesInstanceUID` links each annotation series to its source slide.
```python
# Find slide microscopy images and their annotations in a collection
client.fetch_index("sm_index")
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
client.sql_query("""
SELECT
i.collection_id,
s.ObjectiveLensPower,
g.AnnotationGroupLabel,
g.NumberOfAnnotations,
g.GraphicType
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN sm_index s ON a.referenced_SeriesInstanceUID = s.SeriesInstanceUID
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'your_collection_id'
ORDER BY g.NumberOfAnnotations DESC
""")
```
## Join Patterns
### SM join (slide microscopy details with collection context)
```python
client.fetch_index("sm_index")
result = client.sql_query("""
SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
FROM index i
JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
LIMIT 10
""")
```
### ANN join (annotation groups with collection context)
```python
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
result = client.sql_query("""
SELECT
i.collection_id,
g.AnnotationGroupLabel,
g.GraphicType,
g.NumberOfAnnotations,
a.referenced_SeriesInstanceUID as source_series
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
LIMIT 10
""")
```
## Related Tools
The following tools work with DICOM format for digital pathology workflows:
**Python Libraries:**
- [highdicom](https://github.com/ImagingDataCommons/highdicom) - High-level DICOM abstractions for Python. Create and read DICOM Segmentations (SEG), Structured Reports (SR), and parametric maps for pathology and radiology. Developed by IDC.
- [wsidicom](https://github.com/imi-bigpicture/wsidicom) - Python package for reading DICOM WSI datasets. Parses metadata into easy-to-use dataclasses for whole slide image analysis.
- [TIA-Toolbox](https://github.com/TissueImageAnalytics/tiatoolbox) - End-to-end computational pathology library with DICOM support via `DICOMWSIReader`. Provides tile extraction, feature extraction, and pretrained deep learning models.
- [EZ-WSI-DICOMweb](https://github.com/GoogleCloudPlatform/EZ-WSI-DICOMweb) - Extract image patches from DICOM whole slide images via DICOMweb. Designed for AI/ML workflows with cloud DICOM stores.
**Viewers:**
- [Slim](https://github.com/ImagingDataCommons/slim) - Web-based DICOM slide microscopy viewer and annotation tool. Supports brightfield and multiplexed immunofluorescence imaging via DICOMweb. Developed by IDC.
- [QuPath](https://qupath.github.io/) - Cross-platform open source software for whole slide image analysis. Supports DICOM WSI via Bio-Formats and OpenSlide (v0.4.0+).
**Conversion:**
- [dicom_wsi](https://github.com/Steven-N-Hart/dicom_wsi) - Python implementation for converting proprietary WSI formats to DICOM-compliant files.

View File

@@ -0,0 +1,146 @@
# Index Tables Guide for IDC
**Tested with:** idc-index 0.11.9 (IDC data version v23)
This guide covers the structure and access patterns for IDC index tables: programmatic schema discovery, DataFrame access, and join column references. For the overview of available tables and their purposes, see the "Index Tables" section in the main SKILL.md.
**Complete index table documentation:** https://idc-index.readthedocs.io/en/latest/indices_reference.html
## When to Use This Guide
Load this guide when you need to:
- Discover table schemas and column types programmatically
- Access index tables as pandas DataFrames (not via SQL)
- Understand key columns and join relationships between tables
For SQL query examples (filter discovery, finding annotations, size estimation), see `references/sql_patterns.md`.
## Prerequisites
```bash
pip install --upgrade idc-index
```
## Accessing Index Tables
### Via SQL (recommended for filtering/aggregation)
```python
from idc_index import IDCClient
client = IDCClient()
# Query the primary index (always available)
results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
# Fetch and query additional indices
client.fetch_index("collections_index")
collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
client.fetch_index("analysis_results_index")
analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
```
### As pandas DataFrames (direct access)
```python
# Primary index (always available after client initialization)
df = client.index
# Fetch and access on-demand indices
client.fetch_index("sm_index")
sm_df = client.sm_index
```
## Discovering Table Schemas
The `indices_overview` dictionary contains complete schema information for all tables. **Always consult this when writing queries or exploring data structure.**
**DICOM attribute mapping:** Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like `PatientID`, `StudyInstanceUID`, `Modality`, `BodyPartExamined` work as expected.
```python
from idc_index import IDCClient
client = IDCClient()
# List all available indices with descriptions
for name, info in client.indices_overview.items():
print(f"\n{name}:")
print(f" Installed: {info['installed']}")
print(f" Description: {info['description']}")
# Get complete schema for a specific index (columns, types, descriptions)
schema = client.indices_overview["index"]["schema"]
print(f"\nTable: {schema['table_description']}")
print("\nColumns:")
for col in schema['columns']:
desc = col.get('description', 'No description')
# Description indicates if column is from DICOM attribute
print(f" {col['name']} ({col['type']}): {desc}")
# Find columns that are DICOM attributes (check description for "DICOM" reference)
dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()]
print(f"\nDICOM-sourced columns: {dicom_cols}")
```
**Alternative: use `get_index_schema()` method:**
```python
schema = client.get_index_schema("index")
# Returns same schema dict: {'table_description': ..., 'columns': [...]}
```
## Key Columns Reference
Most common columns in the primary `index` table (use `indices_overview` for complete list and descriptions):
| Column | Type | DICOM | Description |
|--------|------|-------|-------------|
| `collection_id` | STRING | No | IDC collection identifier |
| `analysis_result_id` | STRING | No | If applicable, indicates what analysis results collection given series is part of |
| `source_DOI` | STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) |
| `PatientID` | STRING | Yes | Patient identifier |
| `StudyInstanceUID` | STRING | Yes | DICOM Study UID |
| `SeriesInstanceUID` | STRING | Yes | DICOM Series UID — use for downloads/viewing |
| `Modality` | STRING | Yes | Imaging modality (CT, MR, PT, SM, SEG, ANN, RTSTRUCT, etc.) |
| `BodyPartExamined` | STRING | Yes | Anatomical region |
| `SeriesDescription` | STRING | Yes | Description of the series |
| `Manufacturer` | STRING | Yes | Equipment manufacturer |
| `StudyDate` | STRING | Yes | Date study was performed |
| `PatientSex` | STRING | Yes | Patient sex |
| `PatientAge` | STRING | Yes | Patient age at time of study |
| `license_short_name` | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) |
| `series_size_MB` | FLOAT | No | Size of series in megabytes |
| `instanceCount` | INTEGER | No | Number of DICOM instances in series |
**DICOM = Yes**: Column value extracted from the DICOM attribute with the same name. Refer to the [DICOM standard](https://dicom.nema.org/medical/dicom/current/output/chtml/part06/chapter_6.html) for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.
## Join Column Reference
Use this table to identify join columns between index tables. Always call `client.fetch_index("table_name")` before using a table in SQL.
| Table A | Table B | Join Condition |
|---------|---------|----------------|
| `index` | `collections_index` | `index.collection_id = collections_index.collection_id` |
| `index` | `sm_index` | `index.SeriesInstanceUID = sm_index.SeriesInstanceUID` |
| `index` | `seg_index` | `index.SeriesInstanceUID = seg_index.segmented_SeriesInstanceUID` |
| `index` | `ann_index` | `index.SeriesInstanceUID = ann_index.SeriesInstanceUID` |
| `ann_index` | `ann_group_index` | `ann_index.SeriesInstanceUID = ann_group_index.SeriesInstanceUID` |
| `index` | `clinical_index` | `index.collection_id = clinical_index.collection_id` (then filter by patient) |
| `index` | `contrast_index` | `index.SeriesInstanceUID = contrast_index.SeriesInstanceUID` |
For complete query examples using these joins, see `references/sql_patterns.md`.
## Troubleshooting
**Issue:** Column not found in table
- **Cause:** Column name misspelled or doesn't exist in that table
- **Solution:** Use `client.indices_overview["table_name"]["schema"]["columns"]` to list available columns
**Issue:** DataFrame access returns None
- **Cause:** Index not fetched or property name incorrect
- **Solution:** Fetch first with `client.fetch_index()`, then access via property matching the index name
## Resources
- Complete index table documentation: https://idc-index.readthedocs.io/en/latest/indices_reference.html
- `references/sql_patterns.md` for query examples using these tables
- `references/clinical_data_guide.md` for clinical data workflows
- `references/digital_pathology_guide.md` for pathology-specific indices

View File

@@ -0,0 +1,207 @@
# SQL Query Patterns for IDC
**Tested with:** idc-index 0.11.9 (IDC data version v23)
Quick reference for common SQL query patterns when working with IDC data. For detailed examples with context, see the "Core Capabilities" section in the main SKILL.md.
## When to Use This Guide
Load this guide when you need quick-reference SQL patterns for:
- Discovering available filter values (modalities, body parts, manufacturers)
- Finding annotations and segmentations across collections
- Querying slide microscopy and annotation data
- Estimating download sizes before download
- Linking imaging data to clinical data
For table schemas, DataFrame access, and join column references, see `references/index_tables_guide.md`.
## Prerequisites
```bash
pip install --upgrade idc-index
```
```python
from idc_index import IDCClient
client = IDCClient()
```
## Discover Available Filter Values
```python
# What modalities exist?
client.sql_query("SELECT DISTINCT Modality FROM index")
# What body parts for a specific modality?
client.sql_query("""
SELECT DISTINCT BodyPartExamined, COUNT(*) as n
FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
GROUP BY BodyPartExamined ORDER BY n DESC
""")
# What manufacturers for MR?
client.sql_query("""
SELECT DISTINCT Manufacturer, COUNT(*) as n
FROM index WHERE Modality = 'MR'
GROUP BY Manufacturer ORDER BY n DESC
""")
```
## Find Annotations and Segmentations
**Note:** Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
```python
# Find ALL segmentations and structure sets by DICOM Modality
# SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set
client.sql_query("""
SELECT collection_id, Modality, COUNT(*) as series_count
FROM index
WHERE Modality IN ('SEG', 'RTSTRUCT')
GROUP BY collection_id, Modality
ORDER BY series_count DESC
""")
# Find segmentations for a specific collection (includes non-analysis-result items)
client.sql_query("""
SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
FROM index
WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
""")
# List analysis result collections (curated derived datasets)
client.fetch_index("analysis_results_index")
client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
FROM analysis_results_index
""")
# Find analysis results for a specific source collection
client.sql_query("""
SELECT analysis_result_id, analysis_result_title
FROM analysis_results_index
WHERE Collections LIKE '%tcga_luad%'
""")
# Use seg_index for detailed DICOM Segmentation metadata
client.fetch_index("seg_index")
# Get segmentation statistics by algorithm
client.sql_query("""
SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
FROM seg_index
WHERE AlgorithmName IS NOT NULL
GROUP BY AlgorithmName, AlgorithmType
ORDER BY seg_count DESC
LIMIT 10
""")
# Find segmentations for specific source images (e.g., chest CT)
client.sql_query("""
SELECT
s.SeriesInstanceUID as seg_series,
s.AlgorithmName,
s.total_segments,
s.segmented_SeriesInstanceUID as source_series
FROM seg_index s
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
LIMIT 10
""")
# Find TotalSegmentator results with source image context
client.sql_query("""
SELECT
seg_info.collection_id,
COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
SUM(s.total_segments) as total_segments
FROM seg_index s
JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
GROUP BY seg_info.collection_id
ORDER BY seg_count DESC
""")
# Use ann_index and ann_group_index for Microscopy Bulk Simple Annotations
# ann_group_index has AnnotationGroupLabel, GraphicType, NumberOfAnnotations, AlgorithmName
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
client.sql_query("""
SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations, i.collection_id
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
WHERE g.AlgorithmName IS NOT NULL
LIMIT 10
""")
# See references/digital_pathology_guide.md for AnnotationGroupLabel filtering, SM+ANN joins, and more
```
## Query Slide Microscopy and Annotation Data
Use `sm_index` for slide microscopy metadata and `ann_index`/`ann_group_index` for annotations on slides (DICOM ANN objects). Filter annotation groups by `AnnotationGroupLabel` to find annotations by name.
```python
client.fetch_index("sm_index")
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
# Example: find annotation groups by label within a collection
client.sql_query("""
SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations
FROM ann_group_index g
JOIN index i ON g.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'your_collection_id'
AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%'
""")
```
See `references/digital_pathology_guide.md` for SM queries, ANN filtering patterns, SM+ANN cross-references, and join examples.
## Estimate Download Size
```python
# Size for specific criteria
client.sql_query("""
SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
FROM index
WHERE collection_id = 'nlst' AND Modality = 'CT'
""")
```
## Link to Clinical Data
```python
client.fetch_index("clinical_index")
# Find collections with clinical data and their tables
client.sql_query("""
SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
FROM clinical_index
GROUP BY collection_id, table_name
ORDER BY collection_id
""")
```
See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection.
## Troubleshooting
**Issue:** Query returns error "table not found"
- **Cause:** Index not fetched before query
- **Solution:** Call `client.fetch_index("table_name")` before using tables other than the primary `index`
**Issue:** LIKE pattern not matching expected results
- **Cause:** Case sensitivity or whitespace
- **Solution:** Use `LOWER(column)` for case-insensitive matching, `TRIM()` for whitespace
**Issue:** JOIN returns fewer rows than expected
- **Cause:** NULL values in join columns or no matching records
- **Solution:** Use `LEFT JOIN` to include rows without matches, check for NULLs with `IS NOT NULL`
## Resources
- `references/index_tables_guide.md` for table schemas, DataFrame access, and join column references
- `references/clinical_data_guide.md` for clinical data patterns and value mapping
- `references/digital_pathology_guide.md` for pathology-specific queries
- `references/bigquery_guide.md` for advanced queries requiring full DICOM metadata

View File

@@ -0,0 +1,186 @@
# Common Use Cases for IDC
**Tested with:** idc-index 0.11.9 (IDC data version v23)
This guide provides complete end-to-end workflow examples for common IDC use cases. Each use case demonstrates the full workflow from query to download with best practices.
## When to Use This Guide
Load this guide when you need:
- Complete end-to-end workflow examples for training dataset creation
- Patterns for multi-step data selection and download workflows
- Examples of license-aware data handling for commercial use
- Visualization workflows for data preview before download
For core API patterns (query, download, visualize, citations), see the "Core Capabilities" section in the main SKILL.md.
## Prerequisites
```bash
pip install --upgrade idc-index
```
## Use Case 1: Find and Download Lung CT Scans for Deep Learning
**Objective:** Build training dataset of lung CT scans from NLST collection
**Steps:**
```python
from idc_index import IDCClient
client = IDCClient()
# 1. Query for lung CT scans with specific criteria
query = """
SELECT
PatientID,
SeriesInstanceUID,
SeriesDescription
FROM index
WHERE collection_id = 'nlst'
AND Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND license_short_name = 'CC BY 4.0'
ORDER BY PatientID
LIMIT 100
"""
results = client.sql_query(query)
print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")
# 2. Download data organized by patient
client.download_from_selection(
seriesInstanceUID=list(results['SeriesInstanceUID'].values),
downloadDir="./training_data",
dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
)
# 3. Save manifest for reproducibility
results.to_csv('training_manifest.csv', index=False)
```
## Use Case 2: Query Brain MRI by Manufacturer for Quality Study
**Objective:** Compare image quality across different MRI scanner manufacturers
**Steps:**
```python
from idc_index import IDCClient
import pandas as pd
client = IDCClient()
# Query for brain MRI grouped by manufacturer
query = """
SELECT
Manufacturer,
ManufacturerModelName,
COUNT(DISTINCT SeriesInstanceUID) as num_series,
COUNT(DISTINCT PatientID) as num_patients
FROM index
WHERE Modality = 'MR'
AND BodyPartExamined LIKE '%BRAIN%'
GROUP BY Manufacturer, ManufacturerModelName
HAVING num_series >= 10
ORDER BY num_series DESC
"""
manufacturers = client.sql_query(query)
print(manufacturers)
# Download sample from each manufacturer for comparison
for _, row in manufacturers.head(3).iterrows():
mfr = row['Manufacturer']
model = row['ManufacturerModelName']
query = f"""
SELECT SeriesInstanceUID
FROM index
WHERE Manufacturer = '{mfr}'
AND ManufacturerModelName = '{model}'
AND Modality = 'MR'
AND BodyPartExamined LIKE '%BRAIN%'
LIMIT 5
"""
series = client.sql_query(query)
client.download_from_selection(
seriesInstanceUID=list(series['SeriesInstanceUID'].values),
downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
)
```
## Use Case 3: Visualize Series Without Downloading
**Objective:** Preview imaging data before committing to download
```python
from idc_index import IDCClient
import webbrowser
client = IDCClient()
series_list = client.sql_query("""
SELECT SeriesInstanceUID, PatientID, SeriesDescription
FROM index
WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
LIMIT 10
""")
# Preview each in browser
for _, row in series_list.iterrows():
viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID'])
print(f"Patient {row['PatientID']}: {row['SeriesDescription']}")
print(f" View at: {viewer_url}")
# webbrowser.open(viewer_url) # Uncomment to open automatically
```
For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration.
## Use Case 4: License-Aware Batch Download for Commercial Use
**Objective:** Download only CC-BY licensed data suitable for commercial applications
**Steps:**
```python
from idc_index import IDCClient
client = IDCClient()
# Query ONLY for CC BY licensed data (allows commercial use with attribution)
query = """
SELECT
SeriesInstanceUID,
collection_id,
PatientID,
Modality
FROM index
WHERE license_short_name LIKE 'CC BY%'
AND license_short_name NOT LIKE '%NC%'
AND Modality IN ('CT', 'MR')
AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN')
LIMIT 200
"""
cc_by_data = client.sql_query(query)
print(f"Found {len(cc_by_data)} CC BY licensed series")
print(f"Collections: {cc_by_data['collection_id'].unique()}")
# Download with license verification
client.download_from_selection(
seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values),
downloadDir="./commercial_dataset",
dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID"
)
# Save license information
cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
```
## Resources
- Main SKILL.md for core API patterns (query, download, visualize)
- `references/clinical_data_guide.md` for clinical data integration workflows
- `references/sql_patterns.md` for additional SQL query patterns
- `references/index_tables_guide.md` for complex join patterns