mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Merge pull request #46 from fedorov/update-idc-v1.3.0
update imaging-data-commons skill to v1.3.1
This commit is contained in:
@@ -3,9 +3,10 @@ name: imaging-data-commons
|
|||||||
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
|
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
|
||||||
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
|
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
|
||||||
metadata:
|
metadata:
|
||||||
version: 1.2.0
|
version: 1.3.1
|
||||||
skill-author: Andrey Fedorov, @fedorov
|
skill-author: Andrey Fedorov, @fedorov
|
||||||
idc-index: "0.11.7"
|
idc-index: "0.11.9"
|
||||||
|
idc-data-version: "v23"
|
||||||
repository: https://github.com/ImagingDataCommons/idc-claude-skill
|
repository: https://github.com/ImagingDataCommons/idc-claude-skill
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -15,20 +16,39 @@ metadata:
|
|||||||
|
|
||||||
Use the `idc-index` Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.
|
Use the `idc-index` Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.
|
||||||
|
|
||||||
|
**Current IDC Data Version: v23** (always verify with `IDCClient().get_idc_version()`)
|
||||||
|
|
||||||
**Primary tool:** `idc-index` ([GitHub](https://github.com/imagingdatacommons/idc-index))
|
**Primary tool:** `idc-index` ([GitHub](https://github.com/imagingdatacommons/idc-index))
|
||||||
|
|
||||||
**Check current data scale for the latest version:**
|
**CRITICAL - Check package version and upgrade if needed (run this FIRST):**
|
||||||
|
|
||||||
|
```python
|
||||||
|
import idc_index
|
||||||
|
|
||||||
|
REQUIRED_VERSION = "0.11.9" # Must match metadata.idc-index in this file
|
||||||
|
installed = idc_index.__version__
|
||||||
|
|
||||||
|
if installed < REQUIRED_VERSION:
|
||||||
|
print(f"Upgrading idc-index from {installed} to {REQUIRED_VERSION}...")
|
||||||
|
import subprocess
|
||||||
|
subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", "idc-index"], check=True)
|
||||||
|
print("Upgrade complete. Restart Python to use new version.")
|
||||||
|
else:
|
||||||
|
print(f"idc-index {installed} meets requirement ({REQUIRED_VERSION})")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify IDC data version and check current data scale:**
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from idc_index import IDCClient
|
from idc_index import IDCClient
|
||||||
client = IDCClient()
|
client = IDCClient()
|
||||||
|
|
||||||
# get IDC data version
|
# Verify IDC data version (should be "v23")
|
||||||
print(client.get_idc_version())
|
print(f"IDC data version: {client.get_idc_version()}")
|
||||||
|
|
||||||
# Get collection count and total series
|
# Get collection count and total series
|
||||||
stats = client.sql_query("""
|
stats = client.sql_query("""
|
||||||
SELECT
|
SELECT
|
||||||
COUNT(DISTINCT collection_id) as collections,
|
COUNT(DISTINCT collection_id) as collections,
|
||||||
COUNT(DISTINCT analysis_result_id) as analysis_results,
|
COUNT(DISTINCT analysis_result_id) as analysis_results,
|
||||||
COUNT(DISTINCT PatientID) as patients,
|
COUNT(DISTINCT PatientID) as patients,
|
||||||
@@ -54,6 +74,30 @@ print(stats)
|
|||||||
- Checking data licenses before use in research or commercial applications
|
- Checking data licenses before use in research or commercial applications
|
||||||
- Visualizing medical images in a browser without local DICOM viewer software
|
- Visualizing medical images in a browser without local DICOM viewer software
|
||||||
|
|
||||||
|
## Quick Navigation
|
||||||
|
|
||||||
|
**Core Sections (inline):**
|
||||||
|
- IDC Data Model - Collection and analysis result hierarchy
|
||||||
|
- Index Tables - Available tables and joining patterns
|
||||||
|
- Installation - Package setup and version verification
|
||||||
|
- Core Capabilities - Essential API patterns (query, download, visualize, license, citations, batch)
|
||||||
|
- Best Practices - Usage guidelines
|
||||||
|
- Troubleshooting - Common issues and solutions
|
||||||
|
|
||||||
|
**Reference Guides (load on demand):**
|
||||||
|
|
||||||
|
| Guide | When to Load |
|
||||||
|
|-------|--------------|
|
||||||
|
| `index_tables_guide.md` | Complex JOINs, schema discovery, DataFrame access |
|
||||||
|
| `use_cases.md` | End-to-end workflow examples (training datasets, batch downloads) |
|
||||||
|
| `sql_patterns.md` | Quick SQL patterns for filter discovery, annotations, size estimation |
|
||||||
|
| `clinical_data_guide.md` | Clinical/tabular data, imaging+clinical joins, value mapping |
|
||||||
|
| `cloud_storage_guide.md` | Direct S3/GCS access, versioning, UUID mapping |
|
||||||
|
| `dicomweb_guide.md` | DICOMweb endpoints, PACS integration |
|
||||||
|
| `digital_pathology_guide.md` | Slide microscopy (SM), annotations (ANN), pathology workflows |
|
||||||
|
| `bigquery_guide.md` | Full DICOM metadata, private elements (requires GCP) |
|
||||||
|
| `cli_guide.md` | Command-line tools (`idc download`, manifest files) |
|
||||||
|
|
||||||
## IDC Data Model
|
## IDC Data Model
|
||||||
|
|
||||||
IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):
|
IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):
|
||||||
@@ -75,6 +119,8 @@ Use `collection_id` to find original imaging data, may include annotations depos
|
|||||||
|
|
||||||
The `idc-index` package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.
|
The `idc-index` package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.
|
||||||
|
|
||||||
|
**Complete index table documentation:** Use https://idc-index.readthedocs.io/en/latest/indices_reference.html for quick check of available tables and columns without executing any code.
|
||||||
|
|
||||||
**Important:** Use `client.indices_overview` to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.
|
**Important:** Use `client.indices_overview` to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.
|
||||||
|
|
||||||
### Available Tables
|
### Available Tables
|
||||||
@@ -89,6 +135,9 @@ The `idc-index` package provides multiple metadata index tables, accessible via
|
|||||||
| `sm_index` | 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata |
|
| `sm_index` | 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata |
|
||||||
| `sm_instance_index` | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy |
|
| `sm_instance_index` | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy |
|
||||||
| `seg_index` | 1 row = 1 DICOM Segmentation series | fetch_index() | Segmentation metadata: algorithm, segment count, reference to source image series |
|
| `seg_index` | 1 row = 1 DICOM Segmentation series | fetch_index() | Segmentation metadata: algorithm, segment count, reference to source image series |
|
||||||
|
| `ann_index` | 1 row = 1 DICOM ANN series | fetch_index() | Microscopy Bulk Simple Annotations series metadata; references annotated image series |
|
||||||
|
| `ann_group_index` | 1 row = 1 annotation group | fetch_index() | Detailed annotation group metadata: graphic type, annotation count, property codes, algorithm |
|
||||||
|
| `contrast_index` | 1 row = 1 series with contrast info | fetch_index() | Contrast agent metadata: agent name, ingredient, administration route (CT, MR, PT, XA, RF) |
|
||||||
|
|
||||||
**Auto** = loaded automatically when `IDCClient()` is instantiated
|
**Auto** = loaded automatically when `IDCClient()` is instantiated
|
||||||
**fetch_index()** = requires `client.fetch_index("table_name")` to load
|
**fetch_index()** = requires `client.fetch_index("table_name")` to load
|
||||||
@@ -107,140 +156,13 @@ The `idc-index` package provides multiple metadata index tables, accessible via
|
|||||||
| `source_DOI` | index, analysis_results_index | Link by publication DOI |
|
| `source_DOI` | index, analysis_results_index | Link by publication DOI |
|
||||||
| `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier |
|
| `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier |
|
||||||
| `Modality` | index, prior_versions_index | Filter by imaging modality |
|
| `Modality` | index, prior_versions_index | Filter by imaging modality |
|
||||||
| `SeriesInstanceUID` | index, seg_index | Link segmentation series to its index metadata |
|
| `SeriesInstanceUID` | index, seg_index, ann_index, ann_group_index, contrast_index | Link segmentation/annotation/contrast series to its index metadata |
|
||||||
| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
|
| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
|
||||||
|
| `referenced_SeriesInstanceUID` | ann_index → index | Link annotation to its source image series (join ann_index.referenced_SeriesInstanceUID = index.SeriesInstanceUID) |
|
||||||
|
|
||||||
**Note:** `Subjects`, `Updated`, and `Description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).
|
**Note:** `Subjects`, `Updated`, and `Description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).
|
||||||
|
|
||||||
**Example joins:**
|
For detailed join examples, schema discovery patterns, key columns reference, and DataFrame access, see `references/index_tables_guide.md`.
|
||||||
```python
|
|
||||||
from idc_index import IDCClient
|
|
||||||
client = IDCClient()
|
|
||||||
|
|
||||||
# Join index with collections_index to get cancer types
|
|
||||||
client.fetch_index("collections_index")
|
|
||||||
result = client.sql_query("""
|
|
||||||
SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations
|
|
||||||
FROM index i
|
|
||||||
JOIN collections_index c ON i.collection_id = c.collection_id
|
|
||||||
WHERE i.Modality = 'MR'
|
|
||||||
LIMIT 10
|
|
||||||
""")
|
|
||||||
|
|
||||||
# Join index with sm_index for slide microscopy details
|
|
||||||
client.fetch_index("sm_index")
|
|
||||||
result = client.sql_query("""
|
|
||||||
SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
|
|
||||||
FROM index i
|
|
||||||
JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
|
|
||||||
LIMIT 10
|
|
||||||
""")
|
|
||||||
|
|
||||||
# Join seg_index with index to find segmentations and their source images
|
|
||||||
client.fetch_index("seg_index")
|
|
||||||
result = client.sql_query("""
|
|
||||||
SELECT
|
|
||||||
s.SeriesInstanceUID as seg_series,
|
|
||||||
s.AlgorithmName,
|
|
||||||
s.total_segments,
|
|
||||||
src.collection_id,
|
|
||||||
src.Modality as source_modality,
|
|
||||||
src.BodyPartExamined
|
|
||||||
FROM seg_index s
|
|
||||||
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
|
|
||||||
WHERE s.AlgorithmType = 'AUTOMATIC'
|
|
||||||
LIMIT 10
|
|
||||||
""")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Accessing Index Tables
|
|
||||||
|
|
||||||
**Via SQL (recommended for filtering/aggregation):**
|
|
||||||
```python
|
|
||||||
from idc_index import IDCClient
|
|
||||||
client = IDCClient()
|
|
||||||
|
|
||||||
# Query the primary index (always available)
|
|
||||||
results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
|
|
||||||
|
|
||||||
# Fetch and query additional indices
|
|
||||||
client.fetch_index("collections_index")
|
|
||||||
collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
|
|
||||||
|
|
||||||
client.fetch_index("analysis_results_index")
|
|
||||||
analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
|
|
||||||
```
|
|
||||||
|
|
||||||
**As pandas DataFrames (direct access):**
|
|
||||||
```python
|
|
||||||
# Primary index (always available after client initialization)
|
|
||||||
df = client.index
|
|
||||||
|
|
||||||
# Fetch and access on-demand indices
|
|
||||||
client.fetch_index("sm_index")
|
|
||||||
sm_df = client.sm_index
|
|
||||||
```
|
|
||||||
|
|
||||||
### Discovering Table Schemas (Essential for Query Writing)
|
|
||||||
|
|
||||||
The `indices_overview` dictionary contains complete schema information for all tables. **Always consult this when writing queries or exploring data structure.**
|
|
||||||
|
|
||||||
**DICOM attribute mapping:** Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like `PatientID`, `StudyInstanceUID`, `Modality`, `BodyPartExamined` work as expected.
|
|
||||||
|
|
||||||
```python
|
|
||||||
from idc_index import IDCClient
|
|
||||||
client = IDCClient()
|
|
||||||
|
|
||||||
# List all available indices with descriptions
|
|
||||||
for name, info in client.indices_overview.items():
|
|
||||||
print(f"\n{name}:")
|
|
||||||
print(f" Installed: {info['installed']}")
|
|
||||||
print(f" Description: {info['description']}")
|
|
||||||
|
|
||||||
# Get complete schema for a specific index (columns, types, descriptions)
|
|
||||||
schema = client.indices_overview["index"]["schema"]
|
|
||||||
print(f"\nTable: {schema['table_description']}")
|
|
||||||
print("\nColumns:")
|
|
||||||
for col in schema['columns']:
|
|
||||||
desc = col.get('description', 'No description')
|
|
||||||
# Description indicates if column is from DICOM attribute
|
|
||||||
print(f" {col['name']} ({col['type']}): {desc}")
|
|
||||||
|
|
||||||
# Find columns that are DICOM attributes (check description for "DICOM" reference)
|
|
||||||
dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()]
|
|
||||||
print(f"\nDICOM-sourced columns: {dicom_cols}")
|
|
||||||
```
|
|
||||||
|
|
||||||
**Alternative: use `get_index_schema()` method:**
|
|
||||||
```python
|
|
||||||
schema = client.get_index_schema("index")
|
|
||||||
# Returns same schema dict: {'table_description': ..., 'columns': [...]}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Key Columns in Primary `index` Table
|
|
||||||
|
|
||||||
Most common columns for queries (use `indices_overview` for complete list and descriptions):
|
|
||||||
|
|
||||||
| Column | Type | DICOM | Description |
|
|
||||||
|--------|------|-------|-------------|
|
|
||||||
| `collection_id` | STRING | No | IDC collection identifier |
|
|
||||||
| `analysis_result_id` | STRING | No | If applicable, indicates what analysis results collection given series is part of |
|
|
||||||
| `source_DOI` | STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) |
|
|
||||||
| `PatientID` | STRING | Yes | Patient identifier |
|
|
||||||
| `StudyInstanceUID` | STRING | Yes | DICOM Study UID |
|
|
||||||
| `SeriesInstanceUID` | STRING | Yes | DICOM Series UID — use for downloads/viewing |
|
|
||||||
| `Modality` | STRING | Yes | Imaging modality (CT, MR, PT, SM, etc.) |
|
|
||||||
| `BodyPartExamined` | STRING | Yes | Anatomical region |
|
|
||||||
| `SeriesDescription` | STRING | Yes | Description of the series |
|
|
||||||
| `Manufacturer` | STRING | Yes | Equipment manufacturer |
|
|
||||||
| `StudyDate` | STRING | Yes | Date study was performed |
|
|
||||||
| `PatientSex` | STRING | Yes | Patient sex |
|
|
||||||
| `PatientAge` | STRING | Yes | Patient age at time of study |
|
|
||||||
| `license_short_name` | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) |
|
|
||||||
| `series_size_MB` | FLOAT | No | Size of series in megabytes |
|
|
||||||
| `instanceCount` | INTEGER | No | Number of DICOM instances in series |
|
|
||||||
|
|
||||||
**DICOM = Yes**: Column value extracted from the DICOM attribute with the same name. Refer to the [DICOM standard](https://dicom.nema.org/medical/dicom/current/output/chtml/part06/chapter_6.html) for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.
|
|
||||||
|
|
||||||
### Clinical Data Access
|
### Clinical Data Access
|
||||||
|
|
||||||
@@ -301,7 +223,13 @@ pip install --upgrade idc-index
|
|||||||
|
|
||||||
**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.
|
**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.
|
||||||
|
|
||||||
**Tested with:** idc-index 0.11.7 (IDC data version v23)
|
**IMPORTANT:** IDC data version v23 is current. Always verify your version:
|
||||||
|
```python
|
||||||
|
print(client.get_idc_version()) # Should return "v23"
|
||||||
|
```
|
||||||
|
If you see an older version, upgrade with: `pip install --upgrade idc-index`
|
||||||
|
|
||||||
|
**Tested with:** idc-index 0.11.9 (IDC data version v23)
|
||||||
|
|
||||||
**Optional (for data analysis):**
|
**Optional (for data analysis):**
|
||||||
```bash
|
```bash
|
||||||
@@ -484,6 +412,15 @@ client.download_from_selection(
|
|||||||
# Results in: ./data/flat/*.dcm
|
# Results in: ./data/flat/*.dcm
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Downloaded file names:**
|
||||||
|
|
||||||
|
Individual DICOM files are named using their CRDC instance UUID: `<crdc_instance_uuid>.dcm` (e.g., `0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm`). This UUID-based naming:
|
||||||
|
- Enables version tracking (UUIDs change when file content changes)
|
||||||
|
- Matches cloud storage organization (`s3://idc-open-data/<crdc_series_uuid>/<crdc_instance_uuid>.dcm`)
|
||||||
|
- Differs from DICOM UIDs (SOPInstanceUID) which are preserved inside the file metadata
|
||||||
|
|
||||||
|
To identify files, use the `crdc_instance_uuid` column in queries or read DICOM metadata (SOPInstanceUID) from the files.
|
||||||
|
|
||||||
### Command-Line Download
|
### Command-Line Download
|
||||||
|
|
||||||
The `idc download` command provides command-line access to download functionality without writing Python code. Available after installing `idc-index`.
|
The `idc download` command provides command-line access to download functionality without writing Python code. Available after installing `idc-index`.
|
||||||
@@ -705,6 +642,13 @@ For queries requiring full DICOM metadata, complex JOINs, clinical data tables,
|
|||||||
|
|
||||||
See `references/bigquery_guide.md` for setup, table schemas, query patterns, private element access, and cost optimization.
|
See `references/bigquery_guide.md` for setup, table schemas, query patterns, private element access, and cost optimization.
|
||||||
|
|
||||||
|
**Before using BigQuery**, always check if a specialized index table already has the metadata you need:
|
||||||
|
1. Use `client.indices_overview` or the [idc-index indices reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html) to discover all available tables and their columns
|
||||||
|
2. Fetch the relevant index: `client.fetch_index("table_name")`
|
||||||
|
3. Query locally with `client.sql_query()` (free, no GCP account needed)
|
||||||
|
|
||||||
|
Common specialized indices: `seg_index` (segmentations), `ann_index` / `ann_group_index` (microscopy annotations), `sm_index` (slide microscopy), `collections_index` (collection metadata). Only use BigQuery if you need private DICOM elements or attributes not in any index.
|
||||||
|
|
||||||
### 8. Tool Selection Guide
|
### 8. Tool Selection Guide
|
||||||
|
|
||||||
| Task | Tool | Reference |
|
| Task | Tool | Reference |
|
||||||
@@ -782,166 +726,15 @@ sitk.WriteImage(smoothed, "processed_volume.nii.gz")
|
|||||||
|
|
||||||
## Common Use Cases
|
## Common Use Cases
|
||||||
|
|
||||||
### Use Case 1: Find and Download Lung CT Scans for Deep Learning
|
See `references/use_cases.md` for complete end-to-end workflow examples including:
|
||||||
|
- Building deep learning training datasets from lung CT scans
|
||||||
**Objective:** Build training dataset of lung CT scans from NLST collection
|
- Comparing image quality across scanner manufacturers
|
||||||
|
- Previewing data in browser before downloading
|
||||||
**Steps:**
|
- License-aware batch downloads for commercial use
|
||||||
```python
|
|
||||||
from idc_index import IDCClient
|
|
||||||
|
|
||||||
client = IDCClient()
|
|
||||||
|
|
||||||
# 1. Query for lung CT scans with specific criteria
|
|
||||||
query = """
|
|
||||||
SELECT
|
|
||||||
PatientID,
|
|
||||||
SeriesInstanceUID,
|
|
||||||
SeriesDescription
|
|
||||||
FROM index
|
|
||||||
WHERE collection_id = 'nlst'
|
|
||||||
AND Modality = 'CT'
|
|
||||||
AND BodyPartExamined = 'CHEST'
|
|
||||||
AND license_short_name = 'CC BY 4.0'
|
|
||||||
ORDER BY PatientID
|
|
||||||
LIMIT 100
|
|
||||||
"""
|
|
||||||
|
|
||||||
results = client.sql_query(query)
|
|
||||||
print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")
|
|
||||||
|
|
||||||
# 2. Download data organized by patient
|
|
||||||
client.download_from_selection(
|
|
||||||
seriesInstanceUID=list(results['SeriesInstanceUID'].values),
|
|
||||||
downloadDir="./training_data",
|
|
||||||
dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
|
|
||||||
)
|
|
||||||
|
|
||||||
# 3. Save manifest for reproducibility
|
|
||||||
results.to_csv('training_manifest.csv', index=False)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Use Case 2: Query Brain MRI by Manufacturer for Quality Study
|
|
||||||
|
|
||||||
**Objective:** Compare image quality across different MRI scanner manufacturers
|
|
||||||
|
|
||||||
**Steps:**
|
|
||||||
```python
|
|
||||||
from idc_index import IDCClient
|
|
||||||
import pandas as pd
|
|
||||||
|
|
||||||
client = IDCClient()
|
|
||||||
|
|
||||||
# Query for brain MRI grouped by manufacturer
|
|
||||||
query = """
|
|
||||||
SELECT
|
|
||||||
Manufacturer,
|
|
||||||
ManufacturerModelName,
|
|
||||||
COUNT(DISTINCT SeriesInstanceUID) as num_series,
|
|
||||||
COUNT(DISTINCT PatientID) as num_patients
|
|
||||||
FROM index
|
|
||||||
WHERE Modality = 'MR'
|
|
||||||
AND BodyPartExamined LIKE '%BRAIN%'
|
|
||||||
GROUP BY Manufacturer, ManufacturerModelName
|
|
||||||
HAVING num_series >= 10
|
|
||||||
ORDER BY num_series DESC
|
|
||||||
"""
|
|
||||||
|
|
||||||
manufacturers = client.sql_query(query)
|
|
||||||
print(manufacturers)
|
|
||||||
|
|
||||||
# Download sample from each manufacturer for comparison
|
|
||||||
for _, row in manufacturers.head(3).iterrows():
|
|
||||||
mfr = row['Manufacturer']
|
|
||||||
model = row['ManufacturerModelName']
|
|
||||||
|
|
||||||
query = f"""
|
|
||||||
SELECT SeriesInstanceUID
|
|
||||||
FROM index
|
|
||||||
WHERE Manufacturer = '{mfr}'
|
|
||||||
AND ManufacturerModelName = '{model}'
|
|
||||||
AND Modality = 'MR'
|
|
||||||
AND BodyPartExamined LIKE '%BRAIN%'
|
|
||||||
LIMIT 5
|
|
||||||
"""
|
|
||||||
|
|
||||||
series = client.sql_query(query)
|
|
||||||
client.download_from_selection(
|
|
||||||
seriesInstanceUID=list(series['SeriesInstanceUID'].values),
|
|
||||||
downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Use Case 3: Visualize Series Without Downloading
|
|
||||||
|
|
||||||
**Objective:** Preview imaging data before committing to download
|
|
||||||
|
|
||||||
```python
|
|
||||||
from idc_index import IDCClient
|
|
||||||
import webbrowser
|
|
||||||
|
|
||||||
client = IDCClient()
|
|
||||||
|
|
||||||
series_list = client.sql_query("""
|
|
||||||
SELECT SeriesInstanceUID, PatientID, SeriesDescription
|
|
||||||
FROM index
|
|
||||||
WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
|
|
||||||
LIMIT 10
|
|
||||||
""")
|
|
||||||
|
|
||||||
# Preview each in browser
|
|
||||||
for _, row in series_list.iterrows():
|
|
||||||
viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID'])
|
|
||||||
print(f"Patient {row['PatientID']}: {row['SeriesDescription']}")
|
|
||||||
print(f" View at: {viewer_url}")
|
|
||||||
# webbrowser.open(viewer_url) # Uncomment to open automatically
|
|
||||||
```
|
|
||||||
|
|
||||||
For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration.
|
|
||||||
|
|
||||||
### Use Case 4: License-Aware Batch Download for Commercial Use
|
|
||||||
|
|
||||||
**Objective:** Download only CC-BY licensed data suitable for commercial applications
|
|
||||||
|
|
||||||
**Steps:**
|
|
||||||
```python
|
|
||||||
from idc_index import IDCClient
|
|
||||||
|
|
||||||
client = IDCClient()
|
|
||||||
|
|
||||||
# Query ONLY for CC BY licensed data (allows commercial use with attribution)
|
|
||||||
query = """
|
|
||||||
SELECT
|
|
||||||
SeriesInstanceUID,
|
|
||||||
collection_id,
|
|
||||||
PatientID,
|
|
||||||
Modality
|
|
||||||
FROM index
|
|
||||||
WHERE license_short_name LIKE 'CC BY%'
|
|
||||||
AND license_short_name NOT LIKE '%NC%'
|
|
||||||
AND Modality IN ('CT', 'MR')
|
|
||||||
AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN')
|
|
||||||
LIMIT 200
|
|
||||||
"""
|
|
||||||
|
|
||||||
cc_by_data = client.sql_query(query)
|
|
||||||
|
|
||||||
print(f"Found {len(cc_by_data)} CC BY licensed series")
|
|
||||||
print(f"Collections: {cc_by_data['collection_id'].unique()}")
|
|
||||||
|
|
||||||
# Download with license verification
|
|
||||||
client.download_from_selection(
|
|
||||||
seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values),
|
|
||||||
downloadDir="./commercial_dataset",
|
|
||||||
dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Save license information
|
|
||||||
cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Best Practices
|
## Best Practices
|
||||||
|
|
||||||
|
- **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v23). If using an older version, recommend `pip install --upgrade idc-index`
|
||||||
- **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC)
|
- **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC)
|
||||||
- **Generate citations for attribution** - Use `citations_from_selection()` to get properly formatted citations from `source_DOI` values; include these in publications
|
- **Generate citations for attribution** - Use `citations_from_selection()` to get properly formatted citations from `source_DOI` values; include these in publications
|
||||||
- **Start with small queries** - Use `LIMIT` clause when exploring to avoid long downloads and understand data structure
|
- **Start with small queries** - Use `LIMIT` clause when exploring to avoid long downloads and understand data structure
|
||||||
@@ -989,142 +782,14 @@ cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
|
|||||||
|
|
||||||
## Common SQL Query Patterns
|
## Common SQL Query Patterns
|
||||||
|
|
||||||
Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above.
|
See `references/sql_patterns.md` for quick-reference SQL patterns including:
|
||||||
|
- Filter value discovery (modalities, body parts, manufacturers)
|
||||||
|
- Annotation and segmentation queries (including seg_index, ann_index joins)
|
||||||
|
- Slide microscopy queries (sm_index patterns)
|
||||||
|
- Download size estimation
|
||||||
|
- Clinical data linking
|
||||||
|
|
||||||
### Discover available filter values
|
For segmentation and annotation details, also see `references/digital_pathology_guide.md`.
|
||||||
```python
|
|
||||||
# What modalities exist?
|
|
||||||
client.sql_query("SELECT DISTINCT Modality FROM index")
|
|
||||||
|
|
||||||
# What body parts for a specific modality?
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT DISTINCT BodyPartExamined, COUNT(*) as n
|
|
||||||
FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
|
|
||||||
GROUP BY BodyPartExamined ORDER BY n DESC
|
|
||||||
""")
|
|
||||||
|
|
||||||
# What manufacturers for MR?
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT DISTINCT Manufacturer, COUNT(*) as n
|
|
||||||
FROM index WHERE Modality = 'MR'
|
|
||||||
GROUP BY Manufacturer ORDER BY n DESC
|
|
||||||
""")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Find annotations and segmentations
|
|
||||||
|
|
||||||
**Note:** Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Find ALL segmentations and structure sets by DICOM Modality
|
|
||||||
# SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT collection_id, Modality, COUNT(*) as series_count
|
|
||||||
FROM index
|
|
||||||
WHERE Modality IN ('SEG', 'RTSTRUCT')
|
|
||||||
GROUP BY collection_id, Modality
|
|
||||||
ORDER BY series_count DESC
|
|
||||||
""")
|
|
||||||
|
|
||||||
# Find segmentations for a specific collection (includes non-analysis-result items)
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
|
|
||||||
FROM index
|
|
||||||
WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
|
|
||||||
""")
|
|
||||||
|
|
||||||
# List analysis result collections (curated derived datasets)
|
|
||||||
client.fetch_index("analysis_results_index")
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
|
|
||||||
FROM analysis_results_index
|
|
||||||
""")
|
|
||||||
|
|
||||||
# Find analysis results for a specific source collection
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT analysis_result_id, analysis_result_title
|
|
||||||
FROM analysis_results_index
|
|
||||||
WHERE Collections LIKE '%tcga_luad%'
|
|
||||||
""")
|
|
||||||
|
|
||||||
# Use seg_index for detailed DICOM Segmentation metadata
|
|
||||||
client.fetch_index("seg_index")
|
|
||||||
|
|
||||||
# Get segmentation statistics by algorithm
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
|
|
||||||
FROM seg_index
|
|
||||||
WHERE AlgorithmName IS NOT NULL
|
|
||||||
GROUP BY AlgorithmName, AlgorithmType
|
|
||||||
ORDER BY seg_count DESC
|
|
||||||
LIMIT 10
|
|
||||||
""")
|
|
||||||
|
|
||||||
# Find segmentations for specific source images (e.g., chest CT)
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT
|
|
||||||
s.SeriesInstanceUID as seg_series,
|
|
||||||
s.AlgorithmName,
|
|
||||||
s.total_segments,
|
|
||||||
s.segmented_SeriesInstanceUID as source_series
|
|
||||||
FROM seg_index s
|
|
||||||
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
|
|
||||||
WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
|
|
||||||
LIMIT 10
|
|
||||||
""")
|
|
||||||
|
|
||||||
# Find TotalSegmentator results with source image context
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT
|
|
||||||
seg_info.collection_id,
|
|
||||||
COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
|
|
||||||
SUM(s.total_segments) as total_segments
|
|
||||||
FROM seg_index s
|
|
||||||
JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
|
|
||||||
WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
|
|
||||||
GROUP BY seg_info.collection_id
|
|
||||||
ORDER BY seg_count DESC
|
|
||||||
""")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Query slide microscopy data
|
|
||||||
```python
|
|
||||||
# sm_index has detailed metadata; join with index for collection_id
|
|
||||||
client.fetch_index("sm_index")
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT i.collection_id, COUNT(*) as slides,
|
|
||||||
MIN(s.min_PixelSpacing_2sf) as min_resolution
|
|
||||||
FROM sm_index s
|
|
||||||
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
|
|
||||||
GROUP BY i.collection_id
|
|
||||||
ORDER BY slides DESC
|
|
||||||
""")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Estimate download size
|
|
||||||
```python
|
|
||||||
# Size for specific criteria
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
|
|
||||||
FROM index
|
|
||||||
WHERE collection_id = 'nlst' AND Modality = 'CT'
|
|
||||||
""")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Link to clinical data
|
|
||||||
```python
|
|
||||||
client.fetch_index("clinical_index")
|
|
||||||
|
|
||||||
# Find collections with clinical data and their tables
|
|
||||||
client.sql_query("""
|
|
||||||
SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
|
|
||||||
FROM clinical_index
|
|
||||||
GROUP BY collection_id, table_name
|
|
||||||
ORDER BY collection_id
|
|
||||||
""")
|
|
||||||
```
|
|
||||||
|
|
||||||
See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection.
|
|
||||||
|
|
||||||
## Related Skills
|
## Related Skills
|
||||||
|
|
||||||
@@ -1134,8 +799,7 @@ The following skills complement IDC workflows for downstream analysis and visual
|
|||||||
- **pydicom** - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET).
|
- **pydicom** - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET).
|
||||||
|
|
||||||
### Pathology and Slide Microscopy
|
### Pathology and Slide Microscopy
|
||||||
- **histolab** - Lightweight tile extraction and preprocessing for whole slide images. Use for basic slide processing, tissue detection, and dataset preparation from IDC slide microscopy data.
|
See `references/digital_pathology_guide.md` for DICOM-compatible tools (highdicom, wsidicom, TIA-Toolbox, Slim viewer).
|
||||||
- **pathml** - Full-featured computational pathology toolkit. Use for advanced WSI analysis including multiplexed imaging, nucleus segmentation, and ML model training on pathology data downloaded from IDC.
|
|
||||||
|
|
||||||
### Metadata Visualization
|
### Metadata Visualization
|
||||||
- **matplotlib** - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.).
|
- **matplotlib** - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.).
|
||||||
@@ -1159,11 +823,8 @@ columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['col
|
|||||||
|
|
||||||
### Reference Documentation
|
### Reference Documentation
|
||||||
|
|
||||||
- **clinical_data_guide.md** - Clinical/tabular data navigation, value mapping, and joining with imaging data
|
See the Quick Navigation section at the top for the full list of reference guides with decision triggers.
|
||||||
- **cloud_storage_guide.md** - Direct cloud bucket access (S3/GCS), file organization, CRDC UUIDs, versioning, and reproducibility
|
|
||||||
- **cli_guide.md** - Complete idc-index command-line interface reference (`idc download`, `idc download-from-manifest`, `idc download-from-selection`)
|
|
||||||
- **bigquery_guide.md** - Advanced BigQuery usage guide for complex metadata queries
|
|
||||||
- **dicomweb_guide.md** - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details
|
|
||||||
- **[indices_reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html)** - External documentation for index tables (may be ahead of the installed version)
|
- **[indices_reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html)** - External documentation for index tables (may be ahead of the installed version)
|
||||||
|
|
||||||
### External Links
|
### External Links
|
||||||
|
|||||||
@@ -0,0 +1,324 @@
|
|||||||
|
# Clinical Data Guide for IDC
|
||||||
|
|
||||||
|
**Tested with:** idc-index 0.11.7 (IDC data version v23)
|
||||||
|
|
||||||
|
Clinical data (demographics, diagnoses, therapies, lab tests, staging) accompanies many IDC imaging collections. This guide covers how to discover, access, and integrate clinical data with imaging data using `idc-index`.
|
||||||
|
|
||||||
|
## When to Use This Guide
|
||||||
|
|
||||||
|
Use this guide when you need to:
|
||||||
|
- Find what clinical metadata is available for a collection
|
||||||
|
- Filter patients by clinical criteria (e.g., cancer stage, treatment history)
|
||||||
|
- Join clinical attributes with imaging data for cohort selection
|
||||||
|
- Understand and decode coded values in clinical tables
|
||||||
|
|
||||||
|
For basic clinical data access, see the "Clinical Data Access" section in the main SKILL.md. This guide provides detailed workflows and advanced patterns.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install --upgrade idc-index
|
||||||
|
```
|
||||||
|
|
||||||
|
No BigQuery credentials required - clinical data is packaged with `idc-index`.
|
||||||
|
|
||||||
|
## Understanding Clinical Data in IDC
|
||||||
|
|
||||||
|
### What is Clinical Data?
|
||||||
|
|
||||||
|
Clinical data refers to non-imaging information that accompanies medical images:
|
||||||
|
- Patient demographics (age, sex, race)
|
||||||
|
- Clinical history (diagnoses, surgeries, therapies)
|
||||||
|
- Lab tests and pathology results
|
||||||
|
- Cancer staging (clinical and pathological)
|
||||||
|
- Treatment outcomes
|
||||||
|
|
||||||
|
### Data Organization
|
||||||
|
|
||||||
|
Clinical data in IDC comes from collection-specific spreadsheets provided by data submitters. IDC parses these into queryable tables accessible via `idc-index`.
|
||||||
|
|
||||||
|
**Important characteristics:**
|
||||||
|
- Clinical data is **not harmonized** across collections (terms and formats vary)
|
||||||
|
- Not all collections have clinical data (check availability first)
|
||||||
|
- All data is **anonymized** - `dicom_patient_id` links to imaging
|
||||||
|
|
||||||
|
### The clinical_index Table
|
||||||
|
|
||||||
|
The `clinical_index` serves as a dictionary/catalog of all available clinical data:
|
||||||
|
|
||||||
|
| Column | Purpose | Use For |
|
||||||
|
|--------|---------|---------|
|
||||||
|
| `collection_id` | Collection identifier | Filtering by collection |
|
||||||
|
| `table_name` | Full BigQuery table reference | BigQuery queries (if needed) |
|
||||||
|
| `short_table_name` | Short name | `get_clinical_table()` method |
|
||||||
|
| `column` | Column name in table | Selecting data columns |
|
||||||
|
| `column_label` | Human-readable description | Searching for concepts |
|
||||||
|
| `values` | Observed attribute values for the column | Interpreting coded values |
|
||||||
|
|
||||||
|
### The `values` Column
|
||||||
|
|
||||||
|
The `values` column contains an array of observed attribute values for the column defined in the `column` field. Each entry has:
|
||||||
|
- **option_code**: The actual value observed in that column
|
||||||
|
- **option_description**: Human-readable description of that value (from data dictionary if available, otherwise `None`)
|
||||||
|
|
||||||
|
For ACRIN collections, value descriptions come from provided data dictionaries. For other collections, they are derived from inspection of the actual data values.
|
||||||
|
|
||||||
|
**Note:** For columns with >20 unique values, the `values` array is left empty (`[]`) for simplicity.
|
||||||
|
|
||||||
|
## Core Workflow
|
||||||
|
|
||||||
|
### Step 1: Fetch Clinical Index
|
||||||
|
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
|
||||||
|
client = IDCClient()
|
||||||
|
client.fetch_index('clinical_index')
|
||||||
|
|
||||||
|
# View available columns
|
||||||
|
print(client.clinical_index.columns.tolist())
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Discover Available Clinical Data
|
||||||
|
|
||||||
|
```python
|
||||||
|
# List all collections with clinical data
|
||||||
|
collections_with_clinical = client.clinical_index["collection_id"].unique().tolist()
|
||||||
|
print(f"{len(collections_with_clinical)} collections have clinical data")
|
||||||
|
|
||||||
|
# Find clinical attributes for a specific collection
|
||||||
|
nlst_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']
|
||||||
|
nlst_columns[['short_table_name', 'column', 'column_label', 'values']]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Search for Specific Attributes
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Search by keyword in column_label (case-insensitive)
|
||||||
|
stage_attrs = client.clinical_index[
|
||||||
|
client.clinical_index["column_label"].str.contains("[Ss]tage", na=False)
|
||||||
|
]
|
||||||
|
stage_attrs[["collection_id", "short_table_name", "column", "column_label"]]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Load Clinical Table
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Load table using short_table_name
|
||||||
|
nlst_canc_df = client.get_clinical_table("nlst_canc")
|
||||||
|
|
||||||
|
# Examine structure
|
||||||
|
print(f"Rows: {len(nlst_canc_df)}, Columns: {len(nlst_canc_df.columns)}")
|
||||||
|
nlst_canc_df.head()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Map Coded Values to Descriptions
|
||||||
|
|
||||||
|
Many clinical attributes use coded values. The `values` column in `clinical_index` contains an array of observed values with their descriptions (when available).
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Get the clinical_index rows for NLST
|
||||||
|
nlst_clinical_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']
|
||||||
|
|
||||||
|
# Get observed values for a specific column
|
||||||
|
# Filter to the row for 'clinical_stag' and extract the values array
|
||||||
|
clinical_stag_values = nlst_clinical_columns[
|
||||||
|
nlst_clinical_columns['column']=='clinical_stag'
|
||||||
|
]['values'].values[0]
|
||||||
|
|
||||||
|
# View the observed values and their descriptions
|
||||||
|
print(clinical_stag_values)
|
||||||
|
# Output: array([{'option_code': '.M', 'option_description': 'Missing'},
|
||||||
|
# {'option_code': '110', 'option_description': 'Stage IA'},
|
||||||
|
# {'option_code': '120', 'option_description': 'Stage IB'}, ...])
|
||||||
|
|
||||||
|
# Create mapping dictionary from codes to descriptions
|
||||||
|
mapping_dict = {item['option_code']: item['option_description'] for item in clinical_stag_values}
|
||||||
|
|
||||||
|
# Apply to DataFrame - convert column to string first for consistent matching
|
||||||
|
nlst_canc_df['clinical_stag_meaning'] = nlst_canc_df['clinical_stag'].astype(str).map(mapping_dict)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 6: Join with Imaging Data
|
||||||
|
|
||||||
|
The `dicom_patient_id` column links clinical data to imaging. It matches the `PatientID` column in the imaging index.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Pandas merge approach
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
# Get NLST CT imaging data
|
||||||
|
nlst_imaging = client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')]
|
||||||
|
|
||||||
|
# Join with clinical data
|
||||||
|
merged = pd.merge(
|
||||||
|
nlst_imaging[['PatientID', 'StudyInstanceUID']].drop_duplicates(),
|
||||||
|
nlst_canc_df[['dicom_patient_id', 'clinical_stag', 'clinical_stag_meaning']],
|
||||||
|
left_on='PatientID',
|
||||||
|
right_on='dicom_patient_id',
|
||||||
|
how='inner'
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
# SQL join approach
|
||||||
|
query = """
|
||||||
|
SELECT
|
||||||
|
index.PatientID,
|
||||||
|
index.StudyInstanceUID,
|
||||||
|
index.Modality,
|
||||||
|
nlst_canc.clinical_stag
|
||||||
|
FROM index
|
||||||
|
JOIN nlst_canc ON index.PatientID = nlst_canc.dicom_patient_id
|
||||||
|
WHERE index.collection_id = 'nlst' AND index.Modality = 'CT'
|
||||||
|
"""
|
||||||
|
results = client.sql_query(query)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Common Use Cases
|
||||||
|
|
||||||
|
### Use Case 1: Select Patients by Cancer Stage
|
||||||
|
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
client = IDCClient()
|
||||||
|
client.fetch_index('clinical_index')
|
||||||
|
|
||||||
|
# Load clinical table
|
||||||
|
nlst_canc = client.get_clinical_table("nlst_canc")
|
||||||
|
|
||||||
|
# Select Stage IV patients (code '400')
|
||||||
|
stage_iv_patients = nlst_canc[nlst_canc['clinical_stag'] == '400']['dicom_patient_id']
|
||||||
|
|
||||||
|
# Get CT imaging studies for these patients
|
||||||
|
stage_iv_studies = pd.merge(
|
||||||
|
client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')],
|
||||||
|
stage_iv_patients,
|
||||||
|
left_on='PatientID',
|
||||||
|
right_on='dicom_patient_id',
|
||||||
|
how='inner'
|
||||||
|
)['StudyInstanceUID'].drop_duplicates()
|
||||||
|
|
||||||
|
print(f"Found {len(stage_iv_studies)} CT studies for Stage IV patients")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Use Case 2: Find Collections with Specific Clinical Attributes
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find collections with chemotherapy information
|
||||||
|
chemo_collections = client.clinical_index[
|
||||||
|
client.clinical_index["column_label"].str.contains("[Cc]hemotherapy", na=False)
|
||||||
|
]["collection_id"].unique()
|
||||||
|
|
||||||
|
print(f"Collections with chemotherapy data: {list(chemo_collections)}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Use Case 3: Examine Observed Values for a Clinical Attribute
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find what values have been observed for a specific attribute
|
||||||
|
chemotherapy_rows = client.clinical_index[
|
||||||
|
(client.clinical_index["collection_id"] == "hcc_tace_seg") &
|
||||||
|
(client.clinical_index["column"] == "chemotherapy")
|
||||||
|
]
|
||||||
|
|
||||||
|
# Get the observed values array
|
||||||
|
values_list = chemotherapy_rows["values"].tolist()
|
||||||
|
print(values_list)
|
||||||
|
# Output: [[{'option_code': 'Cisplastin', 'option_description': None},
|
||||||
|
# {'option_code': 'Cisplatin, Mitomycin-C', 'option_description': None}, ...]]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Use Case 4: Generate Viewer URLs for Selected Patients
|
||||||
|
|
||||||
|
```python
|
||||||
|
import random
|
||||||
|
|
||||||
|
# Get studies for a sample Stage IV patient
|
||||||
|
sample_patient = stage_iv_patients.iloc[0]
|
||||||
|
studies = client.index[client.index['PatientID'] == sample_patient]['StudyInstanceUID'].unique()
|
||||||
|
|
||||||
|
# Generate viewer URL
|
||||||
|
if len(studies) > 0:
|
||||||
|
viewer_url = client.get_viewer_URL(studyInstanceUID=studies[0])
|
||||||
|
print(viewer_url)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Concepts
|
||||||
|
|
||||||
|
### column vs column_label
|
||||||
|
|
||||||
|
- **column**: Use for selecting data from tables (programmatic access)
|
||||||
|
- **column_label**: Use for searching/understanding what data means (human-readable)
|
||||||
|
|
||||||
|
Some collections (like `c4kc_kits`) have identical column and column_label. Others (like ACRIN collections) have cryptic column names but descriptive labels.
|
||||||
|
|
||||||
|
### option_code vs option_description
|
||||||
|
|
||||||
|
The `values` array contains observed attribute values:
|
||||||
|
- **option_code**: The actual value observed in the column (what you filter on)
|
||||||
|
- **option_description**: Human-readable description (from data dictionary if available, otherwise `None`)
|
||||||
|
|
||||||
|
### dicom_patient_id
|
||||||
|
|
||||||
|
Every clinical table includes `dicom_patient_id`, which matches the `PatientID` column in the imaging index. This is the key for joining clinical and imaging data.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Issue: Clinical table not found
|
||||||
|
|
||||||
|
**Cause:** Using wrong table name or table doesn't exist for collection
|
||||||
|
|
||||||
|
**Solution:** Query clinical_index first to find available tables:
|
||||||
|
```python
|
||||||
|
client.clinical_index[client.clinical_index['collection_id']=='your_collection']['short_table_name'].unique()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: Empty values array
|
||||||
|
|
||||||
|
**Cause:** The `values` array is left empty when a column has >20 unique values
|
||||||
|
|
||||||
|
**Solution:** Load the clinical table and examine unique values directly:
|
||||||
|
```python
|
||||||
|
clinical_df = client.get_clinical_table("table_name")
|
||||||
|
clinical_df['column_name'].unique()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: Coded values not in mapping
|
||||||
|
|
||||||
|
**Cause:** Some values may be missing from the dictionary (e.g., empty strings, special codes like `.M` for missing)
|
||||||
|
|
||||||
|
**Solution:** Handle unmapped values gracefully:
|
||||||
|
```python
|
||||||
|
df['meaning'] = df['code'].astype(str).map(mapping_dict).fillna('Unknown/Missing')
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: No matching patients when joining
|
||||||
|
|
||||||
|
**Cause:** Clinical data may include patients without images, or vice versa
|
||||||
|
|
||||||
|
**Solution:** Verify patient overlap before joining:
|
||||||
|
```python
|
||||||
|
imaging_patients = set(client.index[client.index['collection_id']=='nlst']['PatientID'].unique())
|
||||||
|
clinical_patients = set(clinical_df['dicom_patient_id'].unique())
|
||||||
|
overlap = imaging_patients & clinical_patients
|
||||||
|
print(f"Patients with both imaging and clinical data: {len(overlap)}")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
**IDC Documentation:**
|
||||||
|
- [Clinical data organization](https://learn.canceridc.dev/data/organization-of-data/clinical) - How clinical data is organized in IDC
|
||||||
|
- [Clinical data dashboard](https://datastudio.google.com/u/0/reporting/04cf5976-4ea0-4fee-a749-8bfd162f2e87/page/p_s7mk6eybqc) - Visual summary of available clinical data
|
||||||
|
- [idc-index clinical_index documentation](https://idc-index.readthedocs.io/en/latest/column_descriptions.html#clinical-index)
|
||||||
|
|
||||||
|
**Related Guides:**
|
||||||
|
- `bigquery_guide.md` - Advanced clinical queries via BigQuery
|
||||||
|
- Main SKILL.md - Core IDC workflows
|
||||||
|
|
||||||
|
**IDC Tutorials:**
|
||||||
|
- [clinical_data_intro.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/advanced_topics/clinical_data_intro.ipynb)
|
||||||
|
- [exploring_clinical_data.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/exploring_clinical_data.ipynb)
|
||||||
|
- [nlst_clinical_data.ipynb](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/collections_demos/nlst_clinical_data.ipynb)
|
||||||
@@ -0,0 +1,254 @@
|
|||||||
|
# Digital Pathology Guide for IDC
|
||||||
|
|
||||||
|
**Tested with:** IDC data version v23, idc-index 0.11.9
|
||||||
|
|
||||||
|
For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
|
||||||
|
|
||||||
|
## Index Tables for Digital Pathology
|
||||||
|
|
||||||
|
Five specialized index tables provide curated metadata without needing BigQuery:
|
||||||
|
|
||||||
|
| Table | Row Granularity | Description |
|
||||||
|
|-------|-----------------|-------------|
|
||||||
|
| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: lens power, pixel spacing, image dimensions |
|
||||||
|
| `sm_instance_index` | 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images |
|
||||||
|
| `seg_index` | 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations |
|
||||||
|
| `ann_index` | 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide |
|
||||||
|
| `ann_group_index` | 1 row = 1 annotation group | Annotation group details: `AnnotationGroupLabel`, `GraphicType`, `NumberOfAnnotations`, `AlgorithmName`, property codes |
|
||||||
|
|
||||||
|
All require `client.fetch_index("table_name")` before querying. Use `client.indices_overview` to inspect column schemas programmatically.
|
||||||
|
|
||||||
|
## Slide Microscopy Queries
|
||||||
|
|
||||||
|
### Basic SM metadata
|
||||||
|
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
client = IDCClient()
|
||||||
|
|
||||||
|
# sm_index has detailed metadata; join with index for collection_id
|
||||||
|
client.fetch_index("sm_index")
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT i.collection_id, COUNT(*) as slides,
|
||||||
|
MIN(s.min_PixelSpacing_2sf) as min_resolution
|
||||||
|
FROM sm_index s
|
||||||
|
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
GROUP BY i.collection_id
|
||||||
|
ORDER BY slides DESC
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Find SM series with specific properties
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find high-resolution slides with specific objective lens power
|
||||||
|
client.fetch_index("sm_index")
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
i.collection_id,
|
||||||
|
i.PatientID,
|
||||||
|
s.ObjectiveLensPower,
|
||||||
|
s.min_PixelSpacing_2sf
|
||||||
|
FROM sm_index s
|
||||||
|
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
WHERE s.ObjectiveLensPower >= 40
|
||||||
|
ORDER BY s.min_PixelSpacing_2sf
|
||||||
|
LIMIT 20
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Annotation Queries (ANN)
|
||||||
|
|
||||||
|
DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`.
|
||||||
|
|
||||||
|
### Basic annotation discovery
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find annotation series and their referenced images
|
||||||
|
client.fetch_index("ann_index")
|
||||||
|
client.fetch_index("ann_group_index")
|
||||||
|
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
a.SeriesInstanceUID as ann_series,
|
||||||
|
a.AnnotationCoordinateType,
|
||||||
|
a.referenced_SeriesInstanceUID as source_series
|
||||||
|
FROM ann_index a
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Annotation group statistics
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Get annotation group details (graphic types, counts, algorithms)
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
GraphicType,
|
||||||
|
SUM(NumberOfAnnotations) as total_annotations,
|
||||||
|
COUNT(*) as group_count
|
||||||
|
FROM ann_group_index
|
||||||
|
GROUP BY GraphicType
|
||||||
|
ORDER BY total_annotations DESC
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Find annotations with source slide context
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find annotations with their source slide microscopy context
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
i.collection_id,
|
||||||
|
g.GraphicType,
|
||||||
|
g.AnnotationPropertyType_CodeMeaning,
|
||||||
|
g.AlgorithmName,
|
||||||
|
g.NumberOfAnnotations
|
||||||
|
FROM ann_group_index g
|
||||||
|
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
|
||||||
|
JOIN index i ON a.referenced_SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
WHERE g.AlgorithmName IS NOT NULL
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Segmentations on Slide Microscopy
|
||||||
|
|
||||||
|
DICOM Segmentations (Modality = 'SEG') are used for both radiology (e.g., organ segmentations on CT) and pathology (e.g., tissue region segmentations on whole slide images). Use `seg_index.segmented_SeriesInstanceUID` to find the source series, then filter by source Modality to isolate pathology segmentations.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find segmentations whose source is a slide microscopy image
|
||||||
|
client.fetch_index("seg_index")
|
||||||
|
client.fetch_index("sm_index")
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
seg.SeriesInstanceUID as seg_series,
|
||||||
|
seg.AlgorithmName,
|
||||||
|
seg.total_segments,
|
||||||
|
src.collection_id,
|
||||||
|
src.Modality as source_modality
|
||||||
|
FROM seg_index seg
|
||||||
|
JOIN index src ON seg.segmented_SeriesInstanceUID = src.SeriesInstanceUID
|
||||||
|
WHERE src.Modality = 'SM'
|
||||||
|
LIMIT 20
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Filter by AnnotationGroupLabel
|
||||||
|
|
||||||
|
`AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search.
|
||||||
|
|
||||||
|
### Simple label filtering
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find annotation groups by label (e.g., groups mentioning "blast")
|
||||||
|
client.fetch_index("ann_group_index")
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
g.SeriesInstanceUID,
|
||||||
|
g.AnnotationGroupLabel,
|
||||||
|
g.GraphicType,
|
||||||
|
g.NumberOfAnnotations,
|
||||||
|
g.AlgorithmName
|
||||||
|
FROM ann_group_index g
|
||||||
|
WHERE LOWER(g.AnnotationGroupLabel) LIKE '%blast%'
|
||||||
|
ORDER BY g.NumberOfAnnotations DESC
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Label filtering with collection context
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find annotation groups matching a label within a specific collection
|
||||||
|
client.fetch_index("ann_index")
|
||||||
|
client.fetch_index("ann_group_index")
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
i.collection_id,
|
||||||
|
g.AnnotationGroupLabel,
|
||||||
|
g.GraphicType,
|
||||||
|
g.NumberOfAnnotations,
|
||||||
|
g.AnnotationPropertyType_CodeMeaning
|
||||||
|
FROM ann_group_index g
|
||||||
|
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
|
||||||
|
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
WHERE i.collection_id = 'your_collection_id'
|
||||||
|
AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%'
|
||||||
|
ORDER BY g.NumberOfAnnotations DESC
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Annotations on Slide Microscopy (SM + ANN Cross-Reference)
|
||||||
|
|
||||||
|
When looking for annotations related to slide microscopy data, use both SM and ANN tables together. The `ann_index.referenced_SeriesInstanceUID` links each annotation series to its source slide.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find slide microscopy images and their annotations in a collection
|
||||||
|
client.fetch_index("sm_index")
|
||||||
|
client.fetch_index("ann_index")
|
||||||
|
client.fetch_index("ann_group_index")
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
i.collection_id,
|
||||||
|
s.ObjectiveLensPower,
|
||||||
|
g.AnnotationGroupLabel,
|
||||||
|
g.NumberOfAnnotations,
|
||||||
|
g.GraphicType
|
||||||
|
FROM ann_group_index g
|
||||||
|
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
|
||||||
|
JOIN sm_index s ON a.referenced_SeriesInstanceUID = s.SeriesInstanceUID
|
||||||
|
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
WHERE i.collection_id = 'your_collection_id'
|
||||||
|
ORDER BY g.NumberOfAnnotations DESC
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Join Patterns
|
||||||
|
|
||||||
|
### SM join (slide microscopy details with collection context)
|
||||||
|
|
||||||
|
```python
|
||||||
|
client.fetch_index("sm_index")
|
||||||
|
result = client.sql_query("""
|
||||||
|
SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
|
||||||
|
FROM index i
|
||||||
|
JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
### ANN join (annotation groups with collection context)
|
||||||
|
|
||||||
|
```python
|
||||||
|
client.fetch_index("ann_index")
|
||||||
|
client.fetch_index("ann_group_index")
|
||||||
|
result = client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
i.collection_id,
|
||||||
|
g.AnnotationGroupLabel,
|
||||||
|
g.GraphicType,
|
||||||
|
g.NumberOfAnnotations,
|
||||||
|
a.referenced_SeriesInstanceUID as source_series
|
||||||
|
FROM ann_group_index g
|
||||||
|
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
|
||||||
|
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Related Tools
|
||||||
|
|
||||||
|
The following tools work with DICOM format for digital pathology workflows:
|
||||||
|
|
||||||
|
**Python Libraries:**
|
||||||
|
- [highdicom](https://github.com/ImagingDataCommons/highdicom) - High-level DICOM abstractions for Python. Create and read DICOM Segmentations (SEG), Structured Reports (SR), and parametric maps for pathology and radiology. Developed by IDC.
|
||||||
|
- [wsidicom](https://github.com/imi-bigpicture/wsidicom) - Python package for reading DICOM WSI datasets. Parses metadata into easy-to-use dataclasses for whole slide image analysis.
|
||||||
|
- [TIA-Toolbox](https://github.com/TissueImageAnalytics/tiatoolbox) - End-to-end computational pathology library with DICOM support via `DICOMWSIReader`. Provides tile extraction, feature extraction, and pretrained deep learning models.
|
||||||
|
- [EZ-WSI-DICOMweb](https://github.com/GoogleCloudPlatform/EZ-WSI-DICOMweb) - Extract image patches from DICOM whole slide images via DICOMweb. Designed for AI/ML workflows with cloud DICOM stores.
|
||||||
|
|
||||||
|
**Viewers:**
|
||||||
|
- [Slim](https://github.com/ImagingDataCommons/slim) - Web-based DICOM slide microscopy viewer and annotation tool. Supports brightfield and multiplexed immunofluorescence imaging via DICOMweb. Developed by IDC.
|
||||||
|
- [QuPath](https://qupath.github.io/) - Cross-platform open source software for whole slide image analysis. Supports DICOM WSI via Bio-Formats and OpenSlide (v0.4.0+).
|
||||||
|
|
||||||
|
**Conversion:**
|
||||||
|
- [dicom_wsi](https://github.com/Steven-N-Hart/dicom_wsi) - Python implementation for converting proprietary WSI formats to DICOM-compliant files.
|
||||||
@@ -0,0 +1,146 @@
|
|||||||
|
# Index Tables Guide for IDC
|
||||||
|
|
||||||
|
**Tested with:** idc-index 0.11.9 (IDC data version v23)
|
||||||
|
|
||||||
|
This guide covers the structure and access patterns for IDC index tables: programmatic schema discovery, DataFrame access, and join column references. For the overview of available tables and their purposes, see the "Index Tables" section in the main SKILL.md.
|
||||||
|
|
||||||
|
**Complete index table documentation:** https://idc-index.readthedocs.io/en/latest/indices_reference.html
|
||||||
|
|
||||||
|
## When to Use This Guide
|
||||||
|
|
||||||
|
Load this guide when you need to:
|
||||||
|
- Discover table schemas and column types programmatically
|
||||||
|
- Access index tables as pandas DataFrames (not via SQL)
|
||||||
|
- Understand key columns and join relationships between tables
|
||||||
|
|
||||||
|
For SQL query examples (filter discovery, finding annotations, size estimation), see `references/sql_patterns.md`.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install --upgrade idc-index
|
||||||
|
```
|
||||||
|
|
||||||
|
## Accessing Index Tables
|
||||||
|
|
||||||
|
### Via SQL (recommended for filtering/aggregation)
|
||||||
|
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
client = IDCClient()
|
||||||
|
|
||||||
|
# Query the primary index (always available)
|
||||||
|
results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
|
||||||
|
|
||||||
|
# Fetch and query additional indices
|
||||||
|
client.fetch_index("collections_index")
|
||||||
|
collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
|
||||||
|
|
||||||
|
client.fetch_index("analysis_results_index")
|
||||||
|
analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
|
||||||
|
```
|
||||||
|
|
||||||
|
### As pandas DataFrames (direct access)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Primary index (always available after client initialization)
|
||||||
|
df = client.index
|
||||||
|
|
||||||
|
# Fetch and access on-demand indices
|
||||||
|
client.fetch_index("sm_index")
|
||||||
|
sm_df = client.sm_index
|
||||||
|
```
|
||||||
|
|
||||||
|
## Discovering Table Schemas
|
||||||
|
|
||||||
|
The `indices_overview` dictionary contains complete schema information for all tables. **Always consult this when writing queries or exploring data structure.**
|
||||||
|
|
||||||
|
**DICOM attribute mapping:** Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like `PatientID`, `StudyInstanceUID`, `Modality`, `BodyPartExamined` work as expected.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
client = IDCClient()
|
||||||
|
|
||||||
|
# List all available indices with descriptions
|
||||||
|
for name, info in client.indices_overview.items():
|
||||||
|
print(f"\n{name}:")
|
||||||
|
print(f" Installed: {info['installed']}")
|
||||||
|
print(f" Description: {info['description']}")
|
||||||
|
|
||||||
|
# Get complete schema for a specific index (columns, types, descriptions)
|
||||||
|
schema = client.indices_overview["index"]["schema"]
|
||||||
|
print(f"\nTable: {schema['table_description']}")
|
||||||
|
print("\nColumns:")
|
||||||
|
for col in schema['columns']:
|
||||||
|
desc = col.get('description', 'No description')
|
||||||
|
# Description indicates if column is from DICOM attribute
|
||||||
|
print(f" {col['name']} ({col['type']}): {desc}")
|
||||||
|
|
||||||
|
# Find columns that are DICOM attributes (check description for "DICOM" reference)
|
||||||
|
dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()]
|
||||||
|
print(f"\nDICOM-sourced columns: {dicom_cols}")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Alternative: use `get_index_schema()` method:**
|
||||||
|
```python
|
||||||
|
schema = client.get_index_schema("index")
|
||||||
|
# Returns same schema dict: {'table_description': ..., 'columns': [...]}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Columns Reference
|
||||||
|
|
||||||
|
Most common columns in the primary `index` table (use `indices_overview` for complete list and descriptions):
|
||||||
|
|
||||||
|
| Column | Type | DICOM | Description |
|
||||||
|
|--------|------|-------|-------------|
|
||||||
|
| `collection_id` | STRING | No | IDC collection identifier |
|
||||||
|
| `analysis_result_id` | STRING | No | If applicable, indicates what analysis results collection given series is part of |
|
||||||
|
| `source_DOI` | STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) |
|
||||||
|
| `PatientID` | STRING | Yes | Patient identifier |
|
||||||
|
| `StudyInstanceUID` | STRING | Yes | DICOM Study UID |
|
||||||
|
| `SeriesInstanceUID` | STRING | Yes | DICOM Series UID — use for downloads/viewing |
|
||||||
|
| `Modality` | STRING | Yes | Imaging modality (CT, MR, PT, SM, SEG, ANN, RTSTRUCT, etc.) |
|
||||||
|
| `BodyPartExamined` | STRING | Yes | Anatomical region |
|
||||||
|
| `SeriesDescription` | STRING | Yes | Description of the series |
|
||||||
|
| `Manufacturer` | STRING | Yes | Equipment manufacturer |
|
||||||
|
| `StudyDate` | STRING | Yes | Date study was performed |
|
||||||
|
| `PatientSex` | STRING | Yes | Patient sex |
|
||||||
|
| `PatientAge` | STRING | Yes | Patient age at time of study |
|
||||||
|
| `license_short_name` | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) |
|
||||||
|
| `series_size_MB` | FLOAT | No | Size of series in megabytes |
|
||||||
|
| `instanceCount` | INTEGER | No | Number of DICOM instances in series |
|
||||||
|
|
||||||
|
**DICOM = Yes**: Column value extracted from the DICOM attribute with the same name. Refer to the [DICOM standard](https://dicom.nema.org/medical/dicom/current/output/chtml/part06/chapter_6.html) for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.
|
||||||
|
|
||||||
|
## Join Column Reference
|
||||||
|
|
||||||
|
Use this table to identify join columns between index tables. Always call `client.fetch_index("table_name")` before using a table in SQL.
|
||||||
|
|
||||||
|
| Table A | Table B | Join Condition |
|
||||||
|
|---------|---------|----------------|
|
||||||
|
| `index` | `collections_index` | `index.collection_id = collections_index.collection_id` |
|
||||||
|
| `index` | `sm_index` | `index.SeriesInstanceUID = sm_index.SeriesInstanceUID` |
|
||||||
|
| `index` | `seg_index` | `index.SeriesInstanceUID = seg_index.segmented_SeriesInstanceUID` |
|
||||||
|
| `index` | `ann_index` | `index.SeriesInstanceUID = ann_index.SeriesInstanceUID` |
|
||||||
|
| `ann_index` | `ann_group_index` | `ann_index.SeriesInstanceUID = ann_group_index.SeriesInstanceUID` |
|
||||||
|
| `index` | `clinical_index` | `index.collection_id = clinical_index.collection_id` (then filter by patient) |
|
||||||
|
| `index` | `contrast_index` | `index.SeriesInstanceUID = contrast_index.SeriesInstanceUID` |
|
||||||
|
|
||||||
|
For complete query examples using these joins, see `references/sql_patterns.md`.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
**Issue:** Column not found in table
|
||||||
|
- **Cause:** Column name misspelled or doesn't exist in that table
|
||||||
|
- **Solution:** Use `client.indices_overview["table_name"]["schema"]["columns"]` to list available columns
|
||||||
|
|
||||||
|
**Issue:** DataFrame access returns None
|
||||||
|
- **Cause:** Index not fetched or property name incorrect
|
||||||
|
- **Solution:** Fetch first with `client.fetch_index()`, then access via property matching the index name
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- Complete index table documentation: https://idc-index.readthedocs.io/en/latest/indices_reference.html
|
||||||
|
- `references/sql_patterns.md` for query examples using these tables
|
||||||
|
- `references/clinical_data_guide.md` for clinical data workflows
|
||||||
|
- `references/digital_pathology_guide.md` for pathology-specific indices
|
||||||
@@ -0,0 +1,207 @@
|
|||||||
|
# SQL Query Patterns for IDC
|
||||||
|
|
||||||
|
**Tested with:** idc-index 0.11.9 (IDC data version v23)
|
||||||
|
|
||||||
|
Quick reference for common SQL query patterns when working with IDC data. For detailed examples with context, see the "Core Capabilities" section in the main SKILL.md.
|
||||||
|
|
||||||
|
## When to Use This Guide
|
||||||
|
|
||||||
|
Load this guide when you need quick-reference SQL patterns for:
|
||||||
|
- Discovering available filter values (modalities, body parts, manufacturers)
|
||||||
|
- Finding annotations and segmentations across collections
|
||||||
|
- Querying slide microscopy and annotation data
|
||||||
|
- Estimating download sizes before download
|
||||||
|
- Linking imaging data to clinical data
|
||||||
|
|
||||||
|
For table schemas, DataFrame access, and join column references, see `references/index_tables_guide.md`.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install --upgrade idc-index
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
client = IDCClient()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Discover Available Filter Values
|
||||||
|
|
||||||
|
```python
|
||||||
|
# What modalities exist?
|
||||||
|
client.sql_query("SELECT DISTINCT Modality FROM index")
|
||||||
|
|
||||||
|
# What body parts for a specific modality?
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT DISTINCT BodyPartExamined, COUNT(*) as n
|
||||||
|
FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
|
||||||
|
GROUP BY BodyPartExamined ORDER BY n DESC
|
||||||
|
""")
|
||||||
|
|
||||||
|
# What manufacturers for MR?
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT DISTINCT Manufacturer, COUNT(*) as n
|
||||||
|
FROM index WHERE Modality = 'MR'
|
||||||
|
GROUP BY Manufacturer ORDER BY n DESC
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Find Annotations and Segmentations
|
||||||
|
|
||||||
|
**Note:** Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find ALL segmentations and structure sets by DICOM Modality
|
||||||
|
# SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT collection_id, Modality, COUNT(*) as series_count
|
||||||
|
FROM index
|
||||||
|
WHERE Modality IN ('SEG', 'RTSTRUCT')
|
||||||
|
GROUP BY collection_id, Modality
|
||||||
|
ORDER BY series_count DESC
|
||||||
|
""")
|
||||||
|
|
||||||
|
# Find segmentations for a specific collection (includes non-analysis-result items)
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
|
||||||
|
FROM index
|
||||||
|
WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
|
||||||
|
""")
|
||||||
|
|
||||||
|
# List analysis result collections (curated derived datasets)
|
||||||
|
client.fetch_index("analysis_results_index")
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
|
||||||
|
FROM analysis_results_index
|
||||||
|
""")
|
||||||
|
|
||||||
|
# Find analysis results for a specific source collection
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT analysis_result_id, analysis_result_title
|
||||||
|
FROM analysis_results_index
|
||||||
|
WHERE Collections LIKE '%tcga_luad%'
|
||||||
|
""")
|
||||||
|
|
||||||
|
# Use seg_index for detailed DICOM Segmentation metadata
|
||||||
|
client.fetch_index("seg_index")
|
||||||
|
|
||||||
|
# Get segmentation statistics by algorithm
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
|
||||||
|
FROM seg_index
|
||||||
|
WHERE AlgorithmName IS NOT NULL
|
||||||
|
GROUP BY AlgorithmName, AlgorithmType
|
||||||
|
ORDER BY seg_count DESC
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
|
||||||
|
# Find segmentations for specific source images (e.g., chest CT)
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
s.SeriesInstanceUID as seg_series,
|
||||||
|
s.AlgorithmName,
|
||||||
|
s.total_segments,
|
||||||
|
s.segmented_SeriesInstanceUID as source_series
|
||||||
|
FROM seg_index s
|
||||||
|
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
|
||||||
|
WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
|
||||||
|
# Find TotalSegmentator results with source image context
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
seg_info.collection_id,
|
||||||
|
COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
|
||||||
|
SUM(s.total_segments) as total_segments
|
||||||
|
FROM seg_index s
|
||||||
|
JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
|
||||||
|
WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
|
||||||
|
GROUP BY seg_info.collection_id
|
||||||
|
ORDER BY seg_count DESC
|
||||||
|
""")
|
||||||
|
|
||||||
|
# Use ann_index and ann_group_index for Microscopy Bulk Simple Annotations
|
||||||
|
# ann_group_index has AnnotationGroupLabel, GraphicType, NumberOfAnnotations, AlgorithmName
|
||||||
|
client.fetch_index("ann_index")
|
||||||
|
client.fetch_index("ann_group_index")
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations, i.collection_id
|
||||||
|
FROM ann_group_index g
|
||||||
|
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
|
||||||
|
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
WHERE g.AlgorithmName IS NOT NULL
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
# See references/digital_pathology_guide.md for AnnotationGroupLabel filtering, SM+ANN joins, and more
|
||||||
|
```
|
||||||
|
|
||||||
|
## Query Slide Microscopy and Annotation Data
|
||||||
|
|
||||||
|
Use `sm_index` for slide microscopy metadata and `ann_index`/`ann_group_index` for annotations on slides (DICOM ANN objects). Filter annotation groups by `AnnotationGroupLabel` to find annotations by name.
|
||||||
|
|
||||||
|
```python
|
||||||
|
client.fetch_index("sm_index")
|
||||||
|
client.fetch_index("ann_index")
|
||||||
|
client.fetch_index("ann_group_index")
|
||||||
|
|
||||||
|
# Example: find annotation groups by label within a collection
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations
|
||||||
|
FROM ann_group_index g
|
||||||
|
JOIN index i ON g.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
WHERE i.collection_id = 'your_collection_id'
|
||||||
|
AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%'
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
See `references/digital_pathology_guide.md` for SM queries, ANN filtering patterns, SM+ANN cross-references, and join examples.
|
||||||
|
|
||||||
|
## Estimate Download Size
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Size for specific criteria
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
|
||||||
|
FROM index
|
||||||
|
WHERE collection_id = 'nlst' AND Modality = 'CT'
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Link to Clinical Data
|
||||||
|
|
||||||
|
```python
|
||||||
|
client.fetch_index("clinical_index")
|
||||||
|
|
||||||
|
# Find collections with clinical data and their tables
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
|
||||||
|
FROM clinical_index
|
||||||
|
GROUP BY collection_id, table_name
|
||||||
|
ORDER BY collection_id
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
**Issue:** Query returns error "table not found"
|
||||||
|
- **Cause:** Index not fetched before query
|
||||||
|
- **Solution:** Call `client.fetch_index("table_name")` before using tables other than the primary `index`
|
||||||
|
|
||||||
|
**Issue:** LIKE pattern not matching expected results
|
||||||
|
- **Cause:** Case sensitivity or whitespace
|
||||||
|
- **Solution:** Use `LOWER(column)` for case-insensitive matching, `TRIM()` for whitespace
|
||||||
|
|
||||||
|
**Issue:** JOIN returns fewer rows than expected
|
||||||
|
- **Cause:** NULL values in join columns or no matching records
|
||||||
|
- **Solution:** Use `LEFT JOIN` to include rows without matches, check for NULLs with `IS NOT NULL`
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- `references/index_tables_guide.md` for table schemas, DataFrame access, and join column references
|
||||||
|
- `references/clinical_data_guide.md` for clinical data patterns and value mapping
|
||||||
|
- `references/digital_pathology_guide.md` for pathology-specific queries
|
||||||
|
- `references/bigquery_guide.md` for advanced queries requiring full DICOM metadata
|
||||||
186
scientific-skills/imaging-data-commons/references/use_cases.md
Normal file
186
scientific-skills/imaging-data-commons/references/use_cases.md
Normal file
@@ -0,0 +1,186 @@
|
|||||||
|
# Common Use Cases for IDC
|
||||||
|
|
||||||
|
**Tested with:** idc-index 0.11.9 (IDC data version v23)
|
||||||
|
|
||||||
|
This guide provides complete end-to-end workflow examples for common IDC use cases. Each use case demonstrates the full workflow from query to download with best practices.
|
||||||
|
|
||||||
|
## When to Use This Guide
|
||||||
|
|
||||||
|
Load this guide when you need:
|
||||||
|
- Complete end-to-end workflow examples for training dataset creation
|
||||||
|
- Patterns for multi-step data selection and download workflows
|
||||||
|
- Examples of license-aware data handling for commercial use
|
||||||
|
- Visualization workflows for data preview before download
|
||||||
|
|
||||||
|
For core API patterns (query, download, visualize, citations), see the "Core Capabilities" section in the main SKILL.md.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install --upgrade idc-index
|
||||||
|
```
|
||||||
|
|
||||||
|
## Use Case 1: Find and Download Lung CT Scans for Deep Learning
|
||||||
|
|
||||||
|
**Objective:** Build training dataset of lung CT scans from NLST collection
|
||||||
|
|
||||||
|
**Steps:**
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
|
||||||
|
client = IDCClient()
|
||||||
|
|
||||||
|
# 1. Query for lung CT scans with specific criteria
|
||||||
|
query = """
|
||||||
|
SELECT
|
||||||
|
PatientID,
|
||||||
|
SeriesInstanceUID,
|
||||||
|
SeriesDescription
|
||||||
|
FROM index
|
||||||
|
WHERE collection_id = 'nlst'
|
||||||
|
AND Modality = 'CT'
|
||||||
|
AND BodyPartExamined = 'CHEST'
|
||||||
|
AND license_short_name = 'CC BY 4.0'
|
||||||
|
ORDER BY PatientID
|
||||||
|
LIMIT 100
|
||||||
|
"""
|
||||||
|
|
||||||
|
results = client.sql_query(query)
|
||||||
|
print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")
|
||||||
|
|
||||||
|
# 2. Download data organized by patient
|
||||||
|
client.download_from_selection(
|
||||||
|
seriesInstanceUID=list(results['SeriesInstanceUID'].values),
|
||||||
|
downloadDir="./training_data",
|
||||||
|
dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
|
||||||
|
)
|
||||||
|
|
||||||
|
# 3. Save manifest for reproducibility
|
||||||
|
results.to_csv('training_manifest.csv', index=False)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Use Case 2: Query Brain MRI by Manufacturer for Quality Study
|
||||||
|
|
||||||
|
**Objective:** Compare image quality across different MRI scanner manufacturers
|
||||||
|
|
||||||
|
**Steps:**
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
client = IDCClient()
|
||||||
|
|
||||||
|
# Query for brain MRI grouped by manufacturer
|
||||||
|
query = """
|
||||||
|
SELECT
|
||||||
|
Manufacturer,
|
||||||
|
ManufacturerModelName,
|
||||||
|
COUNT(DISTINCT SeriesInstanceUID) as num_series,
|
||||||
|
COUNT(DISTINCT PatientID) as num_patients
|
||||||
|
FROM index
|
||||||
|
WHERE Modality = 'MR'
|
||||||
|
AND BodyPartExamined LIKE '%BRAIN%'
|
||||||
|
GROUP BY Manufacturer, ManufacturerModelName
|
||||||
|
HAVING num_series >= 10
|
||||||
|
ORDER BY num_series DESC
|
||||||
|
"""
|
||||||
|
|
||||||
|
manufacturers = client.sql_query(query)
|
||||||
|
print(manufacturers)
|
||||||
|
|
||||||
|
# Download sample from each manufacturer for comparison
|
||||||
|
for _, row in manufacturers.head(3).iterrows():
|
||||||
|
mfr = row['Manufacturer']
|
||||||
|
model = row['ManufacturerModelName']
|
||||||
|
|
||||||
|
query = f"""
|
||||||
|
SELECT SeriesInstanceUID
|
||||||
|
FROM index
|
||||||
|
WHERE Manufacturer = '{mfr}'
|
||||||
|
AND ManufacturerModelName = '{model}'
|
||||||
|
AND Modality = 'MR'
|
||||||
|
AND BodyPartExamined LIKE '%BRAIN%'
|
||||||
|
LIMIT 5
|
||||||
|
"""
|
||||||
|
|
||||||
|
series = client.sql_query(query)
|
||||||
|
client.download_from_selection(
|
||||||
|
seriesInstanceUID=list(series['SeriesInstanceUID'].values),
|
||||||
|
downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Use Case 3: Visualize Series Without Downloading
|
||||||
|
|
||||||
|
**Objective:** Preview imaging data before committing to download
|
||||||
|
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
import webbrowser
|
||||||
|
|
||||||
|
client = IDCClient()
|
||||||
|
|
||||||
|
series_list = client.sql_query("""
|
||||||
|
SELECT SeriesInstanceUID, PatientID, SeriesDescription
|
||||||
|
FROM index
|
||||||
|
WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
|
||||||
|
# Preview each in browser
|
||||||
|
for _, row in series_list.iterrows():
|
||||||
|
viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID'])
|
||||||
|
print(f"Patient {row['PatientID']}: {row['SeriesDescription']}")
|
||||||
|
print(f" View at: {viewer_url}")
|
||||||
|
# webbrowser.open(viewer_url) # Uncomment to open automatically
|
||||||
|
```
|
||||||
|
|
||||||
|
For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration.
|
||||||
|
|
||||||
|
## Use Case 4: License-Aware Batch Download for Commercial Use
|
||||||
|
|
||||||
|
**Objective:** Download only CC-BY licensed data suitable for commercial applications
|
||||||
|
|
||||||
|
**Steps:**
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
|
||||||
|
client = IDCClient()
|
||||||
|
|
||||||
|
# Query ONLY for CC BY licensed data (allows commercial use with attribution)
|
||||||
|
query = """
|
||||||
|
SELECT
|
||||||
|
SeriesInstanceUID,
|
||||||
|
collection_id,
|
||||||
|
PatientID,
|
||||||
|
Modality
|
||||||
|
FROM index
|
||||||
|
WHERE license_short_name LIKE 'CC BY%'
|
||||||
|
AND license_short_name NOT LIKE '%NC%'
|
||||||
|
AND Modality IN ('CT', 'MR')
|
||||||
|
AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN')
|
||||||
|
LIMIT 200
|
||||||
|
"""
|
||||||
|
|
||||||
|
cc_by_data = client.sql_query(query)
|
||||||
|
|
||||||
|
print(f"Found {len(cc_by_data)} CC BY licensed series")
|
||||||
|
print(f"Collections: {cc_by_data['collection_id'].unique()}")
|
||||||
|
|
||||||
|
# Download with license verification
|
||||||
|
client.download_from_selection(
|
||||||
|
seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values),
|
||||||
|
downloadDir="./commercial_dataset",
|
||||||
|
dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Save license information
|
||||||
|
cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- Main SKILL.md for core API patterns (query, download, visualize)
|
||||||
|
- `references/clinical_data_guide.md` for clinical data integration workflows
|
||||||
|
- `references/sql_patterns.md` for additional SQL query patterns
|
||||||
|
- `references/index_tables_guide.md` for complex join patterns
|
||||||
Reference in New Issue
Block a user