mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
1151 lines
41 KiB
Markdown
1151 lines
41 KiB
Markdown
---
|
|
name: imaging-data-commons
|
|
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
|
|
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
|
|
metadata:
|
|
skill-author: Andrey Fedorov, @fedorov
|
|
---
|
|
|
|
# Imaging Data Commons
|
|
|
|
## Overview
|
|
|
|
Use the `idc-index` Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.
|
|
|
|
**Primary tool:** `idc-index` ([GitHub](https://github.com/imagingdatacommons/idc-index))
|
|
|
|
**Check current data scale for the latest version:**
|
|
|
|
```python
|
|
from idc_index import IDCClient
|
|
client = IDCClient()
|
|
|
|
# get IDC data version
|
|
print(client.get_idc_version())
|
|
|
|
# Get collection count and total series
|
|
stats = client.sql_query("""
|
|
SELECT
|
|
COUNT(DISTINCT collection_id) as collections,
|
|
COUNT(DISTINCT analysis_result_id) as analysis_results,
|
|
COUNT(DISTINCT PatientID) as patients,
|
|
COUNT(DISTINCT StudyInstanceUID) as studies,
|
|
COUNT(DISTINCT SeriesInstanceUID) as series,
|
|
SUM(instanceCount) as instances,
|
|
SUM(series_size_MB)/1000000 as size_TB
|
|
FROM index
|
|
""")
|
|
print(stats)
|
|
```
|
|
|
|
**Core workflow:**
|
|
1. Query metadata → `client.sql_query()`
|
|
2. Download DICOM files → `client.download_from_selection()`
|
|
3. Visualize in browser → `client.get_viewer_URL(seriesInstanceUID=...)`
|
|
|
|
## When to Use This Skill
|
|
|
|
- Finding publicly available radiology (CT, MR, PET) or pathology (slide microscopy) images
|
|
- Selecting image subsets by cancer type, modality, anatomical site, or other metadata
|
|
- Downloading DICOM data from IDC
|
|
- Checking data licenses before use in research or commercial applications
|
|
- Visualizing medical images in a browser without local DICOM viewer software
|
|
|
|
## IDC Data Model
|
|
|
|
IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):
|
|
|
|
- **collection_id**: Groups patients by disease, modality, or research focus (e.g., `tcga_luad`, `nlst`). A patient belongs to exactly one collection.
|
|
- **analysis_result_id**: Identifies derived objects (segmentations, annotations, radiomics features) across one or more original collections.
|
|
|
|
Use `collection_id` to find original imaging data, may include annotations deposited along with the images; use `analysis_result_id` to find AI-generated or expert annotations.
|
|
|
|
**Key identifiers for queries:**
|
|
| Identifier | Scope | Use for |
|
|
|------------|-------|---------|
|
|
| `collection_id` | Dataset grouping | Filtering by project/study |
|
|
| `PatientID` | Patient | Grouping images by patient |
|
|
| `StudyInstanceUID` | DICOM study | Grouping of related series, visualization |
|
|
| `SeriesInstanceUID` | DICOM series | Grouping of related series, visualization |
|
|
|
|
## Index Tables
|
|
|
|
The `idc-index` package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.
|
|
|
|
**Important:** Use `client.indices_overview` to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.
|
|
|
|
### Available Tables
|
|
|
|
| Table | Row Granularity | Loaded | Description |
|
|
|-------|-----------------|--------|-------------|
|
|
| `index` | 1 row = 1 DICOM series | Auto | Primary metadata for all current IDC data |
|
|
| `prior_versions_index` | 1 row = 1 DICOM series | Auto | Series from previous IDC releases; for downloading deprecated data |
|
|
| `collections_index` | 1 row = 1 collection | fetch_index() | Collection-level metadata and descriptions |
|
|
| `analysis_results_index` | 1 row = 1 analysis result collection | fetch_index() | Metadata about derived datasets (annotations, segmentations) |
|
|
| `clinical_index` | 1 row = 1 clinical data column | fetch_index() | Dictionary mapping clinical table columns to collections |
|
|
| `sm_index` | 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata |
|
|
| `sm_instance_index` | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy |
|
|
| `seg_index` | 1 row = 1 DICOM Segmentation series | fetch_index() | Segmentation metadata: algorithm, segment count, reference to source image series |
|
|
|
|
**Auto** = loaded automatically when `IDCClient()` is instantiated
|
|
**fetch_index()** = requires `client.fetch_index("table_name")` to load
|
|
|
|
### Joining Tables
|
|
|
|
**Key columns are not explicitly labeled, the following is a subset that can be used in joins.**
|
|
|
|
| Join Column | Tables | Use Case |
|
|
|-------------|--------|----------|
|
|
| `collection_id` | index, prior_versions_index, collections_index, clinical_index | Link series to collection metadata or clinical data |
|
|
| `SeriesInstanceUID` | index, prior_versions_index, sm_index, sm_instance_index | Link series across tables; connect to slide microscopy details |
|
|
| `StudyInstanceUID` | index, prior_versions_index | Link studies across current and historical data |
|
|
| `PatientID` | index, prior_versions_index | Link patients across current and historical data |
|
|
| `analysis_result_id` | index, analysis_results_index | Link series to analysis result metadata (annotations, segmentations) |
|
|
| `source_DOI` | index, analysis_results_index | Link by publication DOI |
|
|
| `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier |
|
|
| `Modality` | index, prior_versions_index | Filter by imaging modality |
|
|
| `SeriesInstanceUID` | index, seg_index | Link segmentation series to its index metadata |
|
|
| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
|
|
|
|
**Note:** `Subjects`, `Updated`, and `Description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).
|
|
|
|
**Example joins:**
|
|
```python
|
|
from idc_index import IDCClient
|
|
client = IDCClient()
|
|
|
|
# Join index with collections_index to get cancer types
|
|
client.fetch_index("collections_index")
|
|
result = client.sql_query("""
|
|
SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations
|
|
FROM index i
|
|
JOIN collections_index c ON i.collection_id = c.collection_id
|
|
WHERE i.Modality = 'MR'
|
|
LIMIT 10
|
|
""")
|
|
|
|
# Join index with sm_index for slide microscopy details
|
|
client.fetch_index("sm_index")
|
|
result = client.sql_query("""
|
|
SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
|
|
FROM index i
|
|
JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
|
|
LIMIT 10
|
|
""")
|
|
|
|
# Join seg_index with index to find segmentations and their source images
|
|
client.fetch_index("seg_index")
|
|
result = client.sql_query("""
|
|
SELECT
|
|
s.SeriesInstanceUID as seg_series,
|
|
s.AlgorithmName,
|
|
s.total_segments,
|
|
src.collection_id,
|
|
src.Modality as source_modality,
|
|
src.BodyPartExamined
|
|
FROM seg_index s
|
|
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
|
|
WHERE s.AlgorithmType = 'AUTOMATIC'
|
|
LIMIT 10
|
|
""")
|
|
```
|
|
|
|
### Accessing Index Tables
|
|
|
|
**Via SQL (recommended for filtering/aggregation):**
|
|
```python
|
|
from idc_index import IDCClient
|
|
client = IDCClient()
|
|
|
|
# Query the primary index (always available)
|
|
results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
|
|
|
|
# Fetch and query additional indices
|
|
client.fetch_index("collections_index")
|
|
collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
|
|
|
|
client.fetch_index("analysis_results_index")
|
|
analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
|
|
```
|
|
|
|
**As pandas DataFrames (direct access):**
|
|
```python
|
|
# Primary index (always available after client initialization)
|
|
df = client.index
|
|
|
|
# Fetch and access on-demand indices
|
|
client.fetch_index("sm_index")
|
|
sm_df = client.sm_index
|
|
```
|
|
|
|
### Discovering Table Schemas (Essential for Query Writing)
|
|
|
|
The `indices_overview` dictionary contains complete schema information for all tables. **Always consult this when writing queries or exploring data structure.**
|
|
|
|
**DICOM attribute mapping:** Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like `PatientID`, `StudyInstanceUID`, `Modality`, `BodyPartExamined` work as expected.
|
|
|
|
```python
|
|
from idc_index import IDCClient
|
|
client = IDCClient()
|
|
|
|
# List all available indices with descriptions
|
|
for name, info in client.indices_overview.items():
|
|
print(f"\n{name}:")
|
|
print(f" Installed: {info['installed']}")
|
|
print(f" Description: {info['description']}")
|
|
|
|
# Get complete schema for a specific index (columns, types, descriptions)
|
|
schema = client.indices_overview["index"]["schema"]
|
|
print(f"\nTable: {schema['table_description']}")
|
|
print("\nColumns:")
|
|
for col in schema['columns']:
|
|
desc = col.get('description', 'No description')
|
|
# Description indicates if column is from DICOM attribute
|
|
print(f" {col['name']} ({col['type']}): {desc}")
|
|
|
|
# Find columns that are DICOM attributes (check description for "DICOM" reference)
|
|
dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()]
|
|
print(f"\nDICOM-sourced columns: {dicom_cols}")
|
|
```
|
|
|
|
**Alternative: use `get_index_schema()` method:**
|
|
```python
|
|
schema = client.get_index_schema("index")
|
|
# Returns same schema dict: {'table_description': ..., 'columns': [...]}
|
|
```
|
|
|
|
### Key Columns in Primary `index` Table
|
|
|
|
Most common columns for queries (use `indices_overview` for complete list and descriptions):
|
|
|
|
| Column | Type | DICOM | Description |
|
|
|--------|------|-------|-------------|
|
|
| `collection_id` | STRING | No | IDC collection identifier |
|
|
| `analysis_result_id` | STRING | No | If applicable, indicates what analysis results collection given series is part of |
|
|
| `source_DOI` | STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) |
|
|
| `PatientID` | STRING | Yes | Patient identifier |
|
|
| `StudyInstanceUID` | STRING | Yes | DICOM Study UID |
|
|
| `SeriesInstanceUID` | STRING | Yes | DICOM Series UID — use for downloads/viewing |
|
|
| `Modality` | STRING | Yes | Imaging modality (CT, MR, PT, SM, etc.) |
|
|
| `BodyPartExamined` | STRING | Yes | Anatomical region |
|
|
| `SeriesDescription` | STRING | Yes | Description of the series |
|
|
| `Manufacturer` | STRING | Yes | Equipment manufacturer |
|
|
| `StudyDate` | STRING | Yes | Date study was performed |
|
|
| `PatientSex` | STRING | Yes | Patient sex |
|
|
| `PatientAge` | STRING | Yes | Patient age at time of study |
|
|
| `license_short_name` | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) |
|
|
| `series_size_MB` | FLOAT | No | Size of series in megabytes |
|
|
| `instanceCount` | INTEGER | No | Number of DICOM instances in series |
|
|
|
|
**DICOM = Yes**: Column value extracted from the DICOM attribute with the same name. Refer to the [DICOM standard](https://dicom.nema.org/medical/dicom/current/output/chtml/part06/chapter_6.html) for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.
|
|
|
|
### Clinical Data Access
|
|
|
|
```python
|
|
# Fetch clinical index (also downloads clinical data tables)
|
|
client.fetch_index("clinical_index")
|
|
|
|
# Query clinical index to find available tables and their columns
|
|
tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index")
|
|
|
|
# Load a specific clinical table as DataFrame
|
|
clinical_df = client.get_clinical_table("table_name")
|
|
```
|
|
|
|
## Data Access Options
|
|
|
|
| Method | Auth Required | Best For |
|
|
|--------|---------------|----------|
|
|
| `idc-index` | No | Key queries and downloads (recommended) |
|
|
| IDC Portal | No | Interactive exploration, manual selection, browser-based download |
|
|
| BigQuery | Yes (GCP account) | Complex queries, full DICOM metadata |
|
|
| DICOMweb proxy | No | Tool integration via DICOMweb API |
|
|
|
|
**DICOMweb access**
|
|
|
|
IDC data is available via DICOMweb interface (Google Cloud Healthcare API implementation) for integration with PACS systems and DICOMweb-compatible tools.
|
|
|
|
| Endpoint | Auth | Use Case |
|
|
|----------|------|----------|
|
|
| Public proxy | No | Testing, moderate queries, daily quota |
|
|
| Google Healthcare | Yes (GCP) | Production use, higher quotas |
|
|
|
|
See `references/dicomweb_guide.md` for endpoint URLs, code examples, supported operations, and implementation details.
|
|
|
|
## Installation and Setup
|
|
|
|
**Required (for basic access):**
|
|
```bash
|
|
pip install --upgrade idc-index
|
|
```
|
|
|
|
**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.
|
|
|
|
**Tested with:** idc-index 0.11.7 (IDC data version v23)
|
|
|
|
**Optional (for data analysis):**
|
|
```bash
|
|
pip install pandas numpy pydicom
|
|
```
|
|
|
|
## Core Capabilities
|
|
|
|
### 1. Data Discovery and Exploration
|
|
|
|
Discover what imaging collections and data are available in IDC:
|
|
|
|
```python
|
|
from idc_index import IDCClient
|
|
|
|
client = IDCClient()
|
|
|
|
# Get summary statistics from primary index
|
|
query = """
|
|
SELECT
|
|
collection_id,
|
|
COUNT(DISTINCT PatientID) as patients,
|
|
COUNT(DISTINCT SeriesInstanceUID) as series,
|
|
SUM(series_size_MB) as size_mb
|
|
FROM index
|
|
GROUP BY collection_id
|
|
ORDER BY patients DESC
|
|
"""
|
|
collections_summary = client.sql_query(query)
|
|
|
|
# For richer collection metadata, use collections_index
|
|
client.fetch_index("collections_index")
|
|
collections_info = client.sql_query("""
|
|
SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData
|
|
FROM collections_index
|
|
""")
|
|
|
|
# For analysis results (annotations, segmentations), use analysis_results_index
|
|
client.fetch_index("analysis_results_index")
|
|
analysis_info = client.sql_query("""
|
|
SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities
|
|
FROM analysis_results_index
|
|
""")
|
|
```
|
|
|
|
**`collections_index`** provides curated metadata per collection: cancer types, tumor locations, species, subject counts, and supporting data types — without needing to aggregate from the primary index.
|
|
|
|
**`analysis_results_index`** lists derived datasets (AI segmentations, expert annotations, radiomics features) with their source collections and modalities.
|
|
|
|
### 2. Querying Metadata with SQL
|
|
|
|
Query the IDC mini-index using SQL to find specific datasets.
|
|
|
|
**First, explore available values for filter columns:**
|
|
```python
|
|
from idc_index import IDCClient
|
|
|
|
client = IDCClient()
|
|
|
|
# Check what Modality values exist
|
|
modalities = client.sql_query("""
|
|
SELECT DISTINCT Modality, COUNT(*) as series_count
|
|
FROM index
|
|
GROUP BY Modality
|
|
ORDER BY series_count DESC
|
|
""")
|
|
print(modalities)
|
|
|
|
# Check what BodyPartExamined values exist for MR modality
|
|
body_parts = client.sql_query("""
|
|
SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count
|
|
FROM index
|
|
WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL
|
|
GROUP BY BodyPartExamined
|
|
ORDER BY series_count DESC
|
|
LIMIT 20
|
|
""")
|
|
print(body_parts)
|
|
```
|
|
|
|
**Then query with validated filter values:**
|
|
```python
|
|
# Find breast MRI scans (use actual values from exploration above)
|
|
results = client.sql_query("""
|
|
SELECT
|
|
collection_id,
|
|
PatientID,
|
|
SeriesInstanceUID,
|
|
Modality,
|
|
SeriesDescription,
|
|
license_short_name
|
|
FROM index
|
|
WHERE Modality = 'MR'
|
|
AND BodyPartExamined = 'BREAST'
|
|
LIMIT 20
|
|
""")
|
|
|
|
# Access results as pandas DataFrame
|
|
for idx, row in results.iterrows():
|
|
print(f"Patient: {row['PatientID']}, Series: {row['SeriesInstanceUID']}")
|
|
```
|
|
|
|
**To filter by cancer type, join with `collections_index`:**
|
|
```python
|
|
client.fetch_index("collections_index")
|
|
results = client.sql_query("""
|
|
SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
|
|
FROM index i
|
|
JOIN collections_index c ON i.collection_id = c.collection_id
|
|
WHERE c.CancerTypes LIKE '%Breast%'
|
|
AND i.Modality = 'MR'
|
|
LIMIT 20
|
|
""")
|
|
```
|
|
|
|
**Available metadata fields** (use `client.indices_overview` for complete list):
|
|
- Identifiers: collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID
|
|
- Imaging: Modality, BodyPartExamined, Manufacturer, ManufacturerModelName
|
|
- Clinical: PatientAge, PatientSex, StudyDate
|
|
- Descriptions: StudyDescription, SeriesDescription
|
|
- Licensing: license_short_name
|
|
|
|
**Note:** Cancer type is in `collections_index.CancerTypes`, not in the primary `index` table.
|
|
|
|
### 3. Downloading DICOM Files
|
|
|
|
Download imaging data efficiently from IDC's cloud storage:
|
|
|
|
**Download entire collection:**
|
|
```python
|
|
from idc_index import IDCClient
|
|
|
|
client = IDCClient()
|
|
|
|
# Download small collection (RIDER Pilot ~1GB)
|
|
client.download_from_selection(
|
|
collection_id="rider_pilot",
|
|
downloadDir="./data/rider"
|
|
)
|
|
```
|
|
|
|
**Download specific series:**
|
|
```python
|
|
# First, query for series UIDs
|
|
series_df = client.sql_query("""
|
|
SELECT SeriesInstanceUID
|
|
FROM index
|
|
WHERE Modality = 'CT'
|
|
AND BodyPartExamined = 'CHEST'
|
|
AND collection_id = 'nlst'
|
|
LIMIT 5
|
|
""")
|
|
|
|
# Download only those series
|
|
client.download_from_selection(
|
|
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
|
|
downloadDir="./data/lung_ct"
|
|
)
|
|
```
|
|
|
|
**Custom directory structure:**
|
|
|
|
Default `dirTemplate`: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`
|
|
|
|
```python
|
|
# Simplified hierarchy (omit StudyInstanceUID level)
|
|
client.download_from_selection(
|
|
collection_id="tcga_luad",
|
|
downloadDir="./data",
|
|
dirTemplate="%collection_id/%PatientID/%Modality"
|
|
)
|
|
# Results in: ./data/tcga_luad/TCGA-05-4244/CT/
|
|
|
|
# Flat structure (all files in one directory)
|
|
client.download_from_selection(
|
|
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
|
|
downloadDir="./data/flat",
|
|
dirTemplate=""
|
|
)
|
|
# Results in: ./data/flat/*.dcm
|
|
```
|
|
|
|
### Command-Line Download
|
|
|
|
The `idc download` command provides command-line access to download functionality without writing Python code. Available after installing `idc-index`.
|
|
|
|
**Auto-detects input type:** manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid).
|
|
|
|
```bash
|
|
# Download entire collection
|
|
idc download rider_pilot --download-dir ./data
|
|
|
|
# Download specific series by UID
|
|
idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data
|
|
|
|
# Download multiple items (comma-separated)
|
|
idc download "tcga_luad,tcga_lusc" --download-dir ./data
|
|
|
|
# Download from manifest file (auto-detected)
|
|
idc download manifest.txt --download-dir ./data
|
|
```
|
|
|
|
**Options:**
|
|
|
|
| Option | Description |
|
|
|--------|-------------|
|
|
| `--download-dir` | Output directory (default: current directory) |
|
|
| `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) |
|
|
| `--log-level` | Verbosity: debug, info, warning, error, critical |
|
|
|
|
**Manifest files:**
|
|
|
|
Manifest files contain S3 URLs (one per line) and can be:
|
|
- Exported from the IDC Portal after cohort selection
|
|
- Shared by collaborators for reproducible data access
|
|
- Generated programmatically from query results
|
|
|
|
Format (one S3 URL per line):
|
|
```
|
|
s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/*
|
|
s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*
|
|
```
|
|
|
|
**Example: Generate manifest from Python query:**
|
|
|
|
```python
|
|
from idc_index import IDCClient
|
|
|
|
client = IDCClient()
|
|
|
|
# Query for series URLs
|
|
results = client.sql_query("""
|
|
SELECT series_aws_url
|
|
FROM index
|
|
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
|
|
""")
|
|
|
|
# Save as manifest file
|
|
with open('ct_manifest.txt', 'w') as f:
|
|
for url in results['series_aws_url']:
|
|
f.write(url + '\n')
|
|
```
|
|
|
|
Then download:
|
|
```bash
|
|
idc download ct_manifest.txt --download-dir ./ct_data
|
|
```
|
|
|
|
### 4. Visualizing IDC Images
|
|
|
|
View DICOM data in browser without downloading:
|
|
|
|
```python
|
|
from idc_index import IDCClient
|
|
import webbrowser
|
|
|
|
client = IDCClient()
|
|
|
|
# First query to get valid UIDs
|
|
results = client.sql_query("""
|
|
SELECT SeriesInstanceUID, StudyInstanceUID
|
|
FROM index
|
|
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
|
|
LIMIT 1
|
|
""")
|
|
|
|
# View single series
|
|
viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID'])
|
|
webbrowser.open(viewer_url)
|
|
|
|
# View all series in a study (useful for multi-series exams like MRI protocols)
|
|
viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID'])
|
|
webbrowser.open(viewer_url)
|
|
```
|
|
|
|
The method automatically selects OHIF v3 for radiology or SLIM for slide microscopy. Viewing by study is useful when a DICOM Study contains multiple Series (e.g., T1, T2, DWI sequences from a single MRI session).
|
|
|
|
### 5. Understanding and Checking Licenses
|
|
|
|
Check data licensing before use (critical for commercial applications):
|
|
|
|
```python
|
|
from idc_index import IDCClient
|
|
|
|
client = IDCClient()
|
|
|
|
# Check licenses for all collections
|
|
query = """
|
|
SELECT DISTINCT
|
|
collection_id,
|
|
license_short_name,
|
|
COUNT(DISTINCT SeriesInstanceUID) as series_count
|
|
FROM index
|
|
GROUP BY collection_id, license_short_name
|
|
ORDER BY collection_id
|
|
"""
|
|
|
|
licenses = client.sql_query(query)
|
|
print(licenses)
|
|
```
|
|
|
|
**License types in IDC:**
|
|
- **CC BY 4.0** / **CC BY 3.0** (~97% of data) - Allows commercial use with attribution
|
|
- **CC BY-NC 4.0** / **CC BY-NC 3.0** (~3% of data) - Non-commercial use only
|
|
- **Custom licenses** (rare) - Some collections have specific terms (e.g., NLM Terms and Conditions)
|
|
|
|
**Important:** Always check the license before using IDC data in publications or commercial applications. Each DICOM file is tagged with its specific license in metadata.
|
|
|
|
### Generating Citations for Attribution
|
|
|
|
The `source_DOI` column contains DOIs linking to publications describing how the data was generated. To satisfy attribution requirements, use `citations_from_selection()` to generate properly formatted citations:
|
|
|
|
```python
|
|
from idc_index import IDCClient
|
|
|
|
client = IDCClient()
|
|
|
|
# Get citations for a collection (APA format by default)
|
|
citations = client.citations_from_selection(collection_id="rider_pilot")
|
|
for citation in citations:
|
|
print(citation)
|
|
|
|
# Get citations for specific series
|
|
results = client.sql_query("""
|
|
SELECT SeriesInstanceUID FROM index
|
|
WHERE collection_id = 'tcga_luad' LIMIT 5
|
|
""")
|
|
citations = client.citations_from_selection(
|
|
seriesInstanceUID=list(results['SeriesInstanceUID'].values)
|
|
)
|
|
|
|
# Alternative format: BibTeX (for LaTeX documents)
|
|
bibtex_citations = client.citations_from_selection(
|
|
collection_id="tcga_luad",
|
|
citation_format=IDCClient.CITATION_FORMAT_BIBTEX
|
|
)
|
|
```
|
|
|
|
**Parameters:**
|
|
- `collection_id`: Filter by collection(s)
|
|
- `patientId`: Filter by patient ID(s)
|
|
- `studyInstanceUID`: Filter by study UID(s)
|
|
- `seriesInstanceUID`: Filter by series UID(s)
|
|
- `citation_format`: Use `IDCClient.CITATION_FORMAT_*` constants:
|
|
- `CITATION_FORMAT_APA` (default) - APA style
|
|
- `CITATION_FORMAT_BIBTEX` - BibTeX for LaTeX
|
|
- `CITATION_FORMAT_JSON` - CSL JSON
|
|
- `CITATION_FORMAT_TURTLE` - RDF Turtle
|
|
|
|
**Best practice:** When publishing results using IDC data, include the generated citations to properly attribute the data sources and satisfy license requirements.
|
|
|
|
### 6. Batch Processing and Filtering
|
|
|
|
Process large datasets efficiently with filtering:
|
|
|
|
```python
|
|
from idc_index import IDCClient
|
|
import pandas as pd
|
|
|
|
client = IDCClient()
|
|
|
|
# Find chest CT scans from GE scanners
|
|
query = """
|
|
SELECT
|
|
SeriesInstanceUID,
|
|
PatientID,
|
|
collection_id,
|
|
ManufacturerModelName
|
|
FROM index
|
|
WHERE Modality = 'CT'
|
|
AND BodyPartExamined = 'CHEST'
|
|
AND Manufacturer = 'GE MEDICAL SYSTEMS'
|
|
AND license_short_name = 'CC BY 4.0'
|
|
LIMIT 100
|
|
"""
|
|
|
|
results = client.sql_query(query)
|
|
|
|
# Save manifest for later
|
|
results.to_csv('lung_ct_manifest.csv', index=False)
|
|
|
|
# Download in batches to avoid timeout
|
|
batch_size = 10
|
|
for i in range(0, len(results), batch_size):
|
|
batch = results.iloc[i:i+batch_size]
|
|
client.download_from_selection(
|
|
seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
|
|
downloadDir=f"./data/batch_{i//batch_size}"
|
|
)
|
|
```
|
|
|
|
### 7. Advanced Queries with BigQuery
|
|
|
|
For queries requiring full DICOM metadata, complex JOINs, or clinical data tables, use Google BigQuery. Requires GCP account with billing enabled.
|
|
|
|
**Quick reference:**
|
|
- Dataset: `bigquery-public-data.idc_current.*`
|
|
- Main table: `dicom_all` (combined metadata)
|
|
- Full metadata: `dicom_metadata` (all DICOM tags)
|
|
|
|
See `references/bigquery_guide.md` for setup, table schemas, query patterns, and cost optimization.
|
|
|
|
### 8. Tool Selection Guide
|
|
|
|
| Task | Tool | Reference |
|
|
|------|------|-----------|
|
|
| Programmatic queries & downloads | `idc-index` | This document |
|
|
| Interactive exploration | IDC Portal | https://portal.imaging.datacommons.cancer.gov/ |
|
|
| Complex metadata queries | BigQuery | `references/bigquery_guide.md` |
|
|
| 3D visualization & analysis | SlicerIDCBrowser | https://github.com/ImagingDataCommons/SlicerIDCBrowser |
|
|
|
|
**Default choice:** Use `idc-index` for most tasks (no auth, easy API, batch downloads).
|
|
|
|
### 9. Integration with Analysis Pipelines
|
|
|
|
Integrate IDC data into imaging analysis workflows:
|
|
|
|
**Read downloaded DICOM files:**
|
|
```python
|
|
import pydicom
|
|
import os
|
|
|
|
# Read DICOM files from downloaded series
|
|
series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."
|
|
|
|
dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
|
|
if f.endswith('.dcm')]
|
|
|
|
# Load first image
|
|
ds = pydicom.dcmread(dicom_files[0])
|
|
print(f"Patient ID: {ds.PatientID}")
|
|
print(f"Modality: {ds.Modality}")
|
|
print(f"Image shape: {ds.pixel_array.shape}")
|
|
```
|
|
|
|
**Build 3D volume from CT series:**
|
|
```python
|
|
import pydicom
|
|
import numpy as np
|
|
from pathlib import Path
|
|
|
|
def load_ct_series(series_path):
|
|
"""Load CT series as 3D numpy array"""
|
|
files = sorted(Path(series_path).glob('*.dcm'))
|
|
slices = [pydicom.dcmread(str(f)) for f in files]
|
|
|
|
# Sort by slice location
|
|
slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))
|
|
|
|
# Stack into 3D array
|
|
volume = np.stack([s.pixel_array for s in slices])
|
|
|
|
return volume, slices[0] # Return volume and first slice for metadata
|
|
|
|
volume, metadata = load_ct_series("./data/lung_ct/series_dir")
|
|
print(f"Volume shape: {volume.shape}") # (z, y, x)
|
|
```
|
|
|
|
**Integrate with SimpleITK:**
|
|
```python
|
|
import SimpleITK as sitk
|
|
from pathlib import Path
|
|
|
|
# Read DICOM series
|
|
series_path = "./data/ct_series"
|
|
reader = sitk.ImageSeriesReader()
|
|
dicom_names = reader.GetGDCMSeriesFileNames(series_path)
|
|
reader.SetFileNames(dicom_names)
|
|
image = reader.Execute()
|
|
|
|
# Apply processing
|
|
smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)
|
|
|
|
# Save as NIfTI
|
|
sitk.WriteImage(smoothed, "processed_volume.nii.gz")
|
|
```
|
|
|
|
## Common Use Cases
|
|
|
|
### Use Case 1: Find and Download Lung CT Scans for Deep Learning
|
|
|
|
**Objective:** Build training dataset of lung CT scans from NLST collection
|
|
|
|
**Steps:**
|
|
```python
|
|
from idc_index import IDCClient
|
|
|
|
client = IDCClient()
|
|
|
|
# 1. Query for lung CT scans with specific criteria
|
|
query = """
|
|
SELECT
|
|
PatientID,
|
|
SeriesInstanceUID,
|
|
SeriesDescription
|
|
FROM index
|
|
WHERE collection_id = 'nlst'
|
|
AND Modality = 'CT'
|
|
AND BodyPartExamined = 'CHEST'
|
|
AND license_short_name = 'CC BY 4.0'
|
|
ORDER BY PatientID
|
|
LIMIT 100
|
|
"""
|
|
|
|
results = client.sql_query(query)
|
|
print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")
|
|
|
|
# 2. Download data organized by patient
|
|
client.download_from_selection(
|
|
seriesInstanceUID=list(results['SeriesInstanceUID'].values),
|
|
downloadDir="./training_data",
|
|
dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
|
|
)
|
|
|
|
# 3. Save manifest for reproducibility
|
|
results.to_csv('training_manifest.csv', index=False)
|
|
```
|
|
|
|
### Use Case 2: Query Brain MRI by Manufacturer for Quality Study
|
|
|
|
**Objective:** Compare image quality across different MRI scanner manufacturers
|
|
|
|
**Steps:**
|
|
```python
|
|
from idc_index import IDCClient
|
|
import pandas as pd
|
|
|
|
client = IDCClient()
|
|
|
|
# Query for brain MRI grouped by manufacturer
|
|
query = """
|
|
SELECT
|
|
Manufacturer,
|
|
ManufacturerModelName,
|
|
COUNT(DISTINCT SeriesInstanceUID) as num_series,
|
|
COUNT(DISTINCT PatientID) as num_patients
|
|
FROM index
|
|
WHERE Modality = 'MR'
|
|
AND BodyPartExamined LIKE '%BRAIN%'
|
|
GROUP BY Manufacturer, ManufacturerModelName
|
|
HAVING num_series >= 10
|
|
ORDER BY num_series DESC
|
|
"""
|
|
|
|
manufacturers = client.sql_query(query)
|
|
print(manufacturers)
|
|
|
|
# Download sample from each manufacturer for comparison
|
|
for _, row in manufacturers.head(3).iterrows():
|
|
mfr = row['Manufacturer']
|
|
model = row['ManufacturerModelName']
|
|
|
|
query = f"""
|
|
SELECT SeriesInstanceUID
|
|
FROM index
|
|
WHERE Manufacturer = '{mfr}'
|
|
AND ManufacturerModelName = '{model}'
|
|
AND Modality = 'MR'
|
|
AND BodyPartExamined LIKE '%BRAIN%'
|
|
LIMIT 5
|
|
"""
|
|
|
|
series = client.sql_query(query)
|
|
client.download_from_selection(
|
|
seriesInstanceUID=list(series['SeriesInstanceUID'].values),
|
|
downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
|
|
)
|
|
```
|
|
|
|
### Use Case 3: Visualize Series Without Downloading
|
|
|
|
**Objective:** Preview imaging data before committing to download
|
|
|
|
```python
|
|
from idc_index import IDCClient
|
|
import webbrowser
|
|
|
|
client = IDCClient()
|
|
|
|
series_list = client.sql_query("""
|
|
SELECT SeriesInstanceUID, PatientID, SeriesDescription
|
|
FROM index
|
|
WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
|
|
LIMIT 10
|
|
""")
|
|
|
|
# Preview each in browser
|
|
for _, row in series_list.iterrows():
|
|
viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID'])
|
|
print(f"Patient {row['PatientID']}: {row['SeriesDescription']}")
|
|
print(f" View at: {viewer_url}")
|
|
# webbrowser.open(viewer_url) # Uncomment to open automatically
|
|
```
|
|
|
|
For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration.
|
|
|
|
### Use Case 4: License-Aware Batch Download for Commercial Use
|
|
|
|
**Objective:** Download only CC-BY licensed data suitable for commercial applications
|
|
|
|
**Steps:**
|
|
```python
|
|
from idc_index import IDCClient
|
|
|
|
client = IDCClient()
|
|
|
|
# Query ONLY for CC BY licensed data (allows commercial use with attribution)
|
|
query = """
|
|
SELECT
|
|
SeriesInstanceUID,
|
|
collection_id,
|
|
PatientID,
|
|
Modality
|
|
FROM index
|
|
WHERE license_short_name LIKE 'CC BY%'
|
|
AND license_short_name NOT LIKE '%NC%'
|
|
AND Modality IN ('CT', 'MR')
|
|
AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN')
|
|
LIMIT 200
|
|
"""
|
|
|
|
cc_by_data = client.sql_query(query)
|
|
|
|
print(f"Found {len(cc_by_data)} CC BY licensed series")
|
|
print(f"Collections: {cc_by_data['collection_id'].unique()}")
|
|
|
|
# Download with license verification
|
|
client.download_from_selection(
|
|
seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values),
|
|
downloadDir="./commercial_dataset",
|
|
dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID"
|
|
)
|
|
|
|
# Save license information
|
|
cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
- **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC)
|
|
- **Generate citations for attribution** - Use `citations_from_selection()` to get properly formatted citations from `source_DOI` values; include these in publications
|
|
- **Start with small queries** - Use `LIMIT` clause when exploring to avoid long downloads and understand data structure
|
|
- **Use mini-index for simple queries** - Only use BigQuery when you need comprehensive metadata or complex JOINs
|
|
- **Organize downloads with dirTemplate** - Use meaningful directory structures like `%collection_id/%PatientID/%Modality`
|
|
- **Cache query results** - Save DataFrames to CSV files to avoid re-querying and ensure reproducibility
|
|
- **Estimate size first** - Check collection size before downloading - some collection sizes are in terabytes!
|
|
- **Save manifests** - Always save query results with Series UIDs for reproducibility and data provenance
|
|
- **Read documentation** - IDC data structure and metadata fields are documented at https://learn.canceridc.dev/
|
|
- **Use IDC forum** - Search for questons/answers and ask your questions to the IDC maintainers and users at https://discourse.canceridc.dev/
|
|
|
|
## Troubleshooting
|
|
|
|
**Issue: `ModuleNotFoundError: No module named 'idc_index'`**
|
|
- **Cause:** idc-index package not installed
|
|
- **Solution:** Install with `pip install --upgrade idc-index`
|
|
|
|
**Issue: Download fails with connection timeout**
|
|
- **Cause:** Network instability or large download size
|
|
- **Solution:**
|
|
- Download smaller batches (e.g., 10-20 series at a time)
|
|
- Check network connection
|
|
- Use `dirTemplate` to organize downloads by batch
|
|
- Implement retry logic with delays
|
|
|
|
**Issue: `BigQuery quota exceeded` or billing errors**
|
|
- **Cause:** BigQuery requires billing-enabled GCP project
|
|
- **Solution:** Use idc-index mini-index for simple queries (no billing required), or see `references/bigquery_guide.md` for cost optimization tips
|
|
|
|
**Issue: Series UID not found or no data returned**
|
|
- **Cause:** Typo in UID, data not in current IDC version, or wrong field name
|
|
- **Solution:**
|
|
- Check if data is in current IDC version (some old data may be deprecated)
|
|
- Use `LIMIT 5` to test query first
|
|
- Check field names against metadata schema documentation
|
|
|
|
**Issue: Downloaded DICOM files won't open**
|
|
- **Cause:** Corrupted download or incompatible viewer
|
|
- **Solution:**
|
|
- Check DICOM object type (Modality and SOPClassUID attributes) - some object types require specialized tools
|
|
- Verify file integrity (check file sizes)
|
|
- Use pydicom to validate: `pydicom.dcmread(file, force=True)`
|
|
- Try different DICOM viewer (3D Slicer, Horos, RadiAnt, QuPath)
|
|
- Re-download the series
|
|
|
|
## Common SQL Query Patterns
|
|
|
|
Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above.
|
|
|
|
### Discover available filter values
|
|
```python
|
|
# What modalities exist?
|
|
client.sql_query("SELECT DISTINCT Modality FROM index")
|
|
|
|
# What body parts for a specific modality?
|
|
client.sql_query("""
|
|
SELECT DISTINCT BodyPartExamined, COUNT(*) as n
|
|
FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
|
|
GROUP BY BodyPartExamined ORDER BY n DESC
|
|
""")
|
|
|
|
# What manufacturers for MR?
|
|
client.sql_query("""
|
|
SELECT DISTINCT Manufacturer, COUNT(*) as n
|
|
FROM index WHERE Modality = 'MR'
|
|
GROUP BY Manufacturer ORDER BY n DESC
|
|
""")
|
|
```
|
|
|
|
### Find annotations and segmentations
|
|
|
|
**Note:** Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
|
|
|
|
```python
|
|
# Find ALL segmentations and structure sets by DICOM Modality
|
|
# SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set
|
|
client.sql_query("""
|
|
SELECT collection_id, Modality, COUNT(*) as series_count
|
|
FROM index
|
|
WHERE Modality IN ('SEG', 'RTSTRUCT')
|
|
GROUP BY collection_id, Modality
|
|
ORDER BY series_count DESC
|
|
""")
|
|
|
|
# Find segmentations for a specific collection (includes non-analysis-result items)
|
|
client.sql_query("""
|
|
SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
|
|
FROM index
|
|
WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
|
|
""")
|
|
|
|
# List analysis result collections (curated derived datasets)
|
|
client.fetch_index("analysis_results_index")
|
|
client.sql_query("""
|
|
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
|
|
FROM analysis_results_index
|
|
""")
|
|
|
|
# Find analysis results for a specific source collection
|
|
client.sql_query("""
|
|
SELECT analysis_result_id, analysis_result_title
|
|
FROM analysis_results_index
|
|
WHERE Collections LIKE '%tcga_luad%'
|
|
""")
|
|
|
|
# Use seg_index for detailed DICOM Segmentation metadata
|
|
client.fetch_index("seg_index")
|
|
|
|
# Get segmentation statistics by algorithm
|
|
client.sql_query("""
|
|
SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
|
|
FROM seg_index
|
|
WHERE AlgorithmName IS NOT NULL
|
|
GROUP BY AlgorithmName, AlgorithmType
|
|
ORDER BY seg_count DESC
|
|
LIMIT 10
|
|
""")
|
|
|
|
# Find segmentations for specific source images (e.g., chest CT)
|
|
client.sql_query("""
|
|
SELECT
|
|
s.SeriesInstanceUID as seg_series,
|
|
s.AlgorithmName,
|
|
s.total_segments,
|
|
s.segmented_SeriesInstanceUID as source_series
|
|
FROM seg_index s
|
|
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
|
|
WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
|
|
LIMIT 10
|
|
""")
|
|
|
|
# Find TotalSegmentator results with source image context
|
|
client.sql_query("""
|
|
SELECT
|
|
seg_info.collection_id,
|
|
COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
|
|
SUM(s.total_segments) as total_segments
|
|
FROM seg_index s
|
|
JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
|
|
WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
|
|
GROUP BY seg_info.collection_id
|
|
ORDER BY seg_count DESC
|
|
""")
|
|
```
|
|
|
|
### Query slide microscopy data
|
|
```python
|
|
# sm_index has detailed metadata; join with index for collection_id
|
|
client.fetch_index("sm_index")
|
|
client.sql_query("""
|
|
SELECT i.collection_id, COUNT(*) as slides,
|
|
MIN(s.min_PixelSpacing_2sf) as min_resolution
|
|
FROM sm_index s
|
|
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
|
|
GROUP BY i.collection_id
|
|
ORDER BY slides DESC
|
|
""")
|
|
```
|
|
|
|
### Estimate download size
|
|
```python
|
|
# Size for specific criteria
|
|
client.sql_query("""
|
|
SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
|
|
FROM index
|
|
WHERE collection_id = 'nlst' AND Modality = 'CT'
|
|
""")
|
|
```
|
|
|
|
### Link to clinical data
|
|
```python
|
|
client.fetch_index("clinical_index")
|
|
|
|
# Find collections with clinical data and their tables
|
|
client.sql_query("""
|
|
SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
|
|
FROM clinical_index
|
|
GROUP BY collection_id, table_name
|
|
ORDER BY collection_id
|
|
""")
|
|
```
|
|
|
|
## Related Skills
|
|
|
|
The following skills complement IDC workflows for downstream analysis and visualization:
|
|
|
|
### DICOM Processing
|
|
- **pydicom** - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET).
|
|
|
|
### Pathology and Slide Microscopy
|
|
- **histolab** - Lightweight tile extraction and preprocessing for whole slide images. Use for basic slide processing, tissue detection, and dataset preparation from IDC slide microscopy data.
|
|
- **pathml** - Full-featured computational pathology toolkit. Use for advanced WSI analysis including multiplexed imaging, nucleus segmentation, and ML model training on pathology data downloaded from IDC.
|
|
|
|
### Metadata Visualization
|
|
- **matplotlib** - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.).
|
|
- **seaborn** - Statistical visualization with pandas integration. Use for quick exploration of IDC metadata distributions, relationships between variables, and categorical comparisons with attractive defaults.
|
|
- **plotly** - Interactive visualization. Use when you need hover info, zoom, and pan for exploring IDC metadata, or for creating web-embeddable dashboards of collection statistics.
|
|
|
|
### Data Exploration
|
|
- **exploratory-data-analysis** - Comprehensive EDA on scientific data files. Use after downloading IDC data to understand file structure, quality, and characteristics before analysis.
|
|
|
|
## Resources
|
|
|
|
### Schema Reference (Primary Source)
|
|
|
|
**Always use `client.indices_overview` for current column schemas.** This ensures accuracy with the installed idc-index version:
|
|
|
|
```python
|
|
# Get all column names and types for any table
|
|
schema = client.indices_overview["index"]["schema"]
|
|
columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]
|
|
```
|
|
|
|
### Reference Documentation
|
|
|
|
- **bigquery_guide.md** - Advanced BigQuery usage guide for complex metadata queries
|
|
- **dicomweb_guide.md** - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details
|
|
- **[indices_reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html)** - External documentation for index tables (may be ahead of the installed version)
|
|
|
|
### External Links
|
|
|
|
- **IDC Portal**: https://portal.imaging.datacommons.cancer.gov/explore/
|
|
- **Documentation**: https://learn.canceridc.dev/
|
|
- **Tutorials**: https://github.com/ImagingDataCommons/IDC-Tutorials
|
|
- **User Forum**: https://discourse.canceridc.dev/
|
|
- **idc-index GitHub**: https://github.com/ImagingDataCommons/idc-index
|
|
- **Citation**: Fedorov, A., et al. "National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence." RadioGraphics 43.12 (2023). https://doi.org/10.1148/rg.230180
|