Merge pull request #42 from fedorov/update-idc-skill-to-v1.2.0

Update imaging-data-commons skill to v1.2.0
This commit is contained in:
Timothy Kassis
2026-02-05 08:51:06 -08:00
committed by GitHub
5 changed files with 1012 additions and 17 deletions

View File

@@ -3,7 +3,10 @@ name: imaging-data-commons
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
metadata:
version: 1.2.0
skill-author: Andrey Fedorov, @fedorov
idc-index: "0.11.7"
repository: https://github.com/ImagingDataCommons/idc-claude-skill
---
# Imaging Data Commons
@@ -252,6 +255,8 @@ tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinica
clinical_df = client.get_clinical_table("table_name")
```
See `references/clinical_data_guide.md` for detailed workflows including value mapping patterns and joining clinical data with imaging.
## Data Access Options
| Method | Auth Required | Best For |
@@ -260,6 +265,21 @@ clinical_df = client.get_clinical_table("table_name")
| IDC Portal | No | Interactive exploration, manual selection, browser-based download |
| BigQuery | Yes (GCP account) | Complex queries, full DICOM metadata |
| DICOMweb proxy | No | Tool integration via DICOMweb API |
| Cloud storage (S3/GCS) | No | Direct file access, bulk downloads, custom pipelines |
**Cloud storage organization**
IDC maintains all DICOM files in public cloud storage buckets mirrored between AWS S3 and Google Cloud Storage. Files are organized by CRDC UUIDs (not DICOM UIDs) to support versioning.
| Bucket (AWS / GCS) | License | Content |
|--------------------|---------|---------|
| `idc-open-data` / `idc-open-data` | No commercial restriction | >90% of IDC data |
| `idc-open-data-two` / `idc-open-idc1` | No commercial restriction | Collections with potential head scans |
| `idc-open-data-cr` / `idc-open-cr` | Commercial use restricted (CC BY-NC) | ~4% of data |
Files are stored as `<crdc_series_uuid>/<crdc_instance_uuid>.dcm`. Access is free (no egress fees) via AWS CLI, gsutil, or s5cmd with anonymous access. Use `series_aws_url` column from the index for S3 URLs; GCS uses the same path structure.
See `references/cloud_storage_guide.md` for bucket details, access commands, UUID mapping, and versioning.
**DICOMweb access**
@@ -675,14 +695,15 @@ for i in range(0, len(results), batch_size):
### 7. Advanced Queries with BigQuery
For queries requiring full DICOM metadata, complex JOINs, or clinical data tables, use Google BigQuery. Requires GCP account with billing enabled.
For queries requiring full DICOM metadata, complex JOINs, clinical data tables, or private DICOM elements, use Google BigQuery. Requires GCP account with billing enabled.
**Quick reference:**
- Dataset: `bigquery-public-data.idc_current.*`
- Main table: `dicom_all` (combined metadata)
- Full metadata: `dicom_metadata` (all DICOM tags)
- Private elements: `OtherElements` column (vendor-specific tags like diffusion b-values)
See `references/bigquery_guide.md` for setup, table schemas, query patterns, and cost optimization.
See `references/bigquery_guide.md` for setup, table schemas, query patterns, private element access, and cost optimization.
### 8. Tool Selection Guide
@@ -1103,6 +1124,8 @@ client.sql_query("""
""")
```
See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection.
## Related Skills
The following skills complement IDC workflows for downstream analysis and visualization:
@@ -1136,6 +1159,9 @@ columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['col
### Reference Documentation
- **clinical_data_guide.md** - Clinical/tabular data navigation, value mapping, and joining with imaging data
- **cloud_storage_guide.md** - Direct cloud bucket access (S3/GCS), file organization, CRDC UUIDs, versioning, and reproducibility
- **cli_guide.md** - Complete idc-index command-line interface reference (`idc download`, `idc download-from-manifest`, `idc download-from-selection`)
- **bigquery_guide.md** - Advanced BigQuery usage guide for complex metadata queries
- **dicomweb_guide.md** - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details
- **[indices_reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html)** - External documentation for index tables (may be ahead of the installed version)
@@ -1148,3 +1174,9 @@ columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['col
- **User Forum**: https://discourse.canceridc.dev/
- **idc-index GitHub**: https://github.com/ImagingDataCommons/idc-index
- **Citation**: Fedorov, A., et al. "National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence." RadioGraphics 43.12 (2023). https://doi.org/10.1148/rg.230180
### Skill Updates
This skill version is available in skill metadata. To check for updates:
- Visit the [releases page](https://github.com/ImagingDataCommons/idc-claude-skill/releases)
- Watch the repository on GitHub (Watch → Custom → Releases)

View File

@@ -24,6 +24,7 @@ Use BigQuery instead of `idc-index` when you need:
- Complex joins across clinical data tables
- DICOM sequence attributes (nested structures)
- Queries on fields not in the idc-index mini-index
- Private DICOM elements (vendor-specific tags in OtherElements column)
## Accessing IDC in BigQuery
@@ -164,6 +165,190 @@ WHERE src.collection_id = 'qin_prostate_repeatability'
LIMIT 10
```
## Private DICOM Elements
Private DICOM elements are vendor-specific attributes not defined in the DICOM standard. They often contain essential acquisition parameters (like diffusion b-values, gradient directions, or scanner-specific settings) that are critical for image interpretation and analysis.
### Understanding Private Elements
**How private elements work:**
- Private elements use odd-numbered group numbers (e.g., 0019, 0043, 2001)
- Each vendor reserves blocks of 256 elements using Private Creator identifiers at positions (gggg,0010-00FF)
- For example, GE uses Private Creator "GEMS_PARM_01" at (0043,0010) to reserve elements (0043,1000-10FF)
**Standard vs. private tags:** Some parameters exist in both forms:
| Parameter | Standard Tag | GE | Siemens | Philips |
|-----------|--------------|-----|---------|---------|
| Diffusion b-value | (0018,9087) | (0043,1039) | (0019,100C) | (2001,1003) |
| Private Creator | - | GEMS_PARM_01 | SIEMENS CSA HEADER | Philips Imaging |
Older scanners typically populate only private tags; newer scanners may use standard tags. Always check both.
**Challenges with private elements:**
- Require manufacturer DICOM Conformance Statements to interpret
- Tag meanings can change between software versions
- May be removed during de-identification for HIPAA compliance
- Value encoding varies (string vs. numeric, different units)
### Accessing Private Elements in BigQuery
Private elements are stored in the `OtherElements` column of `dicom_all` as an array of structs with `Tag` and `Data` fields.
**Tag notation:** DICOM notation (0043,1039) becomes BigQuery format `Tag_00431039`.
### Private Element Query Patterns
#### Discover Available Private Tags
List all non-empty private tags for a collection:
```sql
SELECT
other_elements.Tag,
COUNT(*) AS instance_count,
ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)] IGNORE NULLS LIMIT 5) AS sample_values
FROM `bigquery-public-data.idc_current.dicom_all`,
UNNEST(OtherElements) AS other_elements
WHERE collection_id = 'qin_prostate_repeatability'
AND Modality = 'MR'
AND ARRAY_LENGTH(other_elements.Data) > 0
AND other_elements.Data[SAFE_OFFSET(0)] IS NOT NULL
AND other_elements.Data[SAFE_OFFSET(0)] != ''
GROUP BY other_elements.Tag
ORDER BY instance_count DESC
```
For a specific series:
```sql
SELECT
other_elements.Tag,
ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)] IGNORE NULLS) AS values
FROM `bigquery-public-data.idc_current.dicom_all`,
UNNEST(OtherElements) AS other_elements
WHERE SeriesInstanceUID = '1.3.6.1.4.1.14519.5.2.1.7311.5101.206828891270520544417996275680'
AND ARRAY_LENGTH(other_elements.Data) > 0
AND other_elements.Data[SAFE_OFFSET(0)] IS NOT NULL
AND other_elements.Data[SAFE_OFFSET(0)] != ''
GROUP BY other_elements.Tag
```
To identify the Private Creator for a tag, look for the reservation element in the same group. For example, if you find `Tag_00431039`, the Private Creator is at `Tag_00430010` (the tag that reserves block 10xx in group 0043).
#### Identify Equipment Manufacturer
Determine what equipment produced the data to find the correct DICOM Conformance Statement:
```sql
SELECT DISTINCT Manufacturer, ManufacturerModelName
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE collection_id = 'qin_prostate_repeatability'
AND Modality = 'MR'
```
#### Access Private Element Values
Use `UNNEST` to access individual private elements:
```sql
SELECT
SeriesInstanceUID,
SeriesDescription,
other_elements.Data[SAFE_OFFSET(0)] AS b_value
FROM `bigquery-public-data.idc_current.dicom_all`,
UNNEST(OtherElements) AS other_elements
WHERE collection_id = 'qin_prostate_repeatability'
AND other_elements.Tag = 'Tag_00431039'
LIMIT 10
```
#### Aggregate Values by Series
Collect all unique values across slices in a series:
```sql
SELECT
SeriesInstanceUID,
ANY_VALUE(SeriesDescription) AS SeriesDescription,
ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)]) AS b_values
FROM `bigquery-public-data.idc_current.dicom_all`,
UNNEST(OtherElements) AS other_elements
WHERE collection_id = 'qin_prostate_repeatability'
AND other_elements.Tag = 'Tag_00431039'
GROUP BY SeriesInstanceUID
```
#### Combine Standard and Private Filters
Filter using both standard DICOM attributes and private element values:
```sql
SELECT
PatientID,
SeriesInstanceUID,
ANY_VALUE(SeriesDescription) AS SeriesDescription,
ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)]) AS b_values,
COUNT(DISTINCT SOPInstanceUID) AS n_slices
FROM `bigquery-public-data.idc_current.dicom_all`,
UNNEST(OtherElements) AS other_elements
WHERE collection_id = 'qin_prostate_repeatability'
AND Modality = 'MR'
AND other_elements.Tag = 'Tag_00431039'
AND ImageType[SAFE_OFFSET(0)] = 'ORIGINAL'
AND other_elements.Data[SAFE_OFFSET(0)] = '1400'
GROUP BY PatientID, SeriesInstanceUID
ORDER BY PatientID
```
#### Cross-Collection Analysis
Survey usage of a private tag across all IDC collections:
```sql
SELECT
collection_id,
ARRAY_TO_STRING(ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)] IGNORE NULLS), ', ') AS values_found,
ARRAY_AGG(DISTINCT Manufacturer IGNORE NULLS) AS manufacturers
FROM `bigquery-public-data.idc_current.dicom_all`,
UNNEST(OtherElements) AS other_elements
WHERE other_elements.Tag = 'Tag_00431039'
AND other_elements.Data[SAFE_OFFSET(0)] IS NOT NULL
AND other_elements.Data[SAFE_OFFSET(0)] != ''
GROUP BY collection_id
ORDER BY collection_id
```
### Workflow: Finding and Using Private Tags
1. **Discover available private tags** in your collection using the discovery query above
2. **Identify the manufacturer** to know which conformance statement to consult
3. **Find the DICOM Conformance Statement** from the manufacturer's website (see Resources below)
4. **Search the conformance statement** for the parameter you need (e.g., "b_value", "gradient") to understand what each tag contains
5. **Convert tag to BigQuery format:** (gggg,eeee) → `Tag_ggggeeee`
6. **Query and verify** results visually in the IDC Viewer
### Data Quality Notes
- Some collections show unrealistic values (e.g., b-value "1000000600") indicating encoding issues or different conventions
- IDC data is de-identified; private tags containing PHI may have been removed or modified
- The same tag may have different meanings across software versions
- Always verify query results visually using the [IDC Viewer](https://viewer.imaging.datacommons.cancer.gov/) before large-scale analysis
### Private Element Resources
**Manufacturer DICOM Conformance Statements:**
- [GE Healthcare MR](https://www.gehealthcare.com/products/interoperability/dicom/magnetic-resonance-imaging-dicom-conformance-statements)
- [Siemens MR](https://www.siemens-healthineers.com/services/it-standards/dicom-conformance-statements-magnetic-resonance)
- [Siemens CT](https://www.siemens-healthineers.com/services/it-standards/dicom-conformance-statements-computed-tomography)
**DICOM Standard:**
- [Part 5 Section 7.8 - Private Data Elements](https://dicom.nema.org/medical/dicom/current/output/chtml/part05/sect_7.8.html)
- [Part 15 Appendix E - De-identification Profiles](https://dicom.nema.org/medical/dicom/current/output/chtml/part15/chapter_e.html)
**Community Resources:**
- [NAMIC Wiki: DWI/DTI DICOM](https://www.na-mic.org/wiki/NAMIC_Wiki:DTI:DICOM_for_DWI_and_DTI) - comprehensive vendor comparison for diffusion imaging
- [StandardizeBValue](https://github.com/nslay/StandardizeBValue) - tool to extract vendor b-values to standard tags
## Using Query Results with idc-index
Combine BigQuery for complex queries with idc-index for downloads (no GCP auth needed for downloads):
@@ -220,19 +405,76 @@ print(f"Query will scan {query_job.total_bytes_processed / 1e9:.2f} GB")
## Clinical Data
Clinical data is in separate datasets with collection-specific tables. Not all collections have clinical data (started in IDC v11).
Clinical data is in separate datasets with collection-specific tables. All clinical data available via `idc-index` is also available in BigQuery, with the same content and structure. Use BigQuery when you need complex cross-collection queries or joins that aren't possible with the local `idc-index` tables.
**Datasets:**
- `bigquery-public-data.idc_current_clinical` - current release (for exploration)
- `bigquery-public-data.idc_v{version}_clinical` - versioned datasets (for reproducibility)
Currently there are ~130 clinical tables representing ~70 collections. Not all collections have clinical data (started in IDC v11).
### Clinical Table Naming
Most collections use a single table: `<collection_id>_clinical`
**Exception:** ACRIN collections use multiple tables for different data types (e.g., `acrin_6698_A0`, `acrin_6698_A1`, etc.).
### Metadata Tables
Two metadata tables help navigate clinical data:
**table_metadata** - Collection-level information:
```sql
SELECT
collection_id,
table_name,
table_description
FROM `bigquery-public-data.idc_current_clinical.table_metadata`
WHERE collection_id = 'nlst'
```
**column_metadata** - Attribute-level details with value mappings:
```sql
SELECT
collection_id,
table_name,
column,
column_label,
data_type,
values
FROM `bigquery-public-data.idc_current_clinical.column_metadata`
WHERE collection_id = 'nlst'
AND column_label LIKE '%stage%'
```
The `values` field contains observed attribute values with their descriptions (same as in `idc-index` clinical_index).
### Common Clinical Queries
**List available clinical tables:**
```sql
SELECT table_name
FROM `bigquery-public-data.idc_current_clinical.INFORMATION_SCHEMA.TABLES`
WHERE table_name NOT IN ('table_metadata', 'column_metadata')
```
**Find collections with specific clinical attributes:**
```sql
SELECT DISTINCT collection_id, table_name, column, column_label
FROM `bigquery-public-data.idc_current_clinical.column_metadata`
WHERE LOWER(column_label) LIKE '%chemotherapy%'
```
**Query clinical data for a collection:**
```sql
-- Example: TCGA-LUAD clinical data
SELECT *
FROM `bigquery-public-data.idc_current_clinical.tcga_luad_clinical`
-- Example: NLST cancer staging data
SELECT
dicom_patient_id,
clinical_stag,
path_stag,
de_stag
FROM `bigquery-public-data.idc_current_clinical.nlst_canc`
WHERE clinical_stag IS NOT NULL
LIMIT 10
```
@@ -240,19 +482,44 @@ LIMIT 10
```sql
SELECT
d.PatientID,
d.SeriesInstanceUID,
d.StudyInstanceUID,
d.Modality,
c.age_at_diagnosis,
c.pathologic_stage
c.clinical_stag,
c.path_stag
FROM `bigquery-public-data.idc_current.dicom_all` d
JOIN `bigquery-public-data.idc_current_clinical.tcga_luad_clinical` c
JOIN `bigquery-public-data.idc_current_clinical.nlst_canc` c
ON d.PatientID = c.dicom_patient_id
WHERE d.collection_id = 'tcga_luad'
WHERE d.collection_id = 'nlst'
AND d.Modality = 'CT'
AND c.clinical_stag = '400' -- Stage IV
LIMIT 20
```
**Note:** Clinical table schemas vary by collection. Check column names with `INFORMATION_SCHEMA.COLUMNS` before querying.
**Cross-collection clinical search:**
```sql
-- Find all collections with staging information
SELECT
cm.collection_id,
cm.table_name,
cm.column,
cm.column_label
FROM `bigquery-public-data.idc_current_clinical.column_metadata` cm
WHERE LOWER(cm.column_label) LIKE '%stage%'
ORDER BY cm.collection_id
```
### Key Column: dicom_patient_id
Every clinical table includes `dicom_patient_id`, which matches the DICOM `PatientID` attribute in imaging tables. This is the join key between clinical and imaging data.
**Note:** Clinical table schemas vary significantly by collection. Always check available columns first:
```sql
SELECT column_name, data_type
FROM `bigquery-public-data.idc_current_clinical.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'nlst_canc'
```
See `references/clinical_data_guide.md` for detailed workflows using `idc-index`, which provides the same clinical data without requiring BigQuery authentication.
## Important Notes

View File

@@ -0,0 +1,272 @@
# idc-index Command Line Interface Guide
The `idc-index` package provides command-line tools for downloading DICOM data from the NCI Imaging Data Commons without writing Python code.
## Installation
```bash
pip install --upgrade idc-index
```
After installation, the `idc` command is available in your terminal.
## Available Commands
| Command | Purpose |
|---------|---------|
| `idc download` | General-purpose download with auto-detection of input type |
| `idc download-from-manifest` | Download from manifest file with validation and progress tracking |
| `idc download-from-selection` | Filter-based download with multiple criteria |
---
## idc download
General-purpose download command that intelligently interprets input. It determines whether the input corresponds to a manifest file path or a list of identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid).
### Usage
```bash
# Download entire collection
idc download rider_pilot --download-dir ./data
# Download specific series by UID
idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data
# Download multiple items (comma-separated)
idc download "tcga_luad,tcga_lusc" --download-dir ./data
# Download from manifest file (auto-detected by file extension)
idc download manifest.txt --download-dir ./data
```
### Options
| Option | Description |
|--------|-------------|
| `--download-dir` | Destination directory (default: current directory) |
| `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) |
| `--log-level` | Verbosity: debug, info, warning, error, critical |
### Directory Template Variables
Use these variables in `--dir-template` to organize downloads:
- `%collection_id` - Collection identifier
- `%PatientID` - Patient identifier
- `%StudyInstanceUID` - Study UID
- `%SeriesInstanceUID` - Series UID
- `%Modality` - Imaging modality (CT, MR, PT, etc.)
**Examples:**
```bash
# Flat structure (all files in one directory)
idc download rider_pilot --download-dir ./data --dir-template ""
# Simplified hierarchy
idc download rider_pilot --download-dir ./data --dir-template "%collection_id/%PatientID/%Modality"
```
---
## idc download-from-manifest
Specialized for downloading from manifest files with built-in validation, progress tracking, and resume capability.
### Usage
```bash
# Basic download from manifest
idc download-from-manifest --manifest-file cohort.txt --download-dir ./data
# With progress bar and validation
idc download-from-manifest --manifest-file cohort.txt --download-dir ./data --show-progress-bar
# Resume interrupted download with s5cmd sync
idc download-from-manifest --manifest-file cohort.txt --download-dir ./data --use-s5cmd-sync
```
### Options
| Option | Description |
|--------|-------------|
| `--manifest-file` | **Required.** Path to manifest file containing S3 URLs |
| `--download-dir` | **Required.** Destination directory |
| `--validate-manifest` | Validate manifest before download (enabled by default) |
| `--show-progress-bar` | Display download progress |
| `--use-s5cmd-sync` | Enable resumable downloads - skips already-downloaded files |
| `--quiet` | Suppress subprocess output |
| `--dir-template` | Directory hierarchy template |
| `--log-level` | Logging verbosity |
### Manifest File Format
Manifest files contain S3 URLs, one per line:
```
s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/*
s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*
```
**How to get a manifest file:**
1. **IDC Portal**: Export cohort selection as manifest
2. **Python query**: Generate from SQL results
```python
from idc_index import IDCClient
client = IDCClient()
results = client.sql_query("""
SELECT series_aws_url
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
""")
with open('ct_manifest.txt', 'w') as f:
for url in results['series_aws_url']:
f.write(url + '\n')
```
---
## idc download-from-selection
Download data using filter criteria. Filters are applied sequentially.
### Usage
```bash
# Download by collection
idc download-from-selection --collection-id rider_pilot --download-dir ./data
# Download specific series
idc download-from-selection --series-instance-uid "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data
# Multiple filters
idc download-from-selection --collection-id nlst --patient-id "100004" --download-dir ./data
# Dry run - see what would be downloaded without actually downloading
idc download-from-selection --collection-id tcga_luad --dry-run --download-dir ./data
```
### Options
| Option | Description |
|--------|-------------|
| `--download-dir` | **Required.** Destination directory |
| `--collection-id` | Filter by collection identifier |
| `--patient-id` | Filter by patient identifier |
| `--study-instance-uid` | Filter by study UID |
| `--series-instance-uid` | Filter by series UID |
| `--crdc-series-uuid` | Filter by CRDC UUID |
| `--dry-run` | Calculate cohort size without downloading |
| `--show-progress-bar` | Display download progress |
| `--use-s5cmd-sync` | Enable resumable downloads |
| `--dir-template` | Directory hierarchy template |
### Dry Run for Size Estimation
Use `--dry-run` to estimate download size before committing:
```bash
idc download-from-selection --collection-id nlst --dry-run --download-dir ./data
```
This shows:
- Number of series matching filters
- Total download size
- No files are downloaded
---
## Common Workflows
### 1. Download Small Collection for Testing
```bash
# rider_pilot is ~1GB - good for testing
idc download rider_pilot --download-dir ./test_data
```
### 2. Large Dataset with Progress and Resume
```bash
# Use s5cmd sync for large downloads - can resume if interrupted
idc download-from-selection \
--collection-id nlst \
--download-dir ./nlst_data \
--show-progress-bar \
--use-s5cmd-sync
```
### 3. Estimate Size Before Download
```bash
# Check size first
idc download-from-selection --collection-id tcga_luad --dry-run --download-dir ./data
# Then download if size is acceptable
idc download-from-selection --collection-id tcga_luad --download-dir ./data
```
### 4. Download Specific Modality via Python + CLI
```python
# First, query for series UIDs in Python
from idc_index import IDCClient
client = IDCClient()
results = client.sql_query("""
SELECT SeriesInstanceUID
FROM index
WHERE collection_id = 'nlst'
AND Modality = 'CT'
AND BodyPartExamined = 'CHEST'
LIMIT 50
""")
# Save to manifest
results['SeriesInstanceUID'].to_csv('my_series.csv', index=False, header=False)
```
```bash
# Then download via CLI
idc download my_series.csv --download-dir ./lung_ct
```
---
## Built-in Safety Features
The CLI includes several safety features:
- **Disk space checking**: Verifies sufficient space before starting downloads
- **Manifest validation**: Validates manifest file format by default
- **Progress tracking**: Optional progress bar for monitoring large downloads
- **Resume capability**: Use `--use-s5cmd-sync` to continue interrupted downloads
---
## Troubleshooting
### Download Interrupted
Use `--use-s5cmd-sync` to resume:
```bash
idc download-from-manifest --manifest-file cohort.txt --download-dir ./data --use-s5cmd-sync
```
### Connection Timeout
For unstable networks, download in smaller batches using Python to generate multiple manifests, then download sequentially.
---
## See Also
- [idc-index Documentation](https://idc-index.readthedocs.io/)
- [IDC Portal](https://portal.imaging.datacommons.cancer.gov/) - Interactive cohort building
- [IDC Tutorials](https://github.com/ImagingDataCommons/IDC-Tutorials)

View File

@@ -0,0 +1,333 @@
# Cloud Storage Guide for IDC
IDC maintains all DICOM files in public cloud storage buckets mirrored between Google Cloud Storage (GCS) and AWS S3. This guide covers bucket organization, file structure, access methods, and versioning.
## When to Use Direct Cloud Storage Access
Use direct bucket access when you need:
- Maximum download performance with parallel transfers
- Integration with cloud-native workflows (e.g., running analysis on cloud VMs)
- Programmatic access from tools like s5cmd or gsutil
- Access to specific file versions for reproducibility
For most use cases, `idc-index` is simpler and recommended -— it uses s5cmd internally to download from these same S3 buckets, handling the UUID lookups automatically. Use direct cloud storage when you need raw file access, custom parallelization, or are building cloud-native pipelines.
## Storage Buckets
IDC organizes data across multiple buckets based on licensing and content type. All buckets are mirrored between AWS and GCS with identical content and file paths.
### Bucket Summary
| Purpose | AWS S3 Bucket | GCS Bucket | License | Content |
|---------|---------------|------------|---------|---------|
| Primary data | `idc-open-data` | `idc-open-data` | No commercial restriction | >90% of IDC data |
| Head scans | `idc-open-data-two` | `idc-open-idc1` | No commercial restriction | Collections potentially containing head imaging |
| Commercial-restricted | `idc-open-data-cr` | `idc-open-cr` | Commercial use restricted (CC BY-NC) | ~4% of data |
**Notes:**
- All AWS buckets are in AWS region `us-east-1`
- Prior to IDC v19, GCS used `public-datasets-idc` (now superseded by `idc-open-data`)
- The head scans bucket exists for potential future policy changes regarding facial imaging data
- **Important** Use `idc-index` to get license information - do not rely on bucket name!
### Why Multiple Buckets?
1. **Licensing separation**: Data with commercial-use restrictions (CC BY-NC) is isolated in `idc-open-data-cr` / `idc-open-cr` to prevent accidental commercial use
2. **Head scan handling**: Collections labeled by TCIA as potentially containing head scans are in separate buckets (`idc-open-data-two` / `idc-open-idc1`) for potential future policy compliance
3. **Historical reasons**: The bucket structure evolved as IDC grew and partnered with different cloud programs
## File Organization Within Buckets
Files are organized by CRDC UUIDs, not DICOM UIDs. This enables versioning while maintaining consistent paths across cloud providers.
### Directory Structure
```
<bucket>/
└── <crdc_series_uuid>/
├── <crdc_instance_uuid_1>.dcm
├── <crdc_instance_uuid_2>.dcm
└── ...
```
**Example path:**
```
s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm
```
- `7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9` = series UUID (folder)
- `0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm` = instance UUID (file)
### CRDC UUIDs vs DICOM UIDs
| Identifier Type | Format | Changes When | Use For |
|-----------------|--------|--------------|---------|
| DICOM UID (e.g., SeriesInstanceUID) | Numeric (e.g., `1.3.6.1.4...`) | Never (included in DICOM metadata) | Clinical identification, DICOMweb queries |
| CRDC UUID (e.g., crdc_series_uuid) | UUID (e.g., `e127d258-37c2-...`) | Content changes | File paths, versioning, reproducibility |
**Key insight:** A single DICOM SeriesInstanceUID may have multiple CRDC series UUIDs across IDC versions if the series content changed (instances added/removed, metadata corrected). The CRDC UUID uniquely identifies a specific version of the data.
### Mapping DICOM UIDs to File Paths
Use `idc-index` to get file URLs from DICOM identifiers:
```python
from idc_index import IDCClient
client = IDCClient()
# Get all file URLs for a series
series_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.217441095430480124587725641302"
urls = client.get_series_file_URLs(seriesInstanceUID=series_uid)
for url in urls[:3]:
print(url)
# Returns S3 URLs like: s3://idc-open-data/<crdc_series_uuid>/<crdc_instance_uuid>.dcm
```
Or query the index directly for URL columns:
```python
# Get series-level URL (points to folder)
result = client.sql_query("""
SELECT SeriesInstanceUID, series_aws_url
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
LIMIT 3
""")
print(result[['SeriesInstanceUID', 'series_aws_url']])
```
**Available URL column in index:**
- `series_aws_url`: S3 URL to series folder (e.g., `s3://idc-open-data/uuid/*`)
GCS URLs follow the same path structure—replace `s3://` with `gs://` (e.g., `gs://idc-open-data/uuid/*`). When using `idc-index` download methods, GCS access is handled internally.
## Accessing Cloud Storage
All IDC buckets support free egress (no download fees) through partnerships with AWS Open Data and Google Public Data programs. No authentication required.
### AWS S3 Access
**Using AWS CLI (no account required):**
```bash
# List bucket contents
aws s3 ls --no-sign-request s3://idc-open-data/
# List files in a series folder
aws s3 ls --no-sign-request s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/
# Download a single file
aws s3 cp --no-sign-request \
s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm \
./local_file.dcm
# Download entire series folder
aws s3 cp --no-sign-request --recursive \
s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ \
./series_folder/
```
**Using s5cmd (faster for bulk downloads):**
```bash
# Install s5cmd
# macOS: brew install s5cmd
# Linux: download from https://github.com/peak/s5cmd/releases
# Download specific series
s5cmd --no-sign-request cp 's3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/*' ./local_folder/
# Download from manifest file
s5cmd --no-sign-request run manifest.txt
```
**s5cmd manifest format:** The `s5cmd run` command expects one s5cmd command per line, not just URLs:
```
cp s3://idc-open-data/uuid1/instance1.dcm ./local_folder/
cp s3://idc-open-data/uuid1/instance2.dcm ./local_folder/
cp s3://idc-open-data/uuid2/instance3.dcm ./local_folder/
```
IDC Portal exports manifests in this format. When creating manifests programmatically, use `idc-index` download methods (which handle this internally) rather than constructing manifests manually.
### GCS Access
**Using gsutil:**
```bash
# List bucket contents
gsutil ls gs://idc-open-data/
# Download a series folder
gsutil -m cp -r gs://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ ./local_folder/
```
**Using gcloud storage (newer CLI):**
```bash
gcloud storage cp -r gs://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ ./local_folder/
```
### Python Direct Access
```python
import s3fs
import gcsfs
from idc_index import IDCClient
# First, get a file URL from idc-index
client = IDCClient()
result = client.sql_query("""
SELECT series_aws_url
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
LIMIT 1
""")
# series_aws_url is like: s3://idc-open-data/<uuid>/*
series_url = result['series_aws_url'].iloc[0]
series_path = series_url.replace('s3://', '').rstrip('/*') # e.g., "idc-open-data/<uuid>"
# AWS S3 access
s3 = s3fs.S3FileSystem(anon=True)
files = s3.ls(series_path)
with s3.open(files[0], 'rb') as f:
data = f.read()
# GCS access (same path structure as AWS)
gcs = gcsfs.GCSFileSystem(token='anon')
files = gcs.ls(series_path)
with gcs.open(files[0], 'rb') as f:
data = f.read()
```
## Versioning and Reproducibility
IDC releases new data versions every 2-4 months. The versioning system ensures reproducibility by preserving all historical data.
### How Versioning Works
1. **Snapshots**: Each IDC version (v1, v2, ..., v23, etc.) represents a complete snapshot of all data at release time
2. **UUID-based**: When data changes, new CRDC UUIDs are assigned; old UUIDs remain accessible
3. **Cumulative buckets**: All versions coexist in the same buckets—old series folders
**Version change scenarios:**
| Change Type | DICOM UID | CRDC UUID | Effect |
|-------------|-----------|-----------|--------|
| New series added | New | New | New folder in bucket |
| Instance added to series | Same | New series UUID | New folder, instances may be duplicated |
| Metadata corrected | Same or new | New | New folder with updated files |
| Series removed | N/A | N/A | Old folder remains, not in current index |
**Data removal caveat:** In rare circumstances (e.g., data owner request, PHI incident), data may be removed from IDC entirely, including from all historical versions.
**BigQuery versioned datasets (metadata only, not file storage):**
For querying version-specific metadata, BigQuery provides versioned tables. See `bigquery_guide.md` for details.
- `bigquery-public-data.idc_current` — alias to latest version
- `bigquery-public-data.idc_v23` — specific version (replace 23 with desired version)
### Reproducing a Previous Analysis
The simplest way to ensure reproducibility is to save the `crdc_series_uuid` values of the data you use at analysis time:
```python
from idc_index import IDCClient
import json
client = IDCClient()
# Select data for your analysis
selection = client.sql_query("""
SELECT crdc_series_uuid
FROM index
WHERE collection_id = 'tcga_luad'
AND Modality = 'CT'
LIMIT 10
""")
series_uuids = list(selection['crdc_series_uuid'])
# Download the data
client.download_from_selection(seriesInstanceUID=series_uuids, downloadDir="./data")
# Save a manifest for reproducibility
manifest = {
"crdc_series_uuids": series_uuids,
"download_date": "2024-01-15",
"idc_version": client.get_idc_version(),
"description": "CT scans for lung cancer analysis"
}
with open("analysis_manifest.json", "w") as f:
json.dump(manifest, f, indent=2)
# Later, reproduce the exact dataset:
with open("analysis_manifest.json") as f:
manifest = json.load(f)
client.download_from_selection(
seriesInstanceUID=manifest["crdc_series_uuids"],
downloadDir="./reproduced_data"
)
```
Since `crdc_series_uuid` identifies an immutable version of each series, saving these UUIDs guarantees you can retrieve the exact same files later.
## Relationship Between Buckets, Versions, and Other Access Methods
### Data Coverage Comparison
| Access Method | Buckets Included | Coverage | Versions |
|---------------|------------------|----------|----------|
| Direct bucket access | All 3 buckets | 100% | All historical |
| `idc-index` download | All 3 buckets | 100% | Current + prior_versions_index |
| IDC Portal | All 3 buckets | 100% | Current only |
| DICOMweb public proxy | All 3 buckets | 100% | Current only |
| Google Healthcare DICOM | `idc-open-data` only | ~96% | Current only |
**Important:** The Google Healthcare API DICOM store only replicates data from `idc-open-data`. Data in `idc-open-data-two` and `idc-open-data-cr` (approximately 4% of total) is not available via Google Healthcare DICOMweb endpoint.
## Best Practices
- **Use `idc-index` for discovery**: Query metadata first, then access buckets with known UUIDs
- **Download defaults to AWS buckets**: request GCS if needed
- **Save manifests**: Store the `series_aws_url` or `crdc_series_uuid` values for reproducibility
- **Check licenses**: Query `license_short_name` before commercial use; CC-NC data requires non-commercial use
- **Use current version unless reproducing**: The `index` table has current data; use `prior_versions_index` only for exact reproducibility
## Troubleshooting
### Issue: "Access Denied" when accessing buckets
- **Cause:** Using signed requests or wrong bucket name
- **Solution:** Use `--no-sign-request` flag with AWS CLI, or `anon=True` with Python libraries
### Issue: File not found at expected path
- **Cause:** Using DICOM UID instead of CRDC UUID, or data changed in newer version
- **Solution:** Query `idc-index` for current `series_aws_url`, or check `prior_versions_index` for historical paths
### Issue: Downloaded files don't match expected series
- **Cause:** Series was revised in a newer IDC version
- **Solution:** Use `prior_versions_index` to find the exact version you need; compare `crdc_series_uuid` values
### Issue: Some data missing from Google Healthcare DICOMweb
- **Cause:** Google Healthcare only mirrors `idc-open-data` bucket (~96% of data)
- **Solution:** Use IDC public proxy for 100% coverage, or access buckets directly
## Resources
**IDC Documentation:**
- [Files and metadata](https://learn.canceridc.dev/data/organization-of-data/files-and-metadata) - Bucket organization details
- [Data versioning](https://learn.canceridc.dev/data/data-versioning) - Versioning scheme explanation
- [Resolving GUIDs and UUIDs](https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids) - CRDC UUID documentation
- [Direct loading from cloud](https://learn.canceridc.dev/data/downloading-data/direct-loading) - Python examples for cloud access
**AWS Resources:**
- [NCI IDC on AWS Open Data Registry](https://registry.opendata.aws/nci-imaging-data-commons/) - Bucket ARNs and access info
- [s5cmd](https://github.com/peak/s5cmd) - High-performance S3 client (used internally by idc-index)
- [AWS CLI S3 commands](https://docs.aws.amazon.com/cli/latest/reference/s3/) - Standard AWS command-line interface
- [Boto3 S3 documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html) - AWS SDK for Python
**Google Cloud Resources:**
- [gsutil tool](https://cloud.google.com/storage/docs/gsutil) - Google Cloud Storage command-line tool
- [gcloud storage commands](https://cloud.google.com/sdk/gcloud/reference/storage) - Modern GCS CLI (recommended over gsutil)
- [Google Cloud Storage Python client](https://cloud.google.com/python/docs/reference/storage/latest) - GCS SDK for Python
**Related Guides:**
- `dicomweb_guide.md` - DICOMweb API access (alternative to direct bucket access)
- `bigquery_guide.md` - Advanced metadata queries including versioned datasets

View File

@@ -20,9 +20,12 @@ For most use cases, `idc-index` is simpler and recommended. Use DICOMweb when yo
https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb
```
- **100% data coverage** - Contains all IDC data from all storage buckets
- Points to the latest IDC version automatically
- Daily quota applies (suitable for testing and moderate use)
- **Updates immediately** on new IDC releases
- Per-IP daily quota (suitable for testing and moderate use)
- No authentication required
- Read-only access
- Note: "viewer-only-no-downloads" in URL is legacy naming with no functional meaning
### Google Healthcare API (Requires Authentication)
@@ -39,7 +42,81 @@ client = IDCClient()
print(client.get_idc_version()) # e.g., "23" for v23
```
The Google Healthcare endpoint requires authentication and provides higher quotas. See [Authentication](#authentication-for-google-healthcare-api) section below.
- **~96% data coverage** - Only replicates data from `idc-open-data` bucket (missing ~4% from other buckets)
- **Updates 1-2 weeks after** IDC releases
- Requires authentication and provides higher quotas
- Better performance (no proxy routing)
- Each release gets a new versioned store
See [Content Coverage Differences](#content-coverage-differences) and [Authentication](#authentication-for-google-healthcare-api) sections below.
## Content Coverage Differences
**Important:** The two DICOMweb endpoints have different data coverage. The IDC public proxy contains MORE data than the authenticated Google Healthcare endpoint.
### Coverage Summary
| Endpoint | Coverage | Missing Data |
|----------|----------|--------------|
| **IDC Public Proxy** | 100% | None |
| **Google Healthcare API** | ~96% | ~4% (two buckets not replicated) |
### What's Missing from Google Healthcare?
The Google Healthcare DICOM store **only replicates data from the `idc-open-data` S3 bucket**. It does not include data from two additional buckets:
- `idc-open-data-cr`
- `idc-open-data-two`
These missing buckets typically contain several thousand series each, representing approximately 4% of total IDC data. The exact counts vary by IDC version.
See `cloud_storage_guide.md` for details on bucket organization, file structure, and direct access methods.
### Update Timing
- **IDC Public Proxy**: Updates immediately when new IDC versions are released
- **Google Healthcare**: Updates 1-2 weeks after each new IDC version release
Between releases, both endpoints remain current. The 1-2 week delay only occurs during the transition period after a new IDC version is published.
**Warning from IDC documentation:** *"Google-hosted DICOM store may not contain the latest version of IDC data!"* - Check during the weeks following a new release.
### Choosing the Right Endpoint
**Use IDC Public Proxy when:**
- You need complete data coverage (100%)
- You need the absolute latest data immediately after a new version release
- You don't want to set up GCP authentication
- Your usage fits within per-IP quotas (can request increases via support@canceridc.dev)
- You're accessing slide microscopy images frame-by-frame
**Use Google Healthcare API when:**
- The ~4% missing data doesn't affect your use case
- You need higher quotas for heavy usage
- You want better performance (direct access, no proxy routing)
### Checking Your Data Availability
Before choosing an endpoint, verify whether your data might be in the missing buckets:
```python
from idc_index import IDCClient
client = IDCClient()
# Check which buckets contain your collection's data
results = client.sql_query("""
SELECT series_aws_url, COUNT(*) as series_count
FROM index
WHERE collection_id = 'your_collection_id'
GROUP BY series_aws_url
""")
print(results)
# Look for URLs containing 'idc-open-data-cr' or 'idc-open-data-two'
# If present, that data won't be available in Google Healthcare endpoint
```
## Implementation Details
@@ -289,8 +366,12 @@ response = requests.get(
- **Solution:** Add delays between requests, reduce `limit` values, or use authenticated endpoint for higher quotas
### Issue: 204 No Content for valid UIDs
- **Cause:** UID may be from an older IDC version not in current data
- **Solution:** Verify UID exists using `idc-index` query first. The proxy points to the latest IDC version.
- **Cause:** UID may be from an older IDC version not in current data, or data is in buckets not replicated by Google Healthcare
- **Solution:**
- Verify UID exists using `idc-index` query first
- Check if data is in `idc-open-data-cr` or `idc-open-data-two` buckets (not available in Google Healthcare endpoint)
- Switch to IDC public proxy for 100% coverage
- During new version releases, Google Healthcare may lag 1-2 weeks behind
### Issue: Large metadata responses slow to parse
- **Cause:** Series with many instances returns large JSON
@@ -302,7 +383,17 @@ response = requests.get(
## Resources
**IDC Documentation:**
- [IDC DICOM Stores](https://learn.canceridc.dev/data/organization-of-data/dicom-stores) - Data coverage and bucket details
- [IDC DICOMweb Access](https://learn.canceridc.dev/data/downloading-data/dicomweb-access) - Endpoint usage and differences
- [IDC Proxy Policy](https://learn.canceridc.dev/portal/proxy-policy) - Quota policies and usage restrictions
- [IDC User Guide](https://learn.canceridc.dev/) - Complete documentation
**DICOMweb Standards and Tools:**
- [Google Healthcare DICOM Conformance Statement](https://docs.cloud.google.com/healthcare-api/docs/dicom)
- [DICOMweb Standard](https://www.dicomstandard.org/using/dicomweb)
- [dicomweb-client Python library](https://dicomweb-client.readthedocs.io/)
- [IDC Documentation](https://learn.canceridc.dev/)
**Related Guides:**
- `cloud_storage_guide.md` - Direct bucket access, file organization, CRDC UUIDs, and versioning
- `bigquery_guide.md` - Advanced metadata queries with full DICOM attributes