From 63801af8e61e1d9a18ba06a1b9697cb1ce1ba66c Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Wed, 4 Feb 2026 14:35:14 -0500 Subject: [PATCH] Update imaging-data-commons skill to v1.2.0 see changes in the changelog upstream: https://github.com/ImagingDataCommons/idc-claude-skill/blob/main/CHANGELOG.md#120---2026-02-04 --- .../imaging-data-commons/SKILL.md | 36 +- .../references/bigquery_guide.md | 287 ++++++++++++++- .../references/cli_guide.md | 272 ++++++++++++++ .../references/cloud_storage_guide.md | 333 ++++++++++++++++++ .../references/dicomweb_guide.md | 101 +++++- 5 files changed, 1012 insertions(+), 17 deletions(-) create mode 100644 scientific-skills/imaging-data-commons/references/cli_guide.md create mode 100644 scientific-skills/imaging-data-commons/references/cloud_storage_guide.md diff --git a/scientific-skills/imaging-data-commons/SKILL.md b/scientific-skills/imaging-data-commons/SKILL.md index 4c2708d..2e65b4b 100644 --- a/scientific-skills/imaging-data-commons/SKILL.md +++ b/scientific-skills/imaging-data-commons/SKILL.md @@ -3,7 +3,10 @@ name: imaging-data-commons description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses. license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data. metadata: + version: 1.2.0 skill-author: Andrey Fedorov, @fedorov + idc-index: "0.11.7" + repository: https://github.com/ImagingDataCommons/idc-claude-skill --- # Imaging Data Commons @@ -252,6 +255,8 @@ tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinica clinical_df = client.get_clinical_table("table_name") ``` +See `references/clinical_data_guide.md` for detailed workflows including value mapping patterns and joining clinical data with imaging. + ## Data Access Options | Method | Auth Required | Best For | @@ -260,6 +265,21 @@ clinical_df = client.get_clinical_table("table_name") | IDC Portal | No | Interactive exploration, manual selection, browser-based download | | BigQuery | Yes (GCP account) | Complex queries, full DICOM metadata | | DICOMweb proxy | No | Tool integration via DICOMweb API | +| Cloud storage (S3/GCS) | No | Direct file access, bulk downloads, custom pipelines | + +**Cloud storage organization** + +IDC maintains all DICOM files in public cloud storage buckets mirrored between AWS S3 and Google Cloud Storage. Files are organized by CRDC UUIDs (not DICOM UIDs) to support versioning. + +| Bucket (AWS / GCS) | License | Content | +|--------------------|---------|---------| +| `idc-open-data` / `idc-open-data` | No commercial restriction | >90% of IDC data | +| `idc-open-data-two` / `idc-open-idc1` | No commercial restriction | Collections with potential head scans | +| `idc-open-data-cr` / `idc-open-cr` | Commercial use restricted (CC BY-NC) | ~4% of data | + +Files are stored as `/.dcm`. Access is free (no egress fees) via AWS CLI, gsutil, or s5cmd with anonymous access. Use `series_aws_url` column from the index for S3 URLs; GCS uses the same path structure. + +See `references/cloud_storage_guide.md` for bucket details, access commands, UUID mapping, and versioning. **DICOMweb access** @@ -675,14 +695,15 @@ for i in range(0, len(results), batch_size): ### 7. Advanced Queries with BigQuery -For queries requiring full DICOM metadata, complex JOINs, or clinical data tables, use Google BigQuery. Requires GCP account with billing enabled. +For queries requiring full DICOM metadata, complex JOINs, clinical data tables, or private DICOM elements, use Google BigQuery. Requires GCP account with billing enabled. **Quick reference:** - Dataset: `bigquery-public-data.idc_current.*` - Main table: `dicom_all` (combined metadata) - Full metadata: `dicom_metadata` (all DICOM tags) +- Private elements: `OtherElements` column (vendor-specific tags like diffusion b-values) -See `references/bigquery_guide.md` for setup, table schemas, query patterns, and cost optimization. +See `references/bigquery_guide.md` for setup, table schemas, query patterns, private element access, and cost optimization. ### 8. Tool Selection Guide @@ -1103,6 +1124,8 @@ client.sql_query(""" """) ``` +See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection. + ## Related Skills The following skills complement IDC workflows for downstream analysis and visualization: @@ -1136,6 +1159,9 @@ columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['col ### Reference Documentation +- **clinical_data_guide.md** - Clinical/tabular data navigation, value mapping, and joining with imaging data +- **cloud_storage_guide.md** - Direct cloud bucket access (S3/GCS), file organization, CRDC UUIDs, versioning, and reproducibility +- **cli_guide.md** - Complete idc-index command-line interface reference (`idc download`, `idc download-from-manifest`, `idc download-from-selection`) - **bigquery_guide.md** - Advanced BigQuery usage guide for complex metadata queries - **dicomweb_guide.md** - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details - **[indices_reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html)** - External documentation for index tables (may be ahead of the installed version) @@ -1148,3 +1174,9 @@ columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['col - **User Forum**: https://discourse.canceridc.dev/ - **idc-index GitHub**: https://github.com/ImagingDataCommons/idc-index - **Citation**: Fedorov, A., et al. "National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence." RadioGraphics 43.12 (2023). https://doi.org/10.1148/rg.230180 + +### Skill Updates + +This skill version is available in skill metadata. To check for updates: +- Visit the [releases page](https://github.com/ImagingDataCommons/idc-claude-skill/releases) +- Watch the repository on GitHub (Watch → Custom → Releases) diff --git a/scientific-skills/imaging-data-commons/references/bigquery_guide.md b/scientific-skills/imaging-data-commons/references/bigquery_guide.md index 252a3b7..77cdfa6 100644 --- a/scientific-skills/imaging-data-commons/references/bigquery_guide.md +++ b/scientific-skills/imaging-data-commons/references/bigquery_guide.md @@ -24,6 +24,7 @@ Use BigQuery instead of `idc-index` when you need: - Complex joins across clinical data tables - DICOM sequence attributes (nested structures) - Queries on fields not in the idc-index mini-index +- Private DICOM elements (vendor-specific tags in OtherElements column) ## Accessing IDC in BigQuery @@ -164,6 +165,190 @@ WHERE src.collection_id = 'qin_prostate_repeatability' LIMIT 10 ``` +## Private DICOM Elements + +Private DICOM elements are vendor-specific attributes not defined in the DICOM standard. They often contain essential acquisition parameters (like diffusion b-values, gradient directions, or scanner-specific settings) that are critical for image interpretation and analysis. + +### Understanding Private Elements + +**How private elements work:** +- Private elements use odd-numbered group numbers (e.g., 0019, 0043, 2001) +- Each vendor reserves blocks of 256 elements using Private Creator identifiers at positions (gggg,0010-00FF) +- For example, GE uses Private Creator "GEMS_PARM_01" at (0043,0010) to reserve elements (0043,1000-10FF) + +**Standard vs. private tags:** Some parameters exist in both forms: +| Parameter | Standard Tag | GE | Siemens | Philips | +|-----------|--------------|-----|---------|---------| +| Diffusion b-value | (0018,9087) | (0043,1039) | (0019,100C) | (2001,1003) | +| Private Creator | - | GEMS_PARM_01 | SIEMENS CSA HEADER | Philips Imaging | + +Older scanners typically populate only private tags; newer scanners may use standard tags. Always check both. + +**Challenges with private elements:** +- Require manufacturer DICOM Conformance Statements to interpret +- Tag meanings can change between software versions +- May be removed during de-identification for HIPAA compliance +- Value encoding varies (string vs. numeric, different units) + +### Accessing Private Elements in BigQuery + +Private elements are stored in the `OtherElements` column of `dicom_all` as an array of structs with `Tag` and `Data` fields. + +**Tag notation:** DICOM notation (0043,1039) becomes BigQuery format `Tag_00431039`. + +### Private Element Query Patterns + +#### Discover Available Private Tags + +List all non-empty private tags for a collection: + +```sql +SELECT + other_elements.Tag, + COUNT(*) AS instance_count, + ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)] IGNORE NULLS LIMIT 5) AS sample_values +FROM `bigquery-public-data.idc_current.dicom_all`, + UNNEST(OtherElements) AS other_elements +WHERE collection_id = 'qin_prostate_repeatability' + AND Modality = 'MR' + AND ARRAY_LENGTH(other_elements.Data) > 0 + AND other_elements.Data[SAFE_OFFSET(0)] IS NOT NULL + AND other_elements.Data[SAFE_OFFSET(0)] != '' +GROUP BY other_elements.Tag +ORDER BY instance_count DESC +``` + +For a specific series: + +```sql +SELECT + other_elements.Tag, + ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)] IGNORE NULLS) AS values +FROM `bigquery-public-data.idc_current.dicom_all`, + UNNEST(OtherElements) AS other_elements +WHERE SeriesInstanceUID = '1.3.6.1.4.1.14519.5.2.1.7311.5101.206828891270520544417996275680' + AND ARRAY_LENGTH(other_elements.Data) > 0 + AND other_elements.Data[SAFE_OFFSET(0)] IS NOT NULL + AND other_elements.Data[SAFE_OFFSET(0)] != '' +GROUP BY other_elements.Tag +``` + +To identify the Private Creator for a tag, look for the reservation element in the same group. For example, if you find `Tag_00431039`, the Private Creator is at `Tag_00430010` (the tag that reserves block 10xx in group 0043). + +#### Identify Equipment Manufacturer + +Determine what equipment produced the data to find the correct DICOM Conformance Statement: + +```sql +SELECT DISTINCT Manufacturer, ManufacturerModelName +FROM `bigquery-public-data.idc_current.dicom_all` +WHERE collection_id = 'qin_prostate_repeatability' + AND Modality = 'MR' +``` + +#### Access Private Element Values + +Use `UNNEST` to access individual private elements: + +```sql +SELECT + SeriesInstanceUID, + SeriesDescription, + other_elements.Data[SAFE_OFFSET(0)] AS b_value +FROM `bigquery-public-data.idc_current.dicom_all`, + UNNEST(OtherElements) AS other_elements +WHERE collection_id = 'qin_prostate_repeatability' + AND other_elements.Tag = 'Tag_00431039' +LIMIT 10 +``` + +#### Aggregate Values by Series + +Collect all unique values across slices in a series: + +```sql +SELECT + SeriesInstanceUID, + ANY_VALUE(SeriesDescription) AS SeriesDescription, + ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)]) AS b_values +FROM `bigquery-public-data.idc_current.dicom_all`, + UNNEST(OtherElements) AS other_elements +WHERE collection_id = 'qin_prostate_repeatability' + AND other_elements.Tag = 'Tag_00431039' +GROUP BY SeriesInstanceUID +``` + +#### Combine Standard and Private Filters + +Filter using both standard DICOM attributes and private element values: + +```sql +SELECT + PatientID, + SeriesInstanceUID, + ANY_VALUE(SeriesDescription) AS SeriesDescription, + ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)]) AS b_values, + COUNT(DISTINCT SOPInstanceUID) AS n_slices +FROM `bigquery-public-data.idc_current.dicom_all`, + UNNEST(OtherElements) AS other_elements +WHERE collection_id = 'qin_prostate_repeatability' + AND Modality = 'MR' + AND other_elements.Tag = 'Tag_00431039' + AND ImageType[SAFE_OFFSET(0)] = 'ORIGINAL' + AND other_elements.Data[SAFE_OFFSET(0)] = '1400' +GROUP BY PatientID, SeriesInstanceUID +ORDER BY PatientID +``` + +#### Cross-Collection Analysis + +Survey usage of a private tag across all IDC collections: + +```sql +SELECT + collection_id, + ARRAY_TO_STRING(ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)] IGNORE NULLS), ', ') AS values_found, + ARRAY_AGG(DISTINCT Manufacturer IGNORE NULLS) AS manufacturers +FROM `bigquery-public-data.idc_current.dicom_all`, + UNNEST(OtherElements) AS other_elements +WHERE other_elements.Tag = 'Tag_00431039' + AND other_elements.Data[SAFE_OFFSET(0)] IS NOT NULL + AND other_elements.Data[SAFE_OFFSET(0)] != '' +GROUP BY collection_id +ORDER BY collection_id +``` + +### Workflow: Finding and Using Private Tags + +1. **Discover available private tags** in your collection using the discovery query above +2. **Identify the manufacturer** to know which conformance statement to consult +3. **Find the DICOM Conformance Statement** from the manufacturer's website (see Resources below) +4. **Search the conformance statement** for the parameter you need (e.g., "b_value", "gradient") to understand what each tag contains +5. **Convert tag to BigQuery format:** (gggg,eeee) → `Tag_ggggeeee` +6. **Query and verify** results visually in the IDC Viewer + +### Data Quality Notes + +- Some collections show unrealistic values (e.g., b-value "1000000600") indicating encoding issues or different conventions +- IDC data is de-identified; private tags containing PHI may have been removed or modified +- The same tag may have different meanings across software versions +- Always verify query results visually using the [IDC Viewer](https://viewer.imaging.datacommons.cancer.gov/) before large-scale analysis + +### Private Element Resources + +**Manufacturer DICOM Conformance Statements:** +- [GE Healthcare MR](https://www.gehealthcare.com/products/interoperability/dicom/magnetic-resonance-imaging-dicom-conformance-statements) +- [Siemens MR](https://www.siemens-healthineers.com/services/it-standards/dicom-conformance-statements-magnetic-resonance) +- [Siemens CT](https://www.siemens-healthineers.com/services/it-standards/dicom-conformance-statements-computed-tomography) + +**DICOM Standard:** +- [Part 5 Section 7.8 - Private Data Elements](https://dicom.nema.org/medical/dicom/current/output/chtml/part05/sect_7.8.html) +- [Part 15 Appendix E - De-identification Profiles](https://dicom.nema.org/medical/dicom/current/output/chtml/part15/chapter_e.html) + +**Community Resources:** +- [NAMIC Wiki: DWI/DTI DICOM](https://www.na-mic.org/wiki/NAMIC_Wiki:DTI:DICOM_for_DWI_and_DTI) - comprehensive vendor comparison for diffusion imaging +- [StandardizeBValue](https://github.com/nslay/StandardizeBValue) - tool to extract vendor b-values to standard tags + ## Using Query Results with idc-index Combine BigQuery for complex queries with idc-index for downloads (no GCP auth needed for downloads): @@ -220,19 +405,76 @@ print(f"Query will scan {query_job.total_bytes_processed / 1e9:.2f} GB") ## Clinical Data -Clinical data is in separate datasets with collection-specific tables. Not all collections have clinical data (started in IDC v11). +Clinical data is in separate datasets with collection-specific tables. All clinical data available via `idc-index` is also available in BigQuery, with the same content and structure. Use BigQuery when you need complex cross-collection queries or joins that aren't possible with the local `idc-index` tables. + +**Datasets:** +- `bigquery-public-data.idc_current_clinical` - current release (for exploration) +- `bigquery-public-data.idc_v{version}_clinical` - versioned datasets (for reproducibility) + +Currently there are ~130 clinical tables representing ~70 collections. Not all collections have clinical data (started in IDC v11). + +### Clinical Table Naming + +Most collections use a single table: `_clinical` + +**Exception:** ACRIN collections use multiple tables for different data types (e.g., `acrin_6698_A0`, `acrin_6698_A1`, etc.). + +### Metadata Tables + +Two metadata tables help navigate clinical data: + +**table_metadata** - Collection-level information: +```sql +SELECT + collection_id, + table_name, + table_description +FROM `bigquery-public-data.idc_current_clinical.table_metadata` +WHERE collection_id = 'nlst' +``` + +**column_metadata** - Attribute-level details with value mappings: +```sql +SELECT + collection_id, + table_name, + column, + column_label, + data_type, + values +FROM `bigquery-public-data.idc_current_clinical.column_metadata` +WHERE collection_id = 'nlst' + AND column_label LIKE '%stage%' +``` + +The `values` field contains observed attribute values with their descriptions (same as in `idc-index` clinical_index). + +### Common Clinical Queries **List available clinical tables:** ```sql SELECT table_name FROM `bigquery-public-data.idc_current_clinical.INFORMATION_SCHEMA.TABLES` +WHERE table_name NOT IN ('table_metadata', 'column_metadata') +``` + +**Find collections with specific clinical attributes:** +```sql +SELECT DISTINCT collection_id, table_name, column, column_label +FROM `bigquery-public-data.idc_current_clinical.column_metadata` +WHERE LOWER(column_label) LIKE '%chemotherapy%' ``` **Query clinical data for a collection:** ```sql --- Example: TCGA-LUAD clinical data -SELECT * -FROM `bigquery-public-data.idc_current_clinical.tcga_luad_clinical` +-- Example: NLST cancer staging data +SELECT + dicom_patient_id, + clinical_stag, + path_stag, + de_stag +FROM `bigquery-public-data.idc_current_clinical.nlst_canc` +WHERE clinical_stag IS NOT NULL LIMIT 10 ``` @@ -240,19 +482,44 @@ LIMIT 10 ```sql SELECT d.PatientID, - d.SeriesInstanceUID, + d.StudyInstanceUID, d.Modality, - c.age_at_diagnosis, - c.pathologic_stage + c.clinical_stag, + c.path_stag FROM `bigquery-public-data.idc_current.dicom_all` d -JOIN `bigquery-public-data.idc_current_clinical.tcga_luad_clinical` c +JOIN `bigquery-public-data.idc_current_clinical.nlst_canc` c ON d.PatientID = c.dicom_patient_id -WHERE d.collection_id = 'tcga_luad' +WHERE d.collection_id = 'nlst' AND d.Modality = 'CT' + AND c.clinical_stag = '400' -- Stage IV LIMIT 20 ``` -**Note:** Clinical table schemas vary by collection. Check column names with `INFORMATION_SCHEMA.COLUMNS` before querying. +**Cross-collection clinical search:** +```sql +-- Find all collections with staging information +SELECT + cm.collection_id, + cm.table_name, + cm.column, + cm.column_label +FROM `bigquery-public-data.idc_current_clinical.column_metadata` cm +WHERE LOWER(cm.column_label) LIKE '%stage%' +ORDER BY cm.collection_id +``` + +### Key Column: dicom_patient_id + +Every clinical table includes `dicom_patient_id`, which matches the DICOM `PatientID` attribute in imaging tables. This is the join key between clinical and imaging data. + +**Note:** Clinical table schemas vary significantly by collection. Always check available columns first: +```sql +SELECT column_name, data_type +FROM `bigquery-public-data.idc_current_clinical.INFORMATION_SCHEMA.COLUMNS` +WHERE table_name = 'nlst_canc' +``` + +See `references/clinical_data_guide.md` for detailed workflows using `idc-index`, which provides the same clinical data without requiring BigQuery authentication. ## Important Notes diff --git a/scientific-skills/imaging-data-commons/references/cli_guide.md b/scientific-skills/imaging-data-commons/references/cli_guide.md new file mode 100644 index 0000000..448d104 --- /dev/null +++ b/scientific-skills/imaging-data-commons/references/cli_guide.md @@ -0,0 +1,272 @@ +# idc-index Command Line Interface Guide + +The `idc-index` package provides command-line tools for downloading DICOM data from the NCI Imaging Data Commons without writing Python code. + +## Installation + +```bash +pip install --upgrade idc-index +``` + +After installation, the `idc` command is available in your terminal. + +## Available Commands + +| Command | Purpose | +|---------|---------| +| `idc download` | General-purpose download with auto-detection of input type | +| `idc download-from-manifest` | Download from manifest file with validation and progress tracking | +| `idc download-from-selection` | Filter-based download with multiple criteria | + +--- + +## idc download + +General-purpose download command that intelligently interprets input. It determines whether the input corresponds to a manifest file path or a list of identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid). + +### Usage + +```bash +# Download entire collection +idc download rider_pilot --download-dir ./data + +# Download specific series by UID +idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data + +# Download multiple items (comma-separated) +idc download "tcga_luad,tcga_lusc" --download-dir ./data + +# Download from manifest file (auto-detected by file extension) +idc download manifest.txt --download-dir ./data +``` + +### Options + +| Option | Description | +|--------|-------------| +| `--download-dir` | Destination directory (default: current directory) | +| `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) | +| `--log-level` | Verbosity: debug, info, warning, error, critical | + +### Directory Template Variables + +Use these variables in `--dir-template` to organize downloads: + +- `%collection_id` - Collection identifier +- `%PatientID` - Patient identifier +- `%StudyInstanceUID` - Study UID +- `%SeriesInstanceUID` - Series UID +- `%Modality` - Imaging modality (CT, MR, PT, etc.) + +**Examples:** + +```bash +# Flat structure (all files in one directory) +idc download rider_pilot --download-dir ./data --dir-template "" + +# Simplified hierarchy +idc download rider_pilot --download-dir ./data --dir-template "%collection_id/%PatientID/%Modality" +``` + +--- + +## idc download-from-manifest + +Specialized for downloading from manifest files with built-in validation, progress tracking, and resume capability. + +### Usage + +```bash +# Basic download from manifest +idc download-from-manifest --manifest-file cohort.txt --download-dir ./data + +# With progress bar and validation +idc download-from-manifest --manifest-file cohort.txt --download-dir ./data --show-progress-bar + +# Resume interrupted download with s5cmd sync +idc download-from-manifest --manifest-file cohort.txt --download-dir ./data --use-s5cmd-sync +``` + +### Options + +| Option | Description | +|--------|-------------| +| `--manifest-file` | **Required.** Path to manifest file containing S3 URLs | +| `--download-dir` | **Required.** Destination directory | +| `--validate-manifest` | Validate manifest before download (enabled by default) | +| `--show-progress-bar` | Display download progress | +| `--use-s5cmd-sync` | Enable resumable downloads - skips already-downloaded files | +| `--quiet` | Suppress subprocess output | +| `--dir-template` | Directory hierarchy template | +| `--log-level` | Logging verbosity | + +### Manifest File Format + +Manifest files contain S3 URLs, one per line: + +``` +s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/* +s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/* +``` + +**How to get a manifest file:** + +1. **IDC Portal**: Export cohort selection as manifest +2. **Python query**: Generate from SQL results + +```python +from idc_index import IDCClient + +client = IDCClient() +results = client.sql_query(""" + SELECT series_aws_url + FROM index + WHERE collection_id = 'rider_pilot' AND Modality = 'CT' +""") + +with open('ct_manifest.txt', 'w') as f: + for url in results['series_aws_url']: + f.write(url + '\n') +``` + +--- + +## idc download-from-selection + +Download data using filter criteria. Filters are applied sequentially. + +### Usage + +```bash +# Download by collection +idc download-from-selection --collection-id rider_pilot --download-dir ./data + +# Download specific series +idc download-from-selection --series-instance-uid "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data + +# Multiple filters +idc download-from-selection --collection-id nlst --patient-id "100004" --download-dir ./data + +# Dry run - see what would be downloaded without actually downloading +idc download-from-selection --collection-id tcga_luad --dry-run --download-dir ./data +``` + +### Options + +| Option | Description | +|--------|-------------| +| `--download-dir` | **Required.** Destination directory | +| `--collection-id` | Filter by collection identifier | +| `--patient-id` | Filter by patient identifier | +| `--study-instance-uid` | Filter by study UID | +| `--series-instance-uid` | Filter by series UID | +| `--crdc-series-uuid` | Filter by CRDC UUID | +| `--dry-run` | Calculate cohort size without downloading | +| `--show-progress-bar` | Display download progress | +| `--use-s5cmd-sync` | Enable resumable downloads | +| `--dir-template` | Directory hierarchy template | + +### Dry Run for Size Estimation + +Use `--dry-run` to estimate download size before committing: + +```bash +idc download-from-selection --collection-id nlst --dry-run --download-dir ./data +``` + +This shows: +- Number of series matching filters +- Total download size +- No files are downloaded + +--- + +## Common Workflows + +### 1. Download Small Collection for Testing + +```bash +# rider_pilot is ~1GB - good for testing +idc download rider_pilot --download-dir ./test_data +``` + +### 2. Large Dataset with Progress and Resume + +```bash +# Use s5cmd sync for large downloads - can resume if interrupted +idc download-from-selection \ + --collection-id nlst \ + --download-dir ./nlst_data \ + --show-progress-bar \ + --use-s5cmd-sync +``` + +### 3. Estimate Size Before Download + +```bash +# Check size first +idc download-from-selection --collection-id tcga_luad --dry-run --download-dir ./data + +# Then download if size is acceptable +idc download-from-selection --collection-id tcga_luad --download-dir ./data +``` + +### 4. Download Specific Modality via Python + CLI + +```python +# First, query for series UIDs in Python +from idc_index import IDCClient + +client = IDCClient() +results = client.sql_query(""" + SELECT SeriesInstanceUID + FROM index + WHERE collection_id = 'nlst' + AND Modality = 'CT' + AND BodyPartExamined = 'CHEST' + LIMIT 50 +""") + +# Save to manifest +results['SeriesInstanceUID'].to_csv('my_series.csv', index=False, header=False) +``` + +```bash +# Then download via CLI +idc download my_series.csv --download-dir ./lung_ct +``` + +--- + +## Built-in Safety Features + +The CLI includes several safety features: + +- **Disk space checking**: Verifies sufficient space before starting downloads +- **Manifest validation**: Validates manifest file format by default +- **Progress tracking**: Optional progress bar for monitoring large downloads +- **Resume capability**: Use `--use-s5cmd-sync` to continue interrupted downloads + +--- + +## Troubleshooting + +### Download Interrupted + +Use `--use-s5cmd-sync` to resume: + +```bash +idc download-from-manifest --manifest-file cohort.txt --download-dir ./data --use-s5cmd-sync +``` + +### Connection Timeout + +For unstable networks, download in smaller batches using Python to generate multiple manifests, then download sequentially. + +--- + +## See Also + +- [idc-index Documentation](https://idc-index.readthedocs.io/) +- [IDC Portal](https://portal.imaging.datacommons.cancer.gov/) - Interactive cohort building +- [IDC Tutorials](https://github.com/ImagingDataCommons/IDC-Tutorials) diff --git a/scientific-skills/imaging-data-commons/references/cloud_storage_guide.md b/scientific-skills/imaging-data-commons/references/cloud_storage_guide.md new file mode 100644 index 0000000..d15e677 --- /dev/null +++ b/scientific-skills/imaging-data-commons/references/cloud_storage_guide.md @@ -0,0 +1,333 @@ +# Cloud Storage Guide for IDC + +IDC maintains all DICOM files in public cloud storage buckets mirrored between Google Cloud Storage (GCS) and AWS S3. This guide covers bucket organization, file structure, access methods, and versioning. + +## When to Use Direct Cloud Storage Access + +Use direct bucket access when you need: +- Maximum download performance with parallel transfers +- Integration with cloud-native workflows (e.g., running analysis on cloud VMs) +- Programmatic access from tools like s5cmd or gsutil +- Access to specific file versions for reproducibility + +For most use cases, `idc-index` is simpler and recommended -— it uses s5cmd internally to download from these same S3 buckets, handling the UUID lookups automatically. Use direct cloud storage when you need raw file access, custom parallelization, or are building cloud-native pipelines. + +## Storage Buckets + +IDC organizes data across multiple buckets based on licensing and content type. All buckets are mirrored between AWS and GCS with identical content and file paths. + +### Bucket Summary + +| Purpose | AWS S3 Bucket | GCS Bucket | License | Content | +|---------|---------------|------------|---------|---------| +| Primary data | `idc-open-data` | `idc-open-data` | No commercial restriction | >90% of IDC data | +| Head scans | `idc-open-data-two` | `idc-open-idc1` | No commercial restriction | Collections potentially containing head imaging | +| Commercial-restricted | `idc-open-data-cr` | `idc-open-cr` | Commercial use restricted (CC BY-NC) | ~4% of data | + +**Notes:** +- All AWS buckets are in AWS region `us-east-1` +- Prior to IDC v19, GCS used `public-datasets-idc` (now superseded by `idc-open-data`) +- The head scans bucket exists for potential future policy changes regarding facial imaging data +- **Important** Use `idc-index` to get license information - do not rely on bucket name! + +### Why Multiple Buckets? + +1. **Licensing separation**: Data with commercial-use restrictions (CC BY-NC) is isolated in `idc-open-data-cr` / `idc-open-cr` to prevent accidental commercial use +2. **Head scan handling**: Collections labeled by TCIA as potentially containing head scans are in separate buckets (`idc-open-data-two` / `idc-open-idc1`) for potential future policy compliance +3. **Historical reasons**: The bucket structure evolved as IDC grew and partnered with different cloud programs + +## File Organization Within Buckets + +Files are organized by CRDC UUIDs, not DICOM UIDs. This enables versioning while maintaining consistent paths across cloud providers. + +### Directory Structure + +``` +/ +└── / + ├── .dcm + ├── .dcm + └── ... +``` + +**Example path:** +``` +s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm +``` + +- `7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9` = series UUID (folder) +- `0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm` = instance UUID (file) + +### CRDC UUIDs vs DICOM UIDs + +| Identifier Type | Format | Changes When | Use For | +|-----------------|--------|--------------|---------| +| DICOM UID (e.g., SeriesInstanceUID) | Numeric (e.g., `1.3.6.1.4...`) | Never (included in DICOM metadata) | Clinical identification, DICOMweb queries | +| CRDC UUID (e.g., crdc_series_uuid) | UUID (e.g., `e127d258-37c2-...`) | Content changes | File paths, versioning, reproducibility | + +**Key insight:** A single DICOM SeriesInstanceUID may have multiple CRDC series UUIDs across IDC versions if the series content changed (instances added/removed, metadata corrected). The CRDC UUID uniquely identifies a specific version of the data. + +### Mapping DICOM UIDs to File Paths + +Use `idc-index` to get file URLs from DICOM identifiers: + +```python +from idc_index import IDCClient + +client = IDCClient() + +# Get all file URLs for a series +series_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.217441095430480124587725641302" +urls = client.get_series_file_URLs(seriesInstanceUID=series_uid) + +for url in urls[:3]: + print(url) +# Returns S3 URLs like: s3://idc-open-data//.dcm +``` + +Or query the index directly for URL columns: + +```python +# Get series-level URL (points to folder) +result = client.sql_query(""" + SELECT SeriesInstanceUID, series_aws_url + FROM index + WHERE collection_id = 'rider_pilot' AND Modality = 'CT' + LIMIT 3 +""") + +print(result[['SeriesInstanceUID', 'series_aws_url']]) +``` + +**Available URL column in index:** +- `series_aws_url`: S3 URL to series folder (e.g., `s3://idc-open-data/uuid/*`) + +GCS URLs follow the same path structure—replace `s3://` with `gs://` (e.g., `gs://idc-open-data/uuid/*`). When using `idc-index` download methods, GCS access is handled internally. + +## Accessing Cloud Storage + +All IDC buckets support free egress (no download fees) through partnerships with AWS Open Data and Google Public Data programs. No authentication required. + +### AWS S3 Access + +**Using AWS CLI (no account required):** +```bash +# List bucket contents +aws s3 ls --no-sign-request s3://idc-open-data/ + +# List files in a series folder +aws s3 ls --no-sign-request s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ + +# Download a single file +aws s3 cp --no-sign-request \ + s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm \ + ./local_file.dcm + +# Download entire series folder +aws s3 cp --no-sign-request --recursive \ + s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ \ + ./series_folder/ +``` + +**Using s5cmd (faster for bulk downloads):** +```bash +# Install s5cmd +# macOS: brew install s5cmd +# Linux: download from https://github.com/peak/s5cmd/releases + +# Download specific series +s5cmd --no-sign-request cp 's3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/*' ./local_folder/ + +# Download from manifest file +s5cmd --no-sign-request run manifest.txt +``` + +**s5cmd manifest format:** The `s5cmd run` command expects one s5cmd command per line, not just URLs: +``` +cp s3://idc-open-data/uuid1/instance1.dcm ./local_folder/ +cp s3://idc-open-data/uuid1/instance2.dcm ./local_folder/ +cp s3://idc-open-data/uuid2/instance3.dcm ./local_folder/ +``` + +IDC Portal exports manifests in this format. When creating manifests programmatically, use `idc-index` download methods (which handle this internally) rather than constructing manifests manually. + +### GCS Access + +**Using gsutil:** +```bash +# List bucket contents +gsutil ls gs://idc-open-data/ + +# Download a series folder +gsutil -m cp -r gs://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ ./local_folder/ +``` + +**Using gcloud storage (newer CLI):** +```bash +gcloud storage cp -r gs://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ ./local_folder/ +``` + +### Python Direct Access + +```python +import s3fs +import gcsfs +from idc_index import IDCClient + +# First, get a file URL from idc-index +client = IDCClient() +result = client.sql_query(""" + SELECT series_aws_url + FROM index + WHERE collection_id = 'rider_pilot' AND Modality = 'CT' + LIMIT 1 +""") +# series_aws_url is like: s3://idc-open-data//* +series_url = result['series_aws_url'].iloc[0] +series_path = series_url.replace('s3://', '').rstrip('/*') # e.g., "idc-open-data/" + +# AWS S3 access +s3 = s3fs.S3FileSystem(anon=True) +files = s3.ls(series_path) +with s3.open(files[0], 'rb') as f: + data = f.read() + +# GCS access (same path structure as AWS) +gcs = gcsfs.GCSFileSystem(token='anon') +files = gcs.ls(series_path) +with gcs.open(files[0], 'rb') as f: + data = f.read() +``` + +## Versioning and Reproducibility + +IDC releases new data versions every 2-4 months. The versioning system ensures reproducibility by preserving all historical data. + +### How Versioning Works + +1. **Snapshots**: Each IDC version (v1, v2, ..., v23, etc.) represents a complete snapshot of all data at release time +2. **UUID-based**: When data changes, new CRDC UUIDs are assigned; old UUIDs remain accessible +3. **Cumulative buckets**: All versions coexist in the same buckets—old series folders + +**Version change scenarios:** +| Change Type | DICOM UID | CRDC UUID | Effect | +|-------------|-----------|-----------|--------| +| New series added | New | New | New folder in bucket | +| Instance added to series | Same | New series UUID | New folder, instances may be duplicated | +| Metadata corrected | Same or new | New | New folder with updated files | +| Series removed | N/A | N/A | Old folder remains, not in current index | + +**Data removal caveat:** In rare circumstances (e.g., data owner request, PHI incident), data may be removed from IDC entirely, including from all historical versions. + +**BigQuery versioned datasets (metadata only, not file storage):** + +For querying version-specific metadata, BigQuery provides versioned tables. See `bigquery_guide.md` for details. +- `bigquery-public-data.idc_current` — alias to latest version +- `bigquery-public-data.idc_v23` — specific version (replace 23 with desired version) + +### Reproducing a Previous Analysis + +The simplest way to ensure reproducibility is to save the `crdc_series_uuid` values of the data you use at analysis time: + +```python +from idc_index import IDCClient +import json + +client = IDCClient() + +# Select data for your analysis +selection = client.sql_query(""" + SELECT crdc_series_uuid + FROM index + WHERE collection_id = 'tcga_luad' + AND Modality = 'CT' + LIMIT 10 +""") +series_uuids = list(selection['crdc_series_uuid']) + +# Download the data +client.download_from_selection(seriesInstanceUID=series_uuids, downloadDir="./data") + +# Save a manifest for reproducibility +manifest = { + "crdc_series_uuids": series_uuids, + "download_date": "2024-01-15", + "idc_version": client.get_idc_version(), + "description": "CT scans for lung cancer analysis" +} +with open("analysis_manifest.json", "w") as f: + json.dump(manifest, f, indent=2) + +# Later, reproduce the exact dataset: +with open("analysis_manifest.json") as f: + manifest = json.load(f) +client.download_from_selection( + seriesInstanceUID=manifest["crdc_series_uuids"], + downloadDir="./reproduced_data" +) +``` + +Since `crdc_series_uuid` identifies an immutable version of each series, saving these UUIDs guarantees you can retrieve the exact same files later. + +## Relationship Between Buckets, Versions, and Other Access Methods + +### Data Coverage Comparison + +| Access Method | Buckets Included | Coverage | Versions | +|---------------|------------------|----------|----------| +| Direct bucket access | All 3 buckets | 100% | All historical | +| `idc-index` download | All 3 buckets | 100% | Current + prior_versions_index | +| IDC Portal | All 3 buckets | 100% | Current only | +| DICOMweb public proxy | All 3 buckets | 100% | Current only | +| Google Healthcare DICOM | `idc-open-data` only | ~96% | Current only | + +**Important:** The Google Healthcare API DICOM store only replicates data from `idc-open-data`. Data in `idc-open-data-two` and `idc-open-data-cr` (approximately 4% of total) is not available via Google Healthcare DICOMweb endpoint. + +## Best Practices + +- **Use `idc-index` for discovery**: Query metadata first, then access buckets with known UUIDs +- **Download defaults to AWS buckets**: request GCS if needed +- **Save manifests**: Store the `series_aws_url` or `crdc_series_uuid` values for reproducibility +- **Check licenses**: Query `license_short_name` before commercial use; CC-NC data requires non-commercial use +- **Use current version unless reproducing**: The `index` table has current data; use `prior_versions_index` only for exact reproducibility + +## Troubleshooting + +### Issue: "Access Denied" when accessing buckets +- **Cause:** Using signed requests or wrong bucket name +- **Solution:** Use `--no-sign-request` flag with AWS CLI, or `anon=True` with Python libraries + +### Issue: File not found at expected path +- **Cause:** Using DICOM UID instead of CRDC UUID, or data changed in newer version +- **Solution:** Query `idc-index` for current `series_aws_url`, or check `prior_versions_index` for historical paths + +### Issue: Downloaded files don't match expected series +- **Cause:** Series was revised in a newer IDC version +- **Solution:** Use `prior_versions_index` to find the exact version you need; compare `crdc_series_uuid` values + +### Issue: Some data missing from Google Healthcare DICOMweb +- **Cause:** Google Healthcare only mirrors `idc-open-data` bucket (~96% of data) +- **Solution:** Use IDC public proxy for 100% coverage, or access buckets directly + +## Resources + +**IDC Documentation:** +- [Files and metadata](https://learn.canceridc.dev/data/organization-of-data/files-and-metadata) - Bucket organization details +- [Data versioning](https://learn.canceridc.dev/data/data-versioning) - Versioning scheme explanation +- [Resolving GUIDs and UUIDs](https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids) - CRDC UUID documentation +- [Direct loading from cloud](https://learn.canceridc.dev/data/downloading-data/direct-loading) - Python examples for cloud access + +**AWS Resources:** +- [NCI IDC on AWS Open Data Registry](https://registry.opendata.aws/nci-imaging-data-commons/) - Bucket ARNs and access info +- [s5cmd](https://github.com/peak/s5cmd) - High-performance S3 client (used internally by idc-index) +- [AWS CLI S3 commands](https://docs.aws.amazon.com/cli/latest/reference/s3/) - Standard AWS command-line interface +- [Boto3 S3 documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html) - AWS SDK for Python + +**Google Cloud Resources:** +- [gsutil tool](https://cloud.google.com/storage/docs/gsutil) - Google Cloud Storage command-line tool +- [gcloud storage commands](https://cloud.google.com/sdk/gcloud/reference/storage) - Modern GCS CLI (recommended over gsutil) +- [Google Cloud Storage Python client](https://cloud.google.com/python/docs/reference/storage/latest) - GCS SDK for Python + +**Related Guides:** +- `dicomweb_guide.md` - DICOMweb API access (alternative to direct bucket access) +- `bigquery_guide.md` - Advanced metadata queries including versioned datasets diff --git a/scientific-skills/imaging-data-commons/references/dicomweb_guide.md b/scientific-skills/imaging-data-commons/references/dicomweb_guide.md index 248e80c..0c0be78 100644 --- a/scientific-skills/imaging-data-commons/references/dicomweb_guide.md +++ b/scientific-skills/imaging-data-commons/references/dicomweb_guide.md @@ -20,9 +20,12 @@ For most use cases, `idc-index` is simpler and recommended. Use DICOMweb when yo https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb ``` +- **100% data coverage** - Contains all IDC data from all storage buckets - Points to the latest IDC version automatically -- Daily quota applies (suitable for testing and moderate use) +- **Updates immediately** on new IDC releases +- Per-IP daily quota (suitable for testing and moderate use) - No authentication required +- Read-only access - Note: "viewer-only-no-downloads" in URL is legacy naming with no functional meaning ### Google Healthcare API (Requires Authentication) @@ -39,7 +42,81 @@ client = IDCClient() print(client.get_idc_version()) # e.g., "23" for v23 ``` -The Google Healthcare endpoint requires authentication and provides higher quotas. See [Authentication](#authentication-for-google-healthcare-api) section below. +- **~96% data coverage** - Only replicates data from `idc-open-data` bucket (missing ~4% from other buckets) +- **Updates 1-2 weeks after** IDC releases +- Requires authentication and provides higher quotas +- Better performance (no proxy routing) +- Each release gets a new versioned store + +See [Content Coverage Differences](#content-coverage-differences) and [Authentication](#authentication-for-google-healthcare-api) sections below. + +## Content Coverage Differences + +**Important:** The two DICOMweb endpoints have different data coverage. The IDC public proxy contains MORE data than the authenticated Google Healthcare endpoint. + +### Coverage Summary + +| Endpoint | Coverage | Missing Data | +|----------|----------|--------------| +| **IDC Public Proxy** | 100% | None | +| **Google Healthcare API** | ~96% | ~4% (two buckets not replicated) | + +### What's Missing from Google Healthcare? + +The Google Healthcare DICOM store **only replicates data from the `idc-open-data` S3 bucket**. It does not include data from two additional buckets: + +- `idc-open-data-cr` +- `idc-open-data-two` + +These missing buckets typically contain several thousand series each, representing approximately 4% of total IDC data. The exact counts vary by IDC version. + +See `cloud_storage_guide.md` for details on bucket organization, file structure, and direct access methods. + +### Update Timing + +- **IDC Public Proxy**: Updates immediately when new IDC versions are released +- **Google Healthcare**: Updates 1-2 weeks after each new IDC version release + +Between releases, both endpoints remain current. The 1-2 week delay only occurs during the transition period after a new IDC version is published. + +**Warning from IDC documentation:** *"Google-hosted DICOM store may not contain the latest version of IDC data!"* - Check during the weeks following a new release. + +### Choosing the Right Endpoint + +**Use IDC Public Proxy when:** +- You need complete data coverage (100%) +- You need the absolute latest data immediately after a new version release +- You don't want to set up GCP authentication +- Your usage fits within per-IP quotas (can request increases via support@canceridc.dev) +- You're accessing slide microscopy images frame-by-frame + +**Use Google Healthcare API when:** +- The ~4% missing data doesn't affect your use case +- You need higher quotas for heavy usage +- You want better performance (direct access, no proxy routing) + +### Checking Your Data Availability + +Before choosing an endpoint, verify whether your data might be in the missing buckets: + +```python +from idc_index import IDCClient + +client = IDCClient() + +# Check which buckets contain your collection's data +results = client.sql_query(""" + SELECT series_aws_url, COUNT(*) as series_count + FROM index + WHERE collection_id = 'your_collection_id' + GROUP BY series_aws_url +""") + +print(results) + +# Look for URLs containing 'idc-open-data-cr' or 'idc-open-data-two' +# If present, that data won't be available in Google Healthcare endpoint +``` ## Implementation Details @@ -289,8 +366,12 @@ response = requests.get( - **Solution:** Add delays between requests, reduce `limit` values, or use authenticated endpoint for higher quotas ### Issue: 204 No Content for valid UIDs -- **Cause:** UID may be from an older IDC version not in current data -- **Solution:** Verify UID exists using `idc-index` query first. The proxy points to the latest IDC version. +- **Cause:** UID may be from an older IDC version not in current data, or data is in buckets not replicated by Google Healthcare +- **Solution:** + - Verify UID exists using `idc-index` query first + - Check if data is in `idc-open-data-cr` or `idc-open-data-two` buckets (not available in Google Healthcare endpoint) + - Switch to IDC public proxy for 100% coverage + - During new version releases, Google Healthcare may lag 1-2 weeks behind ### Issue: Large metadata responses slow to parse - **Cause:** Series with many instances returns large JSON @@ -302,7 +383,17 @@ response = requests.get( ## Resources +**IDC Documentation:** +- [IDC DICOM Stores](https://learn.canceridc.dev/data/organization-of-data/dicom-stores) - Data coverage and bucket details +- [IDC DICOMweb Access](https://learn.canceridc.dev/data/downloading-data/dicomweb-access) - Endpoint usage and differences +- [IDC Proxy Policy](https://learn.canceridc.dev/portal/proxy-policy) - Quota policies and usage restrictions +- [IDC User Guide](https://learn.canceridc.dev/) - Complete documentation + +**DICOMweb Standards and Tools:** - [Google Healthcare DICOM Conformance Statement](https://docs.cloud.google.com/healthcare-api/docs/dicom) - [DICOMweb Standard](https://www.dicomstandard.org/using/dicomweb) - [dicomweb-client Python library](https://dicomweb-client.readthedocs.io/) -- [IDC Documentation](https://learn.canceridc.dev/) + +**Related Guides:** +- `cloud_storage_guide.md` - Direct bucket access, file organization, CRDC UUIDs, and versioning +- `bigquery_guide.md` - Advanced metadata queries with full DICOM attributes