diff --git a/scientific-skills/imaging-data-commons/SKILL.md b/scientific-skills/imaging-data-commons/SKILL.md new file mode 100644 index 0000000..d66a882 --- /dev/null +++ b/scientific-skills/imaging-data-commons/SKILL.md @@ -0,0 +1,1092 @@ +--- +name: imaging-data-commons +description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses. +license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data. +metadata: + skill-author: Andrey Fedorov, @fedorov +--- + +# Imaging Data Commons + +## Overview + +Use the `idc-index` Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access. + +**Primary tool:** `idc-index` ([GitHub](https://github.com/imagingdatacommons/idc-index)) + +**Check current data scale for the latest version:** + +```python +from idc_index import IDCClient +client = IDCClient() + +# get IDC data version +print(client.get_idc_version()) + +# Get collection count and total series +stats = client.sql_query(""" + SELECT + COUNT(DISTINCT collection_id) as collections, + COUNT(DISTINCT analysis_result_id) as analysis_results, + COUNT(DISTINCT PatientID) as patients, + COUNT(DISTINCT StudyInstanceUID) as studies, + COUNT(DISTINCT SeriesInstanceUID) as series, + SUM(instanceCount) as instances, + SUM(series_size_MB)/1000000 as size_TB + FROM index +""") +print(stats) +``` + +**Core workflow:** +1. Query metadata → `client.sql_query()` +2. Download DICOM files → `client.download_from_selection()` +3. Visualize in browser → `client.get_viewer_URL(seriesInstanceUID=...)` + +## When to Use This Skill + +- Finding publicly available radiology (CT, MR, PET) or pathology (slide microscopy) images +- Selecting image subsets by cancer type, modality, anatomical site, or other metadata +- Downloading DICOM data from IDC +- Checking data licenses before use in research or commercial applications +- Visualizing medical images in a browser without local DICOM viewer software + +## IDC Data Model + +IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance): + +- **collection_id**: Groups patients by disease, modality, or research focus (e.g., `tcga_luad`, `nlst`). A patient belongs to exactly one collection. +- **analysis_result_id**: Identifies derived objects (segmentations, annotations, radiomics features) across one or more original collections. + +Use `collection_id` to find original imaging data, may include annotations deposited along with the images; use `analysis_result_id` to find AI-generated or expert annotations. + +**Key identifiers for queries:** +| Identifier | Scope | Use for | +|------------|-------|---------| +| `collection_id` | Dataset grouping | Filtering by project/study | +| `PatientID` | Patient | Grouping images by patient | +| `StudyInstanceUID` | DICOM study | Grouping of related series, visualization | +| `SeriesInstanceUID` | DICOM series | Grouping of related series, visualization | + +## Index Tables + +The `idc-index` package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames. + +**Important:** Use `client.indices_overview` to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure. + +### Available Tables + +| Table | Row Granularity | Loaded | Description | +|-------|-----------------|--------|-------------| +| `index` | 1 row = 1 DICOM series | Auto | Primary metadata for all current IDC data | +| `prior_versions_index` | 1 row = 1 DICOM series | Auto | Series from previous IDC releases; for downloading deprecated data | +| `collections_index` | 1 row = 1 collection | fetch_index() | Collection-level metadata and descriptions | +| `analysis_results_index` | 1 row = 1 analysis result collection | fetch_index() | Metadata about derived datasets (annotations, segmentations) | +| `clinical_index` | 1 row = 1 clinical data column | fetch_index() | Dictionary mapping clinical table columns to collections | +| `sm_index` | 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata | +| `sm_instance_index` | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy | + +**Auto** = loaded automatically when `IDCClient()` is instantiated +**fetch_index()** = requires `client.fetch_index("table_name")` to load + +### Joining Tables + +**Key columns are not explicitly labeled, the following is a subset that can be used in joins.** + +| Join Column | Tables | Use Case | +|-------------|--------|----------| +| `collection_id` | index, prior_versions_index, collections_index, clinical_index | Link series to collection metadata or clinical data | +| `SeriesInstanceUID` | index, prior_versions_index, sm_index, sm_instance_index | Link series across tables; connect to slide microscopy details | +| `StudyInstanceUID` | index, prior_versions_index | Link studies across current and historical data | +| `PatientID` | index, prior_versions_index | Link patients across current and historical data | +| `analysis_result_id` | index, analysis_results_index | Link series to analysis result metadata (annotations, segmentations) | +| `source_DOI` | index, analysis_results_index | Link by publication DOI | +| `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier | +| `Modality` | index, prior_versions_index | Filter by imaging modality | + +**Note:** `Subjects`, `Updated`, and `Description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts). + +**Example joins:** +```python +from idc_index import IDCClient +client = IDCClient() + +# Join index with collections_index to get cancer types +client.fetch_index("collections_index") +result = client.sql_query(""" + SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations + FROM index i + JOIN collections_index c ON i.collection_id = c.collection_id + WHERE i.Modality = 'MR' + LIMIT 10 +""") + +# Join index with sm_index for slide microscopy details +client.fetch_index("sm_index") +result = client.sql_query(""" + SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf + FROM index i + JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID + LIMIT 10 +""") +``` + +### Accessing Index Tables + +**Via SQL (recommended for filtering/aggregation):** +```python +from idc_index import IDCClient +client = IDCClient() + +# Query the primary index (always available) +results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10") + +# Fetch and query additional indices +client.fetch_index("collections_index") +collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index") + +client.fetch_index("analysis_results_index") +analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5") +``` + +**As pandas DataFrames (direct access):** +```python +# Primary index (always available after client initialization) +df = client.index + +# Fetch and access on-demand indices +client.fetch_index("sm_index") +sm_df = client.sm_index +``` + +### Discovering Table Schemas (Essential for Query Writing) + +The `indices_overview` dictionary contains complete schema information for all tables. **Always consult this when writing queries or exploring data structure.** + +**DICOM attribute mapping:** Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like `PatientID`, `StudyInstanceUID`, `Modality`, `BodyPartExamined` work as expected. + +```python +from idc_index import IDCClient +client = IDCClient() + +# List all available indices with descriptions +for name, info in client.indices_overview.items(): + print(f"\n{name}:") + print(f" Installed: {info['installed']}") + print(f" Description: {info['description']}") + +# Get complete schema for a specific index (columns, types, descriptions) +schema = client.indices_overview["index"]["schema"] +print(f"\nTable: {schema['table_description']}") +print("\nColumns:") +for col in schema['columns']: + desc = col.get('description', 'No description') + # Description indicates if column is from DICOM attribute + print(f" {col['name']} ({col['type']}): {desc}") + +# Find columns that are DICOM attributes (check description for "DICOM" reference) +dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()] +print(f"\nDICOM-sourced columns: {dicom_cols}") +``` + +**Alternative: use `get_index_schema()` method:** +```python +schema = client.get_index_schema("index") +# Returns same schema dict: {'table_description': ..., 'columns': [...]} +``` + +### Key Columns in Primary `index` Table + +Most common columns for queries (use `indices_overview` for complete list and descriptions): + +| Column | Type | DICOM | Description | +|--------|------|-------|-------------| +| `collection_id` | STRING | No | IDC collection identifier | +| `analysis_result_id` | STRING | No | If applicable, indicates what analysis results collection given series is part of | +| `source_DOI` | STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) | +| `PatientID` | STRING | Yes | Patient identifier | +| `StudyInstanceUID` | STRING | Yes | DICOM Study UID | +| `SeriesInstanceUID` | STRING | Yes | DICOM Series UID — use for downloads/viewing | +| `Modality` | STRING | Yes | Imaging modality (CT, MR, PT, SM, etc.) | +| `BodyPartExamined` | STRING | Yes | Anatomical region | +| `SeriesDescription` | STRING | Yes | Description of the series | +| `Manufacturer` | STRING | Yes | Equipment manufacturer | +| `StudyDate` | STRING | Yes | Date study was performed | +| `PatientSex` | STRING | Yes | Patient sex | +| `PatientAge` | STRING | Yes | Patient age at time of study | +| `license_short_name` | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) | +| `series_size_MB` | FLOAT | No | Size of series in megabytes | +| `instanceCount` | INTEGER | No | Number of DICOM instances in series | + +**DICOM = Yes**: Column value extracted from the DICOM attribute with the same name. Refer to the [DICOM standard](https://dicom.nema.org/medical/dicom/current/output/chtml/part06/chapter_6.html) for numeric tag mappings. Use standard DICOM knowledge for expected values and formats. + +### Clinical Data Access + +```python +# Fetch clinical index (also downloads clinical data tables) +client.fetch_index("clinical_index") + +# Query clinical index to find available tables and their columns +tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index") + +# Load a specific clinical table as DataFrame +clinical_df = client.get_clinical_table("table_name") +``` + +## Data Access Options + +| Method | Auth Required | Best For | +|--------|---------------|----------| +| `idc-index` | No | Key queries and downloads (recommended) | +| IDC Portal | No | Interactive exploration, manual selection, browser-based download | +| BigQuery | Yes (GCP account) | Complex queries, full DICOM metadata | +| DICOMweb proxy | No | Tool integration via DICOMweb API | + +**DICOMweb access** + +IDC data is available via DICOMweb interface (Google Cloud Healthcare API implementation) for integration with PACS systems and DICOMweb-compatible tools. + +| Endpoint | Auth | Use Case | +|----------|------|----------| +| Public proxy | No | Testing, moderate queries, daily quota | +| Google Healthcare | Yes (GCP) | Production use, higher quotas | + +See `references/dicomweb_guide.md` for endpoint URLs, code examples, supported operations, and implementation details. + +## Installation and Setup + +**Required (for basic access):** +```bash +pip install --upgrade idc-index +``` + +**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility. + +**Tested with:** idc-index 0.11.5 (IDC data version v23) + +**Optional (for data analysis):** +```bash +pip install pandas numpy pydicom +``` + +## Core Capabilities + +### 1. Data Discovery and Exploration + +Discover what imaging collections and data are available in IDC: + +```python +from idc_index import IDCClient + +client = IDCClient() + +# Get summary statistics from primary index +query = """ +SELECT + collection_id, + COUNT(DISTINCT PatientID) as patients, + COUNT(DISTINCT SeriesInstanceUID) as series, + SUM(series_size_MB) as size_mb +FROM index +GROUP BY collection_id +ORDER BY patients DESC +""" +collections_summary = client.sql_query(query) + +# For richer collection metadata, use collections_index +client.fetch_index("collections_index") +collections_info = client.sql_query(""" + SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData + FROM collections_index +""") + +# For analysis results (annotations, segmentations), use analysis_results_index +client.fetch_index("analysis_results_index") +analysis_info = client.sql_query(""" + SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities + FROM analysis_results_index +""") +``` + +**`collections_index`** provides curated metadata per collection: cancer types, tumor locations, species, subject counts, and supporting data types — without needing to aggregate from the primary index. + +**`analysis_results_index`** lists derived datasets (AI segmentations, expert annotations, radiomics features) with their source collections and modalities. + +### 2. Querying Metadata with SQL + +Query the IDC mini-index using SQL to find specific datasets. + +**First, explore available values for filter columns:** +```python +from idc_index import IDCClient + +client = IDCClient() + +# Check what Modality values exist +modalities = client.sql_query(""" + SELECT DISTINCT Modality, COUNT(*) as series_count + FROM index + GROUP BY Modality + ORDER BY series_count DESC +""") +print(modalities) + +# Check what BodyPartExamined values exist for MR modality +body_parts = client.sql_query(""" + SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count + FROM index + WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL + GROUP BY BodyPartExamined + ORDER BY series_count DESC + LIMIT 20 +""") +print(body_parts) +``` + +**Then query with validated filter values:** +```python +# Find breast MRI scans (use actual values from exploration above) +results = client.sql_query(""" + SELECT + collection_id, + PatientID, + SeriesInstanceUID, + Modality, + SeriesDescription, + license_short_name + FROM index + WHERE Modality = 'MR' + AND BodyPartExamined = 'BREAST' + LIMIT 20 +""") + +# Access results as pandas DataFrame +for idx, row in results.iterrows(): + print(f"Patient: {row['PatientID']}, Series: {row['SeriesInstanceUID']}") +``` + +**To filter by cancer type, join with `collections_index`:** +```python +client.fetch_index("collections_index") +results = client.sql_query(""" + SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality + FROM index i + JOIN collections_index c ON i.collection_id = c.collection_id + WHERE c.CancerTypes LIKE '%Breast%' + AND i.Modality = 'MR' + LIMIT 20 +""") +``` + +**Available metadata fields** (use `client.indices_overview` for complete list): +- Identifiers: collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID +- Imaging: Modality, BodyPartExamined, Manufacturer, ManufacturerModelName +- Clinical: PatientAge, PatientSex, StudyDate +- Descriptions: StudyDescription, SeriesDescription +- Licensing: license_short_name + +**Note:** Cancer type is in `collections_index.CancerTypes`, not in the primary `index` table. + +### 3. Downloading DICOM Files + +Download imaging data efficiently from IDC's cloud storage: + +**Download entire collection:** +```python +from idc_index import IDCClient + +client = IDCClient() + +# Download small collection (RIDER Pilot ~1GB) +client.download_from_selection( + collection_id="rider_pilot", + downloadDir="./data/rider" +) +``` + +**Download specific series:** +```python +# First, query for series UIDs +series_df = client.sql_query(""" + SELECT SeriesInstanceUID + FROM index + WHERE Modality = 'CT' + AND BodyPartExamined = 'CHEST' + AND collection_id = 'nlst' + LIMIT 5 +""") + +# Download only those series +client.download_from_selection( + seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), + downloadDir="./data/lung_ct" +) +``` + +**Custom directory structure:** + +Default `dirTemplate`: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID` + +```python +# Simplified hierarchy (omit StudyInstanceUID level) +client.download_from_selection( + collection_id="tcga_luad", + downloadDir="./data", + dirTemplate="%collection_id/%PatientID/%Modality" +) +# Results in: ./data/tcga_luad/TCGA-05-4244/CT/ + +# Flat structure (all files in one directory) +client.download_from_selection( + seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), + downloadDir="./data/flat", + dirTemplate="" +) +# Results in: ./data/flat/*.dcm +``` + +### Command-Line Download + +The `idc download` command provides command-line access to download functionality without writing Python code. Available after installing `idc-index`. + +**Auto-detects input type:** manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid). + +```bash +# Download entire collection +idc download rider_pilot --download-dir ./data + +# Download specific series by UID +idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data + +# Download multiple items (comma-separated) +idc download "tcga_luad,tcga_lusc" --download-dir ./data + +# Download from manifest file (auto-detected) +idc download manifest.txt --download-dir ./data +``` + +**Options:** + +| Option | Description | +|--------|-------------| +| `--download-dir` | Output directory (default: current directory) | +| `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) | +| `--log-level` | Verbosity: debug, info, warning, error, critical | + +**Manifest files:** + +Manifest files contain S3 URLs (one per line) and can be: +- Exported from the IDC Portal after cohort selection +- Shared by collaborators for reproducible data access +- Generated programmatically from query results + +Format (one S3 URL per line): +``` +s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/* +s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/* +``` + +**Example: Generate manifest from Python query:** + +```python +from idc_index import IDCClient + +client = IDCClient() + +# Query for series URLs +results = client.sql_query(""" + SELECT series_aws_url + FROM index + WHERE collection_id = 'rider_pilot' AND Modality = 'CT' +""") + +# Save as manifest file +with open('ct_manifest.txt', 'w') as f: + for url in results['series_aws_url']: + f.write(url + '\n') +``` + +Then download: +```bash +idc download ct_manifest.txt --download-dir ./ct_data +``` + +### 4. Visualizing IDC Images + +View DICOM data in browser without downloading: + +```python +from idc_index import IDCClient +import webbrowser + +client = IDCClient() + +# First query to get valid UIDs +results = client.sql_query(""" + SELECT SeriesInstanceUID, StudyInstanceUID + FROM index + WHERE collection_id = 'rider_pilot' AND Modality = 'CT' + LIMIT 1 +""") + +# View single series +viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID']) +webbrowser.open(viewer_url) + +# View all series in a study (useful for multi-series exams like MRI protocols) +viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID']) +webbrowser.open(viewer_url) +``` + +The method automatically selects OHIF v3 for radiology or SLIM for slide microscopy. Viewing by study is useful when a DICOM Study contains multiple Series (e.g., T1, T2, DWI sequences from a single MRI session). + +### 5. Understanding and Checking Licenses + +Check data licensing before use (critical for commercial applications): + +```python +from idc_index import IDCClient + +client = IDCClient() + +# Check licenses for all collections +query = """ +SELECT DISTINCT + collection_id, + license_short_name, + COUNT(DISTINCT SeriesInstanceUID) as series_count +FROM index +GROUP BY collection_id, license_short_name +ORDER BY collection_id +""" + +licenses = client.sql_query(query) +print(licenses) +``` + +**License types in IDC:** +- **CC BY 4.0** / **CC BY 3.0** (~97% of data) - Allows commercial use with attribution +- **CC BY-NC 4.0** / **CC BY-NC 3.0** (~3% of data) - Non-commercial use only +- **Custom licenses** (rare) - Some collections have specific terms (e.g., NLM Terms and Conditions) + +**Important:** Always check the license before using IDC data in publications or commercial applications. Each DICOM file is tagged with its specific license in metadata. + +### Generating Citations for Attribution + +The `source_DOI` column contains DOIs linking to publications describing how the data was generated. To satisfy attribution requirements, use `citations_from_selection()` to generate properly formatted citations: + +```python +from idc_index import IDCClient + +client = IDCClient() + +# Get citations for a collection (APA format by default) +citations = client.citations_from_selection(collection_id="rider_pilot") +for citation in citations: + print(citation) + +# Get citations for specific series +results = client.sql_query(""" + SELECT SeriesInstanceUID FROM index + WHERE collection_id = 'tcga_luad' LIMIT 5 +""") +citations = client.citations_from_selection( + seriesInstanceUID=list(results['SeriesInstanceUID'].values) +) + +# Alternative format: BibTeX (for LaTeX documents) +bibtex_citations = client.citations_from_selection( + collection_id="tcga_luad", + citation_format=IDCClient.CITATION_FORMAT_BIBTEX +) +``` + +**Parameters:** +- `collection_id`: Filter by collection(s) +- `patientId`: Filter by patient ID(s) +- `studyInstanceUID`: Filter by study UID(s) +- `seriesInstanceUID`: Filter by series UID(s) +- `citation_format`: Use `IDCClient.CITATION_FORMAT_*` constants: + - `CITATION_FORMAT_APA` (default) - APA style + - `CITATION_FORMAT_BIBTEX` - BibTeX for LaTeX + - `CITATION_FORMAT_JSON` - CSL JSON + - `CITATION_FORMAT_TURTLE` - RDF Turtle + +**Best practice:** When publishing results using IDC data, include the generated citations to properly attribute the data sources and satisfy license requirements. + +### 6. Batch Processing and Filtering + +Process large datasets efficiently with filtering: + +```python +from idc_index import IDCClient +import pandas as pd + +client = IDCClient() + +# Find chest CT scans from GE scanners +query = """ +SELECT + SeriesInstanceUID, + PatientID, + collection_id, + ManufacturerModelName +FROM index +WHERE Modality = 'CT' + AND BodyPartExamined = 'CHEST' + AND Manufacturer = 'GE MEDICAL SYSTEMS' + AND license_short_name = 'CC BY 4.0' +LIMIT 100 +""" + +results = client.sql_query(query) + +# Save manifest for later +results.to_csv('lung_ct_manifest.csv', index=False) + +# Download in batches to avoid timeout +batch_size = 10 +for i in range(0, len(results), batch_size): + batch = results.iloc[i:i+batch_size] + client.download_from_selection( + seriesInstanceUID=list(batch['SeriesInstanceUID'].values), + downloadDir=f"./data/batch_{i//batch_size}" + ) +``` + +### 7. Advanced Queries with BigQuery + +For queries requiring full DICOM metadata, complex JOINs, or clinical data tables, use Google BigQuery. Requires GCP account with billing enabled. + +**Quick reference:** +- Dataset: `bigquery-public-data.idc_current.*` +- Main table: `dicom_all` (combined metadata) +- Full metadata: `dicom_metadata` (all DICOM tags) + +See `references/bigquery_guide.md` for setup, table schemas, query patterns, and cost optimization. + +### 8. Tool Selection Guide + +| Task | Tool | Reference | +|------|------|-----------| +| Programmatic queries & downloads | `idc-index` | This document | +| Interactive exploration | IDC Portal | https://portal.imaging.datacommons.cancer.gov/ | +| Complex metadata queries | BigQuery | `references/bigquery_guide.md` | +| 3D visualization & analysis | SlicerIDCBrowser | https://github.com/ImagingDataCommons/SlicerIDCBrowser | + +**Default choice:** Use `idc-index` for most tasks (no auth, easy API, batch downloads). + +### 9. Integration with Analysis Pipelines + +Integrate IDC data into imaging analysis workflows: + +**Read downloaded DICOM files:** +```python +import pydicom +import os + +# Read DICOM files from downloaded series +series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..." + +dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir) + if f.endswith('.dcm')] + +# Load first image +ds = pydicom.dcmread(dicom_files[0]) +print(f"Patient ID: {ds.PatientID}") +print(f"Modality: {ds.Modality}") +print(f"Image shape: {ds.pixel_array.shape}") +``` + +**Build 3D volume from CT series:** +```python +import pydicom +import numpy as np +from pathlib import Path + +def load_ct_series(series_path): + """Load CT series as 3D numpy array""" + files = sorted(Path(series_path).glob('*.dcm')) + slices = [pydicom.dcmread(str(f)) for f in files] + + # Sort by slice location + slices.sort(key=lambda x: float(x.ImagePositionPatient[2])) + + # Stack into 3D array + volume = np.stack([s.pixel_array for s in slices]) + + return volume, slices[0] # Return volume and first slice for metadata + +volume, metadata = load_ct_series("./data/lung_ct/series_dir") +print(f"Volume shape: {volume.shape}") # (z, y, x) +``` + +**Integrate with SimpleITK:** +```python +import SimpleITK as sitk +from pathlib import Path + +# Read DICOM series +series_path = "./data/ct_series" +reader = sitk.ImageSeriesReader() +dicom_names = reader.GetGDCMSeriesFileNames(series_path) +reader.SetFileNames(dicom_names) +image = reader.Execute() + +# Apply processing +smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5) + +# Save as NIfTI +sitk.WriteImage(smoothed, "processed_volume.nii.gz") +``` + +## Common Use Cases + +### Use Case 1: Find and Download Lung CT Scans for Deep Learning + +**Objective:** Build training dataset of lung CT scans from NLST collection + +**Steps:** +```python +from idc_index import IDCClient + +client = IDCClient() + +# 1. Query for lung CT scans with specific criteria +query = """ +SELECT + PatientID, + SeriesInstanceUID, + SeriesDescription +FROM index +WHERE collection_id = 'nlst' + AND Modality = 'CT' + AND BodyPartExamined = 'CHEST' + AND license_short_name = 'CC BY 4.0' +ORDER BY PatientID +LIMIT 100 +""" + +results = client.sql_query(query) +print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients") + +# 2. Download data organized by patient +client.download_from_selection( + seriesInstanceUID=list(results['SeriesInstanceUID'].values), + downloadDir="./training_data", + dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID" +) + +# 3. Save manifest for reproducibility +results.to_csv('training_manifest.csv', index=False) +``` + +### Use Case 2: Query Brain MRI by Manufacturer for Quality Study + +**Objective:** Compare image quality across different MRI scanner manufacturers + +**Steps:** +```python +from idc_index import IDCClient +import pandas as pd + +client = IDCClient() + +# Query for brain MRI grouped by manufacturer +query = """ +SELECT + Manufacturer, + ManufacturerModelName, + COUNT(DISTINCT SeriesInstanceUID) as num_series, + COUNT(DISTINCT PatientID) as num_patients +FROM index +WHERE Modality = 'MR' + AND BodyPartExamined LIKE '%BRAIN%' +GROUP BY Manufacturer, ManufacturerModelName +HAVING num_series >= 10 +ORDER BY num_series DESC +""" + +manufacturers = client.sql_query(query) +print(manufacturers) + +# Download sample from each manufacturer for comparison +for _, row in manufacturers.head(3).iterrows(): + mfr = row['Manufacturer'] + model = row['ManufacturerModelName'] + + query = f""" + SELECT SeriesInstanceUID + FROM index + WHERE Manufacturer = '{mfr}' + AND ManufacturerModelName = '{model}' + AND Modality = 'MR' + AND BodyPartExamined LIKE '%BRAIN%' + LIMIT 5 + """ + + series = client.sql_query(query) + client.download_from_selection( + seriesInstanceUID=list(series['SeriesInstanceUID'].values), + downloadDir=f"./quality_study/{mfr.replace(' ', '_')}" + ) +``` + +### Use Case 3: Visualize Series Without Downloading + +**Objective:** Preview imaging data before committing to download + +```python +from idc_index import IDCClient +import webbrowser + +client = IDCClient() + +series_list = client.sql_query(""" + SELECT SeriesInstanceUID, PatientID, SeriesDescription + FROM index + WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT' + LIMIT 10 +""") + +# Preview each in browser +for _, row in series_list.iterrows(): + viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID']) + print(f"Patient {row['PatientID']}: {row['SeriesDescription']}") + print(f" View at: {viewer_url}") + # webbrowser.open(viewer_url) # Uncomment to open automatically +``` + +For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration. + +### Use Case 4: License-Aware Batch Download for Commercial Use + +**Objective:** Download only CC-BY licensed data suitable for commercial applications + +**Steps:** +```python +from idc_index import IDCClient + +client = IDCClient() + +# Query ONLY for CC BY licensed data (allows commercial use with attribution) +query = """ +SELECT + SeriesInstanceUID, + collection_id, + PatientID, + Modality +FROM index +WHERE license_short_name LIKE 'CC BY%' + AND license_short_name NOT LIKE '%NC%' + AND Modality IN ('CT', 'MR') + AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN') +LIMIT 200 +""" + +cc_by_data = client.sql_query(query) + +print(f"Found {len(cc_by_data)} CC BY licensed series") +print(f"Collections: {cc_by_data['collection_id'].unique()}") + +# Download with license verification +client.download_from_selection( + seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values), + downloadDir="./commercial_dataset", + dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID" +) + +# Save license information +cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False) +``` + +## Best Practices + +- **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC) +- **Generate citations for attribution** - Use `citations_from_selection()` to get properly formatted citations from `source_DOI` values; include these in publications +- **Start with small queries** - Use `LIMIT` clause when exploring to avoid long downloads and understand data structure +- **Use mini-index for simple queries** - Only use BigQuery when you need comprehensive metadata or complex JOINs +- **Organize downloads with dirTemplate** - Use meaningful directory structures like `%collection_id/%PatientID/%Modality` +- **Cache query results** - Save DataFrames to CSV files to avoid re-querying and ensure reproducibility +- **Estimate size first** - Check collection size before downloading - some collection sizes are in terabytes! +- **Save manifests** - Always save query results with Series UIDs for reproducibility and data provenance +- **Read documentation** - IDC data structure and metadata fields are documented at https://learn.canceridc.dev/ +- **Use IDC forum** - Search for questons/answers and ask your questions to the IDC maintainers and users at https://discourse.canceridc.dev/ + +## Troubleshooting + +**Issue: `ModuleNotFoundError: No module named 'idc_index'`** +- **Cause:** idc-index package not installed +- **Solution:** Install with `pip install --upgrade idc-index` + +**Issue: Download fails with connection timeout** +- **Cause:** Network instability or large download size +- **Solution:** + - Download smaller batches (e.g., 10-20 series at a time) + - Check network connection + - Use `dirTemplate` to organize downloads by batch + - Implement retry logic with delays + +**Issue: `BigQuery quota exceeded` or billing errors** +- **Cause:** BigQuery requires billing-enabled GCP project +- **Solution:** Use idc-index mini-index for simple queries (no billing required), or see `references/bigquery_guide.md` for cost optimization tips + +**Issue: Series UID not found or no data returned** +- **Cause:** Typo in UID, data not in current IDC version, or wrong field name +- **Solution:** + - Check if data is in current IDC version (some old data may be deprecated) + - Use `LIMIT 5` to test query first + - Check field names against metadata schema documentation + +**Issue: Downloaded DICOM files won't open** +- **Cause:** Corrupted download or incompatible viewer +- **Solution:** + - Check DICOM object type (Modality and SOPClassUID attributes) - some object types require specialized tools + - Verify file integrity (check file sizes) + - Use pydicom to validate: `pydicom.dcmread(file, force=True)` + - Try different DICOM viewer (3D Slicer, Horos, RadiAnt, QuPath) + - Re-download the series + +## Common SQL Query Patterns + +Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above. + +### Discover available filter values +```python +# What modalities exist? +client.sql_query("SELECT DISTINCT Modality FROM index") + +# What body parts for a specific modality? +client.sql_query(""" + SELECT DISTINCT BodyPartExamined, COUNT(*) as n + FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL + GROUP BY BodyPartExamined ORDER BY n DESC +""") + +# What manufacturers for MR? +client.sql_query(""" + SELECT DISTINCT Manufacturer, COUNT(*) as n + FROM index WHERE Modality = 'MR' + GROUP BY Manufacturer ORDER BY n DESC +""") +``` + +### Find annotations and segmentations + +**Note:** Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type. + +```python +# Find ALL segmentations and structure sets by DICOM Modality +# SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set +client.sql_query(""" + SELECT collection_id, Modality, COUNT(*) as series_count + FROM index + WHERE Modality IN ('SEG', 'RTSTRUCT') + GROUP BY collection_id, Modality + ORDER BY series_count DESC +""") + +# Find segmentations for a specific collection (includes non-analysis-result items) +client.sql_query(""" + SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id + FROM index + WHERE collection_id = 'tcga_luad' AND Modality = 'SEG' +""") + +# List analysis result collections (curated derived datasets) +client.fetch_index("analysis_results_index") +client.sql_query(""" + SELECT analysis_result_id, analysis_result_title, Collections, Modalities + FROM analysis_results_index +""") + +# Find analysis results for a specific source collection +client.sql_query(""" + SELECT analysis_result_id, analysis_result_title + FROM analysis_results_index + WHERE Collections LIKE '%tcga_luad%' +""") +``` + +### Query slide microscopy data +```python +# sm_index has detailed metadata; join with index for collection_id +client.fetch_index("sm_index") +client.sql_query(""" + SELECT i.collection_id, COUNT(*) as slides, + MIN(s.min_PixelSpacing_2sf) as min_resolution + FROM sm_index s + JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID + GROUP BY i.collection_id + ORDER BY slides DESC +""") +``` + +### Estimate download size +```python +# Size for specific criteria +client.sql_query(""" + SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count + FROM index + WHERE collection_id = 'nlst' AND Modality = 'CT' +""") +``` + +### Link to clinical data +```python +client.fetch_index("clinical_index") + +# Find collections with clinical data and their tables +client.sql_query(""" + SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns + FROM clinical_index + GROUP BY collection_id, table_name + ORDER BY collection_id +""") +``` + +## Related Skills + +The following skills complement IDC workflows for downstream analysis and visualization: + +### DICOM Processing +- **pydicom** - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET). + +### Pathology and Slide Microscopy +- **histolab** - Lightweight tile extraction and preprocessing for whole slide images. Use for basic slide processing, tissue detection, and dataset preparation from IDC slide microscopy data. +- **pathml** - Full-featured computational pathology toolkit. Use for advanced WSI analysis including multiplexed imaging, nucleus segmentation, and ML model training on pathology data downloaded from IDC. + +### Metadata Visualization +- **matplotlib** - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.). +- **seaborn** - Statistical visualization with pandas integration. Use for quick exploration of IDC metadata distributions, relationships between variables, and categorical comparisons with attractive defaults. +- **plotly** - Interactive visualization. Use when you need hover info, zoom, and pan for exploring IDC metadata, or for creating web-embeddable dashboards of collection statistics. + +### Data Exploration +- **exploratory-data-analysis** - Comprehensive EDA on scientific data files. Use after downloading IDC data to understand file structure, quality, and characteristics before analysis. + +## Resources + +### Schema Reference (Primary Source) + +**Always use `client.indices_overview` for current column schemas.** This ensures accuracy with the installed idc-index version: + +```python +# Get all column names and types for any table +schema = client.indices_overview["index"]["schema"] +columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']] +``` + +### Reference Documentation + +- **bigquery_guide.md** - Advanced BigQuery usage guide for complex metadata queries +- **dicomweb_guide.md** - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details +- **[indices_reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html)** - External documentation for index tables (may be ahead of the installed version) + +### External Links + +- **IDC Portal**: https://portal.imaging.datacommons.cancer.gov/explore/ +- **Documentation**: https://learn.canceridc.dev/ +- **Tutorials**: https://github.com/ImagingDataCommons/IDC-Tutorials +- **User Forum**: https://discourse.canceridc.dev/ +- **idc-index GitHub**: https://github.com/ImagingDataCommons/idc-index +- **Citation**: Fedorov, A., et al. "National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence." RadioGraphics 43.12 (2023). https://doi.org/10.1148/rg.230180 diff --git a/scientific-skills/imaging-data-commons/references/bigquery_guide.md b/scientific-skills/imaging-data-commons/references/bigquery_guide.md new file mode 100644 index 0000000..252a3b7 --- /dev/null +++ b/scientific-skills/imaging-data-commons/references/bigquery_guide.md @@ -0,0 +1,289 @@ +# BigQuery Guide for IDC + +**Tested with:** IDC data version v23 + +For most queries and downloads, use `idc-index` (see main SKILL.md). This guide covers BigQuery for advanced use cases requiring full DICOM metadata or complex joins. + +## Prerequisites + +**Requirements:** +1. Google account +2. Google Cloud project with billing enabled (first 1 TB/month free) +3. `google-cloud-bigquery` Python package or BigQuery console access + +**Authentication setup:** +```bash +# Install Google Cloud SDK, then: +gcloud auth application-default login +``` + +## When to Use BigQuery + +Use BigQuery instead of `idc-index` when you need: +- Full DICOM metadata (all 4000+ tags, not just the ~50 in idc-index) +- Complex joins across clinical data tables +- DICOM sequence attributes (nested structures) +- Queries on fields not in the idc-index mini-index + +## Accessing IDC in BigQuery + +### Dataset Structure + +All IDC tables are in the `bigquery-public-data` BigQuery project. + +**Current version (recommended for exploration):** +- `bigquery-public-data.idc_current.*` +- `bigquery-public-data.idc_current_clinical.*` + +**Versioned datasets (recommended for reproducibility):** + +- `bigquery-public-data.idc_v{IDC version}.*` +- `bigquery-public-data.idc_v{IDC version}_clinical.*` + +Always use versioned datasets for reproducible research! + +## Key Tables + +### dicom_all +Primary table joining complete DICOM metadata with IDC-specific columns (collection_id, gcs_url, license). Contains all DICOM tags from `dicom_metadata` plus collection and administrative metadata. See [dicom_all.sql](https://github.com/ImagingDataCommons/etl_flow/blob/master/bq/generate_tables_and_views/derived_tables/BQ_Table_Building/derived_data_views/sql/dicom_all.sql) for the exact derivation. + +```sql +SELECT + collection_id, + PatientID, + StudyInstanceUID, + SeriesInstanceUID, + Modality, + BodyPartExamined, + SeriesDescription, + gcs_url, + license_short_name +FROM `bigquery-public-data.idc_current.dicom_all` +WHERE Modality = 'CT' + AND BodyPartExamined = 'CHEST' +LIMIT 10 +``` + +### Derived Tables + +**segmentations** - DICOM Segmentation objects +```sql +SELECT * +FROM `bigquery-public-data.idc_current.segmentations` +LIMIT 10 +``` + +**measurement_groups** - SR TID1500 measurement groups +**qualitative_measurements** - Coded evaluations +**quantitative_measurements** - Numeric measurements + +### Collection Metadata + +**original_collections_metadata** - Collection-level descriptions + +```sql +SELECT + collection_id, + CancerTypes, + TumorLocations, + Subjects, + src.source_doi, + src.ImageTypes, + src.license.license_short_name +FROM `bigquery-public-data.idc_current.original_collections_metadata`, +UNNEST(Sources) AS src +WHERE CancerTypes LIKE '%Lung%' +``` + +## Common Query Patterns + +### Find Collections by Criteria + +```sql +SELECT + collection_id, + COUNT(DISTINCT PatientID) as patient_count, + COUNT(DISTINCT SeriesInstanceUID) as series_count, + ARRAY_AGG(DISTINCT Modality) as modalities +FROM `bigquery-public-data.idc_current.dicom_all` +WHERE BodyPartExamined LIKE '%BRAIN%' +GROUP BY collection_id +HAVING patient_count > 50 +ORDER BY patient_count DESC +``` + +### Get Download URLs + +```sql +SELECT + SeriesInstanceUID, + gcs_url +FROM `bigquery-public-data.idc_current.dicom_all` +WHERE collection_id = 'rider_pilot' + AND Modality = 'CT' +``` + +### Find Studies with Multiple Modalities + +```sql +SELECT + StudyInstanceUID, + ARRAY_AGG(DISTINCT Modality) as modalities, + COUNT(DISTINCT SeriesInstanceUID) as series_count +FROM `bigquery-public-data.idc_current.dicom_all` +GROUP BY StudyInstanceUID +HAVING ARRAY_LENGTH(ARRAY_AGG(DISTINCT Modality)) > 1 +LIMIT 100 +``` + +### License Filtering + +```sql +SELECT + collection_id, + license_short_name, + COUNT(*) as instance_count +FROM `bigquery-public-data.idc_current.dicom_all` +WHERE license_short_name = 'CC BY 4.0' +GROUP BY collection_id, license_short_name +``` + +### Find Segmentations with Source Images + +```sql +SELECT + src.collection_id, + seg.SeriesInstanceUID as seg_series, + seg.SegmentedPropertyType, + src.SeriesInstanceUID as source_series, + src.Modality as source_modality +FROM `bigquery-public-data.idc_current.segmentations` seg +JOIN `bigquery-public-data.idc_current.dicom_all` src + ON seg.segmented_SeriesInstanceUID = src.SeriesInstanceUID +WHERE src.collection_id = 'qin_prostate_repeatability' +LIMIT 10 +``` + +## Using Query Results with idc-index + +Combine BigQuery for complex queries with idc-index for downloads (no GCP auth needed for downloads): + +```python +from google.cloud import bigquery +from idc_index import IDCClient + +# Initialize BigQuery client +# Requires: pip install google-cloud-bigquery +# Auth: gcloud auth application-default login +# Project: needed for billing even on public datasets (free tier applies) +bq_client = bigquery.Client(project="your-gcp-project-id") + +# Query for series with specific criteria +query = """ +SELECT DISTINCT SeriesInstanceUID +FROM `bigquery-public-data.idc_current.dicom_all` +WHERE collection_id = 'tcga_luad' + AND Modality = 'CT' + AND Manufacturer = 'GE MEDICAL SYSTEMS' +LIMIT 100 +""" + +df = bq_client.query(query).to_dataframe() +print(f"Found {len(df)} GE CT series") + +# Download with idc-index (no GCP auth required) +idc_client = IDCClient() +idc_client.download_from_selection( + seriesInstanceUID=list(df['SeriesInstanceUID'].values), + downloadDir="./tcga_luad_thin_ct" +) +``` + +## Cost and Optimization + +**Pricing:** $5 per TB scanned (first 1 TB/month free). Most users stay within free tier. + +**Minimize data scanned:** +- Select only needed columns (not `SELECT *`) +- Filter early with `WHERE` clauses +- Use `LIMIT` when testing +- Use `dicom_all` instead of `dicom_metadata` when possible (smaller) +- Preview queries in BQ console (free, shows bytes to scan) + +**Check cost before running:** +```python +query_job = client.query(query, job_config=bigquery.QueryJobConfig(dry_run=True)) +print(f"Query will scan {query_job.total_bytes_processed / 1e9:.2f} GB") +``` + +**Use materialized tables:** IDC provides both views (`table_name_view`) and materialized tables (`table_name`). Always use the materialized tables (faster, lower cost). + +## Clinical Data + +Clinical data is in separate datasets with collection-specific tables. Not all collections have clinical data (started in IDC v11). + +**List available clinical tables:** +```sql +SELECT table_name +FROM `bigquery-public-data.idc_current_clinical.INFORMATION_SCHEMA.TABLES` +``` + +**Query clinical data for a collection:** +```sql +-- Example: TCGA-LUAD clinical data +SELECT * +FROM `bigquery-public-data.idc_current_clinical.tcga_luad_clinical` +LIMIT 10 +``` + +**Join clinical with imaging data:** +```sql +SELECT + d.PatientID, + d.SeriesInstanceUID, + d.Modality, + c.age_at_diagnosis, + c.pathologic_stage +FROM `bigquery-public-data.idc_current.dicom_all` d +JOIN `bigquery-public-data.idc_current_clinical.tcga_luad_clinical` c + ON d.PatientID = c.dicom_patient_id +WHERE d.collection_id = 'tcga_luad' + AND d.Modality = 'CT' +LIMIT 20 +``` + +**Note:** Clinical table schemas vary by collection. Check column names with `INFORMATION_SCHEMA.COLUMNS` before querying. + +## Important Notes + +- Tables are read-only (public dataset) +- Schema changes between IDC versions +- Use versioned datasets for reproducibility +- Some DICOM sequences >15 levels deep are not extracted +- Very large sequences (>1MB) may be truncated +- Always check data license before use + +## Common Errors + +**Issue: Billing must be enabled** +- Cause: BigQuery requires a billing-enabled GCP project +- Solution: Enable billing in Google Cloud Console or use idc-index mini-index instead + +**Issue: Query exceeds resource limits** +- Cause: Query scans too much data or is too complex +- Solution: Add more specific WHERE filters, use LIMIT, break into smaller queries + +**Issue: Column not found** +- Cause: Field name typo or not in selected table +- Solution: Check table schema first with `INFORMATION_SCHEMA.COLUMNS` + +**Issue: Permission denied** +- Cause: Not authenticated to Google Cloud +- Solution: Run `gcloud auth application-default login` or set GOOGLE_APPLICATION_CREDENTIALS + +## Resources + +- [Understanding the BigQuery DICOM schema](https://docs.cloud.google.com/healthcare-api/docs/how-tos/dicom-bigquery-schema) +- [BigQuery Query Syntax](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax) +- [Kaggle Intro to SQL](https://www.kaggle.com/learn/intro-to-sql) +- [Sample BigQuery queries of IDC data](https://github.com/ImagingDataCommons/idc-bigquery-cookbook) diff --git a/scientific-skills/imaging-data-commons/references/dicomweb_guide.md b/scientific-skills/imaging-data-commons/references/dicomweb_guide.md new file mode 100644 index 0000000..248e80c --- /dev/null +++ b/scientific-skills/imaging-data-commons/references/dicomweb_guide.md @@ -0,0 +1,308 @@ +# DICOMweb Guide for IDC + +IDC provides DICOMweb access through Google Cloud Healthcare API DICOM stores. This guide covers the implementation specifics and usage patterns. + +## When to Use DICOMweb + +Use DICOMweb when you need: +- Integration with PACS systems or DICOMweb-compatible tools +- Streaming metadata without downloading full files +- Building custom viewers or web applications +- Using existing DICOMweb client libraries (OHIF, dicomweb-client, etc.) + +For most use cases, `idc-index` is simpler and recommended. Use DICOMweb when you specifically need the DICOMweb protocol. + +## Endpoints + +### Public Proxy (No Authentication) + +``` +https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb +``` + +- Points to the latest IDC version automatically +- Daily quota applies (suitable for testing and moderate use) +- No authentication required +- Note: "viewer-only-no-downloads" in URL is legacy naming with no functional meaning + +### Google Healthcare API (Requires Authentication) + +``` +https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v{VERSION}/dicomWeb +``` + +Replace `{VERSION}` with the IDC release number. To find the current version: + +```python +from idc_index import IDCClient +client = IDCClient() +print(client.get_idc_version()) # e.g., "23" for v23 +``` + +The Google Healthcare endpoint requires authentication and provides higher quotas. See [Authentication](#authentication-for-google-healthcare-api) section below. + +## Implementation Details + +IDC DICOMweb is provided through Google Cloud Healthcare API DICOM stores. The implementation follows DICOM PS3.18 Web Services with specific characteristics documented in the [Google Healthcare DICOM conformance statement](https://docs.cloud.google.com/healthcare-api/docs/dicom). + +### Supported Operations + +| Service | Description | Supported | +|---------|-------------|-----------| +| QIDO-RS | Search for DICOM objects | Yes | +| WADO-RS | Retrieve DICOM objects and metadata | Yes | +| STOW-RS | Store DICOM objects | No (IDC is read-only) | + +**Not supported:** URI Service, Worklist Service, Non-Patient Instance Service, Capabilities Transactions + +### Searchable DICOM Tags (QIDO-RS) + +The implementation supports a limited set of searchable tags: + +| Level | Searchable Tags | +|-------|-----------------| +| Study | StudyInstanceUID, PatientName, PatientID, AccessionNumber, ReferringPhysicianName, StudyDate | +| Series | All study tags + SeriesInstanceUID, Modality | +| Instance | All series tags + SOPInstanceUID | + +**Important:** Only exact matching is supported, except for: +- StudyDate: supports range queries +- PatientName: supports fuzzy matching + +### Query Limitations + +- Maximum results: 5,000 for studies/series searches; 50,000 for instances +- Maximum offset: 1,000,000 +- DICOM sequence tags larger than ~1 MB are not returned in metadata (BulkDataURI provided instead) + +## Code Examples + +All examples use the public proxy endpoint. For authenticated access to Google Healthcare, see the [authentication section](#authentication-for-google-healthcare-api). + +### Finding UIDs with idc-index + +Use `idc-index` to discover data, then use DICOMweb for metadata access: + +```python +from idc_index import IDCClient + +client = IDCClient() + +# Find studies of interest +results = client.sql_query(""" + SELECT StudyInstanceUID, SeriesInstanceUID, PatientID, Modality + FROM index + WHERE collection_id = 'tcga_luad' AND Modality = 'CT' + LIMIT 5 +""") + +# Use these UIDs with DICOMweb +study_uid = results.iloc[0]['StudyInstanceUID'] +series_uid = results.iloc[0]['SeriesInstanceUID'] +print(f"Study: {study_uid}") +print(f"Series: {series_uid}") +``` + +### QIDO-RS: Search by UID + +```python +import requests + +base_url = "https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb" + +# Search for a specific study +study_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.307623500513044641407722230440" +response = requests.get( + f"{base_url}/studies", + params={"StudyInstanceUID": study_uid}, + headers={"Accept": "application/dicom+json"} +) + +if response.status_code == 200: + studies = response.json() + print(f"Found {len(studies)} study") +``` + +### QIDO-RS: List Series in a Study + +```python +import requests + +base_url = "https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb" +study_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.307623500513044641407722230440" + +response = requests.get( + f"{base_url}/studies/{study_uid}/series", + headers={"Accept": "application/dicom+json"} +) + +if response.status_code == 200: + series_list = response.json() + for series in series_list: + # DICOM tags are returned as hex codes + series_uid = series.get("0020000E", {}).get("Value", [None])[0] + modality = series.get("00080060", {}).get("Value", [None])[0] + description = series.get("0008103E", {}).get("Value", [""])[0] + print(f"{modality}: {description}") +``` + +### QIDO-RS: List Instances in a Series + +```python +import requests + +base_url = "https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb" +study_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.307623500513044641407722230440" +series_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.217441095430480124587725641302" + +response = requests.get( + f"{base_url}/studies/{study_uid}/series/{series_uid}/instances", + params={"limit": 10}, + headers={"Accept": "application/dicom+json"} +) + +if response.status_code == 200: + instances = response.json() + print(f"Found {len(instances)} instances") + for inst in instances[:3]: + sop_uid = inst.get("00080018", {}).get("Value", [None])[0] + print(f" SOPInstanceUID: {sop_uid}") +``` + +### WADO-RS: Retrieve Series Metadata + +```python +import requests + +base_url = "https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb" +study_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.307623500513044641407722230440" +series_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.217441095430480124587725641302" + +response = requests.get( + f"{base_url}/studies/{study_uid}/series/{series_uid}/metadata", + headers={"Accept": "application/dicom+json"} +) + +if response.status_code == 200: + instances = response.json() + print(f"Retrieved metadata for {len(instances)} instances") + + # Extract image dimensions from first instance + if instances: + inst = instances[0] + rows = inst.get("00280010", {}).get("Value", [None])[0] + cols = inst.get("00280011", {}).get("Value", [None])[0] + print(f"Image dimensions: {rows} x {cols}") +``` + +### Combined Workflow: idc-index Discovery + DICOMweb Metadata + +```python +from idc_index import IDCClient +import requests + +# Use idc-index for efficient discovery +idc = IDCClient() +results = idc.sql_query(""" + SELECT StudyInstanceUID, SeriesInstanceUID, Modality, SeriesDescription + FROM index + WHERE collection_id = 'nlst' AND Modality = 'CT' + LIMIT 1 +""") + +study_uid = results.iloc[0]['StudyInstanceUID'] +series_uid = results.iloc[0]['SeriesInstanceUID'] +print(f"Found: {results.iloc[0]['SeriesDescription']}") + +# Use DICOMweb to stream metadata without downloading files +base_url = "https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb" + +response = requests.get( + f"{base_url}/studies/{study_uid}/series/{series_uid}/metadata", + headers={"Accept": "application/dicom+json"} +) + +if response.status_code == 200: + metadata = response.json() + print(f"Retrieved metadata for {len(metadata)} instances without downloading files") +``` + +## Common DICOM Tags Reference + +DICOMweb returns tags as hexadecimal codes. Common tags: + +| Tag | Name | Description | +|-----|------|-------------| +| 00080018 | SOPInstanceUID | Unique instance identifier | +| 00080020 | StudyDate | Date study was performed | +| 00080060 | Modality | Imaging modality (CT, MR, PT, etc.) | +| 0008103E | SeriesDescription | Description of series | +| 00100020 | PatientID | Patient identifier | +| 0020000D | StudyInstanceUID | Unique study identifier | +| 0020000E | SeriesInstanceUID | Unique series identifier | +| 00280010 | Rows | Image height in pixels | +| 00280011 | Columns | Image width in pixels | + +## Authentication for Google Healthcare API + +To use the Google Healthcare endpoint with higher quotas: + +```python +from google.auth import default +from google.auth.transport.requests import Request +import requests + +# Get credentials (requires gcloud auth) +credentials, project = default() +credentials.refresh(Request()) + +# Build authenticated request +base_url = "https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v23/dicomWeb" + +response = requests.get( + f"{base_url}/studies", + params={"limit": 5}, + headers={ + "Authorization": f"Bearer {credentials.token}", + "Accept": "application/dicom+json" + } +) +``` + +**Prerequisites:** +1. Google Cloud SDK installed (`gcloud`) +2. Authenticated: `gcloud auth application-default login` +3. Account has access to public Google Cloud datasets + +## Troubleshooting + +### Issue: 400 Bad Request on search queries +- **Cause:** Using unsupported search parameters. The implementation only supports specific DICOM tags for filtering. +- **Solution:** Use UID-based queries (StudyInstanceUID, SeriesInstanceUID). For filtering by Modality or other attributes, use `idc-index` to discover UIDs first, then query DICOMweb with specific UIDs. + +### Issue: 403 Forbidden on Google Healthcare endpoint +- **Cause:** Missing authentication or insufficient permissions +- **Solution:** Run `gcloud auth application-default login` and ensure your account has access + +### Issue: 429 Too Many Requests +- **Cause:** Rate limit exceeded +- **Solution:** Add delays between requests, reduce `limit` values, or use authenticated endpoint for higher quotas + +### Issue: 204 No Content for valid UIDs +- **Cause:** UID may be from an older IDC version not in current data +- **Solution:** Verify UID exists using `idc-index` query first. The proxy points to the latest IDC version. + +### Issue: Large metadata responses slow to parse +- **Cause:** Series with many instances returns large JSON +- **Solution:** Use `limit` parameter on instance queries, or query specific instances by SOPInstanceUID + +### Issue: Response missing expected attributes +- **Cause:** DICOM sequences larger than ~1 MB are excluded from metadata responses +- **Solution:** Retrieve the full DICOM instance using WADO-RS instance retrieval if you need all attributes + +## Resources + +- [Google Healthcare DICOM Conformance Statement](https://docs.cloud.google.com/healthcare-api/docs/dicom) +- [DICOMweb Standard](https://www.dicomstandard.org/using/dicomweb) +- [dicomweb-client Python library](https://dicomweb-client.readthedocs.io/) +- [IDC Documentation](https://learn.canceridc.dev/)