Merge pull request #35 from fedorov/add-idc-clean

Added Imaging Data Commons skill
2026-03-27 07:09:27 +08:00 · 2026-01-25 10:17:32 -08:00
parent cd537c1af6 79a598e060
commit a31cf4dd97
4 changed files with 1748 additions and 0 deletions
--- a/.claude-plugin/marketplace.json
+++ b/.claude-plugin/marketplace.json
@@ -41,6 +41,7 @@
        "./scientific-skills/gget",
        "./scientific-skills/gtars",
        "./scientific-skills/histolab",
        "./scientific-skills/imaging-data-commons",
        "./scientific-skills/hypogenic",
        "./scientific-skills/lamindb",
        "./scientific-skills/markitdown",
--- a/scientific-skills/imaging-data-commons/SKILL.md
+++ b/scientific-skills/imaging-data-commons/SKILL.md
--- a/scientific-skills/imaging-data-commons/references/bigquery_guide.md
+++ b/scientific-skills/imaging-data-commons/references/bigquery_guide.md
@@ -0,0 +1,289 @@
 # BigQuery Guide for IDC
 **Tested with:** IDC data version v23
 For most queries and downloads, use `idc-index` (see main SKILL.md). This guide covers BigQuery for advanced use cases requiring full DICOM metadata or complex joins.
 ## Prerequisites
 **Requirements:**
 1. Google account
 2. Google Cloud project with billing enabled (first 1 TB/month free)
 3. `google-cloud-bigquery` Python package or BigQuery console access
 **Authentication setup:**
 ```bash
 # Install Google Cloud SDK, then:
 gcloud auth application-default login
 ```
 ## When to Use BigQuery
 Use BigQuery instead of `idc-index` when you need:
 - Full DICOM metadata (all 4000+ tags, not just the ~50 in idc-index)
 - Complex joins across clinical data tables
 - DICOM sequence attributes (nested structures)
 - Queries on fields not in the idc-index mini-index
 ## Accessing IDC in BigQuery
 ### Dataset Structure
 All IDC tables are in the `bigquery-public-data` BigQuery project.
 **Current version (recommended for exploration):**
 - `bigquery-public-data.idc_current.*`
 - `bigquery-public-data.idc_current_clinical.*`
 **Versioned datasets (recommended for reproducibility):**
 - `bigquery-public-data.idc_v{IDC version}.*`
 - `bigquery-public-data.idc_v{IDC version}_clinical.*`
 Always use versioned datasets for reproducible research!
 ## Key Tables
 ### dicom_all
 Primary table joining complete DICOM metadata with IDC-specific columns (collection_id, gcs_url, license). Contains all DICOM tags from `dicom_metadata` plus collection and administrative metadata. See [dicom_all.sql](https://github.com/ImagingDataCommons/etl_flow/blob/master/bq/generate_tables_and_views/derived_tables/BQ_Table_Building/derived_data_views/sql/dicom_all.sql) for the exact derivation.
 ```sql
 SELECT 
  collection_id,
  PatientID,
  StudyInstanceUID, 
  SeriesInstanceUID,
  Modality,
  BodyPartExamined,
  SeriesDescription,
  gcs_url,
  license_short_name
 FROM `bigquery-public-data.idc_current.dicom_all`
 WHERE Modality = 'CT'
  AND BodyPartExamined = 'CHEST'
 LIMIT 10
 ```
 ### Derived Tables
 **segmentations** - DICOM Segmentation objects
 ```sql
 SELECT *
 FROM `bigquery-public-data.idc_current.segmentations`
 LIMIT 10
 ```
 **measurement_groups** - SR TID1500 measurement groups
 **qualitative_measurements** - Coded evaluations
 **quantitative_measurements** - Numeric measurements
 ### Collection Metadata
 **original_collections_metadata** - Collection-level descriptions
 ```sql
 SELECT
  collection_id,
  CancerTypes,
  TumorLocations,
  Subjects,
  src.source_doi,
  src.ImageTypes,
  src.license.license_short_name
 FROM `bigquery-public-data.idc_current.original_collections_metadata`,
 UNNEST(Sources) AS src
 WHERE CancerTypes LIKE '%Lung%'
 ```
 ## Common Query Patterns
 ### Find Collections by Criteria
 ```sql
 SELECT 
  collection_id,
  COUNT(DISTINCT PatientID) as patient_count,
  COUNT(DISTINCT SeriesInstanceUID) as series_count,
  ARRAY_AGG(DISTINCT Modality) as modalities
 FROM `bigquery-public-data.idc_current.dicom_all`
 WHERE BodyPartExamined LIKE '%BRAIN%'
 GROUP BY collection_id
 HAVING patient_count > 50
 ORDER BY patient_count DESC
 ```
 ### Get Download URLs
 ```sql
 SELECT
  SeriesInstanceUID,
  gcs_url
 FROM `bigquery-public-data.idc_current.dicom_all`
 WHERE collection_id = 'rider_pilot'
  AND Modality = 'CT'
 ```
 ### Find Studies with Multiple Modalities
 ```sql
 SELECT
  StudyInstanceUID,
  ARRAY_AGG(DISTINCT Modality) as modalities,
  COUNT(DISTINCT SeriesInstanceUID) as series_count
 FROM `bigquery-public-data.idc_current.dicom_all`
 GROUP BY StudyInstanceUID
 HAVING ARRAY_LENGTH(ARRAY_AGG(DISTINCT Modality)) > 1
 LIMIT 100
 ```
 ### License Filtering
 ```sql
 SELECT
  collection_id,
  license_short_name,
  COUNT(*) as instance_count
 FROM `bigquery-public-data.idc_current.dicom_all`
 WHERE license_short_name = 'CC BY 4.0'
 GROUP BY collection_id, license_short_name
 ```
 ### Find Segmentations with Source Images
 ```sql
 SELECT
  src.collection_id,
  seg.SeriesInstanceUID as seg_series,
  seg.SegmentedPropertyType,
  src.SeriesInstanceUID as source_series,
  src.Modality as source_modality
 FROM `bigquery-public-data.idc_current.segmentations` seg
 JOIN `bigquery-public-data.idc_current.dicom_all` src
  ON seg.segmented_SeriesInstanceUID = src.SeriesInstanceUID
 WHERE src.collection_id = 'qin_prostate_repeatability'
 LIMIT 10
 ```
 ## Using Query Results with idc-index
 Combine BigQuery for complex queries with idc-index for downloads (no GCP auth needed for downloads):
 ```python
 from google.cloud import bigquery
 from idc_index import IDCClient
 # Initialize BigQuery client
 # Requires: pip install google-cloud-bigquery
 # Auth: gcloud auth application-default login
 # Project: needed for billing even on public datasets (free tier applies)
 bq_client = bigquery.Client(project="your-gcp-project-id")
 # Query for series with specific criteria
 query = """
 SELECT DISTINCT SeriesInstanceUID
 FROM `bigquery-public-data.idc_current.dicom_all`
 WHERE collection_id = 'tcga_luad'
  AND Modality = 'CT'
  AND Manufacturer = 'GE MEDICAL SYSTEMS'
 LIMIT 100
 """
 df = bq_client.query(query).to_dataframe()
 print(f"Found {len(df)} GE CT series")
 # Download with idc-index (no GCP auth required)
 idc_client = IDCClient()
 idc_client.download_from_selection(
    seriesInstanceUID=list(df['SeriesInstanceUID'].values),
    downloadDir="./tcga_luad_thin_ct"
 )
 ```
 ## Cost and Optimization
 **Pricing:** $5 per TB scanned (first 1 TB/month free). Most users stay within free tier.
 **Minimize data scanned:**
 - Select only needed columns (not `SELECT *`)
 - Filter early with `WHERE` clauses
 - Use `LIMIT` when testing
 - Use `dicom_all` instead of `dicom_metadata` when possible (smaller)
 - Preview queries in BQ console (free, shows bytes to scan)
 **Check cost before running:**
 ```python
 query_job = client.query(query, job_config=bigquery.QueryJobConfig(dry_run=True))
 print(f"Query will scan {query_job.total_bytes_processed / 1e9:.2f} GB")
 ```
 **Use materialized tables:** IDC provides both views (`table_name_view`) and materialized tables (`table_name`). Always use the materialized tables (faster, lower cost).
 ## Clinical Data
 Clinical data is in separate datasets with collection-specific tables. Not all collections have clinical data (started in IDC v11).
 **List available clinical tables:**
 ```sql
 SELECT table_name
 FROM `bigquery-public-data.idc_current_clinical.INFORMATION_SCHEMA.TABLES`
 ```
 **Query clinical data for a collection:**
 ```sql
 -- Example: TCGA-LUAD clinical data
 SELECT *
 FROM `bigquery-public-data.idc_current_clinical.tcga_luad_clinical`
 LIMIT 10
 ```
 **Join clinical with imaging data:**
 ```sql
 SELECT
  d.PatientID,
  d.SeriesInstanceUID,
  d.Modality,
  c.age_at_diagnosis,
  c.pathologic_stage
 FROM `bigquery-public-data.idc_current.dicom_all` d
 JOIN `bigquery-public-data.idc_current_clinical.tcga_luad_clinical` c
  ON d.PatientID = c.dicom_patient_id
 WHERE d.collection_id = 'tcga_luad'
  AND d.Modality = 'CT'
 LIMIT 20
 ```
 **Note:** Clinical table schemas vary by collection. Check column names with `INFORMATION_SCHEMA.COLUMNS` before querying.
 ## Important Notes
 - Tables are read-only (public dataset)
 - Schema changes between IDC versions
 - Use versioned datasets for reproducibility
 - Some DICOM sequences >15 levels deep are not extracted
 - Very large sequences (>1MB) may be truncated
 - Always check data license before use
 ## Common Errors
 **Issue: Billing must be enabled**
 - Cause: BigQuery requires a billing-enabled GCP project
 - Solution: Enable billing in Google Cloud Console or use idc-index mini-index instead
 **Issue: Query exceeds resource limits**
 - Cause: Query scans too much data or is too complex
 - Solution: Add more specific WHERE filters, use LIMIT, break into smaller queries
 **Issue: Column not found**
 - Cause: Field name typo or not in selected table
 - Solution: Check table schema first with `INFORMATION_SCHEMA.COLUMNS`
 **Issue: Permission denied**
 - Cause: Not authenticated to Google Cloud
 - Solution: Run `gcloud auth application-default login` or set GOOGLE_APPLICATION_CREDENTIALS
 ## Resources
 - [Understanding the BigQuery DICOM schema](https://docs.cloud.google.com/healthcare-api/docs/how-tos/dicom-bigquery-schema)
 - [BigQuery Query Syntax](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax)
 - [Kaggle Intro to SQL](https://www.kaggle.com/learn/intro-to-sql)
 - [Sample BigQuery queries of IDC data](https://github.com/ImagingDataCommons/idc-bigquery-cookbook)
--- a/scientific-skills/imaging-data-commons/references/dicomweb_guide.md
+++ b/scientific-skills/imaging-data-commons/references/dicomweb_guide.md
@@ -0,0 +1,308 @@
 # DICOMweb Guide for IDC
 IDC provides DICOMweb access through Google Cloud Healthcare API DICOM stores. This guide covers the implementation specifics and usage patterns.
 ## When to Use DICOMweb
 Use DICOMweb when you need:
 - Integration with PACS systems or DICOMweb-compatible tools
 - Streaming metadata without downloading full files
 - Building custom viewers or web applications
 - Using existing DICOMweb client libraries (OHIF, dicomweb-client, etc.)
 For most use cases, `idc-index` is simpler and recommended. Use DICOMweb when you specifically need the DICOMweb protocol.
 ## Endpoints
 ### Public Proxy (No Authentication)
 ```
 https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb
 ```
 - Points to the latest IDC version automatically
 - Daily quota applies (suitable for testing and moderate use)
 - No authentication required
 - Note: "viewer-only-no-downloads" in URL is legacy naming with no functional meaning
 ### Google Healthcare API (Requires Authentication)
 ```
 https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v{VERSION}/dicomWeb
 ```
 Replace `{VERSION}` with the IDC release number. To find the current version:
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 print(client.get_idc_version())  # e.g., "23" for v23
 ```
 The Google Healthcare endpoint requires authentication and provides higher quotas. See [Authentication](#authentication-for-google-healthcare-api) section below.
 ## Implementation Details
 IDC DICOMweb is provided through Google Cloud Healthcare API DICOM stores. The implementation follows DICOM PS3.18 Web Services with specific characteristics documented in the [Google Healthcare DICOM conformance statement](https://docs.cloud.google.com/healthcare-api/docs/dicom).
 ### Supported Operations
 | Service | Description | Supported |
 |---------|-------------|-----------|
 | QIDO-RS | Search for DICOM objects | Yes |
 | WADO-RS | Retrieve DICOM objects and metadata | Yes |
 | STOW-RS | Store DICOM objects | No (IDC is read-only) |
 **Not supported:** URI Service, Worklist Service, Non-Patient Instance Service, Capabilities Transactions
 ### Searchable DICOM Tags (QIDO-RS)
 The implementation supports a limited set of searchable tags:
 | Level | Searchable Tags |
 |-------|-----------------|
 | Study | StudyInstanceUID, PatientName, PatientID, AccessionNumber, ReferringPhysicianName, StudyDate |
 | Series | All study tags + SeriesInstanceUID, Modality |
 | Instance | All series tags + SOPInstanceUID |
 **Important:** Only exact matching is supported, except for:
 - StudyDate: supports range queries
 - PatientName: supports fuzzy matching
 ### Query Limitations
 - Maximum results: 5,000 for studies/series searches; 50,000 for instances
 - Maximum offset: 1,000,000
 - DICOM sequence tags larger than ~1 MB are not returned in metadata (BulkDataURI provided instead)
 ## Code Examples
 All examples use the public proxy endpoint. For authenticated access to Google Healthcare, see the [authentication section](#authentication-for-google-healthcare-api).
 ### Finding UIDs with idc-index
 Use `idc-index` to discover data, then use DICOMweb for metadata access:
 ```python
 from idc_index import IDCClient
 client = IDCClient()
 # Find studies of interest
 results = client.sql_query("""
    SELECT StudyInstanceUID, SeriesInstanceUID, PatientID, Modality
    FROM index
    WHERE collection_id = 'tcga_luad' AND Modality = 'CT'
    LIMIT 5
 """)
 # Use these UIDs with DICOMweb
 study_uid = results.iloc[0]['StudyInstanceUID']
 series_uid = results.iloc[0]['SeriesInstanceUID']
 print(f"Study: {study_uid}")
 print(f"Series: {series_uid}")
 ```
 ### QIDO-RS: Search by UID
 ```python
 import requests
 base_url = "https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb"
 # Search for a specific study
 study_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.307623500513044641407722230440"
 response = requests.get(
    f"{base_url}/studies",
    params={"StudyInstanceUID": study_uid},
    headers={"Accept": "application/dicom+json"}
 )
 if response.status_code == 200:
    studies = response.json()
    print(f"Found {len(studies)} study")
 ```
 ### QIDO-RS: List Series in a Study
 ```python
 import requests
 base_url = "https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb"
 study_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.307623500513044641407722230440"
 response = requests.get(
    f"{base_url}/studies/{study_uid}/series",
    headers={"Accept": "application/dicom+json"}
 )
 if response.status_code == 200:
    series_list = response.json()
    for series in series_list:
        # DICOM tags are returned as hex codes
        series_uid = series.get("0020000E", {}).get("Value", [None])[0]
        modality = series.get("00080060", {}).get("Value", [None])[0]
        description = series.get("0008103E", {}).get("Value", [""])[0]
        print(f"{modality}: {description}")
 ```
 ### QIDO-RS: List Instances in a Series
 ```python
 import requests
 base_url = "https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb"
 study_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.307623500513044641407722230440"
 series_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.217441095430480124587725641302"
 response = requests.get(
    f"{base_url}/studies/{study_uid}/series/{series_uid}/instances",
    params={"limit": 10},
    headers={"Accept": "application/dicom+json"}
 )
 if response.status_code == 200:
    instances = response.json()
    print(f"Found {len(instances)} instances")
    for inst in instances[:3]:
        sop_uid = inst.get("00080018", {}).get("Value", [None])[0]
        print(f"  SOPInstanceUID: {sop_uid}")
 ```
 ### WADO-RS: Retrieve Series Metadata
 ```python
 import requests
 base_url = "https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb"
 study_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.307623500513044641407722230440"
 series_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.217441095430480124587725641302"
 response = requests.get(
    f"{base_url}/studies/{study_uid}/series/{series_uid}/metadata",
    headers={"Accept": "application/dicom+json"}
 )
 if response.status_code == 200:
    instances = response.json()
    print(f"Retrieved metadata for {len(instances)} instances")
    # Extract image dimensions from first instance
    if instances:
        inst = instances[0]
        rows = inst.get("00280010", {}).get("Value", [None])[0]
        cols = inst.get("00280011", {}).get("Value", [None])[0]
        print(f"Image dimensions: {rows} x {cols}")
 ```
 ### Combined Workflow: idc-index Discovery + DICOMweb Metadata
 ```python
 from idc_index import IDCClient
 import requests
 # Use idc-index for efficient discovery
 idc = IDCClient()
 results = idc.sql_query("""
    SELECT StudyInstanceUID, SeriesInstanceUID, Modality, SeriesDescription
    FROM index
    WHERE collection_id = 'nlst' AND Modality = 'CT'
    LIMIT 1
 """)
 study_uid = results.iloc[0]['StudyInstanceUID']
 series_uid = results.iloc[0]['SeriesInstanceUID']
 print(f"Found: {results.iloc[0]['SeriesDescription']}")
 # Use DICOMweb to stream metadata without downloading files
 base_url = "https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb"
 response = requests.get(
    f"{base_url}/studies/{study_uid}/series/{series_uid}/metadata",
    headers={"Accept": "application/dicom+json"}
 )
 if response.status_code == 200:
    metadata = response.json()
    print(f"Retrieved metadata for {len(metadata)} instances without downloading files")
 ```
 ## Common DICOM Tags Reference
 DICOMweb returns tags as hexadecimal codes. Common tags:
 | Tag | Name | Description |
 |-----|------|-------------|
 | 00080018 | SOPInstanceUID | Unique instance identifier |
 | 00080020 | StudyDate | Date study was performed |
 | 00080060 | Modality | Imaging modality (CT, MR, PT, etc.) |
 | 0008103E | SeriesDescription | Description of series |
 | 00100020 | PatientID | Patient identifier |
 | 0020000D | StudyInstanceUID | Unique study identifier |
 | 0020000E | SeriesInstanceUID | Unique series identifier |
 | 00280010 | Rows | Image height in pixels |
 | 00280011 | Columns | Image width in pixels |
 ## Authentication for Google Healthcare API
 To use the Google Healthcare endpoint with higher quotas:
 ```python
 from google.auth import default
 from google.auth.transport.requests import Request
 import requests
 # Get credentials (requires gcloud auth)
 credentials, project = default()
 credentials.refresh(Request())
 # Build authenticated request
 base_url = "https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v23/dicomWeb"
 response = requests.get(
    f"{base_url}/studies",
    params={"limit": 5},
    headers={
        "Authorization": f"Bearer {credentials.token}",
        "Accept": "application/dicom+json"
    }
 )
 ```
 **Prerequisites:**
 1. Google Cloud SDK installed (`gcloud`)
 2. Authenticated: `gcloud auth application-default login`
 3. Account has access to public Google Cloud datasets
 ## Troubleshooting
 ### Issue: 400 Bad Request on search queries
 - **Cause:** Using unsupported search parameters. The implementation only supports specific DICOM tags for filtering.
 - **Solution:** Use UID-based queries (StudyInstanceUID, SeriesInstanceUID). For filtering by Modality or other attributes, use `idc-index` to discover UIDs first, then query DICOMweb with specific UIDs.
 ### Issue: 403 Forbidden on Google Healthcare endpoint
 - **Cause:** Missing authentication or insufficient permissions
 - **Solution:** Run `gcloud auth application-default login` and ensure your account has access
 ### Issue: 429 Too Many Requests
 - **Cause:** Rate limit exceeded
 - **Solution:** Add delays between requests, reduce `limit` values, or use authenticated endpoint for higher quotas
 ### Issue: 204 No Content for valid UIDs
 - **Cause:** UID may be from an older IDC version not in current data
 - **Solution:** Verify UID exists using `idc-index` query first. The proxy points to the latest IDC version.
 ### Issue: Large metadata responses slow to parse
 - **Cause:** Series with many instances returns large JSON
 - **Solution:** Use `limit` parameter on instance queries, or query specific instances by SOPInstanceUID
 ### Issue: Response missing expected attributes
 - **Cause:** DICOM sequences larger than ~1 MB are excluded from metadata responses
 - **Solution:** Retrieve the full DICOM instance using WADO-RS instance retrieval if you need all attributes
 ## Resources
 - [Google Healthcare DICOM Conformance Statement](https://docs.cloud.google.com/healthcare-api/docs/dicom)
 - [DICOMweb Standard](https://www.dicomstandard.org/using/dicomweb)
 - [dicomweb-client Python library](https://dicomweb-client.readthedocs.io/)
 - [IDC Documentation](https://learn.canceridc.dev/)