From 0c4a7eaf16c3fbb6c90d992cff3ce3c962562830 Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Wed, 4 Mar 2026 17:26:57 -0500 Subject: [PATCH] Update imaging-data-commons skill to v1.4.0 Release notes: - Bump idc-index requirement to 0.11.10 - digital_pathology_guide.md: add "Filter by specimen preparation" section with H&E staining and FFPE/frozen embedding query examples (array column syntax) - digital_pathology_guide.md: add "Identifying Tumor vs Normal Slides" section covering primaryAnatomicStructureModifier_CodeMeaning (all SM collections) and TCGA barcode parsing via ContainerIdentifier (TCGA-specific) - digital_pathology_guide.md: add "Finding Pre-Computed Analysis Results" section for discovering derived datasets (nuclei segmentations, TIL maps) via analysis_results_index - digital_pathology_guide.md: document per-annotation measurements in DICOM ANN objects (extraction via highdicom post-download, link to tutorial notebook) - digital_pathology_guide.md: update sm_index description with new columns (container/slide ID, tissue type, anatomic structure, diagnosis) Co-Authored-By: Claude Sonnet 4.6 --- .../imaging-data-commons/SKILL.md | 8 +- .../references/digital_pathology_guide.md | 153 +++++++++++++++++- 2 files changed, 155 insertions(+), 6 deletions(-) diff --git a/scientific-skills/imaging-data-commons/SKILL.md b/scientific-skills/imaging-data-commons/SKILL.md index ef55896..95fddaa 100644 --- a/scientific-skills/imaging-data-commons/SKILL.md +++ b/scientific-skills/imaging-data-commons/SKILL.md @@ -3,9 +3,9 @@ name: imaging-data-commons description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses. license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data. metadata: - version: 1.3.1 + version: 1.4.0 skill-author: Andrey Fedorov, @fedorov - idc-index: "0.11.9" + idc-index: "0.11.10" idc-data-version: "v23" repository: https://github.com/ImagingDataCommons/idc-claude-skill --- @@ -25,7 +25,7 @@ Use the `idc-index` Python package to query and download public cancer imaging d ```python import idc_index -REQUIRED_VERSION = "0.11.9" # Must match metadata.idc-index in this file +REQUIRED_VERSION = "0.11.10" # Must match metadata.idc-index in this file installed = idc_index.__version__ if installed < REQUIRED_VERSION: @@ -229,7 +229,7 @@ print(client.get_idc_version()) # Should return "v23" ``` If you see an older version, upgrade with: `pip install --upgrade idc-index` -**Tested with:** idc-index 0.11.9 (IDC data version v23) +**Tested with:** idc-index 0.11.10 (IDC data version v23) **Optional (for data analysis):** ```bash diff --git a/scientific-skills/imaging-data-commons/references/digital_pathology_guide.md b/scientific-skills/imaging-data-commons/references/digital_pathology_guide.md index ecf0be5..c574074 100644 --- a/scientific-skills/imaging-data-commons/references/digital_pathology_guide.md +++ b/scientific-skills/imaging-data-commons/references/digital_pathology_guide.md @@ -1,6 +1,6 @@ # Digital Pathology Guide for IDC -**Tested with:** IDC data version v23, idc-index 0.11.9 +**Tested with:** IDC data version v23, idc-index 0.11.10 For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC. @@ -10,7 +10,7 @@ Five specialized index tables provide curated metadata without needing BigQuery: | Table | Row Granularity | Description | |-------|-----------------|-------------| -| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: lens power, pixel spacing, image dimensions | +| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: container/slide ID, tissue type, anatomic structure, diagnosis, lens power, pixel spacing, image dimensions | | `sm_instance_index` | 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images | | `seg_index` | 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations | | `ann_index` | 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide | @@ -57,6 +57,109 @@ client.sql_query(""" """) ``` +### Filter by specimen preparation + +The `sm_index` includes staining, embedding, and fixative metadata. These columns are **arrays** (e.g., `[hematoxylin stain, water soluble eosin stain]` for H&E) — use `array_to_string()` with `LIKE` or `list_contains()` to filter. + +```python +# Find H&E-stained slides in a collection +client.fetch_index("sm_index") +client.sql_query(""" + SELECT + i.PatientID, + s.staining_usingSubstance_CodeMeaning as staining, + s.embeddingMedium_CodeMeaning as embedding, + s.tissueFixative_CodeMeaning as fixative + FROM sm_index s + JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID + WHERE i.collection_id = 'tcga_brca' + AND array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%' + LIMIT 10 +""") +``` + +```python +# Compare FFPE vs frozen slides across collections +client.sql_query(""" + SELECT + i.collection_id, + s.embeddingMedium_CodeMeaning as embedding, + COUNT(*) as slide_count + FROM sm_index s + JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID + GROUP BY i.collection_id, embedding + ORDER BY i.collection_id, slide_count DESC +""") +``` + +## Identifying Tumor vs Normal Slides + +The `sm_index` table provides two ways to identify tissue type: + +| Column | Use Case | +|--------|----------| +| `primaryAnatomicStructureModifier_CodeMeaning` | Structured tissue type from DICOM specimen metadata (e.g., `Neoplasm, Primary`, `Normal`, `Tumor`, `Neoplasm, Metastatic`). Works across all collections with SM data. | +| `ContainerIdentifier` | Slide/container identifier. For TCGA collections, contains the [TCGA barcode](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/) where the [sample type code](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes) (positions 14-15) encodes tissue origin: `01`-`09` = tumor, `10`-`19` = normal. | + +### Using structured tissue type metadata + +```python +from idc_index import IDCClient +client = IDCClient() +client.fetch_index("sm_index") + +# Discover tissue type values across all SM data +client.sql_query(""" + SELECT + s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type, + COUNT(*) as slide_count + FROM sm_index s + WHERE s.primaryAnatomicStructureModifier_CodeMeaning IS NOT NULL + GROUP BY tissue_type + ORDER BY slide_count DESC +""") +``` + +#### Example: Tumor vs normal slides in TCGA-BRCA + +```python +# Tissue type breakdown for TCGA-BRCA +client.sql_query(""" + SELECT + s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type, + COUNT(*) as slide_count, + COUNT(DISTINCT i.PatientID) as patient_count + FROM sm_index s + JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID + WHERE i.collection_id = 'tcga_brca' + GROUP BY tissue_type + ORDER BY slide_count DESC +""") +# Returns: Neoplasm, Primary (2704 slides), Normal (399 slides) +``` + +### Using TCGA barcode (TCGA collections only) + +For TCGA collections, `ContainerIdentifier` contains the slide barcode (e.g., `TCGA-E9-A3X8-01A-03-TSC`). Extract the sample type code to classify tissue: + +```python +# Parse sample type from TCGA barcode +client.sql_query(""" + SELECT + SUBSTRING(SPLIT_PART(s.ContainerIdentifier, '-', 4), 1, 2) as sample_type_code, + s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type, + COUNT(*) as slide_count + FROM sm_index s + JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID + WHERE i.collection_id = 'tcga_brca' + GROUP BY sample_type_code, tissue_type + ORDER BY sample_type_code +""") +# Returns: 01 → Neoplasm, Primary (2704), 06 → None (8), 11 → Normal (399) +``` + +The barcode approach catches cases where structured metadata is NULL (e.g., `06` = Metastatic slides have `primaryAnatomicStructureModifier_CodeMeaning` = NULL in TCGA-BRCA). + ## Annotation Queries (ANN) DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`. @@ -134,6 +237,52 @@ client.sql_query(""" """) ``` +## Finding Pre-Computed Analysis Results + +IDC hosts derived datasets (nuclei segmentations, TIL maps, AI annotations) identified by `analysis_result_id` in the main `index` table. Use `analysis_results_index` to discover what's available for pathology. + +```python +from idc_index import IDCClient +client = IDCClient() +client.fetch_index("analysis_results_index") + +# Find analysis results that include pathology annotations or segmentations +client.sql_query(""" + SELECT + ar.analysis_result_id, + ar.analysis_result_title, + ar.Modalities, + ar.Subjects, + ar.Collections + FROM analysis_results_index ar + WHERE ar.Modalities LIKE '%ANN%' OR ar.Modalities LIKE '%SEG%' + ORDER BY ar.Subjects DESC +""") +``` + +### Find analysis results for a specific slide + +```python +# Find all derived data (annotations, segmentations) for TCGA-BRCA slides +client.fetch_index("ann_index") +client.sql_query(""" + SELECT + i.analysis_result_id, + i.PatientID, + a.referenced_SeriesInstanceUID as source_slide, + g.AnnotationGroupLabel, + g.NumberOfAnnotations, + g.AlgorithmName + FROM ann_group_index g + JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID + JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID + WHERE i.collection_id = 'tcga_brca' + LIMIT 10 +""") +``` + +Annotation objects can also contain per-annotation **measurements** (e.g., nucleus area, eccentricity) stored within the DICOM file. These are not in the index tables — extract them after download using [highdicom](https://github.com/ImagingDataCommons/highdicom) (`ann.get_annotation_groups()`, `group.get_measurements()`). See the [microscopy_dicom_ann_intro](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb) tutorial for a worked example including spatial analysis and cellularity computation. + ## Filter by AnnotationGroupLabel `AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search.