Merge pull request #80 from fedorov/update-idc-1.4.0

Update imaging-data-commons skill to v1.4.0
This commit is contained in:
Timothy Kassis
2026-03-05 14:06:03 -08:00
committed by GitHub
2 changed files with 155 additions and 6 deletions

View File

@@ -3,9 +3,9 @@ name: imaging-data-commons
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses. description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data. license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
metadata: metadata:
version: 1.3.1 version: 1.4.0
skill-author: Andrey Fedorov, @fedorov skill-author: Andrey Fedorov, @fedorov
idc-index: "0.11.9" idc-index: "0.11.10"
idc-data-version: "v23" idc-data-version: "v23"
repository: https://github.com/ImagingDataCommons/idc-claude-skill repository: https://github.com/ImagingDataCommons/idc-claude-skill
--- ---
@@ -25,7 +25,7 @@ Use the `idc-index` Python package to query and download public cancer imaging d
```python ```python
import idc_index import idc_index
REQUIRED_VERSION = "0.11.9" # Must match metadata.idc-index in this file REQUIRED_VERSION = "0.11.10" # Must match metadata.idc-index in this file
installed = idc_index.__version__ installed = idc_index.__version__
if installed < REQUIRED_VERSION: if installed < REQUIRED_VERSION:
@@ -229,7 +229,7 @@ print(client.get_idc_version()) # Should return "v23"
``` ```
If you see an older version, upgrade with: `pip install --upgrade idc-index` If you see an older version, upgrade with: `pip install --upgrade idc-index`
**Tested with:** idc-index 0.11.9 (IDC data version v23) **Tested with:** idc-index 0.11.10 (IDC data version v23)
**Optional (for data analysis):** **Optional (for data analysis):**
```bash ```bash

View File

@@ -1,6 +1,6 @@
# Digital Pathology Guide for IDC # Digital Pathology Guide for IDC
**Tested with:** IDC data version v23, idc-index 0.11.9 **Tested with:** IDC data version v23, idc-index 0.11.10
For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC. For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
@@ -10,7 +10,7 @@ Five specialized index tables provide curated metadata without needing BigQuery:
| Table | Row Granularity | Description | | Table | Row Granularity | Description |
|-------|-----------------|-------------| |-------|-----------------|-------------|
| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: lens power, pixel spacing, image dimensions | | `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: container/slide ID, tissue type, anatomic structure, diagnosis, lens power, pixel spacing, image dimensions |
| `sm_instance_index` | 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images | | `sm_instance_index` | 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images |
| `seg_index` | 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations | | `seg_index` | 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations |
| `ann_index` | 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide | | `ann_index` | 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide |
@@ -57,6 +57,109 @@ client.sql_query("""
""") """)
``` ```
### Filter by specimen preparation
The `sm_index` includes staining, embedding, and fixative metadata. These columns are **arrays** (e.g., `[hematoxylin stain, water soluble eosin stain]` for H&E) — use `array_to_string()` with `LIKE` or `list_contains()` to filter.
```python
# Find H&E-stained slides in a collection
client.fetch_index("sm_index")
client.sql_query("""
SELECT
i.PatientID,
s.staining_usingSubstance_CodeMeaning as staining,
s.embeddingMedium_CodeMeaning as embedding,
s.tissueFixative_CodeMeaning as fixative
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'tcga_brca'
AND array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'
LIMIT 10
""")
```
```python
# Compare FFPE vs frozen slides across collections
client.sql_query("""
SELECT
i.collection_id,
s.embeddingMedium_CodeMeaning as embedding,
COUNT(*) as slide_count
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
GROUP BY i.collection_id, embedding
ORDER BY i.collection_id, slide_count DESC
""")
```
## Identifying Tumor vs Normal Slides
The `sm_index` table provides two ways to identify tissue type:
| Column | Use Case |
|--------|----------|
| `primaryAnatomicStructureModifier_CodeMeaning` | Structured tissue type from DICOM specimen metadata (e.g., `Neoplasm, Primary`, `Normal`, `Tumor`, `Neoplasm, Metastatic`). Works across all collections with SM data. |
| `ContainerIdentifier` | Slide/container identifier. For TCGA collections, contains the [TCGA barcode](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/) where the [sample type code](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes) (positions 14-15) encodes tissue origin: `01`-`09` = tumor, `10`-`19` = normal. |
### Using structured tissue type metadata
```python
from idc_index import IDCClient
client = IDCClient()
client.fetch_index("sm_index")
# Discover tissue type values across all SM data
client.sql_query("""
SELECT
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
COUNT(*) as slide_count
FROM sm_index s
WHERE s.primaryAnatomicStructureModifier_CodeMeaning IS NOT NULL
GROUP BY tissue_type
ORDER BY slide_count DESC
""")
```
#### Example: Tumor vs normal slides in TCGA-BRCA
```python
# Tissue type breakdown for TCGA-BRCA
client.sql_query("""
SELECT
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
COUNT(*) as slide_count,
COUNT(DISTINCT i.PatientID) as patient_count
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'tcga_brca'
GROUP BY tissue_type
ORDER BY slide_count DESC
""")
# Returns: Neoplasm, Primary (2704 slides), Normal (399 slides)
```
### Using TCGA barcode (TCGA collections only)
For TCGA collections, `ContainerIdentifier` contains the slide barcode (e.g., `TCGA-E9-A3X8-01A-03-TSC`). Extract the sample type code to classify tissue:
```python
# Parse sample type from TCGA barcode
client.sql_query("""
SELECT
SUBSTRING(SPLIT_PART(s.ContainerIdentifier, '-', 4), 1, 2) as sample_type_code,
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
COUNT(*) as slide_count
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'tcga_brca'
GROUP BY sample_type_code, tissue_type
ORDER BY sample_type_code
""")
# Returns: 01 → Neoplasm, Primary (2704), 06 → None (8), 11 → Normal (399)
```
The barcode approach catches cases where structured metadata is NULL (e.g., `06` = Metastatic slides have `primaryAnatomicStructureModifier_CodeMeaning` = NULL in TCGA-BRCA).
## Annotation Queries (ANN) ## Annotation Queries (ANN)
DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`. DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`.
@@ -134,6 +237,52 @@ client.sql_query("""
""") """)
``` ```
## Finding Pre-Computed Analysis Results
IDC hosts derived datasets (nuclei segmentations, TIL maps, AI annotations) identified by `analysis_result_id` in the main `index` table. Use `analysis_results_index` to discover what's available for pathology.
```python
from idc_index import IDCClient
client = IDCClient()
client.fetch_index("analysis_results_index")
# Find analysis results that include pathology annotations or segmentations
client.sql_query("""
SELECT
ar.analysis_result_id,
ar.analysis_result_title,
ar.Modalities,
ar.Subjects,
ar.Collections
FROM analysis_results_index ar
WHERE ar.Modalities LIKE '%ANN%' OR ar.Modalities LIKE '%SEG%'
ORDER BY ar.Subjects DESC
""")
```
### Find analysis results for a specific slide
```python
# Find all derived data (annotations, segmentations) for TCGA-BRCA slides
client.fetch_index("ann_index")
client.sql_query("""
SELECT
i.analysis_result_id,
i.PatientID,
a.referenced_SeriesInstanceUID as source_slide,
g.AnnotationGroupLabel,
g.NumberOfAnnotations,
g.AlgorithmName
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'tcga_brca'
LIMIT 10
""")
```
Annotation objects can also contain per-annotation **measurements** (e.g., nucleus area, eccentricity) stored within the DICOM file. These are not in the index tables — extract them after download using [highdicom](https://github.com/ImagingDataCommons/highdicom) (`ann.get_annotation_groups()`, `group.get_measurements()`). See the [microscopy_dicom_ann_intro](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb) tutorial for a worked example including spatial analysis and cellularity computation.
## Filter by AnnotationGroupLabel ## Filter by AnnotationGroupLabel
`AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search. `AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search.