Release notes: - Bump idc-index requirement to 0.11.10 - digital_pathology_guide.md: add "Filter by specimen preparation" section with H&E staining and FFPE/frozen embedding query examples (array column syntax) - digital_pathology_guide.md: add "Identifying Tumor vs Normal Slides" section covering primaryAnatomicStructureModifier_CodeMeaning (all SM collections) and TCGA barcode parsing via ContainerIdentifier (TCGA-specific) - digital_pathology_guide.md: add "Finding Pre-Computed Analysis Results" section for discovering derived datasets (nuclei segmentations, TIL maps) via analysis_results_index - digital_pathology_guide.md: document per-annotation measurements in DICOM ANN objects (extraction via highdicom post-download, link to tutorial notebook) - digital_pathology_guide.md: update sm_index description with new columns (container/slide ID, tissue type, anatomic structure, diagnosis) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
15 KiB
Digital Pathology Guide for IDC
Tested with: IDC data version v23, idc-index 0.11.10
For general IDC queries and downloads, use idc-index (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
Index Tables for Digital Pathology
Five specialized index tables provide curated metadata without needing BigQuery:
| Table | Row Granularity | Description |
|---|---|---|
sm_index |
1 row = 1 SM series | Slide Microscopy series metadata: container/slide ID, tissue type, anatomic structure, diagnosis, lens power, pixel spacing, image dimensions |
sm_instance_index |
1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images |
seg_index |
1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations |
ann_index |
1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes referenced_SeriesInstanceUID linking to the annotated slide |
ann_group_index |
1 row = 1 annotation group | Annotation group details: AnnotationGroupLabel, GraphicType, NumberOfAnnotations, AlgorithmName, property codes |
All require client.fetch_index("table_name") before querying. Use client.indices_overview to inspect column schemas programmatically.
Slide Microscopy Queries
Basic SM metadata
from idc_index import IDCClient
client = IDCClient()
# sm_index has detailed metadata; join with index for collection_id
client.fetch_index("sm_index")
client.sql_query("""
SELECT i.collection_id, COUNT(*) as slides,
MIN(s.min_PixelSpacing_2sf) as min_resolution
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
GROUP BY i.collection_id
ORDER BY slides DESC
""")
Find SM series with specific properties
# Find high-resolution slides with specific objective lens power
client.fetch_index("sm_index")
client.sql_query("""
SELECT
i.collection_id,
i.PatientID,
s.ObjectiveLensPower,
s.min_PixelSpacing_2sf
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
WHERE s.ObjectiveLensPower >= 40
ORDER BY s.min_PixelSpacing_2sf
LIMIT 20
""")
Filter by specimen preparation
The sm_index includes staining, embedding, and fixative metadata. These columns are arrays (e.g., [hematoxylin stain, water soluble eosin stain] for H&E) — use array_to_string() with LIKE or list_contains() to filter.
# Find H&E-stained slides in a collection
client.fetch_index("sm_index")
client.sql_query("""
SELECT
i.PatientID,
s.staining_usingSubstance_CodeMeaning as staining,
s.embeddingMedium_CodeMeaning as embedding,
s.tissueFixative_CodeMeaning as fixative
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'tcga_brca'
AND array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'
LIMIT 10
""")
# Compare FFPE vs frozen slides across collections
client.sql_query("""
SELECT
i.collection_id,
s.embeddingMedium_CodeMeaning as embedding,
COUNT(*) as slide_count
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
GROUP BY i.collection_id, embedding
ORDER BY i.collection_id, slide_count DESC
""")
Identifying Tumor vs Normal Slides
The sm_index table provides two ways to identify tissue type:
| Column | Use Case |
|---|---|
primaryAnatomicStructureModifier_CodeMeaning |
Structured tissue type from DICOM specimen metadata (e.g., Neoplasm, Primary, Normal, Tumor, Neoplasm, Metastatic). Works across all collections with SM data. |
ContainerIdentifier |
Slide/container identifier. For TCGA collections, contains the TCGA barcode where the sample type code (positions 14-15) encodes tissue origin: 01-09 = tumor, 10-19 = normal. |
Using structured tissue type metadata
from idc_index import IDCClient
client = IDCClient()
client.fetch_index("sm_index")
# Discover tissue type values across all SM data
client.sql_query("""
SELECT
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
COUNT(*) as slide_count
FROM sm_index s
WHERE s.primaryAnatomicStructureModifier_CodeMeaning IS NOT NULL
GROUP BY tissue_type
ORDER BY slide_count DESC
""")
Example: Tumor vs normal slides in TCGA-BRCA
# Tissue type breakdown for TCGA-BRCA
client.sql_query("""
SELECT
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
COUNT(*) as slide_count,
COUNT(DISTINCT i.PatientID) as patient_count
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'tcga_brca'
GROUP BY tissue_type
ORDER BY slide_count DESC
""")
# Returns: Neoplasm, Primary (2704 slides), Normal (399 slides)
Using TCGA barcode (TCGA collections only)
For TCGA collections, ContainerIdentifier contains the slide barcode (e.g., TCGA-E9-A3X8-01A-03-TSC). Extract the sample type code to classify tissue:
# Parse sample type from TCGA barcode
client.sql_query("""
SELECT
SUBSTRING(SPLIT_PART(s.ContainerIdentifier, '-', 4), 1, 2) as sample_type_code,
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
COUNT(*) as slide_count
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'tcga_brca'
GROUP BY sample_type_code, tissue_type
ORDER BY sample_type_code
""")
# Returns: 01 → Neoplasm, Primary (2704), 06 → None (8), 11 → Normal (399)
The barcode approach catches cases where structured metadata is NULL (e.g., 06 = Metastatic slides have primaryAnatomicStructureModifier_CodeMeaning = NULL in TCGA-BRCA).
Annotation Queries (ANN)
DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations on slide microscopy images. They appear in ann_index (series-level) and ann_group_index (group-level detail). Each ANN series references the slide it annotates via referenced_SeriesInstanceUID.
Basic annotation discovery
# Find annotation series and their referenced images
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
client.sql_query("""
SELECT
a.SeriesInstanceUID as ann_series,
a.AnnotationCoordinateType,
a.referenced_SeriesInstanceUID as source_series
FROM ann_index a
LIMIT 10
""")
Annotation group statistics
# Get annotation group details (graphic types, counts, algorithms)
client.sql_query("""
SELECT
GraphicType,
SUM(NumberOfAnnotations) as total_annotations,
COUNT(*) as group_count
FROM ann_group_index
GROUP BY GraphicType
ORDER BY total_annotations DESC
""")
Find annotations with source slide context
# Find annotations with their source slide microscopy context
client.sql_query("""
SELECT
i.collection_id,
g.GraphicType,
g.AnnotationPropertyType_CodeMeaning,
g.AlgorithmName,
g.NumberOfAnnotations
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN index i ON a.referenced_SeriesInstanceUID = i.SeriesInstanceUID
WHERE g.AlgorithmName IS NOT NULL
LIMIT 10
""")
Segmentations on Slide Microscopy
DICOM Segmentations (Modality = 'SEG') are used for both radiology (e.g., organ segmentations on CT) and pathology (e.g., tissue region segmentations on whole slide images). Use seg_index.segmented_SeriesInstanceUID to find the source series, then filter by source Modality to isolate pathology segmentations.
# Find segmentations whose source is a slide microscopy image
client.fetch_index("seg_index")
client.fetch_index("sm_index")
client.sql_query("""
SELECT
seg.SeriesInstanceUID as seg_series,
seg.AlgorithmName,
seg.total_segments,
src.collection_id,
src.Modality as source_modality
FROM seg_index seg
JOIN index src ON seg.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE src.Modality = 'SM'
LIMIT 20
""")
Finding Pre-Computed Analysis Results
IDC hosts derived datasets (nuclei segmentations, TIL maps, AI annotations) identified by analysis_result_id in the main index table. Use analysis_results_index to discover what's available for pathology.
from idc_index import IDCClient
client = IDCClient()
client.fetch_index("analysis_results_index")
# Find analysis results that include pathology annotations or segmentations
client.sql_query("""
SELECT
ar.analysis_result_id,
ar.analysis_result_title,
ar.Modalities,
ar.Subjects,
ar.Collections
FROM analysis_results_index ar
WHERE ar.Modalities LIKE '%ANN%' OR ar.Modalities LIKE '%SEG%'
ORDER BY ar.Subjects DESC
""")
Find analysis results for a specific slide
# Find all derived data (annotations, segmentations) for TCGA-BRCA slides
client.fetch_index("ann_index")
client.sql_query("""
SELECT
i.analysis_result_id,
i.PatientID,
a.referenced_SeriesInstanceUID as source_slide,
g.AnnotationGroupLabel,
g.NumberOfAnnotations,
g.AlgorithmName
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'tcga_brca'
LIMIT 10
""")
Annotation objects can also contain per-annotation measurements (e.g., nucleus area, eccentricity) stored within the DICOM file. These are not in the index tables — extract them after download using highdicom (ann.get_annotation_groups(), group.get_measurements()). See the microscopy_dicom_ann_intro tutorial for a worked example including spatial analysis and cellularity computation.
Filter by AnnotationGroupLabel
AnnotationGroupLabel is the most direct column for finding annotation groups by name or semantic content. Use LIKE with wildcards for text search.
Simple label filtering
# Find annotation groups by label (e.g., groups mentioning "blast")
client.fetch_index("ann_group_index")
client.sql_query("""
SELECT
g.SeriesInstanceUID,
g.AnnotationGroupLabel,
g.GraphicType,
g.NumberOfAnnotations,
g.AlgorithmName
FROM ann_group_index g
WHERE LOWER(g.AnnotationGroupLabel) LIKE '%blast%'
ORDER BY g.NumberOfAnnotations DESC
""")
Label filtering with collection context
# Find annotation groups matching a label within a specific collection
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
client.sql_query("""
SELECT
i.collection_id,
g.AnnotationGroupLabel,
g.GraphicType,
g.NumberOfAnnotations,
g.AnnotationPropertyType_CodeMeaning
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'your_collection_id'
AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%'
ORDER BY g.NumberOfAnnotations DESC
""")
Annotations on Slide Microscopy (SM + ANN Cross-Reference)
When looking for annotations related to slide microscopy data, use both SM and ANN tables together. The ann_index.referenced_SeriesInstanceUID links each annotation series to its source slide.
# Find slide microscopy images and their annotations in a collection
client.fetch_index("sm_index")
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
client.sql_query("""
SELECT
i.collection_id,
s.ObjectiveLensPower,
g.AnnotationGroupLabel,
g.NumberOfAnnotations,
g.GraphicType
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN sm_index s ON a.referenced_SeriesInstanceUID = s.SeriesInstanceUID
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'your_collection_id'
ORDER BY g.NumberOfAnnotations DESC
""")
Join Patterns
SM join (slide microscopy details with collection context)
client.fetch_index("sm_index")
result = client.sql_query("""
SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
FROM index i
JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
LIMIT 10
""")
ANN join (annotation groups with collection context)
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
result = client.sql_query("""
SELECT
i.collection_id,
g.AnnotationGroupLabel,
g.GraphicType,
g.NumberOfAnnotations,
a.referenced_SeriesInstanceUID as source_series
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
LIMIT 10
""")
Related Tools
The following tools work with DICOM format for digital pathology workflows:
Python Libraries:
- highdicom - High-level DICOM abstractions for Python. Create and read DICOM Segmentations (SEG), Structured Reports (SR), and parametric maps for pathology and radiology. Developed by IDC.
- wsidicom - Python package for reading DICOM WSI datasets. Parses metadata into easy-to-use dataclasses for whole slide image analysis.
- TIA-Toolbox - End-to-end computational pathology library with DICOM support via
DICOMWSIReader. Provides tile extraction, feature extraction, and pretrained deep learning models. - EZ-WSI-DICOMweb - Extract image patches from DICOM whole slide images via DICOMweb. Designed for AI/ML workflows with cloud DICOM stores.
Viewers:
- Slim - Web-based DICOM slide microscopy viewer and annotation tool. Supports brightfield and multiplexed immunofluorescence imaging via DICOMweb. Developed by IDC.
- QuPath - Cross-platform open source software for whole slide image analysis. Supports DICOM WSI via Bio-Formats and OpenSlide (v0.4.0+).
Conversion:
- dicom_wsi - Python implementation for converting proprietary WSI formats to DICOM-compliant files.