mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Merge pull request #80 from fedorov/update-idc-1.4.0
Update imaging-data-commons skill to v1.4.0
This commit is contained in:
@@ -3,9 +3,9 @@ name: imaging-data-commons
|
|||||||
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
|
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
|
||||||
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
|
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
|
||||||
metadata:
|
metadata:
|
||||||
version: 1.3.1
|
version: 1.4.0
|
||||||
skill-author: Andrey Fedorov, @fedorov
|
skill-author: Andrey Fedorov, @fedorov
|
||||||
idc-index: "0.11.9"
|
idc-index: "0.11.10"
|
||||||
idc-data-version: "v23"
|
idc-data-version: "v23"
|
||||||
repository: https://github.com/ImagingDataCommons/idc-claude-skill
|
repository: https://github.com/ImagingDataCommons/idc-claude-skill
|
||||||
---
|
---
|
||||||
@@ -25,7 +25,7 @@ Use the `idc-index` Python package to query and download public cancer imaging d
|
|||||||
```python
|
```python
|
||||||
import idc_index
|
import idc_index
|
||||||
|
|
||||||
REQUIRED_VERSION = "0.11.9" # Must match metadata.idc-index in this file
|
REQUIRED_VERSION = "0.11.10" # Must match metadata.idc-index in this file
|
||||||
installed = idc_index.__version__
|
installed = idc_index.__version__
|
||||||
|
|
||||||
if installed < REQUIRED_VERSION:
|
if installed < REQUIRED_VERSION:
|
||||||
@@ -229,7 +229,7 @@ print(client.get_idc_version()) # Should return "v23"
|
|||||||
```
|
```
|
||||||
If you see an older version, upgrade with: `pip install --upgrade idc-index`
|
If you see an older version, upgrade with: `pip install --upgrade idc-index`
|
||||||
|
|
||||||
**Tested with:** idc-index 0.11.9 (IDC data version v23)
|
**Tested with:** idc-index 0.11.10 (IDC data version v23)
|
||||||
|
|
||||||
**Optional (for data analysis):**
|
**Optional (for data analysis):**
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
# Digital Pathology Guide for IDC
|
# Digital Pathology Guide for IDC
|
||||||
|
|
||||||
**Tested with:** IDC data version v23, idc-index 0.11.9
|
**Tested with:** IDC data version v23, idc-index 0.11.10
|
||||||
|
|
||||||
For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
|
For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
|
||||||
|
|
||||||
@@ -10,7 +10,7 @@ Five specialized index tables provide curated metadata without needing BigQuery:
|
|||||||
|
|
||||||
| Table | Row Granularity | Description |
|
| Table | Row Granularity | Description |
|
||||||
|-------|-----------------|-------------|
|
|-------|-----------------|-------------|
|
||||||
| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: lens power, pixel spacing, image dimensions |
|
| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: container/slide ID, tissue type, anatomic structure, diagnosis, lens power, pixel spacing, image dimensions |
|
||||||
| `sm_instance_index` | 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images |
|
| `sm_instance_index` | 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images |
|
||||||
| `seg_index` | 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations |
|
| `seg_index` | 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations |
|
||||||
| `ann_index` | 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide |
|
| `ann_index` | 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide |
|
||||||
@@ -57,6 +57,109 @@ client.sql_query("""
|
|||||||
""")
|
""")
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Filter by specimen preparation
|
||||||
|
|
||||||
|
The `sm_index` includes staining, embedding, and fixative metadata. These columns are **arrays** (e.g., `[hematoxylin stain, water soluble eosin stain]` for H&E) — use `array_to_string()` with `LIKE` or `list_contains()` to filter.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find H&E-stained slides in a collection
|
||||||
|
client.fetch_index("sm_index")
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
i.PatientID,
|
||||||
|
s.staining_usingSubstance_CodeMeaning as staining,
|
||||||
|
s.embeddingMedium_CodeMeaning as embedding,
|
||||||
|
s.tissueFixative_CodeMeaning as fixative
|
||||||
|
FROM sm_index s
|
||||||
|
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
WHERE i.collection_id = 'tcga_brca'
|
||||||
|
AND array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Compare FFPE vs frozen slides across collections
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
i.collection_id,
|
||||||
|
s.embeddingMedium_CodeMeaning as embedding,
|
||||||
|
COUNT(*) as slide_count
|
||||||
|
FROM sm_index s
|
||||||
|
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
GROUP BY i.collection_id, embedding
|
||||||
|
ORDER BY i.collection_id, slide_count DESC
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Identifying Tumor vs Normal Slides
|
||||||
|
|
||||||
|
The `sm_index` table provides two ways to identify tissue type:
|
||||||
|
|
||||||
|
| Column | Use Case |
|
||||||
|
|--------|----------|
|
||||||
|
| `primaryAnatomicStructureModifier_CodeMeaning` | Structured tissue type from DICOM specimen metadata (e.g., `Neoplasm, Primary`, `Normal`, `Tumor`, `Neoplasm, Metastatic`). Works across all collections with SM data. |
|
||||||
|
| `ContainerIdentifier` | Slide/container identifier. For TCGA collections, contains the [TCGA barcode](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/) where the [sample type code](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes) (positions 14-15) encodes tissue origin: `01`-`09` = tumor, `10`-`19` = normal. |
|
||||||
|
|
||||||
|
### Using structured tissue type metadata
|
||||||
|
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
client = IDCClient()
|
||||||
|
client.fetch_index("sm_index")
|
||||||
|
|
||||||
|
# Discover tissue type values across all SM data
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
|
||||||
|
COUNT(*) as slide_count
|
||||||
|
FROM sm_index s
|
||||||
|
WHERE s.primaryAnatomicStructureModifier_CodeMeaning IS NOT NULL
|
||||||
|
GROUP BY tissue_type
|
||||||
|
ORDER BY slide_count DESC
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Example: Tumor vs normal slides in TCGA-BRCA
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Tissue type breakdown for TCGA-BRCA
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
|
||||||
|
COUNT(*) as slide_count,
|
||||||
|
COUNT(DISTINCT i.PatientID) as patient_count
|
||||||
|
FROM sm_index s
|
||||||
|
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
WHERE i.collection_id = 'tcga_brca'
|
||||||
|
GROUP BY tissue_type
|
||||||
|
ORDER BY slide_count DESC
|
||||||
|
""")
|
||||||
|
# Returns: Neoplasm, Primary (2704 slides), Normal (399 slides)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Using TCGA barcode (TCGA collections only)
|
||||||
|
|
||||||
|
For TCGA collections, `ContainerIdentifier` contains the slide barcode (e.g., `TCGA-E9-A3X8-01A-03-TSC`). Extract the sample type code to classify tissue:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Parse sample type from TCGA barcode
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
SUBSTRING(SPLIT_PART(s.ContainerIdentifier, '-', 4), 1, 2) as sample_type_code,
|
||||||
|
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
|
||||||
|
COUNT(*) as slide_count
|
||||||
|
FROM sm_index s
|
||||||
|
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
WHERE i.collection_id = 'tcga_brca'
|
||||||
|
GROUP BY sample_type_code, tissue_type
|
||||||
|
ORDER BY sample_type_code
|
||||||
|
""")
|
||||||
|
# Returns: 01 → Neoplasm, Primary (2704), 06 → None (8), 11 → Normal (399)
|
||||||
|
```
|
||||||
|
|
||||||
|
The barcode approach catches cases where structured metadata is NULL (e.g., `06` = Metastatic slides have `primaryAnatomicStructureModifier_CodeMeaning` = NULL in TCGA-BRCA).
|
||||||
|
|
||||||
## Annotation Queries (ANN)
|
## Annotation Queries (ANN)
|
||||||
|
|
||||||
DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`.
|
DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`.
|
||||||
@@ -134,6 +237,52 @@ client.sql_query("""
|
|||||||
""")
|
""")
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Finding Pre-Computed Analysis Results
|
||||||
|
|
||||||
|
IDC hosts derived datasets (nuclei segmentations, TIL maps, AI annotations) identified by `analysis_result_id` in the main `index` table. Use `analysis_results_index` to discover what's available for pathology.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from idc_index import IDCClient
|
||||||
|
client = IDCClient()
|
||||||
|
client.fetch_index("analysis_results_index")
|
||||||
|
|
||||||
|
# Find analysis results that include pathology annotations or segmentations
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
ar.analysis_result_id,
|
||||||
|
ar.analysis_result_title,
|
||||||
|
ar.Modalities,
|
||||||
|
ar.Subjects,
|
||||||
|
ar.Collections
|
||||||
|
FROM analysis_results_index ar
|
||||||
|
WHERE ar.Modalities LIKE '%ANN%' OR ar.Modalities LIKE '%SEG%'
|
||||||
|
ORDER BY ar.Subjects DESC
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Find analysis results for a specific slide
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Find all derived data (annotations, segmentations) for TCGA-BRCA slides
|
||||||
|
client.fetch_index("ann_index")
|
||||||
|
client.sql_query("""
|
||||||
|
SELECT
|
||||||
|
i.analysis_result_id,
|
||||||
|
i.PatientID,
|
||||||
|
a.referenced_SeriesInstanceUID as source_slide,
|
||||||
|
g.AnnotationGroupLabel,
|
||||||
|
g.NumberOfAnnotations,
|
||||||
|
g.AlgorithmName
|
||||||
|
FROM ann_group_index g
|
||||||
|
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
|
||||||
|
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
|
||||||
|
WHERE i.collection_id = 'tcga_brca'
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
```
|
||||||
|
|
||||||
|
Annotation objects can also contain per-annotation **measurements** (e.g., nucleus area, eccentricity) stored within the DICOM file. These are not in the index tables — extract them after download using [highdicom](https://github.com/ImagingDataCommons/highdicom) (`ann.get_annotation_groups()`, `group.get_measurements()`). See the [microscopy_dicom_ann_intro](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb) tutorial for a worked example including spatial analysis and cellularity computation.
|
||||||
|
|
||||||
## Filter by AnnotationGroupLabel
|
## Filter by AnnotationGroupLabel
|
||||||
|
|
||||||
`AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search.
|
`AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search.
|
||||||
|
|||||||
Reference in New Issue
Block a user