# SQL Query Patterns for IDC **Tested with:** idc-index 0.11.9 (IDC data version v23) Quick reference for common SQL query patterns when working with IDC data. For detailed examples with context, see the "Core Capabilities" section in the main SKILL.md. ## When to Use This Guide Load this guide when you need quick-reference SQL patterns for: - Discovering available filter values (modalities, body parts, manufacturers) - Finding annotations and segmentations across collections - Querying slide microscopy and annotation data - Estimating download sizes before download - Linking imaging data to clinical data For table schemas, DataFrame access, and join column references, see `references/index_tables_guide.md`. ## Prerequisites ```bash pip install --upgrade idc-index ``` ```python from idc_index import IDCClient client = IDCClient() ``` ## Discover Available Filter Values ```python # What modalities exist? client.sql_query("SELECT DISTINCT Modality FROM index") # What body parts for a specific modality? client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as n FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY n DESC """) # What manufacturers for MR? client.sql_query(""" SELECT DISTINCT Manufacturer, COUNT(*) as n FROM index WHERE Modality = 'MR' GROUP BY Manufacturer ORDER BY n DESC """) ``` ## Find Annotations and Segmentations **Note:** Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type. ```python # Find ALL segmentations and structure sets by DICOM Modality # SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set client.sql_query(""" SELECT collection_id, Modality, COUNT(*) as series_count FROM index WHERE Modality IN ('SEG', 'RTSTRUCT') GROUP BY collection_id, Modality ORDER BY series_count DESC """) # Find segmentations for a specific collection (includes non-analysis-result items) client.sql_query(""" SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id FROM index WHERE collection_id = 'tcga_luad' AND Modality = 'SEG' """) # List analysis result collections (curated derived datasets) client.fetch_index("analysis_results_index") client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Collections, Modalities FROM analysis_results_index """) # Find analysis results for a specific source collection client.sql_query(""" SELECT analysis_result_id, analysis_result_title FROM analysis_results_index WHERE Collections LIKE '%tcga_luad%' """) # Use seg_index for detailed DICOM Segmentation metadata client.fetch_index("seg_index") # Get segmentation statistics by algorithm client.sql_query(""" SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count FROM seg_index WHERE AlgorithmName IS NOT NULL GROUP BY AlgorithmName, AlgorithmType ORDER BY seg_count DESC LIMIT 10 """) # Find segmentations for specific source images (e.g., chest CT) client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, s.segmented_SeriesInstanceUID as source_series FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST' LIMIT 10 """) # Find TotalSegmentator results with source image context client.sql_query(""" SELECT seg_info.collection_id, COUNT(DISTINCT s.SeriesInstanceUID) as seg_count, SUM(s.total_segments) as total_segments FROM seg_index s JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID WHERE s.AlgorithmName LIKE '%TotalSegmentator%' GROUP BY seg_info.collection_id ORDER BY seg_count DESC """) # Use ann_index and ann_group_index for Microscopy Bulk Simple Annotations # ann_group_index has AnnotationGroupLabel, GraphicType, NumberOfAnnotations, AlgorithmName client.fetch_index("ann_index") client.fetch_index("ann_group_index") client.sql_query(""" SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations, i.collection_id FROM ann_group_index g JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID WHERE g.AlgorithmName IS NOT NULL LIMIT 10 """) # See references/digital_pathology_guide.md for AnnotationGroupLabel filtering, SM+ANN joins, and more ``` ## Query Slide Microscopy and Annotation Data Use `sm_index` for slide microscopy metadata and `ann_index`/`ann_group_index` for annotations on slides (DICOM ANN objects). Filter annotation groups by `AnnotationGroupLabel` to find annotations by name. ```python client.fetch_index("sm_index") client.fetch_index("ann_index") client.fetch_index("ann_group_index") # Example: find annotation groups by label within a collection client.sql_query(""" SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations FROM ann_group_index g JOIN index i ON g.SeriesInstanceUID = i.SeriesInstanceUID WHERE i.collection_id = 'your_collection_id' AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%' """) ``` See `references/digital_pathology_guide.md` for SM queries, ANN filtering patterns, SM+ANN cross-references, and join examples. ## Estimate Download Size ```python # Size for specific criteria client.sql_query(""" SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' """) ``` ## Link to Clinical Data ```python client.fetch_index("clinical_index") # Find collections with clinical data and their tables client.sql_query(""" SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns FROM clinical_index GROUP BY collection_id, table_name ORDER BY collection_id """) ``` See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection. ## Troubleshooting **Issue:** Query returns error "table not found" - **Cause:** Index not fetched before query - **Solution:** Call `client.fetch_index("table_name")` before using tables other than the primary `index` **Issue:** LIKE pattern not matching expected results - **Cause:** Case sensitivity or whitespace - **Solution:** Use `LOWER(column)` for case-insensitive matching, `TRIM()` for whitespace **Issue:** JOIN returns fewer rows than expected - **Cause:** NULL values in join columns or no matching records - **Solution:** Use `LEFT JOIN` to include rows without matches, check for NULLs with `IS NOT NULL` ## Resources - `references/index_tables_guide.md` for table schemas, DataFrame access, and join column references - `references/clinical_data_guide.md` for clinical data patterns and value mapping - `references/digital_pathology_guide.md` for pathology-specific queries - `references/bigquery_guide.md` for advanced queries requiring full DICOM metadata