# BigQuery Guide for IDC **Tested with:** IDC data version v23 For most queries and downloads, use `idc-index` (see main SKILL.md). This guide covers BigQuery for advanced use cases requiring full DICOM metadata or complex joins. ## Prerequisites **Requirements:** 1. Google account 2. Google Cloud project with billing enabled (first 1 TB/month free) 3. `google-cloud-bigquery` Python package or BigQuery console access **Authentication setup:** ```bash # Install Google Cloud SDK, then: gcloud auth application-default login ``` ## When to Use BigQuery Use BigQuery instead of `idc-index` when you need: - Full DICOM metadata (all 4000+ tags, not just the ~50 in idc-index) - Complex joins across clinical data tables - DICOM sequence attributes (nested structures) - Queries on fields not in the idc-index mini-index ## Accessing IDC in BigQuery ### Dataset Structure All IDC tables are in the `bigquery-public-data` BigQuery project. **Current version (recommended for exploration):** - `bigquery-public-data.idc_current.*` - `bigquery-public-data.idc_current_clinical.*` **Versioned datasets (recommended for reproducibility):** - `bigquery-public-data.idc_v{IDC version}.*` - `bigquery-public-data.idc_v{IDC version}_clinical.*` Always use versioned datasets for reproducible research! ## Key Tables ### dicom_all Primary table joining complete DICOM metadata with IDC-specific columns (collection_id, gcs_url, license). Contains all DICOM tags from `dicom_metadata` plus collection and administrative metadata. See [dicom_all.sql](https://github.com/ImagingDataCommons/etl_flow/blob/master/bq/generate_tables_and_views/derived_tables/BQ_Table_Building/derived_data_views/sql/dicom_all.sql) for the exact derivation. ```sql SELECT collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, Modality, BodyPartExamined, SeriesDescription, gcs_url, license_short_name FROM `bigquery-public-data.idc_current.dicom_all` WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' LIMIT 10 ``` ### Derived Tables **segmentations** - DICOM Segmentation objects ```sql SELECT * FROM `bigquery-public-data.idc_current.segmentations` LIMIT 10 ``` **measurement_groups** - SR TID1500 measurement groups **qualitative_measurements** - Coded evaluations **quantitative_measurements** - Numeric measurements ### Collection Metadata **original_collections_metadata** - Collection-level descriptions ```sql SELECT collection_id, CancerTypes, TumorLocations, Subjects, src.source_doi, src.ImageTypes, src.license.license_short_name FROM `bigquery-public-data.idc_current.original_collections_metadata`, UNNEST(Sources) AS src WHERE CancerTypes LIKE '%Lung%' ``` ## Common Query Patterns ### Find Collections by Criteria ```sql SELECT collection_id, COUNT(DISTINCT PatientID) as patient_count, COUNT(DISTINCT SeriesInstanceUID) as series_count, ARRAY_AGG(DISTINCT Modality) as modalities FROM `bigquery-public-data.idc_current.dicom_all` WHERE BodyPartExamined LIKE '%BRAIN%' GROUP BY collection_id HAVING patient_count > 50 ORDER BY patient_count DESC ``` ### Get Download URLs ```sql SELECT SeriesInstanceUID, gcs_url FROM `bigquery-public-data.idc_current.dicom_all` WHERE collection_id = 'rider_pilot' AND Modality = 'CT' ``` ### Find Studies with Multiple Modalities ```sql SELECT StudyInstanceUID, ARRAY_AGG(DISTINCT Modality) as modalities, COUNT(DISTINCT SeriesInstanceUID) as series_count FROM `bigquery-public-data.idc_current.dicom_all` GROUP BY StudyInstanceUID HAVING ARRAY_LENGTH(ARRAY_AGG(DISTINCT Modality)) > 1 LIMIT 100 ``` ### License Filtering ```sql SELECT collection_id, license_short_name, COUNT(*) as instance_count FROM `bigquery-public-data.idc_current.dicom_all` WHERE license_short_name = 'CC BY 4.0' GROUP BY collection_id, license_short_name ``` ### Find Segmentations with Source Images ```sql SELECT src.collection_id, seg.SeriesInstanceUID as seg_series, seg.SegmentedPropertyType, src.SeriesInstanceUID as source_series, src.Modality as source_modality FROM `bigquery-public-data.idc_current.segmentations` seg JOIN `bigquery-public-data.idc_current.dicom_all` src ON seg.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE src.collection_id = 'qin_prostate_repeatability' LIMIT 10 ``` ## Using Query Results with idc-index Combine BigQuery for complex queries with idc-index for downloads (no GCP auth needed for downloads): ```python from google.cloud import bigquery from idc_index import IDCClient # Initialize BigQuery client # Requires: pip install google-cloud-bigquery # Auth: gcloud auth application-default login # Project: needed for billing even on public datasets (free tier applies) bq_client = bigquery.Client(project="your-gcp-project-id") # Query for series with specific criteria query = """ SELECT DISTINCT SeriesInstanceUID FROM `bigquery-public-data.idc_current.dicom_all` WHERE collection_id = 'tcga_luad' AND Modality = 'CT' AND Manufacturer = 'GE MEDICAL SYSTEMS' LIMIT 100 """ df = bq_client.query(query).to_dataframe() print(f"Found {len(df)} GE CT series") # Download with idc-index (no GCP auth required) idc_client = IDCClient() idc_client.download_from_selection( seriesInstanceUID=list(df['SeriesInstanceUID'].values), downloadDir="./tcga_luad_thin_ct" ) ``` ## Cost and Optimization **Pricing:** $5 per TB scanned (first 1 TB/month free). Most users stay within free tier. **Minimize data scanned:** - Select only needed columns (not `SELECT *`) - Filter early with `WHERE` clauses - Use `LIMIT` when testing - Use `dicom_all` instead of `dicom_metadata` when possible (smaller) - Preview queries in BQ console (free, shows bytes to scan) **Check cost before running:** ```python query_job = client.query(query, job_config=bigquery.QueryJobConfig(dry_run=True)) print(f"Query will scan {query_job.total_bytes_processed / 1e9:.2f} GB") ``` **Use materialized tables:** IDC provides both views (`table_name_view`) and materialized tables (`table_name`). Always use the materialized tables (faster, lower cost). ## Clinical Data Clinical data is in separate datasets with collection-specific tables. Not all collections have clinical data (started in IDC v11). **List available clinical tables:** ```sql SELECT table_name FROM `bigquery-public-data.idc_current_clinical.INFORMATION_SCHEMA.TABLES` ``` **Query clinical data for a collection:** ```sql -- Example: TCGA-LUAD clinical data SELECT * FROM `bigquery-public-data.idc_current_clinical.tcga_luad_clinical` LIMIT 10 ``` **Join clinical with imaging data:** ```sql SELECT d.PatientID, d.SeriesInstanceUID, d.Modality, c.age_at_diagnosis, c.pathologic_stage FROM `bigquery-public-data.idc_current.dicom_all` d JOIN `bigquery-public-data.idc_current_clinical.tcga_luad_clinical` c ON d.PatientID = c.dicom_patient_id WHERE d.collection_id = 'tcga_luad' AND d.Modality = 'CT' LIMIT 20 ``` **Note:** Clinical table schemas vary by collection. Check column names with `INFORMATION_SCHEMA.COLUMNS` before querying. ## Important Notes - Tables are read-only (public dataset) - Schema changes between IDC versions - Use versioned datasets for reproducibility - Some DICOM sequences >15 levels deep are not extracted - Very large sequences (>1MB) may be truncated - Always check data license before use ## Common Errors **Issue: Billing must be enabled** - Cause: BigQuery requires a billing-enabled GCP project - Solution: Enable billing in Google Cloud Console or use idc-index mini-index instead **Issue: Query exceeds resource limits** - Cause: Query scans too much data or is too complex - Solution: Add more specific WHERE filters, use LIMIT, break into smaller queries **Issue: Column not found** - Cause: Field name typo or not in selected table - Solution: Check table schema first with `INFORMATION_SCHEMA.COLUMNS` **Issue: Permission denied** - Cause: Not authenticated to Google Cloud - Solution: Run `gcloud auth application-default login` or set GOOGLE_APPLICATION_CREDENTIALS ## Resources - [Understanding the BigQuery DICOM schema](https://docs.cloud.google.com/healthcare-api/docs/how-tos/dicom-bigquery-schema) - [BigQuery Query Syntax](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax) - [Kaggle Intro to SQL](https://www.kaggle.com/learn/intro-to-sql) - [Sample BigQuery queries of IDC data](https://github.com/ImagingDataCommons/idc-bigquery-cookbook)