Files
2026-01-22 18:26:30 -05:00

8.3 KiB

BigQuery Guide for IDC

Tested with: IDC data version v23

For most queries and downloads, use idc-index (see main SKILL.md). This guide covers BigQuery for advanced use cases requiring full DICOM metadata or complex joins.

Prerequisites

Requirements:

  1. Google account
  2. Google Cloud project with billing enabled (first 1 TB/month free)
  3. google-cloud-bigquery Python package or BigQuery console access

Authentication setup:

# Install Google Cloud SDK, then:
gcloud auth application-default login

When to Use BigQuery

Use BigQuery instead of idc-index when you need:

  • Full DICOM metadata (all 4000+ tags, not just the ~50 in idc-index)
  • Complex joins across clinical data tables
  • DICOM sequence attributes (nested structures)
  • Queries on fields not in the idc-index mini-index

Accessing IDC in BigQuery

Dataset Structure

All IDC tables are in the bigquery-public-data BigQuery project.

Current version (recommended for exploration):

  • bigquery-public-data.idc_current.*
  • bigquery-public-data.idc_current_clinical.*

Versioned datasets (recommended for reproducibility):

  • bigquery-public-data.idc_v{IDC version}.*
  • bigquery-public-data.idc_v{IDC version}_clinical.*

Always use versioned datasets for reproducible research!

Key Tables

dicom_all

Primary table joining complete DICOM metadata with IDC-specific columns (collection_id, gcs_url, license). Contains all DICOM tags from dicom_metadata plus collection and administrative metadata. See dicom_all.sql for the exact derivation.

SELECT 
  collection_id,
  PatientID,
  StudyInstanceUID, 
  SeriesInstanceUID,
  Modality,
  BodyPartExamined,
  SeriesDescription,
  gcs_url,
  license_short_name
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE Modality = 'CT'
  AND BodyPartExamined = 'CHEST'
LIMIT 10

Derived Tables

segmentations - DICOM Segmentation objects

SELECT *
FROM `bigquery-public-data.idc_current.segmentations`
LIMIT 10

measurement_groups - SR TID1500 measurement groups qualitative_measurements - Coded evaluations quantitative_measurements - Numeric measurements

Collection Metadata

original_collections_metadata - Collection-level descriptions

SELECT
  collection_id,
  CancerTypes,
  TumorLocations,
  Subjects,
  src.source_doi,
  src.ImageTypes,
  src.license.license_short_name
FROM `bigquery-public-data.idc_current.original_collections_metadata`,
UNNEST(Sources) AS src
WHERE CancerTypes LIKE '%Lung%'

Common Query Patterns

Find Collections by Criteria

SELECT 
  collection_id,
  COUNT(DISTINCT PatientID) as patient_count,
  COUNT(DISTINCT SeriesInstanceUID) as series_count,
  ARRAY_AGG(DISTINCT Modality) as modalities
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE BodyPartExamined LIKE '%BRAIN%'
GROUP BY collection_id
HAVING patient_count > 50
ORDER BY patient_count DESC

Get Download URLs

SELECT
  SeriesInstanceUID,
  gcs_url
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE collection_id = 'rider_pilot'
  AND Modality = 'CT'

Find Studies with Multiple Modalities

SELECT
  StudyInstanceUID,
  ARRAY_AGG(DISTINCT Modality) as modalities,
  COUNT(DISTINCT SeriesInstanceUID) as series_count
FROM `bigquery-public-data.idc_current.dicom_all`
GROUP BY StudyInstanceUID
HAVING ARRAY_LENGTH(ARRAY_AGG(DISTINCT Modality)) > 1
LIMIT 100

License Filtering

SELECT
  collection_id,
  license_short_name,
  COUNT(*) as instance_count
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE license_short_name = 'CC BY 4.0'
GROUP BY collection_id, license_short_name

Find Segmentations with Source Images

SELECT
  src.collection_id,
  seg.SeriesInstanceUID as seg_series,
  seg.SegmentedPropertyType,
  src.SeriesInstanceUID as source_series,
  src.Modality as source_modality
FROM `bigquery-public-data.idc_current.segmentations` seg
JOIN `bigquery-public-data.idc_current.dicom_all` src
  ON seg.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE src.collection_id = 'qin_prostate_repeatability'
LIMIT 10

Using Query Results with idc-index

Combine BigQuery for complex queries with idc-index for downloads (no GCP auth needed for downloads):

from google.cloud import bigquery
from idc_index import IDCClient

# Initialize BigQuery client
# Requires: pip install google-cloud-bigquery
# Auth: gcloud auth application-default login
# Project: needed for billing even on public datasets (free tier applies)
bq_client = bigquery.Client(project="your-gcp-project-id")

# Query for series with specific criteria
query = """
SELECT DISTINCT SeriesInstanceUID
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE collection_id = 'tcga_luad'
  AND Modality = 'CT'
  AND Manufacturer = 'GE MEDICAL SYSTEMS'
LIMIT 100
"""

df = bq_client.query(query).to_dataframe()
print(f"Found {len(df)} GE CT series")

# Download with idc-index (no GCP auth required)
idc_client = IDCClient()
idc_client.download_from_selection(
    seriesInstanceUID=list(df['SeriesInstanceUID'].values),
    downloadDir="./tcga_luad_thin_ct"
)

Cost and Optimization

Pricing: $5 per TB scanned (first 1 TB/month free). Most users stay within free tier.

Minimize data scanned:

  • Select only needed columns (not SELECT *)
  • Filter early with WHERE clauses
  • Use LIMIT when testing
  • Use dicom_all instead of dicom_metadata when possible (smaller)
  • Preview queries in BQ console (free, shows bytes to scan)

Check cost before running:

query_job = client.query(query, job_config=bigquery.QueryJobConfig(dry_run=True))
print(f"Query will scan {query_job.total_bytes_processed / 1e9:.2f} GB")

Use materialized tables: IDC provides both views (table_name_view) and materialized tables (table_name). Always use the materialized tables (faster, lower cost).

Clinical Data

Clinical data is in separate datasets with collection-specific tables. Not all collections have clinical data (started in IDC v11).

List available clinical tables:

SELECT table_name
FROM `bigquery-public-data.idc_current_clinical.INFORMATION_SCHEMA.TABLES`

Query clinical data for a collection:

-- Example: TCGA-LUAD clinical data
SELECT *
FROM `bigquery-public-data.idc_current_clinical.tcga_luad_clinical`
LIMIT 10

Join clinical with imaging data:

SELECT
  d.PatientID,
  d.SeriesInstanceUID,
  d.Modality,
  c.age_at_diagnosis,
  c.pathologic_stage
FROM `bigquery-public-data.idc_current.dicom_all` d
JOIN `bigquery-public-data.idc_current_clinical.tcga_luad_clinical` c
  ON d.PatientID = c.dicom_patient_id
WHERE d.collection_id = 'tcga_luad'
  AND d.Modality = 'CT'
LIMIT 20

Note: Clinical table schemas vary by collection. Check column names with INFORMATION_SCHEMA.COLUMNS before querying.

Important Notes

  • Tables are read-only (public dataset)
  • Schema changes between IDC versions
  • Use versioned datasets for reproducibility
  • Some DICOM sequences >15 levels deep are not extracted
  • Very large sequences (>1MB) may be truncated
  • Always check data license before use

Common Errors

Issue: Billing must be enabled

  • Cause: BigQuery requires a billing-enabled GCP project
  • Solution: Enable billing in Google Cloud Console or use idc-index mini-index instead

Issue: Query exceeds resource limits

  • Cause: Query scans too much data or is too complex
  • Solution: Add more specific WHERE filters, use LIMIT, break into smaller queries

Issue: Column not found

  • Cause: Field name typo or not in selected table
  • Solution: Check table schema first with INFORMATION_SCHEMA.COLUMNS

Issue: Permission denied

  • Cause: Not authenticated to Google Cloud
  • Solution: Run gcloud auth application-default login or set GOOGLE_APPLICATION_CREDENTIALS

Resources