Files
claude-scientific-skills/scientific-skills/imaging-data-commons/references/bigquery_guide.md

18 KiB

BigQuery Guide for IDC

Tested with: IDC data version v23

For most queries and downloads, use idc-index (see main SKILL.md). This guide covers BigQuery for advanced use cases requiring full DICOM metadata or complex joins.

Prerequisites

Requirements:

  1. Google account
  2. Google Cloud project with billing enabled (first 1 TB/month free)
  3. google-cloud-bigquery Python package or BigQuery console access

Authentication setup:

# Install Google Cloud SDK, then:
gcloud auth application-default login

When to Use BigQuery

Use BigQuery instead of idc-index when you need:

  • Full DICOM metadata (all 4000+ tags, not just the ~50 in idc-index)
  • Complex joins across clinical data tables
  • DICOM sequence attributes (nested structures)
  • Queries on fields not in the idc-index mini-index
  • Private DICOM elements (vendor-specific tags in OtherElements column)

Accessing IDC in BigQuery

Dataset Structure

All IDC tables are in the bigquery-public-data BigQuery project.

Current version (recommended for exploration):

  • bigquery-public-data.idc_current.*
  • bigquery-public-data.idc_current_clinical.*

Versioned datasets (recommended for reproducibility):

  • bigquery-public-data.idc_v{IDC version}.*
  • bigquery-public-data.idc_v{IDC version}_clinical.*

Always use versioned datasets for reproducible research!

Key Tables

dicom_all

Primary table joining complete DICOM metadata with IDC-specific columns (collection_id, gcs_url, license). Contains all DICOM tags from dicom_metadata plus collection and administrative metadata. See dicom_all.sql for the exact derivation.

SELECT 
  collection_id,
  PatientID,
  StudyInstanceUID, 
  SeriesInstanceUID,
  Modality,
  BodyPartExamined,
  SeriesDescription,
  gcs_url,
  license_short_name
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE Modality = 'CT'
  AND BodyPartExamined = 'CHEST'
LIMIT 10

Derived Tables

segmentations - DICOM Segmentation objects

SELECT *
FROM `bigquery-public-data.idc_current.segmentations`
LIMIT 10

measurement_groups - SR TID1500 measurement groups qualitative_measurements - Coded evaluations quantitative_measurements - Numeric measurements

Collection Metadata

original_collections_metadata - Collection-level descriptions

SELECT
  collection_id,
  CancerTypes,
  TumorLocations,
  Subjects,
  src.source_doi,
  src.ImageTypes,
  src.license.license_short_name
FROM `bigquery-public-data.idc_current.original_collections_metadata`,
UNNEST(Sources) AS src
WHERE CancerTypes LIKE '%Lung%'

Common Query Patterns

Find Collections by Criteria

SELECT 
  collection_id,
  COUNT(DISTINCT PatientID) as patient_count,
  COUNT(DISTINCT SeriesInstanceUID) as series_count,
  ARRAY_AGG(DISTINCT Modality) as modalities
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE BodyPartExamined LIKE '%BRAIN%'
GROUP BY collection_id
HAVING patient_count > 50
ORDER BY patient_count DESC

Get Download URLs

SELECT
  SeriesInstanceUID,
  gcs_url
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE collection_id = 'rider_pilot'
  AND Modality = 'CT'

Find Studies with Multiple Modalities

SELECT
  StudyInstanceUID,
  ARRAY_AGG(DISTINCT Modality) as modalities,
  COUNT(DISTINCT SeriesInstanceUID) as series_count
FROM `bigquery-public-data.idc_current.dicom_all`
GROUP BY StudyInstanceUID
HAVING ARRAY_LENGTH(ARRAY_AGG(DISTINCT Modality)) > 1
LIMIT 100

License Filtering

SELECT
  collection_id,
  license_short_name,
  COUNT(*) as instance_count
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE license_short_name = 'CC BY 4.0'
GROUP BY collection_id, license_short_name

Find Segmentations with Source Images

SELECT
  src.collection_id,
  seg.SeriesInstanceUID as seg_series,
  seg.SegmentedPropertyType,
  src.SeriesInstanceUID as source_series,
  src.Modality as source_modality
FROM `bigquery-public-data.idc_current.segmentations` seg
JOIN `bigquery-public-data.idc_current.dicom_all` src
  ON seg.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE src.collection_id = 'qin_prostate_repeatability'
LIMIT 10

Private DICOM Elements

Private DICOM elements are vendor-specific attributes not defined in the DICOM standard. They often contain essential acquisition parameters (like diffusion b-values, gradient directions, or scanner-specific settings) that are critical for image interpretation and analysis.

Understanding Private Elements

How private elements work:

  • Private elements use odd-numbered group numbers (e.g., 0019, 0043, 2001)
  • Each vendor reserves blocks of 256 elements using Private Creator identifiers at positions (gggg,0010-00FF)
  • For example, GE uses Private Creator "GEMS_PARM_01" at (0043,0010) to reserve elements (0043,1000-10FF)

Standard vs. private tags: Some parameters exist in both forms:

Parameter Standard Tag GE Siemens Philips
Diffusion b-value (0018,9087) (0043,1039) (0019,100C) (2001,1003)
Private Creator - GEMS_PARM_01 SIEMENS CSA HEADER Philips Imaging

Older scanners typically populate only private tags; newer scanners may use standard tags. Always check both.

Challenges with private elements:

  • Require manufacturer DICOM Conformance Statements to interpret
  • Tag meanings can change between software versions
  • May be removed during de-identification for HIPAA compliance
  • Value encoding varies (string vs. numeric, different units)

Accessing Private Elements in BigQuery

Private elements are stored in the OtherElements column of dicom_all as an array of structs with Tag and Data fields.

Tag notation: DICOM notation (0043,1039) becomes BigQuery format Tag_00431039.

Private Element Query Patterns

Discover Available Private Tags

List all non-empty private tags for a collection:

SELECT
  other_elements.Tag,
  COUNT(*) AS instance_count,
  ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)] IGNORE NULLS LIMIT 5) AS sample_values
FROM `bigquery-public-data.idc_current.dicom_all`,
  UNNEST(OtherElements) AS other_elements
WHERE collection_id = 'qin_prostate_repeatability'
  AND Modality = 'MR'
  AND ARRAY_LENGTH(other_elements.Data) > 0
  AND other_elements.Data[SAFE_OFFSET(0)] IS NOT NULL
  AND other_elements.Data[SAFE_OFFSET(0)] != ''
GROUP BY other_elements.Tag
ORDER BY instance_count DESC

For a specific series:

SELECT
  other_elements.Tag,
  ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)] IGNORE NULLS) AS values
FROM `bigquery-public-data.idc_current.dicom_all`,
  UNNEST(OtherElements) AS other_elements
WHERE SeriesInstanceUID = '1.3.6.1.4.1.14519.5.2.1.7311.5101.206828891270520544417996275680'
  AND ARRAY_LENGTH(other_elements.Data) > 0
  AND other_elements.Data[SAFE_OFFSET(0)] IS NOT NULL
  AND other_elements.Data[SAFE_OFFSET(0)] != ''
GROUP BY other_elements.Tag

To identify the Private Creator for a tag, look for the reservation element in the same group. For example, if you find Tag_00431039, the Private Creator is at Tag_00430010 (the tag that reserves block 10xx in group 0043).

Identify Equipment Manufacturer

Determine what equipment produced the data to find the correct DICOM Conformance Statement:

SELECT DISTINCT Manufacturer, ManufacturerModelName
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE collection_id = 'qin_prostate_repeatability'
  AND Modality = 'MR'

Access Private Element Values

Use UNNEST to access individual private elements:

SELECT
  SeriesInstanceUID,
  SeriesDescription,
  other_elements.Data[SAFE_OFFSET(0)] AS b_value
FROM `bigquery-public-data.idc_current.dicom_all`,
  UNNEST(OtherElements) AS other_elements
WHERE collection_id = 'qin_prostate_repeatability'
  AND other_elements.Tag = 'Tag_00431039'
LIMIT 10

Aggregate Values by Series

Collect all unique values across slices in a series:

SELECT
  SeriesInstanceUID,
  ANY_VALUE(SeriesDescription) AS SeriesDescription,
  ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)]) AS b_values
FROM `bigquery-public-data.idc_current.dicom_all`,
  UNNEST(OtherElements) AS other_elements
WHERE collection_id = 'qin_prostate_repeatability'
  AND other_elements.Tag = 'Tag_00431039'
GROUP BY SeriesInstanceUID

Combine Standard and Private Filters

Filter using both standard DICOM attributes and private element values:

SELECT
  PatientID,
  SeriesInstanceUID,
  ANY_VALUE(SeriesDescription) AS SeriesDescription,
  ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)]) AS b_values,
  COUNT(DISTINCT SOPInstanceUID) AS n_slices
FROM `bigquery-public-data.idc_current.dicom_all`,
  UNNEST(OtherElements) AS other_elements
WHERE collection_id = 'qin_prostate_repeatability'
  AND Modality = 'MR'
  AND other_elements.Tag = 'Tag_00431039'
  AND ImageType[SAFE_OFFSET(0)] = 'ORIGINAL'
  AND other_elements.Data[SAFE_OFFSET(0)] = '1400'
GROUP BY PatientID, SeriesInstanceUID
ORDER BY PatientID

Cross-Collection Analysis

Survey usage of a private tag across all IDC collections:

SELECT
  collection_id,
  ARRAY_TO_STRING(ARRAY_AGG(DISTINCT other_elements.Data[SAFE_OFFSET(0)] IGNORE NULLS), ', ') AS values_found,
  ARRAY_AGG(DISTINCT Manufacturer IGNORE NULLS) AS manufacturers
FROM `bigquery-public-data.idc_current.dicom_all`,
  UNNEST(OtherElements) AS other_elements
WHERE other_elements.Tag = 'Tag_00431039'
  AND other_elements.Data[SAFE_OFFSET(0)] IS NOT NULL
  AND other_elements.Data[SAFE_OFFSET(0)] != ''
GROUP BY collection_id
ORDER BY collection_id

Workflow: Finding and Using Private Tags

  1. Discover available private tags in your collection using the discovery query above
  2. Identify the manufacturer to know which conformance statement to consult
  3. Find the DICOM Conformance Statement from the manufacturer's website (see Resources below)
  4. Search the conformance statement for the parameter you need (e.g., "b_value", "gradient") to understand what each tag contains
  5. Convert tag to BigQuery format: (gggg,eeee) → Tag_ggggeeee
  6. Query and verify results visually in the IDC Viewer

Data Quality Notes

  • Some collections show unrealistic values (e.g., b-value "1000000600") indicating encoding issues or different conventions
  • IDC data is de-identified; private tags containing PHI may have been removed or modified
  • The same tag may have different meanings across software versions
  • Always verify query results visually using the IDC Viewer before large-scale analysis

Private Element Resources

Manufacturer DICOM Conformance Statements:

DICOM Standard:

Community Resources:

Using Query Results with idc-index

Combine BigQuery for complex queries with idc-index for downloads (no GCP auth needed for downloads):

from google.cloud import bigquery
from idc_index import IDCClient

# Initialize BigQuery client
# Requires: pip install google-cloud-bigquery
# Auth: gcloud auth application-default login
# Project: needed for billing even on public datasets (free tier applies)
bq_client = bigquery.Client(project="your-gcp-project-id")

# Query for series with specific criteria
query = """
SELECT DISTINCT SeriesInstanceUID
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE collection_id = 'tcga_luad'
  AND Modality = 'CT'
  AND Manufacturer = 'GE MEDICAL SYSTEMS'
LIMIT 100
"""

df = bq_client.query(query).to_dataframe()
print(f"Found {len(df)} GE CT series")

# Download with idc-index (no GCP auth required)
idc_client = IDCClient()
idc_client.download_from_selection(
    seriesInstanceUID=list(df['SeriesInstanceUID'].values),
    downloadDir="./tcga_luad_thin_ct"
)

Cost and Optimization

Pricing: $5 per TB scanned (first 1 TB/month free). Most users stay within free tier.

Minimize data scanned:

  • Select only needed columns (not SELECT *)
  • Filter early with WHERE clauses
  • Use LIMIT when testing
  • Use dicom_all instead of dicom_metadata when possible (smaller)
  • Preview queries in BQ console (free, shows bytes to scan)

Check cost before running:

query_job = client.query(query, job_config=bigquery.QueryJobConfig(dry_run=True))
print(f"Query will scan {query_job.total_bytes_processed / 1e9:.2f} GB")

Use materialized tables: IDC provides both views (table_name_view) and materialized tables (table_name). Always use the materialized tables (faster, lower cost).

Clinical Data

Clinical data is in separate datasets with collection-specific tables. All clinical data available via idc-index is also available in BigQuery, with the same content and structure. Use BigQuery when you need complex cross-collection queries or joins that aren't possible with the local idc-index tables.

Datasets:

  • bigquery-public-data.idc_current_clinical - current release (for exploration)
  • bigquery-public-data.idc_v{version}_clinical - versioned datasets (for reproducibility)

Currently there are ~130 clinical tables representing ~70 collections. Not all collections have clinical data (started in IDC v11).

Clinical Table Naming

Most collections use a single table: <collection_id>_clinical

Exception: ACRIN collections use multiple tables for different data types (e.g., acrin_6698_A0, acrin_6698_A1, etc.).

Metadata Tables

Two metadata tables help navigate clinical data:

table_metadata - Collection-level information:

SELECT
  collection_id,
  table_name,
  table_description
FROM `bigquery-public-data.idc_current_clinical.table_metadata`
WHERE collection_id = 'nlst'

column_metadata - Attribute-level details with value mappings:

SELECT
  collection_id,
  table_name,
  column,
  column_label,
  data_type,
  values
FROM `bigquery-public-data.idc_current_clinical.column_metadata`
WHERE collection_id = 'nlst'
  AND column_label LIKE '%stage%'

The values field contains observed attribute values with their descriptions (same as in idc-index clinical_index).

Common Clinical Queries

List available clinical tables:

SELECT table_name
FROM `bigquery-public-data.idc_current_clinical.INFORMATION_SCHEMA.TABLES`
WHERE table_name NOT IN ('table_metadata', 'column_metadata')

Find collections with specific clinical attributes:

SELECT DISTINCT collection_id, table_name, column, column_label
FROM `bigquery-public-data.idc_current_clinical.column_metadata`
WHERE LOWER(column_label) LIKE '%chemotherapy%'

Query clinical data for a collection:

-- Example: NLST cancer staging data
SELECT
  dicom_patient_id,
  clinical_stag,
  path_stag,
  de_stag
FROM `bigquery-public-data.idc_current_clinical.nlst_canc`
WHERE clinical_stag IS NOT NULL
LIMIT 10

Join clinical with imaging data:

SELECT
  d.PatientID,
  d.StudyInstanceUID,
  d.Modality,
  c.clinical_stag,
  c.path_stag
FROM `bigquery-public-data.idc_current.dicom_all` d
JOIN `bigquery-public-data.idc_current_clinical.nlst_canc` c
  ON d.PatientID = c.dicom_patient_id
WHERE d.collection_id = 'nlst'
  AND d.Modality = 'CT'
  AND c.clinical_stag = '400'  -- Stage IV
LIMIT 20

Cross-collection clinical search:

-- Find all collections with staging information
SELECT
  cm.collection_id,
  cm.table_name,
  cm.column,
  cm.column_label
FROM `bigquery-public-data.idc_current_clinical.column_metadata` cm
WHERE LOWER(cm.column_label) LIKE '%stage%'
ORDER BY cm.collection_id

Key Column: dicom_patient_id

Every clinical table includes dicom_patient_id, which matches the DICOM PatientID attribute in imaging tables. This is the join key between clinical and imaging data.

Note: Clinical table schemas vary significantly by collection. Always check available columns first:

SELECT column_name, data_type
FROM `bigquery-public-data.idc_current_clinical.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'nlst_canc'

See references/clinical_data_guide.md for detailed workflows using idc-index, which provides the same clinical data without requiring BigQuery authentication.

Important Notes

  • Tables are read-only (public dataset)
  • Schema changes between IDC versions
  • Use versioned datasets for reproducibility
  • Some DICOM sequences >15 levels deep are not extracted
  • Very large sequences (>1MB) may be truncated
  • Always check data license before use

Common Errors

Issue: Billing must be enabled

  • Cause: BigQuery requires a billing-enabled GCP project
  • Solution: Enable billing in Google Cloud Console or use idc-index mini-index instead

Issue: Query exceeds resource limits

  • Cause: Query scans too much data or is too complex
  • Solution: Add more specific WHERE filters, use LIMIT, break into smaller queries

Issue: Column not found

  • Cause: Field name typo or not in selected table
  • Solution: Check table schema first with INFORMATION_SCHEMA.COLUMNS

Issue: Permission denied

  • Cause: Not authenticated to Google Cloud
  • Solution: Run gcloud auth application-default login or set GOOGLE_APPLICATION_CREDENTIALS

Resources