12 KiB
Clinical Data Guide for IDC
Tested with: idc-index 0.11.7 (IDC data version v23)
Clinical data (demographics, diagnoses, therapies, lab tests, staging) accompanies many IDC imaging collections. This guide covers how to discover, access, and integrate clinical data with imaging data using idc-index.
When to Use This Guide
Use this guide when you need to:
- Find what clinical metadata is available for a collection
- Filter patients by clinical criteria (e.g., cancer stage, treatment history)
- Join clinical attributes with imaging data for cohort selection
- Understand and decode coded values in clinical tables
For basic clinical data access, see the "Clinical Data Access" section in the main SKILL.md. This guide provides detailed workflows and advanced patterns.
Prerequisites
pip install --upgrade idc-index
No BigQuery credentials required - clinical data is packaged with idc-index.
Understanding Clinical Data in IDC
What is Clinical Data?
Clinical data refers to non-imaging information that accompanies medical images:
- Patient demographics (age, sex, race)
- Clinical history (diagnoses, surgeries, therapies)
- Lab tests and pathology results
- Cancer staging (clinical and pathological)
- Treatment outcomes
Data Organization
Clinical data in IDC comes from collection-specific spreadsheets provided by data submitters. IDC parses these into queryable tables accessible via idc-index.
Important characteristics:
- Clinical data is not harmonized across collections (terms and formats vary)
- Not all collections have clinical data (check availability first)
- All data is anonymized -
dicom_patient_idlinks to imaging
The clinical_index Table
The clinical_index serves as a dictionary/catalog of all available clinical data:
| Column | Purpose | Use For |
|---|---|---|
collection_id |
Collection identifier | Filtering by collection |
table_name |
Full BigQuery table reference | BigQuery queries (if needed) |
short_table_name |
Short name | get_clinical_table() method |
column |
Column name in table | Selecting data columns |
column_label |
Human-readable description | Searching for concepts |
values |
Observed attribute values for the column | Interpreting coded values |
The values Column
The values column contains an array of observed attribute values for the column defined in the column field. Each entry has:
- option_code: The actual value observed in that column
- option_description: Human-readable description of that value (from data dictionary if available, otherwise
None)
For ACRIN collections, value descriptions come from provided data dictionaries. For other collections, they are derived from inspection of the actual data values.
Note: For columns with >20 unique values, the values array is left empty ([]) for simplicity.
Core Workflow
Step 1: Fetch Clinical Index
from idc_index import IDCClient
client = IDCClient()
client.fetch_index('clinical_index')
# View available columns
print(client.clinical_index.columns.tolist())
Step 2: Discover Available Clinical Data
# List all collections with clinical data
collections_with_clinical = client.clinical_index["collection_id"].unique().tolist()
print(f"{len(collections_with_clinical)} collections have clinical data")
# Find clinical attributes for a specific collection
nlst_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']
nlst_columns[['short_table_name', 'column', 'column_label', 'values']]
Step 3: Search for Specific Attributes
# Search by keyword in column_label (case-insensitive)
stage_attrs = client.clinical_index[
client.clinical_index["column_label"].str.contains("[Ss]tage", na=False)
]
stage_attrs[["collection_id", "short_table_name", "column", "column_label"]]
Step 4: Load Clinical Table
# Load table using short_table_name
nlst_canc_df = client.get_clinical_table("nlst_canc")
# Examine structure
print(f"Rows: {len(nlst_canc_df)}, Columns: {len(nlst_canc_df.columns)}")
nlst_canc_df.head()
Step 5: Map Coded Values to Descriptions
Many clinical attributes use coded values. The values column in clinical_index contains an array of observed values with their descriptions (when available).
# Get the clinical_index rows for NLST
nlst_clinical_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']
# Get observed values for a specific column
# Filter to the row for 'clinical_stag' and extract the values array
clinical_stag_values = nlst_clinical_columns[
nlst_clinical_columns['column']=='clinical_stag'
]['values'].values[0]
# View the observed values and their descriptions
print(clinical_stag_values)
# Output: array([{'option_code': '.M', 'option_description': 'Missing'},
# {'option_code': '110', 'option_description': 'Stage IA'},
# {'option_code': '120', 'option_description': 'Stage IB'}, ...])
# Create mapping dictionary from codes to descriptions
mapping_dict = {item['option_code']: item['option_description'] for item in clinical_stag_values}
# Apply to DataFrame - convert column to string first for consistent matching
nlst_canc_df['clinical_stag_meaning'] = nlst_canc_df['clinical_stag'].astype(str).map(mapping_dict)
Step 6: Join with Imaging Data
The dicom_patient_id column links clinical data to imaging. It matches the PatientID column in the imaging index.
# Pandas merge approach
import pandas as pd
# Get NLST CT imaging data
nlst_imaging = client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')]
# Join with clinical data
merged = pd.merge(
nlst_imaging[['PatientID', 'StudyInstanceUID']].drop_duplicates(),
nlst_canc_df[['dicom_patient_id', 'clinical_stag', 'clinical_stag_meaning']],
left_on='PatientID',
right_on='dicom_patient_id',
how='inner'
)
# SQL join approach
query = """
SELECT
index.PatientID,
index.StudyInstanceUID,
index.Modality,
nlst_canc.clinical_stag
FROM index
JOIN nlst_canc ON index.PatientID = nlst_canc.dicom_patient_id
WHERE index.collection_id = 'nlst' AND index.Modality = 'CT'
"""
results = client.sql_query(query)
Common Use Cases
Use Case 1: Select Patients by Cancer Stage
from idc_index import IDCClient
import pandas as pd
client = IDCClient()
client.fetch_index('clinical_index')
# Load clinical table
nlst_canc = client.get_clinical_table("nlst_canc")
# Select Stage IV patients (code '400')
stage_iv_patients = nlst_canc[nlst_canc['clinical_stag'] == '400']['dicom_patient_id']
# Get CT imaging studies for these patients
stage_iv_studies = pd.merge(
client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')],
stage_iv_patients,
left_on='PatientID',
right_on='dicom_patient_id',
how='inner'
)['StudyInstanceUID'].drop_duplicates()
print(f"Found {len(stage_iv_studies)} CT studies for Stage IV patients")
Use Case 2: Find Collections with Specific Clinical Attributes
# Find collections with chemotherapy information
chemo_collections = client.clinical_index[
client.clinical_index["column_label"].str.contains("[Cc]hemotherapy", na=False)
]["collection_id"].unique()
print(f"Collections with chemotherapy data: {list(chemo_collections)}")
Use Case 3: Examine Observed Values for a Clinical Attribute
# Find what values have been observed for a specific attribute
chemotherapy_rows = client.clinical_index[
(client.clinical_index["collection_id"] == "hcc_tace_seg") &
(client.clinical_index["column"] == "chemotherapy")
]
# Get the observed values array
values_list = chemotherapy_rows["values"].tolist()
print(values_list)
# Output: [[{'option_code': 'Cisplastin', 'option_description': None},
# {'option_code': 'Cisplatin, Mitomycin-C', 'option_description': None}, ...]]
Use Case 4: Generate Viewer URLs for Selected Patients
import random
# Get studies for a sample Stage IV patient
sample_patient = stage_iv_patients.iloc[0]
studies = client.index[client.index['PatientID'] == sample_patient]['StudyInstanceUID'].unique()
# Generate viewer URL
if len(studies) > 0:
viewer_url = client.get_viewer_URL(studyInstanceUID=studies[0])
print(viewer_url)
Key Concepts
column vs column_label
- column: Use for selecting data from tables (programmatic access)
- column_label: Use for searching/understanding what data means (human-readable)
Some collections (like c4kc_kits) have identical column and column_label. Others (like ACRIN collections) have cryptic column names but descriptive labels.
option_code vs option_description
The values array contains observed attribute values:
- option_code: The actual value observed in the column (what you filter on)
- option_description: Human-readable description (from data dictionary if available, otherwise
None)
dicom_patient_id
Every clinical table includes dicom_patient_id, which matches the PatientID column in the imaging index. This is the key for joining clinical and imaging data.
Troubleshooting
Issue: Clinical table not found
Cause: Using wrong table name or table doesn't exist for collection
Solution: Query clinical_index first to find available tables:
client.clinical_index[client.clinical_index['collection_id']=='your_collection']['short_table_name'].unique()
Issue: Empty values array
Cause: The values array is left empty when a column has >20 unique values
Solution: Load the clinical table and examine unique values directly:
clinical_df = client.get_clinical_table("table_name")
clinical_df['column_name'].unique()
Issue: Coded values not in mapping
Cause: Some values may be missing from the dictionary (e.g., empty strings, special codes like .M for missing)
Solution: Handle unmapped values gracefully:
df['meaning'] = df['code'].astype(str).map(mapping_dict).fillna('Unknown/Missing')
Issue: No matching patients when joining
Cause: Clinical data may include patients without images, or vice versa
Solution: Verify patient overlap before joining:
imaging_patients = set(client.index[client.index['collection_id']=='nlst']['PatientID'].unique())
clinical_patients = set(clinical_df['dicom_patient_id'].unique())
overlap = imaging_patients & clinical_patients
print(f"Patients with both imaging and clinical data: {len(overlap)}")
Resources
IDC Documentation:
- Clinical data organization - How clinical data is organized in IDC
- Clinical data dashboard - Visual summary of available clinical data
- idc-index clinical_index documentation
Related Guides:
bigquery_guide.md- Advanced clinical queries via BigQuery- Main SKILL.md - Core IDC workflows
IDC Tutorials: