Initial commit for imaging-data-commons
This commit is contained in:
289
references/bigquery_guide.md
Normal file
289
references/bigquery_guide.md
Normal file
@@ -0,0 +1,289 @@
|
||||
# BigQuery Guide for IDC
|
||||
|
||||
**Tested with:** IDC data version v23
|
||||
|
||||
For most queries and downloads, use `idc-index` (see main SKILL.md). This guide covers BigQuery for advanced use cases requiring full DICOM metadata or complex joins.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**Requirements:**
|
||||
1. Google account
|
||||
2. Google Cloud project with billing enabled (first 1 TB/month free)
|
||||
3. `google-cloud-bigquery` Python package or BigQuery console access
|
||||
|
||||
**Authentication setup:**
|
||||
```bash
|
||||
# Install Google Cloud SDK, then:
|
||||
gcloud auth application-default login
|
||||
```
|
||||
|
||||
## When to Use BigQuery
|
||||
|
||||
Use BigQuery instead of `idc-index` when you need:
|
||||
- Full DICOM metadata (all 4000+ tags, not just the ~50 in idc-index)
|
||||
- Complex joins across clinical data tables
|
||||
- DICOM sequence attributes (nested structures)
|
||||
- Queries on fields not in the idc-index mini-index
|
||||
|
||||
## Accessing IDC in BigQuery
|
||||
|
||||
### Dataset Structure
|
||||
|
||||
All IDC tables are in the `bigquery-public-data` BigQuery project.
|
||||
|
||||
**Current version (recommended for exploration):**
|
||||
- `bigquery-public-data.idc_current.*`
|
||||
- `bigquery-public-data.idc_current_clinical.*`
|
||||
|
||||
**Versioned datasets (recommended for reproducibility):**
|
||||
|
||||
- `bigquery-public-data.idc_v{IDC version}.*`
|
||||
- `bigquery-public-data.idc_v{IDC version}_clinical.*`
|
||||
|
||||
Always use versioned datasets for reproducible research!
|
||||
|
||||
## Key Tables
|
||||
|
||||
### dicom_all
|
||||
Primary table joining complete DICOM metadata with IDC-specific columns (collection_id, gcs_url, license). Contains all DICOM tags from `dicom_metadata` plus collection and administrative metadata. See [dicom_all.sql](https://github.com/ImagingDataCommons/etl_flow/blob/master/bq/generate_tables_and_views/derived_tables/BQ_Table_Building/derived_data_views/sql/dicom_all.sql) for the exact derivation.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
collection_id,
|
||||
PatientID,
|
||||
StudyInstanceUID,
|
||||
SeriesInstanceUID,
|
||||
Modality,
|
||||
BodyPartExamined,
|
||||
SeriesDescription,
|
||||
gcs_url,
|
||||
license_short_name
|
||||
FROM `bigquery-public-data.idc_current.dicom_all`
|
||||
WHERE Modality = 'CT'
|
||||
AND BodyPartExamined = 'CHEST'
|
||||
LIMIT 10
|
||||
```
|
||||
|
||||
### Derived Tables
|
||||
|
||||
**segmentations** - DICOM Segmentation objects
|
||||
```sql
|
||||
SELECT *
|
||||
FROM `bigquery-public-data.idc_current.segmentations`
|
||||
LIMIT 10
|
||||
```
|
||||
|
||||
**measurement_groups** - SR TID1500 measurement groups
|
||||
**qualitative_measurements** - Coded evaluations
|
||||
**quantitative_measurements** - Numeric measurements
|
||||
|
||||
### Collection Metadata
|
||||
|
||||
**original_collections_metadata** - Collection-level descriptions
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
collection_id,
|
||||
CancerTypes,
|
||||
TumorLocations,
|
||||
Subjects,
|
||||
src.source_doi,
|
||||
src.ImageTypes,
|
||||
src.license.license_short_name
|
||||
FROM `bigquery-public-data.idc_current.original_collections_metadata`,
|
||||
UNNEST(Sources) AS src
|
||||
WHERE CancerTypes LIKE '%Lung%'
|
||||
```
|
||||
|
||||
## Common Query Patterns
|
||||
|
||||
### Find Collections by Criteria
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
collection_id,
|
||||
COUNT(DISTINCT PatientID) as patient_count,
|
||||
COUNT(DISTINCT SeriesInstanceUID) as series_count,
|
||||
ARRAY_AGG(DISTINCT Modality) as modalities
|
||||
FROM `bigquery-public-data.idc_current.dicom_all`
|
||||
WHERE BodyPartExamined LIKE '%BRAIN%'
|
||||
GROUP BY collection_id
|
||||
HAVING patient_count > 50
|
||||
ORDER BY patient_count DESC
|
||||
```
|
||||
|
||||
### Get Download URLs
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
SeriesInstanceUID,
|
||||
gcs_url
|
||||
FROM `bigquery-public-data.idc_current.dicom_all`
|
||||
WHERE collection_id = 'rider_pilot'
|
||||
AND Modality = 'CT'
|
||||
```
|
||||
|
||||
### Find Studies with Multiple Modalities
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
StudyInstanceUID,
|
||||
ARRAY_AGG(DISTINCT Modality) as modalities,
|
||||
COUNT(DISTINCT SeriesInstanceUID) as series_count
|
||||
FROM `bigquery-public-data.idc_current.dicom_all`
|
||||
GROUP BY StudyInstanceUID
|
||||
HAVING ARRAY_LENGTH(ARRAY_AGG(DISTINCT Modality)) > 1
|
||||
LIMIT 100
|
||||
```
|
||||
|
||||
### License Filtering
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
collection_id,
|
||||
license_short_name,
|
||||
COUNT(*) as instance_count
|
||||
FROM `bigquery-public-data.idc_current.dicom_all`
|
||||
WHERE license_short_name = 'CC BY 4.0'
|
||||
GROUP BY collection_id, license_short_name
|
||||
```
|
||||
|
||||
### Find Segmentations with Source Images
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
src.collection_id,
|
||||
seg.SeriesInstanceUID as seg_series,
|
||||
seg.SegmentedPropertyType,
|
||||
src.SeriesInstanceUID as source_series,
|
||||
src.Modality as source_modality
|
||||
FROM `bigquery-public-data.idc_current.segmentations` seg
|
||||
JOIN `bigquery-public-data.idc_current.dicom_all` src
|
||||
ON seg.segmented_SeriesInstanceUID = src.SeriesInstanceUID
|
||||
WHERE src.collection_id = 'qin_prostate_repeatability'
|
||||
LIMIT 10
|
||||
```
|
||||
|
||||
## Using Query Results with idc-index
|
||||
|
||||
Combine BigQuery for complex queries with idc-index for downloads (no GCP auth needed for downloads):
|
||||
|
||||
```python
|
||||
from google.cloud import bigquery
|
||||
from idc_index import IDCClient
|
||||
|
||||
# Initialize BigQuery client
|
||||
# Requires: pip install google-cloud-bigquery
|
||||
# Auth: gcloud auth application-default login
|
||||
# Project: needed for billing even on public datasets (free tier applies)
|
||||
bq_client = bigquery.Client(project="your-gcp-project-id")
|
||||
|
||||
# Query for series with specific criteria
|
||||
query = """
|
||||
SELECT DISTINCT SeriesInstanceUID
|
||||
FROM `bigquery-public-data.idc_current.dicom_all`
|
||||
WHERE collection_id = 'tcga_luad'
|
||||
AND Modality = 'CT'
|
||||
AND Manufacturer = 'GE MEDICAL SYSTEMS'
|
||||
LIMIT 100
|
||||
"""
|
||||
|
||||
df = bq_client.query(query).to_dataframe()
|
||||
print(f"Found {len(df)} GE CT series")
|
||||
|
||||
# Download with idc-index (no GCP auth required)
|
||||
idc_client = IDCClient()
|
||||
idc_client.download_from_selection(
|
||||
seriesInstanceUID=list(df['SeriesInstanceUID'].values),
|
||||
downloadDir="./tcga_luad_thin_ct"
|
||||
)
|
||||
```
|
||||
|
||||
## Cost and Optimization
|
||||
|
||||
**Pricing:** $5 per TB scanned (first 1 TB/month free). Most users stay within free tier.
|
||||
|
||||
**Minimize data scanned:**
|
||||
- Select only needed columns (not `SELECT *`)
|
||||
- Filter early with `WHERE` clauses
|
||||
- Use `LIMIT` when testing
|
||||
- Use `dicom_all` instead of `dicom_metadata` when possible (smaller)
|
||||
- Preview queries in BQ console (free, shows bytes to scan)
|
||||
|
||||
**Check cost before running:**
|
||||
```python
|
||||
query_job = client.query(query, job_config=bigquery.QueryJobConfig(dry_run=True))
|
||||
print(f"Query will scan {query_job.total_bytes_processed / 1e9:.2f} GB")
|
||||
```
|
||||
|
||||
**Use materialized tables:** IDC provides both views (`table_name_view`) and materialized tables (`table_name`). Always use the materialized tables (faster, lower cost).
|
||||
|
||||
## Clinical Data
|
||||
|
||||
Clinical data is in separate datasets with collection-specific tables. Not all collections have clinical data (started in IDC v11).
|
||||
|
||||
**List available clinical tables:**
|
||||
```sql
|
||||
SELECT table_name
|
||||
FROM `bigquery-public-data.idc_current_clinical.INFORMATION_SCHEMA.TABLES`
|
||||
```
|
||||
|
||||
**Query clinical data for a collection:**
|
||||
```sql
|
||||
-- Example: TCGA-LUAD clinical data
|
||||
SELECT *
|
||||
FROM `bigquery-public-data.idc_current_clinical.tcga_luad_clinical`
|
||||
LIMIT 10
|
||||
```
|
||||
|
||||
**Join clinical with imaging data:**
|
||||
```sql
|
||||
SELECT
|
||||
d.PatientID,
|
||||
d.SeriesInstanceUID,
|
||||
d.Modality,
|
||||
c.age_at_diagnosis,
|
||||
c.pathologic_stage
|
||||
FROM `bigquery-public-data.idc_current.dicom_all` d
|
||||
JOIN `bigquery-public-data.idc_current_clinical.tcga_luad_clinical` c
|
||||
ON d.PatientID = c.dicom_patient_id
|
||||
WHERE d.collection_id = 'tcga_luad'
|
||||
AND d.Modality = 'CT'
|
||||
LIMIT 20
|
||||
```
|
||||
|
||||
**Note:** Clinical table schemas vary by collection. Check column names with `INFORMATION_SCHEMA.COLUMNS` before querying.
|
||||
|
||||
## Important Notes
|
||||
|
||||
- Tables are read-only (public dataset)
|
||||
- Schema changes between IDC versions
|
||||
- Use versioned datasets for reproducibility
|
||||
- Some DICOM sequences >15 levels deep are not extracted
|
||||
- Very large sequences (>1MB) may be truncated
|
||||
- Always check data license before use
|
||||
|
||||
## Common Errors
|
||||
|
||||
**Issue: Billing must be enabled**
|
||||
- Cause: BigQuery requires a billing-enabled GCP project
|
||||
- Solution: Enable billing in Google Cloud Console or use idc-index mini-index instead
|
||||
|
||||
**Issue: Query exceeds resource limits**
|
||||
- Cause: Query scans too much data or is too complex
|
||||
- Solution: Add more specific WHERE filters, use LIMIT, break into smaller queries
|
||||
|
||||
**Issue: Column not found**
|
||||
- Cause: Field name typo or not in selected table
|
||||
- Solution: Check table schema first with `INFORMATION_SCHEMA.COLUMNS`
|
||||
|
||||
**Issue: Permission denied**
|
||||
- Cause: Not authenticated to Google Cloud
|
||||
- Solution: Run `gcloud auth application-default login` or set GOOGLE_APPLICATION_CREDENTIALS
|
||||
|
||||
## Resources
|
||||
|
||||
- [Understanding the BigQuery DICOM schema](https://docs.cloud.google.com/healthcare-api/docs/how-tos/dicom-bigquery-schema)
|
||||
- [BigQuery Query Syntax](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax)
|
||||
- [Kaggle Intro to SQL](https://www.kaggle.com/learn/intro-to-sql)
|
||||
- [Sample BigQuery queries of IDC data](https://github.com/ImagingDataCommons/idc-bigquery-cookbook)
|
||||
Reference in New Issue
Block a user