mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-03-27 07:09:27 +08:00

Files

Andrey Fedorov 63801af8e6 Update imaging-data-commons skill to v1.2.0

see changes in the changelog upstream:

https://github.com/ImagingDataCommons/idc-claude-skill/blob/main/CHANGELOG.md#120---2026-02-04

2026-02-04 14:35:14 -05:00

14 KiB

Raw Blame History

Cloud Storage Guide for IDC

IDC maintains all DICOM files in public cloud storage buckets mirrored between Google Cloud Storage (GCS) and AWS S3. This guide covers bucket organization, file structure, access methods, and versioning.

When to Use Direct Cloud Storage Access

Use direct bucket access when you need:

Maximum download performance with parallel transfers
Integration with cloud-native workflows (e.g., running analysis on cloud VMs)
Programmatic access from tools like s5cmd or gsutil
Access to specific file versions for reproducibility

For most use cases, idc-index is simpler and recommended -— it uses s5cmd internally to download from these same S3 buckets, handling the UUID lookups automatically. Use direct cloud storage when you need raw file access, custom parallelization, or are building cloud-native pipelines.

Storage Buckets

IDC organizes data across multiple buckets based on licensing and content type. All buckets are mirrored between AWS and GCS with identical content and file paths.

Bucket Summary

Purpose	AWS S3 Bucket	GCS Bucket	License	Content
Primary data	`idc-open-data`	`idc-open-data`	No commercial restriction	>90% of IDC data
Head scans	`idc-open-data-two`	`idc-open-idc1`	No commercial restriction	Collections potentially containing head imaging
Commercial-restricted	`idc-open-data-cr`	`idc-open-cr`	Commercial use restricted (CC BY-NC)	~4% of data

Notes:

All AWS buckets are in AWS region us-east-1
Prior to IDC v19, GCS used public-datasets-idc (now superseded by idc-open-data)
The head scans bucket exists for potential future policy changes regarding facial imaging data
Important Use idc-index to get license information - do not rely on bucket name!

Why Multiple Buckets?

Licensing separation: Data with commercial-use restrictions (CC BY-NC) is isolated in idc-open-data-cr / idc-open-cr to prevent accidental commercial use
Head scan handling: Collections labeled by TCIA as potentially containing head scans are in separate buckets (idc-open-data-two / idc-open-idc1) for potential future policy compliance
Historical reasons: The bucket structure evolved as IDC grew and partnered with different cloud programs

File Organization Within Buckets

Files are organized by CRDC UUIDs, not DICOM UIDs. This enables versioning while maintaining consistent paths across cloud providers.

Directory Structure

<bucket>/
└── <crdc_series_uuid>/
    ├── <crdc_instance_uuid_1>.dcm
    ├── <crdc_instance_uuid_2>.dcm
    └── ...

Example path:

s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm

7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9 = series UUID (folder)
0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm = instance UUID (file)

CRDC UUIDs vs DICOM UIDs

Identifier Type	Format	Changes When	Use For
DICOM UID (e.g., SeriesInstanceUID)	Numeric (e.g., `1.3.6.1.4...`)	Never (included in DICOM metadata)	Clinical identification, DICOMweb queries
CRDC UUID (e.g., crdc_series_uuid)	UUID (e.g., `e127d258-37c2-...`)	Content changes	File paths, versioning, reproducibility

Key insight: A single DICOM SeriesInstanceUID may have multiple CRDC series UUIDs across IDC versions if the series content changed (instances added/removed, metadata corrected). The CRDC UUID uniquely identifies a specific version of the data.

Mapping DICOM UIDs to File Paths

Use idc-index to get file URLs from DICOM identifiers:

from idc_index import IDCClient

client = IDCClient()

# Get all file URLs for a series
series_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.217441095430480124587725641302"
urls = client.get_series_file_URLs(seriesInstanceUID=series_uid)

for url in urls[:3]:
    print(url)
# Returns S3 URLs like: s3://idc-open-data/<crdc_series_uuid>/<crdc_instance_uuid>.dcm

Or query the index directly for URL columns:

# Get series-level URL (points to folder)
result = client.sql_query("""
    SELECT SeriesInstanceUID, series_aws_url
    FROM index
    WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
    LIMIT 3
""")

print(result[['SeriesInstanceUID', 'series_aws_url']])

Available URL column in index:

series_aws_url: S3 URL to series folder (e.g., s3://idc-open-data/uuid/*)

GCS URLs follow the same path structure—replace s3:// with gs:// (e.g., gs://idc-open-data/uuid/*). When using idc-index download methods, GCS access is handled internally.

Accessing Cloud Storage

All IDC buckets support free egress (no download fees) through partnerships with AWS Open Data and Google Public Data programs. No authentication required.

AWS S3 Access

Using AWS CLI (no account required):

# List bucket contents
aws s3 ls --no-sign-request s3://idc-open-data/

# List files in a series folder
aws s3 ls --no-sign-request s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/

# Download a single file
aws s3 cp --no-sign-request \
    s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm \
    ./local_file.dcm

# Download entire series folder
aws s3 cp --no-sign-request --recursive \
    s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ \
    ./series_folder/

Using s5cmd (faster for bulk downloads):

# Install s5cmd
# macOS: brew install s5cmd
# Linux: download from https://github.com/peak/s5cmd/releases

# Download specific series
s5cmd --no-sign-request cp 's3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/*' ./local_folder/

# Download from manifest file
s5cmd --no-sign-request run manifest.txt

s5cmd manifest format: The s5cmd run command expects one s5cmd command per line, not just URLs:

cp s3://idc-open-data/uuid1/instance1.dcm ./local_folder/
cp s3://idc-open-data/uuid1/instance2.dcm ./local_folder/
cp s3://idc-open-data/uuid2/instance3.dcm ./local_folder/

IDC Portal exports manifests in this format. When creating manifests programmatically, use idc-index download methods (which handle this internally) rather than constructing manifests manually.

GCS Access

Using gsutil:

# List bucket contents
gsutil ls gs://idc-open-data/

# Download a series folder
gsutil -m cp -r gs://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ ./local_folder/

Using gcloud storage (newer CLI):

gcloud storage cp -r gs://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ ./local_folder/

Python Direct Access

import s3fs
import gcsfs
from idc_index import IDCClient

# First, get a file URL from idc-index
client = IDCClient()
result = client.sql_query("""
    SELECT series_aws_url
    FROM index
    WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
    LIMIT 1
""")
# series_aws_url is like: s3://idc-open-data/<uuid>/*
series_url = result['series_aws_url'].iloc[0]
series_path = series_url.replace('s3://', '').rstrip('/*')  # e.g., "idc-open-data/<uuid>"

# AWS S3 access
s3 = s3fs.S3FileSystem(anon=True)
files = s3.ls(series_path)
with s3.open(files[0], 'rb') as f:
    data = f.read()

# GCS access (same path structure as AWS)
gcs = gcsfs.GCSFileSystem(token='anon')
files = gcs.ls(series_path)
with gcs.open(files[0], 'rb') as f:
    data = f.read()

Versioning and Reproducibility

IDC releases new data versions every 2-4 months. The versioning system ensures reproducibility by preserving all historical data.

How Versioning Works

Snapshots: Each IDC version (v1, v2, ..., v23, etc.) represents a complete snapshot of all data at release time
UUID-based: When data changes, new CRDC UUIDs are assigned; old UUIDs remain accessible
Cumulative buckets: All versions coexist in the same buckets—old series folders

Version change scenarios:

Change Type	DICOM UID	CRDC UUID	Effect
New series added	New	New	New folder in bucket
Instance added to series	Same	New series UUID	New folder, instances may be duplicated
Metadata corrected	Same or new	New	New folder with updated files
Series removed	N/A	N/A	Old folder remains, not in current index

Data removal caveat: In rare circumstances (e.g., data owner request, PHI incident), data may be removed from IDC entirely, including from all historical versions.

BigQuery versioned datasets (metadata only, not file storage):

For querying version-specific metadata, BigQuery provides versioned tables. See bigquery_guide.md for details.

bigquery-public-data.idc_current — alias to latest version
bigquery-public-data.idc_v23 — specific version (replace 23 with desired version)

Reproducing a Previous Analysis

The simplest way to ensure reproducibility is to save the crdc_series_uuid values of the data you use at analysis time:

from idc_index import IDCClient
import json

client = IDCClient()

# Select data for your analysis
selection = client.sql_query("""
    SELECT crdc_series_uuid
    FROM index
    WHERE collection_id = 'tcga_luad'
      AND Modality = 'CT'
    LIMIT 10
""")
series_uuids = list(selection['crdc_series_uuid'])

# Download the data
client.download_from_selection(seriesInstanceUID=series_uuids, downloadDir="./data")

# Save a manifest for reproducibility
manifest = {
    "crdc_series_uuids": series_uuids,
    "download_date": "2024-01-15",
    "idc_version": client.get_idc_version(),
    "description": "CT scans for lung cancer analysis"
}
with open("analysis_manifest.json", "w") as f:
    json.dump(manifest, f, indent=2)

# Later, reproduce the exact dataset:
with open("analysis_manifest.json") as f:
    manifest = json.load(f)
client.download_from_selection(
    seriesInstanceUID=manifest["crdc_series_uuids"],
    downloadDir="./reproduced_data"
)

Since crdc_series_uuid identifies an immutable version of each series, saving these UUIDs guarantees you can retrieve the exact same files later.

Relationship Between Buckets, Versions, and Other Access Methods

Data Coverage Comparison

Access Method	Buckets Included	Coverage	Versions
Direct bucket access	All 3 buckets	100%	All historical
`idc-index` download	All 3 buckets	100%	Current + prior_versions_index
IDC Portal	All 3 buckets	100%	Current only
DICOMweb public proxy	All 3 buckets	100%	Current only
Google Healthcare DICOM	`idc-open-data` only	~96%	Current only

Important: The Google Healthcare API DICOM store only replicates data from idc-open-data. Data in idc-open-data-two and idc-open-data-cr (approximately 4% of total) is not available via Google Healthcare DICOMweb endpoint.

Best Practices

Use idc-index for discovery: Query metadata first, then access buckets with known UUIDs
Download defaults to AWS buckets: request GCS if needed
Save manifests: Store the series_aws_url or crdc_series_uuid values for reproducibility
Check licenses: Query license_short_name before commercial use; CC-NC data requires non-commercial use
Use current version unless reproducing: The index table has current data; use prior_versions_index only for exact reproducibility

Troubleshooting

Issue: "Access Denied" when accessing buckets

Cause: Using signed requests or wrong bucket name
Solution: Use --no-sign-request flag with AWS CLI, or anon=True with Python libraries

Issue: File not found at expected path

Cause: Using DICOM UID instead of CRDC UUID, or data changed in newer version
Solution: Query idc-index for current series_aws_url, or check prior_versions_index for historical paths

Issue: Downloaded files don't match expected series

Cause: Series was revised in a newer IDC version
Solution: Use prior_versions_index to find the exact version you need; compare crdc_series_uuid values

Issue: Some data missing from Google Healthcare DICOMweb

Cause: Google Healthcare only mirrors idc-open-data bucket (~96% of data)
Solution: Use IDC public proxy for 100% coverage, or access buckets directly

Resources

IDC Documentation:

Files and metadata - Bucket organization details
Data versioning - Versioning scheme explanation
Resolving GUIDs and UUIDs - CRDC UUID documentation
Direct loading from cloud - Python examples for cloud access

AWS Resources:

NCI IDC on AWS Open Data Registry - Bucket ARNs and access info
s5cmd - High-performance S3 client (used internally by idc-index)
AWS CLI S3 commands - Standard AWS command-line interface
Boto3 S3 documentation - AWS SDK for Python

Google Cloud Resources:

gsutil tool - Google Cloud Storage command-line tool
gcloud storage commands - Modern GCS CLI (recommended over gsutil)
Google Cloud Storage Python client - GCS SDK for Python

Related Guides:

dicomweb_guide.md - DICOMweb API access (alternative to direct bucket access)
bigquery_guide.md - Advanced metadata queries including versioned datasets

14 KiB Raw Blame History