# idc-index Command Line Interface Guide The `idc-index` package provides command-line tools for downloading DICOM data from the NCI Imaging Data Commons without writing Python code. ## Installation ```bash pip install --upgrade idc-index ``` After installation, the `idc` command is available in your terminal. ## Available Commands | Command | Purpose | |---------|---------| | `idc download` | General-purpose download with auto-detection of input type | | `idc download-from-manifest` | Download from manifest file with validation and progress tracking | | `idc download-from-selection` | Filter-based download with multiple criteria | --- ## idc download General-purpose download command that intelligently interprets input. It determines whether the input corresponds to a manifest file path or a list of identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid). ### Usage ```bash # Download entire collection idc download rider_pilot --download-dir ./data # Download specific series by UID idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data # Download multiple items (comma-separated) idc download "tcga_luad,tcga_lusc" --download-dir ./data # Download from manifest file (auto-detected by file extension) idc download manifest.txt --download-dir ./data ``` ### Options | Option | Description | |--------|-------------| | `--download-dir` | Destination directory (default: current directory) | | `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) | | `--log-level` | Verbosity: debug, info, warning, error, critical | ### Directory Template Variables Use these variables in `--dir-template` to organize downloads: - `%collection_id` - Collection identifier - `%PatientID` - Patient identifier - `%StudyInstanceUID` - Study UID - `%SeriesInstanceUID` - Series UID - `%Modality` - Imaging modality (CT, MR, PT, etc.) **Examples:** ```bash # Flat structure (all files in one directory) idc download rider_pilot --download-dir ./data --dir-template "" # Simplified hierarchy idc download rider_pilot --download-dir ./data --dir-template "%collection_id/%PatientID/%Modality" ``` --- ## idc download-from-manifest Specialized for downloading from manifest files with built-in validation, progress tracking, and resume capability. ### Usage ```bash # Basic download from manifest idc download-from-manifest --manifest-file cohort.txt --download-dir ./data # With progress bar and validation idc download-from-manifest --manifest-file cohort.txt --download-dir ./data --show-progress-bar # Resume interrupted download with s5cmd sync idc download-from-manifest --manifest-file cohort.txt --download-dir ./data --use-s5cmd-sync ``` ### Options | Option | Description | |--------|-------------| | `--manifest-file` | **Required.** Path to manifest file containing S3 URLs | | `--download-dir` | **Required.** Destination directory | | `--validate-manifest` | Validate manifest before download (enabled by default) | | `--show-progress-bar` | Display download progress | | `--use-s5cmd-sync` | Enable resumable downloads - skips already-downloaded files | | `--quiet` | Suppress subprocess output | | `--dir-template` | Directory hierarchy template | | `--log-level` | Logging verbosity | ### Manifest File Format Manifest files contain S3 URLs, one per line: ``` s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/* s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/* ``` **How to get a manifest file:** 1. **IDC Portal**: Export cohort selection as manifest 2. **Python query**: Generate from SQL results ```python from idc_index import IDCClient client = IDCClient() results = client.sql_query(""" SELECT series_aws_url FROM index WHERE collection_id = 'rider_pilot' AND Modality = 'CT' """) with open('ct_manifest.txt', 'w') as f: for url in results['series_aws_url']: f.write(url + '\n') ``` --- ## idc download-from-selection Download data using filter criteria. Filters are applied sequentially. ### Usage ```bash # Download by collection idc download-from-selection --collection-id rider_pilot --download-dir ./data # Download specific series idc download-from-selection --series-instance-uid "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data # Multiple filters idc download-from-selection --collection-id nlst --patient-id "100004" --download-dir ./data # Dry run - see what would be downloaded without actually downloading idc download-from-selection --collection-id tcga_luad --dry-run --download-dir ./data ``` ### Options | Option | Description | |--------|-------------| | `--download-dir` | **Required.** Destination directory | | `--collection-id` | Filter by collection identifier | | `--patient-id` | Filter by patient identifier | | `--study-instance-uid` | Filter by study UID | | `--series-instance-uid` | Filter by series UID | | `--crdc-series-uuid` | Filter by CRDC UUID | | `--dry-run` | Calculate cohort size without downloading | | `--show-progress-bar` | Display download progress | | `--use-s5cmd-sync` | Enable resumable downloads | | `--dir-template` | Directory hierarchy template | ### Dry Run for Size Estimation Use `--dry-run` to estimate download size before committing: ```bash idc download-from-selection --collection-id nlst --dry-run --download-dir ./data ``` This shows: - Number of series matching filters - Total download size - No files are downloaded --- ## Common Workflows ### 1. Download Small Collection for Testing ```bash # rider_pilot is ~1GB - good for testing idc download rider_pilot --download-dir ./test_data ``` ### 2. Large Dataset with Progress and Resume ```bash # Use s5cmd sync for large downloads - can resume if interrupted idc download-from-selection \ --collection-id nlst \ --download-dir ./nlst_data \ --show-progress-bar \ --use-s5cmd-sync ``` ### 3. Estimate Size Before Download ```bash # Check size first idc download-from-selection --collection-id tcga_luad --dry-run --download-dir ./data # Then download if size is acceptable idc download-from-selection --collection-id tcga_luad --download-dir ./data ``` ### 4. Download Specific Modality via Python + CLI ```python # First, query for series UIDs in Python from idc_index import IDCClient client = IDCClient() results = client.sql_query(""" SELECT SeriesInstanceUID FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' AND BodyPartExamined = 'CHEST' LIMIT 50 """) # Save to manifest results['SeriesInstanceUID'].to_csv('my_series.csv', index=False, header=False) ``` ```bash # Then download via CLI idc download my_series.csv --download-dir ./lung_ct ``` --- ## Built-in Safety Features The CLI includes several safety features: - **Disk space checking**: Verifies sufficient space before starting downloads - **Manifest validation**: Validates manifest file format by default - **Progress tracking**: Optional progress bar for monitoring large downloads - **Resume capability**: Use `--use-s5cmd-sync` to continue interrupted downloads --- ## Troubleshooting ### Download Interrupted Use `--use-s5cmd-sync` to resume: ```bash idc download-from-manifest --manifest-file cohort.txt --download-dir ./data --use-s5cmd-sync ``` ### Connection Timeout For unstable networks, download in smaller batches using Python to generate multiple manifests, then download sequentially. --- ## See Also - [idc-index Documentation](https://idc-index.readthedocs.io/) - [IDC Portal](https://portal.imaging.datacommons.cancer.gov/) - Interactive cohort building - [IDC Tutorials](https://github.com/ImagingDataCommons/IDC-Tutorials)