claude-scientific-skills/scientific-skills/modal/references/gpu.md

# Modal GPU Compute

## Table of Contents

- [Available GPUs](#available-gpus)
- [Requesting GPUs](#requesting-gpus)
- [GPU Selection Guide](#gpu-selection-guide)
- [Multi-GPU](#multi-gpu)
- [GPU Fallback Chains](#gpu-fallback-chains)
- [Auto-Upgrades](#auto-upgrades)
- [Multi-GPU Training](#multi-gpu-training)

## Available GPUs

| GPU | VRAM | Max per Container | Best For |
|-----|------|-------------------|----------|
| T4 | 16 GB | 8 | Budget inference, small models |
| L4 | 24 GB | 8 | Inference, video processing |
| A10 | 24 GB | 4 | Inference, fine-tuning small models |
| L40S | 48 GB | 8 | Inference (best cost/perf), medium models |
| A100-40GB | 40 GB | 8 | Training, large model inference |
| A100-80GB | 80 GB | 8 | Training, large models |
| RTX-PRO-6000 | 48 GB | 8 | Rendering, inference |
| H100 | 80 GB | 8 | Large-scale training, fast inference |
| H200 | 141 GB | 8 | Very large models, training |
| B200 | 192 GB | 8 | Largest models, maximum throughput |
| B200+ | 192 GB | 8 | B200 or B300, B200 pricing |

## Requesting GPUs

### Basic Request

```python
@app.function(gpu="H100")
def train():
    import torch
    assert torch.cuda.is_available()
    print(f"Using: {torch.cuda.get_device_name(0)}")
```

### String Shorthand

```python
gpu="T4"           # Single T4
gpu="A100-80GB"    # Single A100 80GB
gpu="H100:4"       # Four H100s
```

### GPU Object (Advanced)

```python
@app.function(gpu=modal.gpu.H100(count=2))
def multi_gpu():
    ...
```

## GPU Selection Guide

### For Inference

| Model Size | Recommended GPU | Why |
|-----------|----------------|-----|
| < 7B params | T4, L4 | Cost-effective, sufficient VRAM |
| 7B-13B params | L40S | Best cost/performance, 48 GB VRAM |
| 13B-70B params | A100-80GB, H100 | Large VRAM, fast memory bandwidth |
| 70B+ params | H100:2+, H200, B200 | Multi-GPU or very large VRAM |

### For Training

| Task | Recommended GPU |
|------|----------------|
| Fine-tuning (LoRA) | L40S, A100-40GB |
| Full fine-tuning small models | A100-80GB |
| Full fine-tuning large models | H100:4+, H200 |
| Pre-training | H100:8, B200:8 |

### General Recommendation

L40S is the best default for inference workloads — it offers an excellent trade-off of cost and performance with 48 GB of GPU RAM.

## Multi-GPU

Request multiple GPUs by appending `:count`:

```python
@app.function(gpu="H100:4")
def distributed():
    import torch
    print(f"GPUs available: {torch.cuda.device_count()}")
    # All 4 GPUs are on the same physical machine
```

- Up to 8 GPUs for most types (up to 4 for A10)
- All GPUs attach to the same physical machine
- Requesting more than 2 GPUs may result in longer wait times
- Maximum VRAM: 8 x B200 = 1,536 GB

## GPU Fallback Chains

Specify a prioritized list of GPU types:

```python
@app.function(gpu=["H100", "A100-80GB", "L40S"])
def flexible():
    # Modal tries H100 first, then A100-80GB, then L40S
    ...
```

Useful for reducing queue times when a specific GPU isn't available.

## Auto-Upgrades

### H100 → H200

Modal may automatically upgrade H100 requests to H200 at no extra cost. To prevent this:

```python
@app.function(gpu="H100!")  # Exclamation mark prevents auto-upgrade
def must_use_h100():
    ...
```

### A100 → A100-80GB

A100-40GB requests may be upgraded to 80GB at no extra cost.

### B200+

`gpu="B200+"` allows Modal to run on B200 or B300 GPUs at B200 pricing. Requires CUDA 13.0+.

## Multi-GPU Training

Modal supports multi-GPU training on a single node. Multi-node training is in private beta.

### PyTorch DDP Example

```python
@app.function(gpu="H100:4", image=image, timeout=86400)
def train_distributed():
    import torch
    import torch.distributed as dist

    dist.init_process_group(backend="nccl")
    local_rank = int(os.environ.get("LOCAL_RANK", 0))
    device = torch.device(f"cuda:{local_rank}")
    # ... training loop with DDP ...
```

### PyTorch Lightning

When using frameworks that re-execute Python entrypoints (like PyTorch Lightning), either:

1. Set strategy to `ddp_spawn` or `ddp_notebook`
2. Or run training as a subprocess

```python
@app.function(gpu="H100:4", image=image)
def train():
    import subprocess
    subprocess.run(["python", "train_script.py"], check=True)
```

### Hugging Face Accelerate

```python
@app.function(gpu="A100-80GB:4", image=image)
def finetune():
    import subprocess
    subprocess.run([
        "accelerate", "launch",
        "--num_processes", "4",
        "train.py"
    ], check=True)
```