Files
claude-scientific-skills/scientific-skills/modal/references/scaling.md
2026-03-23 16:21:31 -07:00

4.4 KiB

Modal Scaling and Concurrency

Table of Contents

Autoscaling

Modal automatically manages a pool of containers for each function:

  • Spins up containers when there's no capacity for new inputs
  • Spins down idle containers to save costs
  • Scales from zero (no cost when idle) to thousands of containers

No configuration needed for basic autoscaling — it works out of the box.

Configuration

Fine-tune autoscaling behavior:

@app.function(
    max_containers=100,     # Upper limit on container count
    min_containers=2,       # Keep 2 warm (reduces cold starts)
    buffer_containers=5,    # Reserve 5 extra for burst traffic
    scaledown_window=300,   # Wait 5 min idle before shutting down
)
def handle_request(data):
    ...
Parameter Default Description
max_containers Unlimited Hard cap on total containers
min_containers 0 Minimum warm containers (costs money even when idle)
buffer_containers 0 Extra containers to prevent queuing
scaledown_window 60 Seconds of idle time before shutdown

Trade-offs

  • Higher min_containers = lower latency, higher cost
  • Higher buffer_containers = less queuing, higher cost
  • Lower scaledown_window = faster cost savings, more cold starts

Parallel Execution

.map() — Process Many Inputs

@app.function()
def process(item):
    return heavy_computation(item)

@app.local_entrypoint()
def main():
    items = list(range(10_000))
    results = list(process.map(items))

Modal automatically scales containers to handle the workload. Results maintain input order.

.map() Options

# Unordered results (faster)
for result in process.map(items, order_outputs=False):
    handle(result)

# Collect errors instead of raising
results = list(process.map(items, return_exceptions=True))
for r in results:
    if isinstance(r, Exception):
        print(f"Error: {r}")

.starmap() — Multi-Argument

@app.function()
def add(x, y):
    return x + y

results = list(add.starmap([(1, 2), (3, 4), (5, 6)]))
# [3, 7, 11]

.spawn() — Fire-and-Forget

# Returns immediately
call = process.spawn(large_data)

# Check status or get result later
result = call.get()

Up to 1 million pending .spawn() calls.

Concurrent Inputs

By default, each container handles one input at a time. Use @modal.concurrent to handle multiple:

@app.function(gpu="L40S")
@modal.concurrent(max_inputs=10)
async def predict(text: str):
    result = await model.predict_async(text)
    return result

This is ideal for I/O-bound workloads or async inference where a single GPU can handle multiple requests.

With Web Endpoints

@app.function(gpu="L40S")
@modal.concurrent(max_inputs=20)
@modal.asgi_app()
def web_service():
    return fastapi_app

Dynamic Batching

Collect inputs into batches for efficient GPU utilization:

@app.function(gpu="L40S")
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(texts: list[str]):
    # Called with up to 32 texts at once
    embeddings = model.encode(texts)
    return list(embeddings)
  • max_batch_size — Maximum inputs per batch
  • wait_ms — How long to wait for more inputs before processing
  • The function receives a list and must return a list of the same length

Dynamic Autoscaler Updates

Adjust autoscaling at runtime without redeploying:

@app.function()
def scale_up_for_peak():
    process = modal.Function.from_name("my-app", "process")
    process.update_autoscaler(min_containers=10, buffer_containers=20)

@app.function()
def scale_down_after_peak():
    process = modal.Function.from_name("my-app", "process")
    process.update_autoscaler(min_containers=1, buffer_containers=2)

Settings revert to the decorator values on the next deployment.

Limits

Resource Limit
Pending inputs (unassigned) 2,000
Total inputs (running + pending) 25,000
Pending .spawn() inputs 1,000,000
Concurrent inputs per .map() 1,000
Rate limit (web endpoints) 200 req/s

Exceeding these limits triggers Resource Exhausted errors. Implement retry logic for resilience.