claude-scientific-skills/scientific-skills/modal/references/scaling.md at b75f4e8d08b10720d637d423f8cf3d7357c29f4b

skills/claude-scientific-skills

Fork 0

mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-03-27 07:09:27 +08:00

Files

Timothy Kassis b75f4e8d08 Update Modal skill

2026-03-23 16:21:31 -07:00

Autoscaling

Modal automatically manages a pool of containers for each function:

Spins up containers when there's no capacity for new inputs
Spins down idle containers to save costs
Scales from zero (no cost when idle) to thousands of containers

No configuration needed for basic autoscaling — it works out of the box.

Configuration

Fine-tune autoscaling behavior:

@app.function(
    max_containers=100,     # Upper limit on container count
    min_containers=2,       # Keep 2 warm (reduces cold starts)
    buffer_containers=5,    # Reserve 5 extra for burst traffic
    scaledown_window=300,   # Wait 5 min idle before shutting down
)
def handle_request(data):
    ...

Parameter	Default	Description
`max_containers`	Unlimited	Hard cap on total containers
`min_containers`	0	Minimum warm containers (costs money even when idle)
`buffer_containers`	0	Extra containers to prevent queuing
`scaledown_window`	60	Seconds of idle time before shutdown

Trade-offs

Higher min_containers = lower latency, higher cost
Higher buffer_containers = less queuing, higher cost
Lower scaledown_window = faster cost savings, more cold starts

Parallel Execution

`.map()` — Process Many Inputs

@app.function()
def process(item):
    return heavy_computation(item)

@app.local_entrypoint()
def main():
    items = list(range(10_000))
    results = list(process.map(items))

Modal automatically scales containers to handle the workload. Results maintain input order.

`.map()` Options

# Unordered results (faster)
for result in process.map(items, order_outputs=False):
    handle(result)

# Collect errors instead of raising
results = list(process.map(items, return_exceptions=True))
for r in results:
    if isinstance(r, Exception):
        print(f"Error: {r}")

`.starmap()` — Multi-Argument

@app.function()
def add(x, y):
    return x + y

results = list(add.starmap([(1, 2), (3, 4), (5, 6)]))
# [3, 7, 11]

`.spawn()` — Fire-and-Forget

# Returns immediately
call = process.spawn(large_data)

# Check status or get result later
result = call.get()

Up to 1 million pending .spawn() calls.

Concurrent Inputs

By default, each container handles one input at a time. Use @modal.concurrent to handle multiple:

@app.function(gpu="L40S")
@modal.concurrent(max_inputs=10)
async def predict(text: str):
    result = await model.predict_async(text)
    return result

This is ideal for I/O-bound workloads or async inference where a single GPU can handle multiple requests.

With Web Endpoints

@app.function(gpu="L40S")
@modal.concurrent(max_inputs=20)
@modal.asgi_app()
def web_service():
    return fastapi_app

Dynamic Batching

Collect inputs into batches for efficient GPU utilization:

@app.function(gpu="L40S")
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(texts: list[str]):
    # Called with up to 32 texts at once
    embeddings = model.encode(texts)
    return list(embeddings)

max_batch_size — Maximum inputs per batch
wait_ms — How long to wait for more inputs before processing
The function receives a list and must return a list of the same length

Dynamic Autoscaler Updates

Adjust autoscaling at runtime without redeploying:

@app.function()
def scale_up_for_peak():
    process = modal.Function.from_name("my-app", "process")
    process.update_autoscaler(min_containers=10, buffer_containers=20)

@app.function()
def scale_down_after_peak():
    process = modal.Function.from_name("my-app", "process")
    process.update_autoscaler(min_containers=1, buffer_containers=2)

Settings revert to the decorator values on the next deployment.

Limits

Resource	Limit
Pending inputs (unassigned)	2,000
Total inputs (running + pending)	25,000
Pending `.spawn()` inputs	1,000,000
Concurrent inputs per `.map()`	1,000
Rate limit (web endpoints)	200 req/s

Exceeding these limits triggers Resource Exhausted errors. Implement retry logic for resilience.

4.4 KiB Raw Blame History

Modal Scaling and Concurrency

Table of Contents