Files
claude-scientific-skills/scientific-skills/modal/references/scaling.md
2026-03-23 16:21:31 -07:00

174 lines
4.4 KiB
Markdown

# Modal Scaling and Concurrency
## Table of Contents
- [Autoscaling](#autoscaling)
- [Configuration](#configuration)
- [Parallel Execution](#parallel-execution)
- [Concurrent Inputs](#concurrent-inputs)
- [Dynamic Batching](#dynamic-batching)
- [Dynamic Autoscaler Updates](#dynamic-autoscaler-updates)
- [Limits](#limits)
## Autoscaling
Modal automatically manages a pool of containers for each function:
- Spins up containers when there's no capacity for new inputs
- Spins down idle containers to save costs
- Scales from zero (no cost when idle) to thousands of containers
No configuration needed for basic autoscaling — it works out of the box.
## Configuration
Fine-tune autoscaling behavior:
```python
@app.function(
max_containers=100, # Upper limit on container count
min_containers=2, # Keep 2 warm (reduces cold starts)
buffer_containers=5, # Reserve 5 extra for burst traffic
scaledown_window=300, # Wait 5 min idle before shutting down
)
def handle_request(data):
...
```
| Parameter | Default | Description |
|-----------|---------|-------------|
| `max_containers` | Unlimited | Hard cap on total containers |
| `min_containers` | 0 | Minimum warm containers (costs money even when idle) |
| `buffer_containers` | 0 | Extra containers to prevent queuing |
| `scaledown_window` | 60 | Seconds of idle time before shutdown |
### Trade-offs
- Higher `min_containers` = lower latency, higher cost
- Higher `buffer_containers` = less queuing, higher cost
- Lower `scaledown_window` = faster cost savings, more cold starts
## Parallel Execution
### `.map()` — Process Many Inputs
```python
@app.function()
def process(item):
return heavy_computation(item)
@app.local_entrypoint()
def main():
items = list(range(10_000))
results = list(process.map(items))
```
Modal automatically scales containers to handle the workload. Results maintain input order.
### `.map()` Options
```python
# Unordered results (faster)
for result in process.map(items, order_outputs=False):
handle(result)
# Collect errors instead of raising
results = list(process.map(items, return_exceptions=True))
for r in results:
if isinstance(r, Exception):
print(f"Error: {r}")
```
### `.starmap()` — Multi-Argument
```python
@app.function()
def add(x, y):
return x + y
results = list(add.starmap([(1, 2), (3, 4), (5, 6)]))
# [3, 7, 11]
```
### `.spawn()` — Fire-and-Forget
```python
# Returns immediately
call = process.spawn(large_data)
# Check status or get result later
result = call.get()
```
Up to 1 million pending `.spawn()` calls.
## Concurrent Inputs
By default, each container handles one input at a time. Use `@modal.concurrent` to handle multiple:
```python
@app.function(gpu="L40S")
@modal.concurrent(max_inputs=10)
async def predict(text: str):
result = await model.predict_async(text)
return result
```
This is ideal for I/O-bound workloads or async inference where a single GPU can handle multiple requests.
### With Web Endpoints
```python
@app.function(gpu="L40S")
@modal.concurrent(max_inputs=20)
@modal.asgi_app()
def web_service():
return fastapi_app
```
## Dynamic Batching
Collect inputs into batches for efficient GPU utilization:
```python
@app.function(gpu="L40S")
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(texts: list[str]):
# Called with up to 32 texts at once
embeddings = model.encode(texts)
return list(embeddings)
```
- `max_batch_size` — Maximum inputs per batch
- `wait_ms` — How long to wait for more inputs before processing
- The function receives a list and must return a list of the same length
## Dynamic Autoscaler Updates
Adjust autoscaling at runtime without redeploying:
```python
@app.function()
def scale_up_for_peak():
process = modal.Function.from_name("my-app", "process")
process.update_autoscaler(min_containers=10, buffer_containers=20)
@app.function()
def scale_down_after_peak():
process = modal.Function.from_name("my-app", "process")
process.update_autoscaler(min_containers=1, buffer_containers=2)
```
Settings revert to the decorator values on the next deployment.
## Limits
| Resource | Limit |
|----------|-------|
| Pending inputs (unassigned) | 2,000 |
| Total inputs (running + pending) | 25,000 |
| Pending `.spawn()` inputs | 1,000,000 |
| Concurrent inputs per `.map()` | 1,000 |
| Rate limit (web endpoints) | 200 req/s |
Exceeding these limits triggers `Resource Exhausted` errors. Implement retry logic for resilience.