# Modal Scaling and Concurrency ## Table of Contents - [Autoscaling](#autoscaling) - [Configuration](#configuration) - [Parallel Execution](#parallel-execution) - [Concurrent Inputs](#concurrent-inputs) - [Dynamic Batching](#dynamic-batching) - [Dynamic Autoscaler Updates](#dynamic-autoscaler-updates) - [Limits](#limits) ## Autoscaling Modal automatically manages a pool of containers for each function: - Spins up containers when there's no capacity for new inputs - Spins down idle containers to save costs - Scales from zero (no cost when idle) to thousands of containers No configuration needed for basic autoscaling — it works out of the box. ## Configuration Fine-tune autoscaling behavior: ```python @app.function( max_containers=100, # Upper limit on container count min_containers=2, # Keep 2 warm (reduces cold starts) buffer_containers=5, # Reserve 5 extra for burst traffic scaledown_window=300, # Wait 5 min idle before shutting down ) def handle_request(data): ... ``` | Parameter | Default | Description | |-----------|---------|-------------| | `max_containers` | Unlimited | Hard cap on total containers | | `min_containers` | 0 | Minimum warm containers (costs money even when idle) | | `buffer_containers` | 0 | Extra containers to prevent queuing | | `scaledown_window` | 60 | Seconds of idle time before shutdown | ### Trade-offs - Higher `min_containers` = lower latency, higher cost - Higher `buffer_containers` = less queuing, higher cost - Lower `scaledown_window` = faster cost savings, more cold starts ## Parallel Execution ### `.map()` — Process Many Inputs ```python @app.function() def process(item): return heavy_computation(item) @app.local_entrypoint() def main(): items = list(range(10_000)) results = list(process.map(items)) ``` Modal automatically scales containers to handle the workload. Results maintain input order. ### `.map()` Options ```python # Unordered results (faster) for result in process.map(items, order_outputs=False): handle(result) # Collect errors instead of raising results = list(process.map(items, return_exceptions=True)) for r in results: if isinstance(r, Exception): print(f"Error: {r}") ``` ### `.starmap()` — Multi-Argument ```python @app.function() def add(x, y): return x + y results = list(add.starmap([(1, 2), (3, 4), (5, 6)])) # [3, 7, 11] ``` ### `.spawn()` — Fire-and-Forget ```python # Returns immediately call = process.spawn(large_data) # Check status or get result later result = call.get() ``` Up to 1 million pending `.spawn()` calls. ## Concurrent Inputs By default, each container handles one input at a time. Use `@modal.concurrent` to handle multiple: ```python @app.function(gpu="L40S") @modal.concurrent(max_inputs=10) async def predict(text: str): result = await model.predict_async(text) return result ``` This is ideal for I/O-bound workloads or async inference where a single GPU can handle multiple requests. ### With Web Endpoints ```python @app.function(gpu="L40S") @modal.concurrent(max_inputs=20) @modal.asgi_app() def web_service(): return fastapi_app ``` ## Dynamic Batching Collect inputs into batches for efficient GPU utilization: ```python @app.function(gpu="L40S") @modal.batched(max_batch_size=32, wait_ms=100) async def batch_predict(texts: list[str]): # Called with up to 32 texts at once embeddings = model.encode(texts) return list(embeddings) ``` - `max_batch_size` — Maximum inputs per batch - `wait_ms` — How long to wait for more inputs before processing - The function receives a list and must return a list of the same length ## Dynamic Autoscaler Updates Adjust autoscaling at runtime without redeploying: ```python @app.function() def scale_up_for_peak(): process = modal.Function.from_name("my-app", "process") process.update_autoscaler(min_containers=10, buffer_containers=20) @app.function() def scale_down_after_peak(): process = modal.Function.from_name("my-app", "process") process.update_autoscaler(min_containers=1, buffer_containers=2) ``` Settings revert to the decorator values on the next deployment. ## Limits | Resource | Limit | |----------|-------| | Pending inputs (unassigned) | 2,000 | | Total inputs (running + pending) | 25,000 | | Pending `.spawn()` inputs | 1,000,000 | | Concurrent inputs per `.map()` | 1,000 | | Rate limit (web endpoints) | 200 req/s | Exceeding these limits triggers `Resource Exhausted` errors. Implement retry logic for resilience.