Update Modal skill

2026-03-27 07:09:27 +08:00 · 2026-03-23 16:21:31 -07:00
parent 71e26ffa6d
commit b75f4e8d08
15 changed files with 2062 additions and 2413 deletions
--- a/scientific-skills/modal/references/scaling.md
+++ b/scientific-skills/modal/references/scaling.md
@@ -1,230 +1,173 @@
-# Scaling Out on Modal
+# Modal Scaling and Concurrency

-## Automatic Autoscaling
+## Table of Contents

-Every Modal Function corresponds to an autoscaling pool of containers. Modal's autoscaler:
- Spins up containers when no capacity available
- Spins down containers when resources idle
- Scales to zero by default when no inputs to process
+- [Autoscaling](#autoscaling)
+- [Configuration](#configuration)
+- [Parallel Execution](#parallel-execution)
+- [Concurrent Inputs](#concurrent-inputs)
+- [Dynamic Batching](#dynamic-batching)
+- [Dynamic Autoscaler Updates](#dynamic-autoscaler-updates)
+- [Limits](#limits)

-Autoscaling decisions are made quickly and frequently.
+## Autoscaling

-## Parallel Execution with `.map()`
+Modal automatically manages a pool of containers for each function:
+- Spins up containers when there's no capacity for new inputs
+- Spins down idle containers to save costs
+- Scales from zero (no cost when idle) to thousands of containers

-Run function repeatedly with different inputs in parallel:
+No configuration needed for basic autoscaling — it works out of the box.

-```python
-@app.function()
-def evaluate_model(x):
-    return x ** 2
+## Configuration

-@app.local_entrypoint()
-def main():
-    inputs = list(range(100))
-    # Runs 100 inputs in parallel across containers
-    for result in evaluate_model.map(inputs):
-        print(result)
-```
-
-### Multiple Arguments with `.starmap()`
-
-For functions with multiple arguments:
-
-```python
-@app.function()
-def add(a, b):
-    return a + b
-
-@app.local_entrypoint()
-def main():
-    results = list(add.starmap([(1, 2), (3, 4)]))
-    # [3, 7]
-```
-
-### Exception Handling
-
-```python
-@app.function()
-def may_fail(a):
-    if a == 2:
-        raise Exception("error")
-    return a ** 2
-
-@app.local_entrypoint()
-def main():
-    results = list(may_fail.map(
-        range(3),
-        return_exceptions=True,
-        wrap_returned_exceptions=False
-    ))
-    # [0, 1, Exception('error')]
-```
-
-## Autoscaling Configuration
-
-Configure autoscaler behavior with parameters:
+Fine-tune autoscaling behavior:

 ```python
@app.function(
-    max_containers=100,      # Upper limit on containers
-    min_containers=2,        # Keep warm even when inactive
-    buffer_containers=5,     # Maintain buffer while active
-    scaledown_window=60,     # Max idle time before scaling down (seconds)
+    max_containers=100,     # Upper limit on container count
+    min_containers=2,       # Keep 2 warm (reduces cold starts)
+    buffer_containers=5,    # Reserve 5 extra for burst traffic
+    scaledown_window=300,   # Wait 5 min idle before shutting down
 )
-def my_function():
+def handle_request(data):
    ...
 ```

-Parameters:
- **max_containers**: Upper limit on total containers
- **min_containers**: Minimum kept warm even when inactive
- **buffer_containers**: Buffer size while function active (additional inputs won't need to queue)
- **scaledown_window**: Maximum idle duration before scale down (seconds)
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `max_containers` | Unlimited | Hard cap on total containers |
+| `min_containers` | 0 | Minimum warm containers (costs money even when idle) |
+| `buffer_containers` | 0 | Extra containers to prevent queuing |
+| `scaledown_window` | 60 | Seconds of idle time before shutdown |

-Trade-offs:
- Larger warm pool/buffer → Higher cost, lower latency
- Longer scaledown window → Less churn for infrequent requests
+### Trade-offs
+
+- Higher `min_containers` = lower latency, higher cost
+- Higher `buffer_containers` = less queuing, higher cost
+- Lower `scaledown_window` = faster cost savings, more cold starts
+
+## Parallel Execution
+
+### `.map()` — Process Many Inputs
+
+```python
+@app.function()
+def process(item):
+    return heavy_computation(item)
+
+@app.local_entrypoint()
+def main():
+    items = list(range(10_000))
+    results = list(process.map(items))
+```
+
+Modal automatically scales containers to handle the workload. Results maintain input order.
+
+### `.map()` Options
+
+```python
+# Unordered results (faster)
+for result in process.map(items, order_outputs=False):
+    handle(result)
+
+# Collect errors instead of raising
+results = list(process.map(items, return_exceptions=True))
+for r in results:
+    if isinstance(r, Exception):
+        print(f"Error: {r}")
+```
+
+### `.starmap()` — Multi-Argument
+
+```python
+@app.function()
+def add(x, y):
+    return x + y
+
+results = list(add.starmap([(1, 2), (3, 4), (5, 6)]))
+# [3, 7, 11]
+```
+
+### `.spawn()` — Fire-and-Forget
+
+```python
+# Returns immediately
+call = process.spawn(large_data)
+
+# Check status or get result later
+result = call.get()
+```
+
+Up to 1 million pending `.spawn()` calls.
+
+## Concurrent Inputs
+
+By default, each container handles one input at a time. Use `@modal.concurrent` to handle multiple:
+
+```python
+@app.function(gpu="L40S")
+@modal.concurrent(max_inputs=10)
+async def predict(text: str):
+    result = await model.predict_async(text)
+    return result
+```
+
+This is ideal for I/O-bound workloads or async inference where a single GPU can handle multiple requests.
+
+### With Web Endpoints
+
+```python
+@app.function(gpu="L40S")
+@modal.concurrent(max_inputs=20)
+@modal.asgi_app()
+def web_service():
+    return fastapi_app
+```
+
+## Dynamic Batching
+
+Collect inputs into batches for efficient GPU utilization:
+
+```python
+@app.function(gpu="L40S")
+@modal.batched(max_batch_size=32, wait_ms=100)
+async def batch_predict(texts: list[str]):
+    # Called with up to 32 texts at once
+    embeddings = model.encode(texts)
+    return list(embeddings)
+```
+
+- `max_batch_size` — Maximum inputs per batch
+- `wait_ms` — How long to wait for more inputs before processing
+- The function receives a list and must return a list of the same length

 ## Dynamic Autoscaler Updates

-Update autoscaler settings without redeployment:
-
-```python
-f = modal.Function.from_name("my-app", "f")
-f.update_autoscaler(max_containers=100)
-```
-
-Settings revert to decorator configuration on next deploy, or are overridden by further updates:
-
-```python
-f.update_autoscaler(min_containers=2, max_containers=10)
-f.update_autoscaler(min_containers=4)  # max_containers=10 still in effect
-```
-
-### Time-Based Scaling
-
-Adjust warm pool based on time of day:
+Adjust autoscaling at runtime without redeploying:

 ```python
@app.function()
-def inference_server():
-    ...
+def scale_up_for_peak():
+    process = modal.Function.from_name("my-app", "process")
+    process.update_autoscaler(min_containers=10, buffer_containers=20)

-@app.function(schedule=modal.Cron("0 6 * * *", timezone="America/New_York"))
-def increase_warm_pool():
-    inference_server.update_autoscaler(min_containers=4)
-
-@app.function(schedule=modal.Cron("0 22 * * *", timezone="America/New_York"))
-def decrease_warm_pool():
-    inference_server.update_autoscaler(min_containers=0)
-```
-
-### For Classes
-
-Update autoscaler for specific parameter instances:
-
-```python
-MyClass = modal.Cls.from_name("my-app", "MyClass")
-obj = MyClass(model_version="3.5")
-obj.update_autoscaler(buffer_containers=2)  # type: ignore
-```
-
-## Input Concurrency
-
-Process multiple inputs per container with `@modal.concurrent`:
-
-```python
@app.function()
-@modal.concurrent(max_inputs=100)
-def my_function(input: str):
-    # Container can handle up to 100 concurrent inputs
-    ...
+def scale_down_after_peak():
+    process = modal.Function.from_name("my-app", "process")
+    process.update_autoscaler(min_containers=1, buffer_containers=2)
 ```

-Ideal for I/O-bound workloads:
- Database queries
- External API requests
- Remote Modal Function calls
+Settings revert to the decorator values on the next deployment.

-### Concurrency Mechanisms
+## Limits

-**Synchronous Functions**: Separate threads (must be thread-safe)
+| Resource | Limit |
+|----------|-------|
+| Pending inputs (unassigned) | 2,000 |
+| Total inputs (running + pending) | 25,000 |
+| Pending `.spawn()` inputs | 1,000,000 |
+| Concurrent inputs per `.map()` | 1,000 |
+| Rate limit (web endpoints) | 200 req/s |

-```python
-@app.function()
-@modal.concurrent(max_inputs=10)
-def sync_function():
-    time.sleep(1)  # Must be thread-safe
-```
-
-**Async Functions**: Separate asyncio tasks (must not block event loop)
-
-```python
-@app.function()
-@modal.concurrent(max_inputs=10)
-async def async_function():
-    await asyncio.sleep(1)  # Must not block event loop
-```
-
-### Target vs Max Inputs
-
-```python
-@app.function()
-@modal.concurrent(
-    max_inputs=120,    # Hard limit
-    target_inputs=100  # Autoscaler target
-)
-def my_function(input: str):
-    # Allow 20% burst above target
-    ...
-```
-
-Autoscaler aims for `target_inputs`, but containers can burst to `max_inputs` during scale-up.
-
-## Scaling Limits
-
-Modal enforces limits per function:
- 2,000 pending inputs (not yet assigned to containers)
- 25,000 total inputs (running + pending)
-
-For `.spawn()` async jobs: up to 1 million pending inputs.
-
-Exceeding limits returns `Resource Exhausted` error - retry later.
-
-Each `.map()` invocation: max 1,000 concurrent inputs.
-
-## Async Usage
-
-Use async APIs for arbitrary parallel execution patterns:
-
-```python
-@app.function()
-async def async_task(x):
-    await asyncio.sleep(1)
-    return x * 2
-
-@app.local_entrypoint()
-async def main():
-    tasks = [async_task.remote.aio(i) for i in range(100)]
-    results = await asyncio.gather(*tasks)
-```
-
-## Common Gotchas
-
-**Incorrect**: Using Python's builtin map (runs sequentially)
-```python
-# DON'T DO THIS
-results = map(evaluate_model, inputs)
-```
-
-**Incorrect**: Calling function first
-```python
-# DON'T DO THIS
-results = evaluate_model(inputs).map()
-```
-
-**Correct**: Call .map() on Modal function object
-```python
-# DO THIS
-results = evaluate_model.map(inputs)
-```
+Exceeding these limits triggers `Resource Exhausted` errors. Implement retry logic for resilience.