mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
5.4 KiB
5.4 KiB
Modal Web Endpoints
Table of Contents
- Simple Endpoints
- Deployment
- ASGI Apps
- WSGI Apps
- Custom Web Servers
- WebSockets
- Authentication
- Streaming
- Concurrency
- Limits
Simple Endpoints
The easiest way to create a web endpoint:
import modal
app = modal.App("api-service")
@app.function()
@modal.fastapi_endpoint()
def hello(name: str = "World"):
return {"message": f"Hello, {name}!"}
POST Endpoints
@app.function()
@modal.fastapi_endpoint(method="POST")
def predict(data: dict):
result = model.predict(data["text"])
return {"prediction": result}
Query Parameters
Parameters are automatically parsed from query strings:
@app.function()
@modal.fastapi_endpoint()
def search(query: str, limit: int = 10):
return {"results": do_search(query, limit)}
Access via: https://your-app.modal.run?query=hello&limit=5
Deployment
Development Mode
modal serve script.py
- Creates a temporary public URL
- Hot-reloads on file changes
- Perfect for development and testing
- URL expires when you stop the command
Production Deployment
modal deploy script.py
- Creates a permanent URL
- Runs persistently in the cloud
- Autoscales based on traffic
- URL format:
https://<workspace>--<app-name>-<function-name>.modal.run
ASGI Apps (FastAPI, Starlette, FastHTML)
For full framework applications, use @modal.asgi_app:
from fastapi import FastAPI
web_app = FastAPI()
@web_app.get("/")
async def root():
return {"status": "ok"}
@web_app.post("/predict")
async def predict(request: dict):
return {"result": model.run(request["input"])}
@app.function(image=image, gpu="L40S")
@modal.asgi_app()
def fastapi_app():
return web_app
With Class Lifecycle
@app.cls(gpu="L40S", image=image)
class InferenceService:
@modal.enter()
def load_model(self):
self.model = load_model()
@modal.asgi_app()
def serve(self):
from fastapi import FastAPI
app = FastAPI()
@app.post("/generate")
async def generate(request: dict):
return self.model.generate(request["prompt"])
return app
WSGI Apps (Flask, Django)
from flask import Flask
flask_app = Flask(__name__)
@flask_app.route("/")
def index():
return {"status": "ok"}
@app.function(image=image)
@modal.wsgi_app()
def flask_server():
return flask_app
WSGI is synchronous — concurrent inputs run on separate threads.
Custom Web Servers
For non-standard web frameworks (aiohttp, Tornado, TGI):
@app.function(image=image, gpu="H100")
@modal.web_server(port=8000)
def serve():
import subprocess
subprocess.Popen([
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", "meta-llama/Llama-3-70B",
"--host", "0.0.0.0", # Must bind to 0.0.0.0, not localhost
"--port", "8000",
])
The application must bind to 0.0.0.0 (not 127.0.0.1).
WebSockets
Supported with @modal.asgi_app, @modal.wsgi_app, and @modal.web_server:
from fastapi import FastAPI, WebSocket
web_app = FastAPI()
@web_app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
while True:
data = await websocket.receive_text()
result = process(data)
await websocket.send_text(result)
@app.function()
@modal.asgi_app()
def ws_app():
return web_app
- Full WebSocket protocol (RFC 6455)
- Messages up to 2 MiB each
- No RFC 8441 or RFC 7692 support yet
Authentication
Proxy Auth Tokens (Built-in)
Modal provides first-class endpoint protection via proxy auth tokens:
@app.function()
@modal.fastapi_endpoint()
def protected(text: str):
return {"result": process(text)}
Clients include Modal-Key and Modal-Secret headers to authenticate.
Custom Bearer Tokens
from fastapi import Header, HTTPException
@app.function(secrets=[modal.Secret.from_name("auth-secret")])
@modal.fastapi_endpoint(method="POST")
def secure_predict(data: dict, authorization: str = Header(None)):
import os
expected = os.environ["AUTH_TOKEN"]
if authorization != f"Bearer {expected}":
raise HTTPException(status_code=401, detail="Unauthorized")
return {"result": model.predict(data["text"])}
Client IP Access
Available for geolocation, rate limiting, and access control.
Streaming
Server-Sent Events (SSE)
from fastapi.responses import StreamingResponse
@app.function(gpu="H100")
@modal.fastapi_endpoint()
def stream_generate(prompt: str):
def generate():
for token in model.stream(prompt):
yield f"data: {token}\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Concurrency
Handle multiple requests per container using @modal.concurrent:
@app.function(gpu="L40S")
@modal.concurrent(max_inputs=10)
@modal.fastapi_endpoint(method="POST")
async def batch_predict(data: dict):
return {"result": await model.predict_async(data["text"])}
Limits
- Request body: up to 4 GiB
- Response body: unlimited
- Rate limit: 200 requests/second (5-second burst for new accounts)
- Cold starts occur when no containers are active (use
min_containersto avoid)