Files
Clayton Young 98670bcf47 feat(skill): add timesfm-forecasting skill for time series forecasting
Add comprehensive TimesFM forecasting skill with mandatory system
preflight checks (RAM/GPU/disk), end-to-end CSV forecasting script,
full API reference, data preparation guide, and hardware requirements
documentation. Supports TimesFM 2.5 (200M), 2.0 (500M), and legacy
v1.0 with automatic batch size recommendations based on hardware.
2026-02-23 07:43:04 -05:00

273 lines
7.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Data Preparation for TimesFM
## Input Format
TimesFM accepts a **list of 1-D numpy arrays**. Each array represents one
univariate time series.
```python
inputs = [
np.array([1.0, 2.0, 3.0, 4.0, 5.0]), # Series 1
np.array([10.0, 20.0, 15.0, 25.0]), # Series 2 (different length)
np.array([100.0, 110.0, 105.0, 115.0, 120.0, 130.0]), # Series 3
]
```
### Key Properties
- **Variable lengths**: Series in the same batch can have different lengths
- **Float values**: Use `np.float32` or `np.float64`
- **1-D only**: Each array must be 1-dimensional (not 2-D matrix rows)
- **NaN handling**: Leading NaNs are stripped; internal NaNs are linearly interpolated
## Loading from Common Formats
### CSV — Single Series (Long Format)
```python
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv", parse_dates=["date"])
values = df["value"].values.astype(np.float32)
inputs = [values]
```
### CSV — Multiple Series (Wide Format)
```python
df = pd.read_csv("data.csv", parse_dates=["date"], index_col="date")
inputs = [df[col].dropna().values.astype(np.float32) for col in df.columns]
```
### CSV — Long Format with ID Column
```python
df = pd.read_csv("data.csv", parse_dates=["date"])
inputs = []
for series_id, group in df.groupby("series_id"):
values = group.sort_values("date")["value"].values.astype(np.float32)
inputs.append(values)
```
### Pandas DataFrame
```python
# Single column
inputs = [df["temperature"].values.astype(np.float32)]
# Multiple columns
inputs = [df[col].dropna().values.astype(np.float32) for col in numeric_cols]
```
### Numpy Arrays
```python
# 2-D array (rows = series, cols = time steps)
data = np.load("timeseries.npy") # shape (N, T)
inputs = [data[i] for i in range(data.shape[0])]
# Or from 1-D
inputs = [np.sin(np.linspace(0, 10, 200))]
```
### Excel
```python
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
inputs = [df[col].dropna().values.astype(np.float32) for col in df.select_dtypes(include=[np.number]).columns]
```
### Parquet
```python
df = pd.read_parquet("data.parquet")
inputs = [df[col].dropna().values.astype(np.float32) for col in df.select_dtypes(include=[np.number]).columns]
```
### JSON
```python
import json
with open("data.json") as f:
data = json.load(f)
# Assumes {"series_name": [values...], ...}
inputs = [np.array(values, dtype=np.float32) for values in data.values()]
```
## NaN Handling
TimesFM handles NaN values automatically:
### Leading NaNs
Stripped before feeding to the model:
```python
# Input: [NaN, NaN, 1.0, 2.0, 3.0]
# Actual: [1.0, 2.0, 3.0]
```
### Internal NaNs
Linearly interpolated:
```python
# Input: [1.0, NaN, 3.0, NaN, NaN, 6.0]
# Actual: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
```
### Trailing NaNs
**Not handled** — drop them before passing to the model:
```python
values = df["value"].values.astype(np.float32)
# Remove trailing NaNs
while len(values) > 0 and np.isnan(values[-1]):
values = values[:-1]
inputs = [values]
```
### Best Practice
```python
def clean_series(arr: np.ndarray) -> np.ndarray:
"""Clean a time series for TimesFM input."""
arr = np.asarray(arr, dtype=np.float32)
# Remove trailing NaNs
while len(arr) > 0 and np.isnan(arr[-1]):
arr = arr[:-1]
# Replace inf with NaN (will be interpolated)
arr[np.isinf(arr)] = np.nan
return arr
inputs = [clean_series(df[col].values) for col in cols]
```
## Context Length Considerations
| Context Length | Use Case | Notes |
| -------------- | -------- | ----- |
| 64256 | Quick prototyping | Minimal context, fast |
| 256512 | Daily data, ~1 year | Good balance |
| 5121024 | Daily data, ~2-3 years | Standard production |
| 10244096 | Hourly data, weekly patterns | More context = better |
| 409616384 | High-frequency, long patterns | TimesFM 2.5 maximum |
**Rule of thumb**: Provide at least 35 full cycles of the dominant pattern
(e.g., for weekly seasonality with daily data, provide at least 2135 days).
## Covariates (XReg)
TimesFM 2.5 supports exogenous variables through the `forecast_with_covariates()` API.
### Types of Covariates
| Type | Description | Example |
| ---- | ----------- | ------- |
| **Dynamic numerical** | Time-varying numeric features | Temperature, price, promotion spend |
| **Dynamic categorical** | Time-varying categorical features | Day of week, holiday flag |
| **Static categorical** | Fixed per-series features | Store ID, region, product category |
### Preparing Covariates
Each covariate must have length `context + horizon` for each series:
```python
import numpy as np
context_len = 100 # length of historical data
horizon = 24 # forecast horizon
total_len = context_len + horizon
# Dynamic numerical: temperature forecast for each series
temp = [
np.random.randn(total_len).astype(np.float32), # Series 1
np.random.randn(total_len).astype(np.float32), # Series 2
]
# Dynamic categorical: day of week (0-6) for each series
dow = [
np.tile(np.arange(7), total_len // 7 + 1)[:total_len], # Series 1
np.tile(np.arange(7), total_len // 7 + 1)[:total_len], # Series 2
]
# Static categorical: one label per series
regions = ["east", "west"]
# Forecast with covariates
point, quantiles = model.forecast_with_covariates(
inputs=[values1, values2],
dynamic_numerical_covariates={"temperature": temp},
dynamic_categorical_covariates={"day_of_week": dow},
static_categorical_covariates={"region": regions},
xreg_mode="xreg + timesfm",
)
```
### XReg Modes
| Mode | Description |
| ---- | ----------- |
| `"xreg + timesfm"` | Covariates processed first, then combined with TimesFM forecast |
| `"timesfm + xreg"` | TimesFM forecast first, then adjusted by covariates |
## Common Data Issues
### Issue: Series too short
TimesFM needs at least 1 data point, but more context = better forecasts.
```python
MIN_LENGTH = 32 # Practical minimum for meaningful forecasts
inputs = [
arr for arr in raw_inputs
if len(arr[~np.isnan(arr)]) >= MIN_LENGTH
]
```
### Issue: Series with constant values
Constant series may produce NaN or zero-width prediction intervals:
```python
for i, arr in enumerate(inputs):
if np.std(arr[~np.isnan(arr)]) < 1e-10:
print(f"⚠️ Series {i} is constant — forecast will be flat")
```
### Issue: Extreme outliers
Large outliers can destabilize forecasts even with normalization:
```python
def clip_outliers(arr: np.ndarray, n_sigma: float = 5.0) -> np.ndarray:
"""Clip values beyond n_sigma standard deviations."""
mu = np.nanmean(arr)
sigma = np.nanstd(arr)
if sigma > 0:
arr = np.clip(arr, mu - n_sigma * sigma, mu + n_sigma * sigma)
return arr
```
### Issue: Mixed frequencies in batch
TimesFM handles each series independently, so you can mix frequencies:
```python
inputs = [
daily_sales, # 365 points
weekly_revenue, # 52 points
monthly_users, # 24 points
]
# All forecasted in one batch — TimesFM handles different lengths
point, q = model.forecast(horizon=12, inputs=inputs)
```
However, the `horizon` is shared. If you need different horizons per series,
forecast in separate calls.