mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Add comprehensive TimesFM forecasting skill with mandatory system preflight checks (RAM/GPU/disk), end-to-end CSV forecasting script, full API reference, data preparation guide, and hardware requirements documentation. Supports TimesFM 2.5 (200M), 2.0 (500M), and legacy v1.0 with automatic batch size recommendations based on hardware.
273 lines
7.0 KiB
Markdown
273 lines
7.0 KiB
Markdown
# Data Preparation for TimesFM
|
||
|
||
## Input Format
|
||
|
||
TimesFM accepts a **list of 1-D numpy arrays**. Each array represents one
|
||
univariate time series.
|
||
|
||
```python
|
||
inputs = [
|
||
np.array([1.0, 2.0, 3.0, 4.0, 5.0]), # Series 1
|
||
np.array([10.0, 20.0, 15.0, 25.0]), # Series 2 (different length)
|
||
np.array([100.0, 110.0, 105.0, 115.0, 120.0, 130.0]), # Series 3
|
||
]
|
||
```
|
||
|
||
### Key Properties
|
||
|
||
- **Variable lengths**: Series in the same batch can have different lengths
|
||
- **Float values**: Use `np.float32` or `np.float64`
|
||
- **1-D only**: Each array must be 1-dimensional (not 2-D matrix rows)
|
||
- **NaN handling**: Leading NaNs are stripped; internal NaNs are linearly interpolated
|
||
|
||
## Loading from Common Formats
|
||
|
||
### CSV — Single Series (Long Format)
|
||
|
||
```python
|
||
import pandas as pd
|
||
import numpy as np
|
||
|
||
df = pd.read_csv("data.csv", parse_dates=["date"])
|
||
values = df["value"].values.astype(np.float32)
|
||
inputs = [values]
|
||
```
|
||
|
||
### CSV — Multiple Series (Wide Format)
|
||
|
||
```python
|
||
df = pd.read_csv("data.csv", parse_dates=["date"], index_col="date")
|
||
inputs = [df[col].dropna().values.astype(np.float32) for col in df.columns]
|
||
```
|
||
|
||
### CSV — Long Format with ID Column
|
||
|
||
```python
|
||
df = pd.read_csv("data.csv", parse_dates=["date"])
|
||
inputs = []
|
||
for series_id, group in df.groupby("series_id"):
|
||
values = group.sort_values("date")["value"].values.astype(np.float32)
|
||
inputs.append(values)
|
||
```
|
||
|
||
### Pandas DataFrame
|
||
|
||
```python
|
||
# Single column
|
||
inputs = [df["temperature"].values.astype(np.float32)]
|
||
|
||
# Multiple columns
|
||
inputs = [df[col].dropna().values.astype(np.float32) for col in numeric_cols]
|
||
```
|
||
|
||
### Numpy Arrays
|
||
|
||
```python
|
||
# 2-D array (rows = series, cols = time steps)
|
||
data = np.load("timeseries.npy") # shape (N, T)
|
||
inputs = [data[i] for i in range(data.shape[0])]
|
||
|
||
# Or from 1-D
|
||
inputs = [np.sin(np.linspace(0, 10, 200))]
|
||
```
|
||
|
||
### Excel
|
||
|
||
```python
|
||
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
|
||
inputs = [df[col].dropna().values.astype(np.float32) for col in df.select_dtypes(include=[np.number]).columns]
|
||
```
|
||
|
||
### Parquet
|
||
|
||
```python
|
||
df = pd.read_parquet("data.parquet")
|
||
inputs = [df[col].dropna().values.astype(np.float32) for col in df.select_dtypes(include=[np.number]).columns]
|
||
```
|
||
|
||
### JSON
|
||
|
||
```python
|
||
import json
|
||
|
||
with open("data.json") as f:
|
||
data = json.load(f)
|
||
|
||
# Assumes {"series_name": [values...], ...}
|
||
inputs = [np.array(values, dtype=np.float32) for values in data.values()]
|
||
```
|
||
|
||
## NaN Handling
|
||
|
||
TimesFM handles NaN values automatically:
|
||
|
||
### Leading NaNs
|
||
|
||
Stripped before feeding to the model:
|
||
|
||
```python
|
||
# Input: [NaN, NaN, 1.0, 2.0, 3.0]
|
||
# Actual: [1.0, 2.0, 3.0]
|
||
```
|
||
|
||
### Internal NaNs
|
||
|
||
Linearly interpolated:
|
||
|
||
```python
|
||
# Input: [1.0, NaN, 3.0, NaN, NaN, 6.0]
|
||
# Actual: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
|
||
```
|
||
|
||
### Trailing NaNs
|
||
|
||
**Not handled** — drop them before passing to the model:
|
||
|
||
```python
|
||
values = df["value"].values.astype(np.float32)
|
||
# Remove trailing NaNs
|
||
while len(values) > 0 and np.isnan(values[-1]):
|
||
values = values[:-1]
|
||
inputs = [values]
|
||
```
|
||
|
||
### Best Practice
|
||
|
||
```python
|
||
def clean_series(arr: np.ndarray) -> np.ndarray:
|
||
"""Clean a time series for TimesFM input."""
|
||
arr = np.asarray(arr, dtype=np.float32)
|
||
# Remove trailing NaNs
|
||
while len(arr) > 0 and np.isnan(arr[-1]):
|
||
arr = arr[:-1]
|
||
# Replace inf with NaN (will be interpolated)
|
||
arr[np.isinf(arr)] = np.nan
|
||
return arr
|
||
|
||
inputs = [clean_series(df[col].values) for col in cols]
|
||
```
|
||
|
||
## Context Length Considerations
|
||
|
||
| Context Length | Use Case | Notes |
|
||
| -------------- | -------- | ----- |
|
||
| 64–256 | Quick prototyping | Minimal context, fast |
|
||
| 256–512 | Daily data, ~1 year | Good balance |
|
||
| 512–1024 | Daily data, ~2-3 years | Standard production |
|
||
| 1024–4096 | Hourly data, weekly patterns | More context = better |
|
||
| 4096–16384 | High-frequency, long patterns | TimesFM 2.5 maximum |
|
||
|
||
**Rule of thumb**: Provide at least 3–5 full cycles of the dominant pattern
|
||
(e.g., for weekly seasonality with daily data, provide at least 21–35 days).
|
||
|
||
## Covariates (XReg)
|
||
|
||
TimesFM 2.5 supports exogenous variables through the `forecast_with_covariates()` API.
|
||
|
||
### Types of Covariates
|
||
|
||
| Type | Description | Example |
|
||
| ---- | ----------- | ------- |
|
||
| **Dynamic numerical** | Time-varying numeric features | Temperature, price, promotion spend |
|
||
| **Dynamic categorical** | Time-varying categorical features | Day of week, holiday flag |
|
||
| **Static categorical** | Fixed per-series features | Store ID, region, product category |
|
||
|
||
### Preparing Covariates
|
||
|
||
Each covariate must have length `context + horizon` for each series:
|
||
|
||
```python
|
||
import numpy as np
|
||
|
||
context_len = 100 # length of historical data
|
||
horizon = 24 # forecast horizon
|
||
total_len = context_len + horizon
|
||
|
||
# Dynamic numerical: temperature forecast for each series
|
||
temp = [
|
||
np.random.randn(total_len).astype(np.float32), # Series 1
|
||
np.random.randn(total_len).astype(np.float32), # Series 2
|
||
]
|
||
|
||
# Dynamic categorical: day of week (0-6) for each series
|
||
dow = [
|
||
np.tile(np.arange(7), total_len // 7 + 1)[:total_len], # Series 1
|
||
np.tile(np.arange(7), total_len // 7 + 1)[:total_len], # Series 2
|
||
]
|
||
|
||
# Static categorical: one label per series
|
||
regions = ["east", "west"]
|
||
|
||
# Forecast with covariates
|
||
point, quantiles = model.forecast_with_covariates(
|
||
inputs=[values1, values2],
|
||
dynamic_numerical_covariates={"temperature": temp},
|
||
dynamic_categorical_covariates={"day_of_week": dow},
|
||
static_categorical_covariates={"region": regions},
|
||
xreg_mode="xreg + timesfm",
|
||
)
|
||
```
|
||
|
||
### XReg Modes
|
||
|
||
| Mode | Description |
|
||
| ---- | ----------- |
|
||
| `"xreg + timesfm"` | Covariates processed first, then combined with TimesFM forecast |
|
||
| `"timesfm + xreg"` | TimesFM forecast first, then adjusted by covariates |
|
||
|
||
## Common Data Issues
|
||
|
||
### Issue: Series too short
|
||
|
||
TimesFM needs at least 1 data point, but more context = better forecasts.
|
||
|
||
```python
|
||
MIN_LENGTH = 32 # Practical minimum for meaningful forecasts
|
||
|
||
inputs = [
|
||
arr for arr in raw_inputs
|
||
if len(arr[~np.isnan(arr)]) >= MIN_LENGTH
|
||
]
|
||
```
|
||
|
||
### Issue: Series with constant values
|
||
|
||
Constant series may produce NaN or zero-width prediction intervals:
|
||
|
||
```python
|
||
for i, arr in enumerate(inputs):
|
||
if np.std(arr[~np.isnan(arr)]) < 1e-10:
|
||
print(f"⚠️ Series {i} is constant — forecast will be flat")
|
||
```
|
||
|
||
### Issue: Extreme outliers
|
||
|
||
Large outliers can destabilize forecasts even with normalization:
|
||
|
||
```python
|
||
def clip_outliers(arr: np.ndarray, n_sigma: float = 5.0) -> np.ndarray:
|
||
"""Clip values beyond n_sigma standard deviations."""
|
||
mu = np.nanmean(arr)
|
||
sigma = np.nanstd(arr)
|
||
if sigma > 0:
|
||
arr = np.clip(arr, mu - n_sigma * sigma, mu + n_sigma * sigma)
|
||
return arr
|
||
```
|
||
|
||
### Issue: Mixed frequencies in batch
|
||
|
||
TimesFM handles each series independently, so you can mix frequencies:
|
||
|
||
```python
|
||
inputs = [
|
||
daily_sales, # 365 points
|
||
weekly_revenue, # 52 points
|
||
monthly_users, # 24 points
|
||
]
|
||
# All forecasted in one batch — TimesFM handles different lengths
|
||
point, q = model.forecast(horizon=12, inputs=inputs)
|
||
```
|
||
|
||
However, the `horizon` is shared. If you need different horizons per series,
|
||
forecast in separate calls.
|