claude-scientific-skills/scientific-skills/timesfm-forecasting/references/data_preparation.md

# Data Preparation for TimesFM

## Input Format

TimesFM accepts a **list of 1-D numpy arrays**. Each array represents one
univariate time series.

```python
inputs = [
    np.array([1.0, 2.0, 3.0, 4.0, 5.0]),       # Series 1
    np.array([10.0, 20.0, 15.0, 25.0]),          # Series 2 (different length)
    np.array([100.0, 110.0, 105.0, 115.0, 120.0, 130.0]),  # Series 3
]
```

### Key Properties

- **Variable lengths**: Series in the same batch can have different lengths
- **Float values**: Use `np.float32` or `np.float64`
- **1-D only**: Each array must be 1-dimensional (not 2-D matrix rows)
- **NaN handling**: Leading NaNs are stripped; internal NaNs are linearly interpolated

## Loading from Common Formats

### CSV — Single Series (Long Format)

```python
import pandas as pd
import numpy as np

df = pd.read_csv("data.csv", parse_dates=["date"])
values = df["value"].values.astype(np.float32)
inputs = [values]
```

### CSV — Multiple Series (Wide Format)

```python
df = pd.read_csv("data.csv", parse_dates=["date"], index_col="date")
inputs = [df[col].dropna().values.astype(np.float32) for col in df.columns]
```

### CSV — Long Format with ID Column

```python
df = pd.read_csv("data.csv", parse_dates=["date"])
inputs = []
for series_id, group in df.groupby("series_id"):
    values = group.sort_values("date")["value"].values.astype(np.float32)
    inputs.append(values)
```

### Pandas DataFrame

```python
# Single column
inputs = [df["temperature"].values.astype(np.float32)]

# Multiple columns
inputs = [df[col].dropna().values.astype(np.float32) for col in numeric_cols]
```

### Numpy Arrays

```python
# 2-D array (rows = series, cols = time steps)
data = np.load("timeseries.npy")  # shape (N, T)
inputs = [data[i] for i in range(data.shape[0])]

# Or from 1-D
inputs = [np.sin(np.linspace(0, 10, 200))]
```

### Excel

```python
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
inputs = [df[col].dropna().values.astype(np.float32) for col in df.select_dtypes(include=[np.number]).columns]
```

### Parquet

```python
df = pd.read_parquet("data.parquet")
inputs = [df[col].dropna().values.astype(np.float32) for col in df.select_dtypes(include=[np.number]).columns]
```

### JSON

```python
import json

with open("data.json") as f:
    data = json.load(f)

# Assumes {"series_name": [values...], ...}
inputs = [np.array(values, dtype=np.float32) for values in data.values()]
```

## NaN Handling

TimesFM handles NaN values automatically:

### Leading NaNs

Stripped before feeding to the model:

```python
# Input:  [NaN, NaN, 1.0, 2.0, 3.0]
# Actual: [1.0, 2.0, 3.0]
```

### Internal NaNs

Linearly interpolated:

```python
# Input:  [1.0, NaN, 3.0, NaN, NaN, 6.0]
# Actual: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
```

### Trailing NaNs

**Not handled** — drop them before passing to the model:

```python
values = df["value"].values.astype(np.float32)
# Remove trailing NaNs
while len(values) > 0 and np.isnan(values[-1]):
    values = values[:-1]
inputs = [values]
```

### Best Practice

```python
def clean_series(arr: np.ndarray) -> np.ndarray:
    """Clean a time series for TimesFM input."""
    arr = np.asarray(arr, dtype=np.float32)
    # Remove trailing NaNs
    while len(arr) > 0 and np.isnan(arr[-1]):
        arr = arr[:-1]
    # Replace inf with NaN (will be interpolated)
    arr[np.isinf(arr)] = np.nan
    return arr

inputs = [clean_series(df[col].values) for col in cols]
```

## Context Length Considerations

| Context Length | Use Case | Notes |
| -------------- | -------- | ----- |
| 64–256 | Quick prototyping | Minimal context, fast |
| 256–512 | Daily data, ~1 year | Good balance |
| 512–1024 | Daily data, ~2-3 years | Standard production |
| 1024–4096 | Hourly data, weekly patterns | More context = better |
| 4096–16384 | High-frequency, long patterns | TimesFM 2.5 maximum |

**Rule of thumb**: Provide at least 3–5 full cycles of the dominant pattern
(e.g., for weekly seasonality with daily data, provide at least 21–35 days).

## Covariates (XReg)

TimesFM 2.5 supports exogenous variables through the `forecast_with_covariates()` API.

### Types of Covariates

| Type | Description | Example |
| ---- | ----------- | ------- |
| **Dynamic numerical** | Time-varying numeric features | Temperature, price, promotion spend |
| **Dynamic categorical** | Time-varying categorical features | Day of week, holiday flag |
| **Static categorical** | Fixed per-series features | Store ID, region, product category |

### Preparing Covariates

Each covariate must have length `context + horizon` for each series:

```python
import numpy as np

context_len = 100   # length of historical data
horizon = 24        # forecast horizon
total_len = context_len + horizon

# Dynamic numerical: temperature forecast for each series
temp = [
    np.random.randn(total_len).astype(np.float32),  # Series 1
    np.random.randn(total_len).astype(np.float32),  # Series 2
]

# Dynamic categorical: day of week (0-6) for each series
dow = [
    np.tile(np.arange(7), total_len // 7 + 1)[:total_len],  # Series 1
    np.tile(np.arange(7), total_len // 7 + 1)[:total_len],  # Series 2
]

# Static categorical: one label per series
regions = ["east", "west"]

# Forecast with covariates
point, quantiles = model.forecast_with_covariates(
    inputs=[values1, values2],
    dynamic_numerical_covariates={"temperature": temp},
    dynamic_categorical_covariates={"day_of_week": dow},
    static_categorical_covariates={"region": regions},
    xreg_mode="xreg + timesfm",
)
```

### XReg Modes

| Mode | Description |
| ---- | ----------- |
| `"xreg + timesfm"` | Covariates processed first, then combined with TimesFM forecast |
| `"timesfm + xreg"` | TimesFM forecast first, then adjusted by covariates |

## Common Data Issues

### Issue: Series too short

TimesFM needs at least 1 data point, but more context = better forecasts.

```python
MIN_LENGTH = 32  # Practical minimum for meaningful forecasts

inputs = [
    arr for arr in raw_inputs
    if len(arr[~np.isnan(arr)]) >= MIN_LENGTH
]
```

### Issue: Series with constant values

Constant series may produce NaN or zero-width prediction intervals:

```python
for i, arr in enumerate(inputs):
    if np.std(arr[~np.isnan(arr)]) < 1e-10:
        print(f"⚠️ Series {i} is constant — forecast will be flat")
```

### Issue: Extreme outliers

Large outliers can destabilize forecasts even with normalization:

```python
def clip_outliers(arr: np.ndarray, n_sigma: float = 5.0) -> np.ndarray:
    """Clip values beyond n_sigma standard deviations."""
    mu = np.nanmean(arr)
    sigma = np.nanstd(arr)
    if sigma > 0:
        arr = np.clip(arr, mu - n_sigma * sigma, mu + n_sigma * sigma)
    return arr
```

### Issue: Mixed frequencies in batch

TimesFM handles each series independently, so you can mix frequencies:

```python
inputs = [
    daily_sales,      # 365 points
    weekly_revenue,   # 52 points
    monthly_users,    # 24 points
]
# All forecasted in one batch — TimesFM handles different lengths
point, q = model.forecast(horizon=12, inputs=inputs)
```

However, the `horizon` is shared. If you need different horizons per series,
forecast in separate calls.