diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index d725c37..4c207dc 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -7,7 +7,7 @@ }, "metadata": { "description": "Claude scientific skills from K-Dense Inc", - "version": "1.22.0" + "version": "1.23.0" }, "plugins": [ { @@ -51,6 +51,7 @@ "./scientific-packages/scikit-bio", "./scientific-packages/scikit-learn", "./scientific-packages/seaborn", + "./scientific-packages/statsmodels", "./scientific-packages/torch_geometric", "./scientific-packages/transformers", "./scientific-packages/umap-learn", diff --git a/README.md b/README.md index 266cb9b..df8716c 100644 --- a/README.md +++ b/README.md @@ -76,6 +76,7 @@ After installing the plugin, you can use the skill by just mentioning it. Additi - **PyMOO** - Multi-objective optimization with evolutionary algorithms - **PyTorch Lightning** - Structured PyTorch training with automatic optimization - **scikit-learn** - Machine learning algorithms, preprocessing, and model selection +- **statsmodels** - Statistical modeling and econometrics (OLS, GLM, logit/probit, ARIMA, time series forecasting, hypothesis testing, diagnostics) - **Torch Geometric** - Graph Neural Networks for molecular and geometric data - **Transformers** - Hugging Face transformers for NLU, image classification, and generation - **UMAP-learn** - Dimensionality reduction and manifold learning diff --git a/scientific-packages/statsmodels/SKILL.md b/scientific-packages/statsmodels/SKILL.md new file mode 100644 index 0000000..7dd21cc --- /dev/null +++ b/scientific-packages/statsmodels/SKILL.md @@ -0,0 +1,608 @@ +--- +name: statsmodels +description: Comprehensive statistical modeling and econometric analysis toolkit for Python. This skill should be used when you need to fit statistical models, perform hypothesis testing, conduct econometric analysis, or analyze time series data. Specifically, use this skill for: (1) Linear regression analysis including OLS, WLS, GLS, quantile regression, and mixed effects models for continuous outcomes with comprehensive diagnostics, robust standard errors, and influence analysis; (2) Generalized linear models (GLM) for non-normal outcomes including logistic regression for binary data, Poisson/Negative Binomial for count data, and Gamma regression for skewed continuous data; (3) Discrete choice models including logit/probit for binary outcomes, multinomial logit for categorical outcomes, and zero-inflated models for count data with excess zeros; (4) Time series analysis including ARIMA/SARIMAX for univariate forecasting, VAR/VARMAX for multivariate analysis, state space models, and exponential smoothing with stationarity testing and residual diagnostics; (5) Statistical testing and diagnostics including tests for heteroskedasticity, autocorrelation, normality, multicollinearity, influence detection, ANOVA, and hypothesis testing with appropriate corrections; (6) Model comparison and selection using AIC/BIC, likelihood ratio tests, and cross-validation; (7) Causal inference with instrumental variables, difference-in-differences, and regression discontinuity designs. This skill is essential when you need rigorous statistical inference, publication-ready results, detailed model diagnostics, or when working with econometric, biomedical, or social science data requiring proper statistical methodology. Always use this skill instead of basic regression when you need confidence intervals, p-values, diagnostic tests, or when assumptions need to be validated. +--- + +# Statsmodels: Statistical Modeling and Econometrics + +## Overview + +Statsmodels is Python's premier library for statistical modeling, providing tools for estimation, inference, and diagnostics across a wide range of statistical methods. This skill provides comprehensive guidance for conducting rigorous statistical analysis, from simple linear regression to complex time series models and econometric analyses. + +## When to Use This Skill + +Activate this skill when: +- Fitting regression models (OLS, WLS, GLS, quantile regression) +- Performing generalized linear modeling (logistic, Poisson, Gamma, etc.) +- Analyzing discrete outcomes (binary, multinomial, count, ordinal) +- Conducting time series analysis (ARIMA, SARIMAX, VAR, forecasting) +- Running statistical tests and diagnostics +- Testing model assumptions (heteroskedasticity, autocorrelation, normality) +- Detecting outliers and influential observations +- Comparing models (AIC/BIC, likelihood ratio tests) +- Estimating causal effects +- Producing publication-ready statistical tables and inference + +## Quick Start Guide + +### Linear Regression (OLS) + +```python +import statsmodels.api as sm +import numpy as np +import pandas as pd + +# Prepare data - ALWAYS add constant for intercept +X = sm.add_constant(X_data) + +# Fit OLS model +model = sm.OLS(y, X) +results = model.fit() + +# View comprehensive results +print(results.summary()) + +# Key results +print(f"R-squared: {results.rsquared:.4f}") +print(f"Coefficients:\\n{results.params}") +print(f"P-values:\\n{results.pvalues}") + +# Predictions with confidence intervals +predictions = results.get_prediction(X_new) +pred_summary = predictions.summary_frame() +print(pred_summary) # includes mean, CI, prediction intervals + +# Diagnostics +from statsmodels.stats.diagnostic import het_breuschpagan +bp_test = het_breuschpagan(results.resid, X) +print(f"Breusch-Pagan p-value: {bp_test[1]:.4f}") + +# Visualize residuals +import matplotlib.pyplot as plt +plt.scatter(results.fittedvalues, results.resid) +plt.axhline(y=0, color='r', linestyle='--') +plt.xlabel('Fitted values') +plt.ylabel('Residuals') +plt.show() +``` + +### Logistic Regression (Binary Outcomes) + +```python +from statsmodels.discrete.discrete_model import Logit + +# Add constant +X = sm.add_constant(X_data) + +# Fit logit model +model = Logit(y_binary, X) +results = model.fit() + +print(results.summary()) + +# Odds ratios +odds_ratios = np.exp(results.params) +print("Odds ratios:\\n", odds_ratios) + +# Predicted probabilities +probs = results.predict(X) + +# Binary predictions (0.5 threshold) +predictions = (probs > 0.5).astype(int) + +# Model evaluation +from sklearn.metrics import classification_report, roc_auc_score + +print(classification_report(y_binary, predictions)) +print(f"AUC: {roc_auc_score(y_binary, probs):.4f}") + +# Marginal effects +marginal = results.get_margeff() +print(marginal.summary()) +``` + +### Time Series (ARIMA) + +```python +from statsmodels.tsa.arima.model import ARIMA +from statsmodels.graphics.tsaplots import plot_acf, plot_pacf + +# Check stationarity +from statsmodels.tsa.stattools import adfuller + +adf_result = adfuller(y_series) +print(f"ADF p-value: {adf_result[1]:.4f}") + +if adf_result[1] > 0.05: + # Series is non-stationary, difference it + y_diff = y_series.diff().dropna() + +# Plot ACF/PACF to identify p, q +fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8)) +plot_acf(y_diff, lags=40, ax=ax1) +plot_pacf(y_diff, lags=40, ax=ax2) +plt.show() + +# Fit ARIMA(p,d,q) +model = ARIMA(y_series, order=(1, 1, 1)) +results = model.fit() + +print(results.summary()) + +# Forecast +forecast = results.forecast(steps=10) +forecast_obj = results.get_forecast(steps=10) +forecast_df = forecast_obj.summary_frame() + +print(forecast_df) # includes mean and confidence intervals + +# Residual diagnostics +results.plot_diagnostics(figsize=(12, 8)) +plt.show() +``` + +### Generalized Linear Models (GLM) + +```python +import statsmodels.api as sm + +# Poisson regression for count data +X = sm.add_constant(X_data) +model = sm.GLM(y_counts, X, family=sm.families.Poisson()) +results = model.fit() + +print(results.summary()) + +# Rate ratios (for Poisson with log link) +rate_ratios = np.exp(results.params) +print("Rate ratios:\\n", rate_ratios) + +# Check overdispersion +overdispersion = results.pearson_chi2 / results.df_resid +print(f"Overdispersion: {overdispersion:.2f}") + +if overdispersion > 1.5: + # Use Negative Binomial instead + from statsmodels.discrete.count_model import NegativeBinomial + nb_model = NegativeBinomial(y_counts, X) + nb_results = nb_model.fit() + print(nb_results.summary()) +``` + +## Core Statistical Modeling Capabilities + +### 1. Linear Regression Models + +Comprehensive suite of linear models for continuous outcomes with various error structures. + +**Available models:** +- **OLS**: Standard linear regression with i.i.d. errors +- **WLS**: Weighted least squares for heteroskedastic errors +- **GLS**: Generalized least squares for arbitrary covariance structure +- **GLSAR**: GLS with autoregressive errors for time series +- **Quantile Regression**: Conditional quantiles (robust to outliers) +- **Mixed Effects**: Hierarchical/multilevel models with random effects +- **Recursive/Rolling**: Time-varying parameter estimation + +**Key features:** +- Comprehensive diagnostic tests +- Robust standard errors (HC, HAC, cluster-robust) +- Influence statistics (Cook's distance, leverage, DFFITS) +- Hypothesis testing (F-tests, Wald tests) +- Model comparison (AIC, BIC, likelihood ratio tests) +- Prediction with confidence and prediction intervals + +**When to use:** Continuous outcome variable, want inference on coefficients, need diagnostics + +**Reference:** See `references/linear_models.md` for detailed guidance on model selection, diagnostics, and best practices. + +### 2. Generalized Linear Models (GLM) + +Flexible framework extending linear models to non-normal distributions. + +**Distribution families:** +- **Binomial**: Binary outcomes or proportions (logistic regression) +- **Poisson**: Count data +- **Negative Binomial**: Overdispersed counts +- **Gamma**: Positive continuous, right-skewed data +- **Inverse Gaussian**: Positive continuous with specific variance structure +- **Gaussian**: Equivalent to OLS +- **Tweedie**: Flexible family for semi-continuous data + +**Link functions:** +- Logit, Probit, Log, Identity, Inverse, Sqrt, CLogLog, Power +- Choose based on interpretation needs and model fit + +**Key features:** +- Maximum likelihood estimation via IRLS +- Deviance and Pearson residuals +- Goodness-of-fit statistics +- Pseudo R-squared measures +- Robust standard errors + +**When to use:** Non-normal outcomes, need flexible variance and link specifications + +**Reference:** See `references/glm.md` for family selection, link functions, interpretation, and diagnostics. + +### 3. Discrete Choice Models + +Models for categorical and count outcomes. + +**Binary models:** +- **Logit**: Logistic regression (odds ratios) +- **Probit**: Probit regression (normal distribution) + +**Multinomial models:** +- **MNLogit**: Unordered categories (3+ levels) +- **Conditional Logit**: Choice models with alternative-specific variables +- **Ordered Model**: Ordinal outcomes (ordered categories) + +**Count models:** +- **Poisson**: Standard count model +- **Negative Binomial**: Overdispersed counts +- **Zero-Inflated**: Excess zeros (ZIP, ZINB) +- **Hurdle Models**: Two-stage models for zero-heavy data + +**Key features:** +- Maximum likelihood estimation +- Marginal effects at means or average marginal effects +- Model comparison via AIC/BIC +- Predicted probabilities and classification +- Goodness-of-fit tests + +**When to use:** Binary, categorical, or count outcomes + +**Reference:** See `references/discrete_choice.md` for model selection, interpretation, and evaluation. + +### 4. Time Series Analysis + +Comprehensive time series modeling and forecasting capabilities. + +**Univariate models:** +- **AutoReg (AR)**: Autoregressive models +- **ARIMA**: Autoregressive integrated moving average +- **SARIMAX**: Seasonal ARIMA with exogenous variables +- **Exponential Smoothing**: Simple, Holt, Holt-Winters +- **ETS**: Innovations state space models + +**Multivariate models:** +- **VAR**: Vector autoregression +- **VARMAX**: VAR with MA and exogenous variables +- **Dynamic Factor Models**: Extract common factors +- **VECM**: Vector error correction models (cointegration) + +**Advanced models:** +- **State Space**: Kalman filtering, custom specifications +- **Regime Switching**: Markov switching models +- **ARDL**: Autoregressive distributed lag + +**Key features:** +- ACF/PACF analysis for model identification +- Stationarity tests (ADF, KPSS) +- Forecasting with prediction intervals +- Residual diagnostics (Ljung-Box, heteroskedasticity) +- Granger causality testing +- Impulse response functions (IRF) +- Forecast error variance decomposition (FEVD) + +**When to use:** Time-ordered data, forecasting, understanding temporal dynamics + +**Reference:** See `references/time_series.md` for model selection, diagnostics, and forecasting methods. + +### 5. Statistical Tests and Diagnostics + +Extensive testing and diagnostic capabilities for model validation. + +**Residual diagnostics:** +- Autocorrelation tests (Ljung-Box, Durbin-Watson, Breusch-Godfrey) +- Heteroskedasticity tests (Breusch-Pagan, White, ARCH) +- Normality tests (Jarque-Bera, Omnibus, Anderson-Darling, Lilliefors) +- Specification tests (RESET, Harvey-Collier) + +**Influence and outliers:** +- Leverage (hat values) +- Cook's distance +- DFFITS and DFBETAs +- Studentized residuals +- Influence plots + +**Hypothesis testing:** +- t-tests (one-sample, two-sample, paired) +- Proportion tests +- Chi-square tests +- Non-parametric tests (Mann-Whitney, Wilcoxon, Kruskal-Wallis) +- ANOVA (one-way, two-way, repeated measures) + +**Multiple comparisons:** +- Tukey's HSD +- Bonferroni correction +- False Discovery Rate (FDR) + +**Effect sizes and power:** +- Cohen's d, eta-squared +- Power analysis for t-tests, proportions +- Sample size calculations + +**Robust inference:** +- Heteroskedasticity-consistent SEs (HC0-HC3) +- HAC standard errors (Newey-West) +- Cluster-robust standard errors + +**When to use:** Validating assumptions, detecting problems, ensuring robust inference + +**Reference:** See `references/stats_diagnostics.md` for comprehensive testing and diagnostic procedures. + +## Formula API (R-style) + +Statsmodels supports R-style formulas for intuitive model specification: + +```python +import statsmodels.formula.api as smf + +# OLS with formula +results = smf.ols('y ~ x1 + x2 + x1:x2', data=df).fit() + +# Categorical variables (automatic dummy coding) +results = smf.ols('y ~ x1 + C(category)', data=df).fit() + +# Interactions +results = smf.ols('y ~ x1 * x2', data=df).fit() # x1 + x2 + x1:x2 + +# Polynomial terms +results = smf.ols('y ~ x + I(x**2)', data=df).fit() + +# Logit +results = smf.logit('y ~ x1 + x2 + C(group)', data=df).fit() + +# Poisson +results = smf.poisson('count ~ x1 + x2', data=df).fit() + +# ARIMA (not available via formula, use regular API) +``` + +## Model Selection and Comparison + +### Information Criteria + +```python +# Compare models using AIC/BIC +models = { + 'Model 1': model1_results, + 'Model 2': model2_results, + 'Model 3': model3_results +} + +comparison = pd.DataFrame({ + 'AIC': {name: res.aic for name, res in models.items()}, + 'BIC': {name: res.bic for name, res in models.items()}, + 'Log-Likelihood': {name: res.llf for name, res in models.items()} +}) + +print(comparison.sort_values('AIC')) +# Lower AIC/BIC indicates better model +``` + +### Likelihood Ratio Test (Nested Models) + +```python +# For nested models (one is subset of the other) +from scipy import stats + +lr_stat = 2 * (full_model.llf - reduced_model.llf) +df = full_model.df_model - reduced_model.df_model +p_value = 1 - stats.chi2.cdf(lr_stat, df) + +print(f"LR statistic: {lr_stat:.4f}") +print(f"p-value: {p_value:.4f}") + +if p_value < 0.05: + print("Full model significantly better") +else: + print("Reduced model preferred (parsimony)") +``` + +### Cross-Validation + +```python +from sklearn.model_selection import KFold +from sklearn.metrics import mean_squared_error + +kf = KFold(n_splits=5, shuffle=True, random_state=42) +cv_scores = [] + +for train_idx, val_idx in kf.split(X): + X_train, X_val = X.iloc[train_idx], X.iloc[val_idx] + y_train, y_val = y.iloc[train_idx], y.iloc[val_idx] + + # Fit model + model = sm.OLS(y_train, X_train).fit() + + # Predict + y_pred = model.predict(X_val) + + # Score + rmse = np.sqrt(mean_squared_error(y_val, y_pred)) + cv_scores.append(rmse) + +print(f"CV RMSE: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}") +``` + +## Best Practices + +### Data Preparation + +1. **Always add constant**: Use `sm.add_constant()` unless excluding intercept +2. **Check for missing values**: Handle or impute before fitting +3. **Scale if needed**: Improves convergence, interpretation (but not required for tree models) +4. **Encode categoricals**: Use formula API or manual dummy coding + +### Model Building + +1. **Start simple**: Begin with basic model, add complexity as needed +2. **Check assumptions**: Test residuals, heteroskedasticity, autocorrelation +3. **Use appropriate model**: Match model to outcome type (binary→Logit, count→Poisson) +4. **Consider alternatives**: If assumptions violated, use robust methods or different model + +### Inference + +1. **Report effect sizes**: Not just p-values +2. **Use robust SEs**: When heteroskedasticity or clustering present +3. **Multiple comparisons**: Correct when testing many hypotheses +4. **Confidence intervals**: Always report alongside point estimates + +### Model Evaluation + +1. **Check residuals**: Plot residuals vs fitted, Q-Q plot +2. **Influence diagnostics**: Identify and investigate influential observations +3. **Out-of-sample validation**: Test on holdout set or cross-validate +4. **Compare models**: Use AIC/BIC for non-nested, LR test for nested + +### Reporting + +1. **Comprehensive summary**: Use `.summary()` for detailed output +2. **Document decisions**: Note transformations, excluded observations +3. **Interpret carefully**: Account for link functions (e.g., exp(β) for log link) +4. **Visualize**: Plot predictions, confidence intervals, diagnostics + +## Common Workflows + +### Workflow 1: Linear Regression Analysis + +1. Explore data (plots, descriptives) +2. Fit initial OLS model +3. Check residual diagnostics +4. Test for heteroskedasticity, autocorrelation +5. Check for multicollinearity (VIF) +6. Identify influential observations +7. Refit with robust SEs if needed +8. Interpret coefficients and inference +9. Validate on holdout or via CV + +### Workflow 2: Binary Classification + +1. Fit logistic regression (Logit) +2. Check for convergence issues +3. Interpret odds ratios +4. Calculate marginal effects +5. Evaluate classification performance (AUC, confusion matrix) +6. Check for influential observations +7. Compare with alternative models (Probit) +8. Validate predictions on test set + +### Workflow 3: Count Data Analysis + +1. Fit Poisson regression +2. Check for overdispersion +3. If overdispersed, fit Negative Binomial +4. Check for excess zeros (consider ZIP/ZINB) +5. Interpret rate ratios +6. Assess goodness of fit +7. Compare models via AIC +8. Validate predictions + +### Workflow 4: Time Series Forecasting + +1. Plot series, check for trend/seasonality +2. Test for stationarity (ADF, KPSS) +3. Difference if non-stationary +4. Identify p, q from ACF/PACF +5. Fit ARIMA or SARIMAX +6. Check residual diagnostics (Ljung-Box) +7. Generate forecasts with confidence intervals +8. Evaluate forecast accuracy on test set + +## Reference Documentation + +This skill includes comprehensive reference files for detailed guidance: + +### references/linear_models.md +Detailed coverage of linear regression models including: +- OLS, WLS, GLS, GLSAR, Quantile Regression +- Mixed effects models +- Recursive and rolling regression +- Comprehensive diagnostics (heteroskedasticity, autocorrelation, multicollinearity) +- Influence statistics and outlier detection +- Robust standard errors (HC, HAC, cluster) +- Hypothesis testing and model comparison + +### references/glm.md +Complete guide to generalized linear models: +- All distribution families (Binomial, Poisson, Gamma, etc.) +- Link functions and when to use each +- Model fitting and interpretation +- Pseudo R-squared and goodness of fit +- Diagnostics and residual analysis +- Applications (logistic, Poisson, Gamma regression) + +### references/discrete_choice.md +Comprehensive guide to discrete outcome models: +- Binary models (Logit, Probit) +- Multinomial models (MNLogit, Conditional Logit) +- Count models (Poisson, Negative Binomial, Zero-Inflated, Hurdle) +- Ordinal models +- Marginal effects and interpretation +- Model diagnostics and comparison + +### references/time_series.md +In-depth time series analysis guidance: +- Univariate models (AR, ARIMA, SARIMAX, Exponential Smoothing) +- Multivariate models (VAR, VARMAX, Dynamic Factor) +- State space models +- Stationarity testing and diagnostics +- Forecasting methods and evaluation +- Granger causality, IRF, FEVD + +### references/stats_diagnostics.md +Comprehensive statistical testing and diagnostics: +- Residual diagnostics (autocorrelation, heteroskedasticity, normality) +- Influence and outlier detection +- Hypothesis tests (parametric and non-parametric) +- ANOVA and post-hoc tests +- Multiple comparisons correction +- Robust covariance matrices +- Power analysis and effect sizes + +**When to reference:** +- Need detailed parameter explanations +- Choosing between similar models +- Troubleshooting convergence or diagnostic issues +- Understanding specific test statistics +- Looking for code examples for advanced features + +**Search patterns:** +```bash +# Find information about specific models +grep -r "Quantile Regression" references/ + +# Find diagnostic tests +grep -r "Breusch-Pagan" references/stats_diagnostics.md + +# Find time series guidance +grep -r "SARIMAX" references/time_series.md +``` + +## Common Pitfalls to Avoid + +1. **Forgetting constant term**: Always use `sm.add_constant()` unless no intercept desired +2. **Ignoring assumptions**: Check residuals, heteroskedasticity, autocorrelation +3. **Wrong model for outcome type**: Binary→Logit/Probit, Count→Poisson/NB, not OLS +4. **Not checking convergence**: Look for optimization warnings +5. **Misinterpreting coefficients**: Remember link functions (log, logit, etc.) +6. **Using Poisson with overdispersion**: Check dispersion, use Negative Binomial if needed +7. **Not using robust SEs**: When heteroskedasticity or clustering present +8. **Overfitting**: Too many parameters relative to sample size +9. **Data leakage**: Fitting on test data or using future information +10. **Not validating predictions**: Always check out-of-sample performance +11. **Comparing non-nested models**: Use AIC/BIC, not LR test +12. **Ignoring influential observations**: Check Cook's distance and leverage +13. **Multiple testing**: Correct p-values when testing many hypotheses +14. **Not differencing time series**: Fit ARIMA on non-stationary data +15. **Confusing prediction vs confidence intervals**: Prediction intervals are wider + +## Getting Help + +For detailed documentation and examples: +- Official docs: https://www.statsmodels.org/stable/ +- User guide: https://www.statsmodels.org/stable/user-guide.html +- Examples: https://www.statsmodels.org/stable/examples/index.html +- API reference: https://www.statsmodels.org/stable/api.html diff --git a/scientific-packages/statsmodels/references/discrete_choice.md b/scientific-packages/statsmodels/references/discrete_choice.md new file mode 100644 index 0000000..6d51bee --- /dev/null +++ b/scientific-packages/statsmodels/references/discrete_choice.md @@ -0,0 +1,669 @@ +# Discrete Choice Models Reference + +This document provides comprehensive guidance on discrete choice models in statsmodels, including binary, multinomial, count, and ordinal models. + +## Overview + +Discrete choice models handle outcomes that are: +- **Binary**: 0/1, success/failure +- **Multinomial**: Multiple unordered categories +- **Ordinal**: Ordered categories +- **Count**: Non-negative integers + +All models use maximum likelihood estimation and assume i.i.d. errors. + +## Binary Models + +### Logit (Logistic Regression) + +Uses logistic distribution for binary outcomes. + +**When to use:** +- Binary classification (yes/no, success/failure) +- Probability estimation for binary outcomes +- Interpretable odds ratios + +**Model**: P(Y=1|X) = 1 / (1 + exp(-Xβ)) + +```python +import statsmodels.api as sm +from statsmodels.discrete.discrete_model import Logit + +# Prepare data +X = sm.add_constant(X_data) + +# Fit model +model = Logit(y, X) +results = model.fit() + +print(results.summary()) +``` + +**Interpretation:** +```python +import numpy as np + +# Odds ratios +odds_ratios = np.exp(results.params) +print("Odds ratios:", odds_ratios) + +# For 1-unit increase in X, odds multiply by exp(β) +# OR > 1: increases odds of success +# OR < 1: decreases odds of success +# OR = 1: no effect + +# Confidence intervals for odds ratios +odds_ci = np.exp(results.conf_int()) +print("Odds ratio 95% CI:") +print(odds_ci) +``` + +**Marginal effects:** +```python +# Average marginal effects (AME) +marginal_effects = results.get_margeff(at='mean') +print(marginal_effects.summary()) + +# Marginal effects at means (MEM) +marginal_effects_mem = results.get_margeff(at='mean', method='dydx') + +# Marginal effects at representative values +marginal_effects_custom = results.get_margeff(at='mean', + atexog={'x1': 1, 'x2': 5}) +``` + +**Predictions:** +```python +# Predicted probabilities +probs = results.predict(X) + +# Binary predictions (0.5 threshold) +predictions = (probs > 0.5).astype(int) + +# Custom threshold +threshold = 0.3 +predictions_custom = (probs > threshold).astype(int) + +# For new data +X_new = sm.add_constant(X_new_data) +new_probs = results.predict(X_new) +``` + +**Model evaluation:** +```python +from sklearn.metrics import (classification_report, confusion_matrix, + roc_auc_score, roc_curve) + +# Classification report +print(classification_report(y, predictions)) + +# Confusion matrix +print(confusion_matrix(y, predictions)) + +# AUC-ROC +auc = roc_auc_score(y, probs) +print(f"AUC: {auc:.4f}") + +# Pseudo R-squared +print(f"McFadden's Pseudo R²: {results.prsquared:.4f}") +``` + +### Probit + +Uses normal distribution for binary outcomes. + +**When to use:** +- Binary outcomes +- Prefer normal distribution assumption +- Field convention (econometrics often uses probit) + +**Model**: P(Y=1|X) = Φ(Xβ), where Φ is standard normal CDF + +```python +from statsmodels.discrete.discrete_model import Probit + +model = Probit(y, X) +results = model.fit() + +print(results.summary()) +``` + +**Comparison with Logit:** +- Probit and Logit usually give similar results +- Probit: symmetric, based on normal distribution +- Logit: slightly heavier tails, easier interpretation (odds ratios) +- Coefficients not directly comparable (scale difference) + +```python +# Marginal effects are comparable +logit_me = logit_results.get_margeff().margeff +probit_me = probit_results.get_margeff().margeff + +print("Logit marginal effects:", logit_me) +print("Probit marginal effects:", probit_me) +``` + +## Multinomial Models + +### MNLogit (Multinomial Logit) + +For unordered categorical outcomes with 3+ categories. + +**When to use:** +- Multiple unordered categories (e.g., transportation mode, brand choice) +- No natural ordering among categories +- Need probabilities for each category + +**Model**: P(Y=j|X) = exp(Xβⱼ) / Σₖ exp(Xβₖ) + +```python +from statsmodels.discrete.discrete_model import MNLogit + +# y should be integers 0, 1, 2, ... for categories +model = MNLogit(y, X) +results = model.fit() + +print(results.summary()) +``` + +**Interpretation:** +```python +# One category is reference (usually category 0) +# Coefficients represent log-odds relative to reference + +# For category j vs reference: +# exp(β_j) = odds ratio of category j vs reference + +# Predicted probabilities for each category +probs = results.predict(X) # Shape: (n_samples, n_categories) + +# Most likely category +predicted_categories = probs.argmax(axis=1) +``` + +**Relative risk ratios:** +```python +# Exponentiate coefficients for relative risk ratios +import numpy as np +import pandas as pd + +# Get parameter names and values +params_df = pd.DataFrame({ + 'coef': results.params, + 'RRR': np.exp(results.params) +}) +print(params_df) +``` + +### Conditional Logit + +For choice models where alternatives have characteristics. + +**When to use:** +- Alternative-specific regressors (vary across choices) +- Panel data with choices +- Discrete choice experiments + +```python +from statsmodels.discrete.conditional_models import ConditionalLogit + +# Data structure: long format with choice indicator +model = ConditionalLogit(y_choice, X_alternatives, groups=individual_id) +results = model.fit() +``` + +## Count Models + +### Poisson + +Standard model for count data. + +**When to use:** +- Count outcomes (events, occurrences) +- Rare events +- Mean ≈ variance + +**Model**: P(Y=k|X) = exp(-λ) λᵏ / k!, where log(λ) = Xβ + +```python +from statsmodels.discrete.count_model import Poisson + +model = Poisson(y_counts, X) +results = model.fit() + +print(results.summary()) +``` + +**Interpretation:** +```python +# Rate ratios (incident rate ratios) +rate_ratios = np.exp(results.params) +print("Rate ratios:", rate_ratios) + +# For 1-unit increase in X, expected count multiplies by exp(β) +``` + +**Check overdispersion:** +```python +# Mean and variance should be similar for Poisson +print(f"Mean: {y_counts.mean():.2f}") +print(f"Variance: {y_counts.var():.2f}") + +# Formal test +from statsmodels.stats.stattools import durbin_watson + +# Overdispersion if variance >> mean +# Rule of thumb: variance/mean > 1.5 suggests overdispersion +overdispersion_ratio = y_counts.var() / y_counts.mean() +print(f"Variance/Mean: {overdispersion_ratio:.2f}") + +if overdispersion_ratio > 1.5: + print("Consider Negative Binomial model") +``` + +**With offset (for rates):** +```python +# When modeling rates with varying exposure +# log(λ) = log(exposure) + Xβ + +model = Poisson(y_counts, X, offset=np.log(exposure)) +results = model.fit() +``` + +### Negative Binomial + +For overdispersed count data (variance > mean). + +**When to use:** +- Count data with overdispersion +- Excess variance not explained by Poisson +- Heterogeneity in counts + +**Model**: Adds dispersion parameter α to account for overdispersion + +```python +from statsmodels.discrete.count_model import NegativeBinomial + +model = NegativeBinomial(y_counts, X) +results = model.fit() + +print(results.summary()) +print(f"Dispersion parameter alpha: {results.params['alpha']:.4f}") +``` + +**Compare with Poisson:** +```python +# Fit both models +poisson_results = Poisson(y_counts, X).fit() +nb_results = NegativeBinomial(y_counts, X).fit() + +# AIC comparison (lower is better) +print(f"Poisson AIC: {poisson_results.aic:.2f}") +print(f"Negative Binomial AIC: {nb_results.aic:.2f}") + +# Likelihood ratio test (if NB is better) +from scipy import stats +lr_stat = 2 * (nb_results.llf - poisson_results.llf) +lr_pval = 1 - stats.chi2.cdf(lr_stat, df=1) # 1 extra parameter (alpha) +print(f"LR test p-value: {lr_pval:.4f}") + +if lr_pval < 0.05: + print("Negative Binomial significantly better") +``` + +### Zero-Inflated Models + +For count data with excess zeros. + +**When to use:** +- More zeros than expected from Poisson/NB +- Two processes: one for zeros, one for counts +- Examples: number of doctor visits, insurance claims + +**Models:** +- ZeroInflatedPoisson (ZIP) +- ZeroInflatedNegativeBinomialP (ZINB) + +```python +from statsmodels.discrete.count_model import (ZeroInflatedPoisson, + ZeroInflatedNegativeBinomialP) + +# ZIP model +zip_model = ZeroInflatedPoisson(y_counts, X, exog_infl=X_inflation) +zip_results = zip_model.fit() + +# ZINB model (for overdispersion + excess zeros) +zinb_model = ZeroInflatedNegativeBinomialP(y_counts, X, exog_infl=X_inflation) +zinb_results = zinb_model.fit() + +print(zip_results.summary()) +``` + +**Two parts of the model:** +```python +# 1. Inflation model: P(Y=0 due to inflation) +# 2. Count model: distribution of counts + +# Predicted probabilities of inflation +inflation_probs = zip_results.predict(X, which='prob') + +# Predicted counts +predicted_counts = zip_results.predict(X, which='mean') +``` + +### Hurdle Models + +Two-stage model: whether any counts, then how many. + +**When to use:** +- Excess zeros +- Different processes for zero vs positive counts +- Zeros structurally different from positive values + +```python +from statsmodels.discrete.count_model import HurdleCountModel + +# Specify count distribution and zero inflation +model = HurdleCountModel(y_counts, X, + exog_infl=X_hurdle, + dist='poisson') # or 'negbin' +results = model.fit() + +print(results.summary()) +``` + +## Ordinal Models + +### Ordered Logit/Probit + +For ordered categorical outcomes. + +**When to use:** +- Ordered categories (e.g., low/medium/high, ratings 1-5) +- Natural ordering matters +- Want to respect ordinal structure + +**Model**: Cumulative probability model with cutpoints + +```python +from statsmodels.miscmodels.ordinal_model import OrderedModel + +# y should be ordered integers: 0, 1, 2, ... +model = OrderedModel(y_ordered, X, distr='logit') # or 'probit' +results = model.fit(method='bfgs') + +print(results.summary()) +``` + +**Interpretation:** +```python +# Cutpoints (thresholds between categories) +cutpoints = results.params[-n_categories+1:] +print("Cutpoints:", cutpoints) + +# Coefficients +coefficients = results.params[:-n_categories+1] +print("Coefficients:", coefficients) + +# Predicted probabilities for each category +probs = results.predict(X) # Shape: (n_samples, n_categories) + +# Most likely category +predicted_categories = probs.argmax(axis=1) +``` + +**Proportional odds assumption:** +```python +# Test if coefficients are same across cutpoints +# (Brant test - implement manually or check residuals) + +# Check: model each cutpoint separately and compare coefficients +``` + +## Model Diagnostics + +### Goodness of Fit + +```python +# Pseudo R-squared (McFadden) +print(f"Pseudo R²: {results.prsquared:.4f}") + +# AIC/BIC for model comparison +print(f"AIC: {results.aic:.2f}") +print(f"BIC: {results.bic:.2f}") + +# Log-likelihood +print(f"Log-likelihood: {results.llf:.2f}") + +# Likelihood ratio test vs null model +lr_stat = 2 * (results.llf - results.llnull) +from scipy import stats +lr_pval = 1 - stats.chi2.cdf(lr_stat, results.df_model) +print(f"LR test p-value: {lr_pval}") +``` + +### Classification Metrics (Binary) + +```python +from sklearn.metrics import (accuracy_score, precision_score, recall_score, + f1_score, roc_auc_score) + +# Predictions +probs = results.predict(X) +predictions = (probs > 0.5).astype(int) + +# Metrics +print(f"Accuracy: {accuracy_score(y, predictions):.4f}") +print(f"Precision: {precision_score(y, predictions):.4f}") +print(f"Recall: {recall_score(y, predictions):.4f}") +print(f"F1: {f1_score(y, predictions):.4f}") +print(f"AUC: {roc_auc_score(y, probs):.4f}") +``` + +### Classification Metrics (Multinomial) + +```python +from sklearn.metrics import accuracy_score, classification_report, log_loss + +# Predicted categories +probs = results.predict(X) +predictions = probs.argmax(axis=1) + +# Accuracy +accuracy = accuracy_score(y, predictions) +print(f"Accuracy: {accuracy:.4f}") + +# Classification report +print(classification_report(y, predictions)) + +# Log loss +logloss = log_loss(y, probs) +print(f"Log Loss: {logloss:.4f}") +``` + +### Count Model Diagnostics + +```python +# Observed vs predicted frequencies +observed = pd.Series(y_counts).value_counts().sort_index() +predicted = results.predict(X) +predicted_counts = pd.Series(np.round(predicted)).value_counts().sort_index() + +# Compare distributions +import matplotlib.pyplot as plt +fig, ax = plt.subplots() +observed.plot(kind='bar', alpha=0.5, label='Observed', ax=ax) +predicted_counts.plot(kind='bar', alpha=0.5, label='Predicted', ax=ax) +ax.legend() +ax.set_xlabel('Count') +ax.set_ylabel('Frequency') +plt.show() + +# Rootogram (better visualization) +from statsmodels.graphics.agreement import mean_diff_plot +# Custom rootogram implementation needed +``` + +### Influence and Outliers + +```python +# Standardized residuals +std_resid = (y - results.predict(X)) / np.sqrt(results.predict(X)) + +# Check for outliers (|std_resid| > 2) +outliers = np.where(np.abs(std_resid) > 2)[0] +print(f"Number of outliers: {len(outliers)}") + +# Leverage (hat values) - for logit/probit +# from statsmodels.stats.outliers_influence +``` + +## Hypothesis Testing + +```python +# Single parameter test (automatic in summary) + +# Multiple parameters: Wald test +# Test H0: β₁ = β₂ = 0 +R = [[0, 1, 0, 0], [0, 0, 1, 0]] +wald_test = results.wald_test(R) +print(wald_test) + +# Likelihood ratio test for nested models +model_reduced = Logit(y, X_reduced).fit() +model_full = Logit(y, X_full).fit() + +lr_stat = 2 * (model_full.llf - model_reduced.llf) +df = model_full.df_model - model_reduced.df_model +from scipy import stats +lr_pval = 1 - stats.chi2.cdf(lr_stat, df) +print(f"LR test p-value: {lr_pval:.4f}") +``` + +## Model Selection and Comparison + +```python +# Fit multiple models +models = { + 'Logit': Logit(y, X).fit(), + 'Probit': Probit(y, X).fit(), + # Add more models +} + +# Compare AIC/BIC +comparison = pd.DataFrame({ + 'AIC': {name: model.aic for name, model in models.items()}, + 'BIC': {name: model.bic for name, model in models.items()}, + 'Pseudo R²': {name: model.prsquared for name, model in models.items()} +}) +print(comparison.sort_values('AIC')) + +# Cross-validation for predictive performance +from sklearn.model_selection import cross_val_score +from sklearn.linear_model import LogisticRegression + +# Use sklearn wrapper or manual CV +``` + +## Formula API + +Use R-style formulas for easier specification. + +```python +import statsmodels.formula.api as smf + +# Logit with formula +formula = 'y ~ x1 + x2 + C(category) + x1:x2' +results = smf.logit(formula, data=df).fit() + +# MNLogit with formula +results = smf.mnlogit(formula, data=df).fit() + +# Poisson with formula +results = smf.poisson(formula, data=df).fit() + +# Negative Binomial with formula +results = smf.negativebinomial(formula, data=df).fit() +``` + +## Common Applications + +### Binary Classification (Marketing Response) + +```python +# Predict customer purchase probability +X = sm.add_constant(customer_features) +model = Logit(purchased, X) +results = model.fit() + +# Targeting: select top 20% likely to purchase +probs = results.predict(X) +top_20_pct_idx = np.argsort(probs)[-int(0.2*len(probs)):] +``` + +### Multinomial Choice (Transportation Mode) + +```python +# Predict transportation mode choice +model = MNLogit(mode_choice, X) +results = model.fit() + +# Predicted mode for new commuter +new_commuter = sm.add_constant(new_features) +mode_probs = results.predict(new_commuter) +predicted_mode = mode_probs.argmax(axis=1) +``` + +### Count Data (Number of Doctor Visits) + +```python +# Model healthcare utilization +model = NegativeBinomial(num_visits, X) +results = model.fit() + +# Expected visits for new patient +expected_visits = results.predict(new_patient_X) +``` + +### Zero-Inflated (Insurance Claims) + +```python +# Many people have zero claims +# Zero-inflation: some never claim +# Count process: those who might claim + +zip_model = ZeroInflatedPoisson(claims, X_count, exog_infl=X_inflation) +results = zip_model.fit() + +# P(never file claim) +never_claim_prob = results.predict(X, which='prob-zero') + +# Expected claims +expected_claims = results.predict(X, which='mean') +``` + +## Best Practices + +1. **Check data type**: Ensure response matches model (binary, counts, categories) +2. **Add constant**: Always use `sm.add_constant()` unless no intercept desired +3. **Scale continuous predictors**: For better convergence and interpretation +4. **Check convergence**: Look for convergence warnings +5. **Use formula API**: For categorical variables and interactions +6. **Marginal effects**: Report marginal effects, not just coefficients +7. **Model comparison**: Use AIC/BIC and cross-validation +8. **Validate**: Holdout set or cross-validation for predictive models +9. **Check overdispersion**: For count models, test Poisson assumption +10. **Consider alternatives**: Zero-inflation, hurdle models for excess zeros + +## Common Pitfalls + +1. **Forgetting constant**: No intercept term +2. **Perfect separation**: Logit/probit may not converge +3. **Using Poisson with overdispersion**: Check and use Negative Binomial +4. **Misinterpreting coefficients**: Remember they're on log-odds/log scale +5. **Not checking convergence**: Optimization may fail silently +6. **Wrong distribution**: Match model to data type (binary/count/categorical) +7. **Ignoring excess zeros**: Use ZIP/ZINB when appropriate +8. **Not validating predictions**: Always check out-of-sample performance +9. **Comparing non-nested models**: Use AIC/BIC, not likelihood ratio test +10. **Ordinal as nominal**: Use OrderedModel for ordered categories diff --git a/scientific-packages/statsmodels/references/glm.md b/scientific-packages/statsmodels/references/glm.md new file mode 100644 index 0000000..e8ab8e7 --- /dev/null +++ b/scientific-packages/statsmodels/references/glm.md @@ -0,0 +1,619 @@ +# Generalized Linear Models (GLM) Reference + +This document provides comprehensive guidance on generalized linear models in statsmodels, including families, link functions, and applications. + +## Overview + +GLMs extend linear regression to non-normal response distributions through: +1. **Distribution family**: Specifies the conditional distribution of the response +2. **Link function**: Transforms the linear predictor to the scale of the mean +3. **Variance function**: Relates variance to the mean + +**General form**: g(μ) = Xβ, where g is the link function and μ = E(Y|X) + +## When to Use GLM + +- **Binary outcomes**: Logistic regression (Binomial family with logit link) +- **Count data**: Poisson or Negative Binomial regression +- **Positive continuous data**: Gamma or Inverse Gaussian +- **Non-normal distributions**: When OLS assumptions violated +- **Link functions**: Need non-linear relationship between predictors and response scale + +## Distribution Families + +### Binomial Family + +For binary outcomes (0/1) or proportions (k/n). + +**When to use:** +- Binary classification +- Success/failure outcomes +- Proportions or rates + +**Common links:** +- Logit (default): log(μ/(1-μ)) +- Probit: Φ⁻¹(μ) +- Log: log(μ) + +```python +import statsmodels.api as sm +import statsmodels.formula.api as smf + +# Binary logistic regression +model = sm.GLM(y, X, family=sm.families.Binomial()) +results = model.fit() + +# Formula API +results = smf.glm('success ~ x1 + x2', data=df, + family=sm.families.Binomial()).fit() + +# Access predictions (probabilities) +probs = results.predict(X_new) + +# Classification (0.5 threshold) +predictions = (probs > 0.5).astype(int) +``` + +**Interpretation:** +```python +import numpy as np + +# Odds ratios (for logit link) +odds_ratios = np.exp(results.params) +print("Odds ratios:", odds_ratios) + +# For 1-unit increase in x, odds multiply by exp(beta) +``` + +### Poisson Family + +For count data (non-negative integers). + +**When to use:** +- Count outcomes (number of events) +- Rare events +- Rate modeling (with offset) + +**Common links:** +- Log (default): log(μ) +- Identity: μ +- Sqrt: √μ + +```python +# Poisson regression +model = sm.GLM(y, X, family=sm.families.Poisson()) +results = model.fit() + +# With exposure/offset for rates +# If modeling rate = counts/exposure +model = sm.GLM(y, X, family=sm.families.Poisson(), + offset=np.log(exposure)) +results = model.fit() + +# Interpretation: exp(beta) = multiplicative effect on expected count +import numpy as np +rate_ratios = np.exp(results.params) +print("Rate ratios:", rate_ratios) +``` + +**Overdispersion check:** +```python +# Deviance / df should be ~1 for Poisson +overdispersion = results.deviance / results.df_resid +print(f"Overdispersion: {overdispersion}") + +# If >> 1, consider Negative Binomial +if overdispersion > 1.5: + print("Consider Negative Binomial model for overdispersion") +``` + +### Negative Binomial Family + +For overdispersed count data. + +**When to use:** +- Count data with variance > mean +- Excess zeros or large variance +- Poisson model shows overdispersion + +```python +# Negative Binomial GLM +model = sm.GLM(y, X, family=sm.families.NegativeBinomial()) +results = model.fit() + +# Alternative: use discrete choice model with alpha estimation +from statsmodels.discrete.discrete_model import NegativeBinomial +nb_model = NegativeBinomial(y, X) +nb_results = nb_model.fit() + +print(f"Dispersion parameter alpha: {nb_results.params[-1]}") +``` + +### Gaussian Family + +Equivalent to OLS but fit via IRLS (Iteratively Reweighted Least Squares). + +**When to use:** +- Want GLM framework for consistency +- Need robust standard errors +- Comparing with other GLMs + +**Common links:** +- Identity (default): μ +- Log: log(μ) +- Inverse: 1/μ + +```python +# Gaussian GLM (equivalent to OLS) +model = sm.GLM(y, X, family=sm.families.Gaussian()) +results = model.fit() + +# Verify equivalence with OLS +ols_results = sm.OLS(y, X).fit() +print("Parameters close:", np.allclose(results.params, ols_results.params)) +``` + +### Gamma Family + +For positive continuous data, often right-skewed. + +**When to use:** +- Positive outcomes (insurance claims, survival times) +- Right-skewed distributions +- Variance proportional to mean² + +**Common links:** +- Inverse (default): 1/μ +- Log: log(μ) +- Identity: μ + +```python +# Gamma regression (common for cost data) +model = sm.GLM(y, X, family=sm.families.Gamma()) +results = model.fit() + +# Log link often preferred for interpretation +model = sm.GLM(y, X, family=sm.families.Gamma(link=sm.families.links.Log())) +results = model.fit() + +# With log link, exp(beta) = multiplicative effect +import numpy as np +effects = np.exp(results.params) +``` + +### Inverse Gaussian Family + +For positive continuous data with specific variance structure. + +**When to use:** +- Positive skewed outcomes +- Variance proportional to mean³ +- Alternative to Gamma + +**Common links:** +- Inverse squared (default): 1/μ² +- Log: log(μ) + +```python +model = sm.GLM(y, X, family=sm.families.InverseGaussian()) +results = model.fit() +``` + +### Tweedie Family + +Flexible family covering multiple distributions. + +**When to use:** +- Insurance claims (mixture of zeros and continuous) +- Semi-continuous data +- Need flexible variance function + +**Special cases (power parameter p):** +- p=0: Normal +- p=1: Poisson +- p=2: Gamma +- p=3: Inverse Gaussian +- 11 suggests overdispersion, <1 underdispersion +``` + +### Residual Analysis + +```python +import matplotlib.pyplot as plt + +# Deviance residuals vs fitted +plt.figure(figsize=(10, 6)) +plt.scatter(results.fittedvalues, results.resid_deviance, alpha=0.5) +plt.xlabel('Fitted values') +plt.ylabel('Deviance residuals') +plt.axhline(y=0, color='r', linestyle='--') +plt.title('Deviance Residuals vs Fitted') +plt.show() + +# Q-Q plot of deviance residuals +from statsmodels.graphics.gofplots import qqplot +qqplot(results.resid_deviance, line='s') +plt.title('Q-Q Plot of Deviance Residuals') +plt.show() + +# For binary outcomes: binned residual plot +if isinstance(results.model.family, sm.families.Binomial): + from statsmodels.graphics.gofplots import qqplot + # Group predictions and compute average residuals + # (custom implementation needed) + pass +``` + +### Influence and Outliers + +```python +from statsmodels.stats.outliers_influence import GLMInfluence + +influence = GLMInfluence(results) + +# Leverage +leverage = influence.hat_matrix_diag + +# Cook's distance +cooks_d = influence.cooks_distance[0] + +# DFFITS +dffits = influence.dffits[0] + +# Find influential observations +influential = np.where(cooks_d > 4/len(y))[0] +print(f"Influential observations: {influential}") +``` + +## Hypothesis Testing + +```python +# Wald test for single parameter (automatically in summary) + +# Likelihood ratio test for nested models +# Fit reduced model +model_reduced = sm.GLM(y, X_reduced, family=family).fit() +model_full = sm.GLM(y, X_full, family=family).fit() + +# LR statistic +lr_stat = 2 * (model_full.llf - model_reduced.llf) +df = model_full.df_model - model_reduced.df_model + +from scipy import stats +lr_pval = 1 - stats.chi2.cdf(lr_stat, df) +print(f"LR test p-value: {lr_pval}") + +# Wald test for multiple parameters +# Test beta_1 = beta_2 = 0 +R = [[0, 1, 0, 0], [0, 0, 1, 0]] +wald_test = results.wald_test(R) +print(wald_test) +``` + +## Robust Standard Errors + +```python +# Heteroscedasticity-robust (sandwich estimator) +results_robust = results.get_robustcov_results(cov_type='HC0') + +# Cluster-robust +results_cluster = results.get_robustcov_results(cov_type='cluster', + groups=cluster_ids) + +# Compare standard errors +print("Regular SE:", results.bse) +print("Robust SE:", results_robust.bse) +``` + +## Model Comparison + +```python +# AIC/BIC for non-nested models +models = [model1_results, model2_results, model3_results] +for i, res in enumerate(models, 1): + print(f"Model {i}: AIC={res.aic:.2f}, BIC={res.bic:.2f}") + +# Likelihood ratio test for nested models (as shown above) + +# Cross-validation for predictive performance +from sklearn.model_selection import KFold +from sklearn.metrics import log_loss + +kf = KFold(n_splits=5, shuffle=True, random_state=42) +cv_scores = [] + +for train_idx, val_idx in kf.split(X): + X_train, X_val = X[train_idx], X[val_idx] + y_train, y_val = y[train_idx], y[val_idx] + + model_cv = sm.GLM(y_train, X_train, family=family).fit() + pred_probs = model_cv.predict(X_val) + + score = log_loss(y_val, pred_probs) + cv_scores.append(score) + +print(f"CV Log Loss: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}") +``` + +## Prediction + +```python +# Point predictions +predictions = results.predict(X_new) + +# For classification: get probabilities and convert +if isinstance(family, sm.families.Binomial): + probs = predictions + class_predictions = (probs > 0.5).astype(int) + +# For counts: predictions are expected counts +if isinstance(family, sm.families.Poisson): + expected_counts = predictions + +# Prediction intervals via bootstrap +n_boot = 1000 +boot_preds = np.zeros((n_boot, len(X_new))) + +for i in range(n_boot): + # Bootstrap resample + boot_idx = np.random.choice(len(y), size=len(y), replace=True) + X_boot, y_boot = X[boot_idx], y[boot_idx] + + # Fit and predict + boot_model = sm.GLM(y_boot, X_boot, family=family).fit() + boot_preds[i] = boot_model.predict(X_new) + +# 95% prediction intervals +pred_lower = np.percentile(boot_preds, 2.5, axis=0) +pred_upper = np.percentile(boot_preds, 97.5, axis=0) +``` + +## Common Applications + +### Logistic Regression (Binary Classification) + +```python +import statsmodels.api as sm + +# Fit logistic regression +X = sm.add_constant(X_data) +model = sm.GLM(y, X, family=sm.families.Binomial()) +results = model.fit() + +# Odds ratios +odds_ratios = np.exp(results.params) +odds_ci = np.exp(results.conf_int()) + +# Classification metrics +from sklearn.metrics import classification_report, roc_auc_score + +probs = results.predict(X) +predictions = (probs > 0.5).astype(int) + +print(classification_report(y, predictions)) +print(f"AUC: {roc_auc_score(y, probs):.4f}") + +# ROC curve +from sklearn.metrics import roc_curve +import matplotlib.pyplot as plt + +fpr, tpr, thresholds = roc_curve(y, probs) +plt.plot(fpr, tpr) +plt.plot([0, 1], [0, 1], 'k--') +plt.xlabel('False Positive Rate') +plt.ylabel('True Positive Rate') +plt.title('ROC Curve') +plt.show() +``` + +### Poisson Regression (Count Data) + +```python +# Fit Poisson model +X = sm.add_constant(X_data) +model = sm.GLM(y_counts, X, family=sm.families.Poisson()) +results = model.fit() + +# Rate ratios +rate_ratios = np.exp(results.params) +print("Rate ratios:", rate_ratios) + +# Check overdispersion +dispersion = results.pearson_chi2 / results.df_resid +if dispersion > 1.5: + print(f"Overdispersion detected ({dispersion:.2f}). Consider Negative Binomial.") +``` + +### Gamma Regression (Cost/Duration Data) + +```python +# Fit Gamma model with log link +X = sm.add_constant(X_data) +model = sm.GLM(y_cost, X, + family=sm.families.Gamma(link=sm.families.links.Log())) +results = model.fit() + +# Multiplicative effects +effects = np.exp(results.params) +print("Multiplicative effects on mean:", effects) +``` + +## Best Practices + +1. **Check distribution assumptions**: Plot histograms and Q-Q plots of response +2. **Verify link function**: Use canonical links unless there's a reason not to +3. **Examine residuals**: Deviance residuals should be approximately normal +4. **Test for overdispersion**: Especially for Poisson models +5. **Use offsets appropriately**: For rate modeling with varying exposure +6. **Consider robust SEs**: When variance assumptions questionable +7. **Compare models**: Use AIC/BIC for non-nested, LR test for nested +8. **Interpret on original scale**: Transform coefficients (e.g., exp for log link) +9. **Check influential observations**: Use Cook's distance +10. **Validate predictions**: Use cross-validation or holdout set + +## Common Pitfalls + +1. **Forgetting to add constant**: No intercept term +2. **Using wrong family**: Check distribution of response +3. **Ignoring overdispersion**: Use Negative Binomial instead of Poisson +4. **Misinterpreting coefficients**: Remember link function transformation +5. **Not checking convergence**: IRLS may not converge; check warnings +6. **Complete separation in logistic**: Some categories perfectly predict outcome +7. **Using identity link with bounded outcomes**: May predict outside valid range +8. **Comparing models with different samples**: Use same observations +9. **Forgetting offset in rate models**: Must use log(exposure) as offset +10. **Not considering alternatives**: Mixed models, zero-inflation for complex data diff --git a/scientific-packages/statsmodels/references/linear_models.md b/scientific-packages/statsmodels/references/linear_models.md new file mode 100644 index 0000000..cf42340 --- /dev/null +++ b/scientific-packages/statsmodels/references/linear_models.md @@ -0,0 +1,447 @@ +# Linear Regression Models Reference + +This document provides detailed guidance on linear regression models in statsmodels, including OLS, GLS, WLS, quantile regression, and specialized variants. + +## Core Model Classes + +### OLS (Ordinary Least Squares) + +Assumes independent, identically distributed errors (Σ=I). Best for standard regression with homoscedastic errors. + +**When to use:** +- Standard regression analysis +- Errors are independent and have constant variance +- No autocorrelation or heteroscedasticity +- Most common starting point + +**Basic usage:** +```python +import statsmodels.api as sm +import numpy as np + +# Prepare data - ALWAYS add constant for intercept +X = sm.add_constant(X_data) # Adds column of 1s for intercept + +# Fit model +model = sm.OLS(y, X) +results = model.fit() + +# View results +print(results.summary()) +``` + +**Key results attributes:** +```python +results.params # Coefficients +results.bse # Standard errors +results.tvalues # T-statistics +results.pvalues # P-values +results.rsquared # R-squared +results.rsquared_adj # Adjusted R-squared +results.fittedvalues # Fitted values (predictions on training data) +results.resid # Residuals +results.conf_int() # Confidence intervals for parameters +``` + +**Prediction with confidence/prediction intervals:** +```python +# For in-sample predictions +pred = results.get_prediction(X) +pred_summary = pred.summary_frame() +print(pred_summary) # Contains mean, std, confidence intervals + +# For out-of-sample predictions +X_new = sm.add_constant(X_new_data) +pred_new = results.get_prediction(X_new) +pred_summary = pred_new.summary_frame() + +# Access intervals +mean_ci_lower = pred_summary["mean_ci_lower"] +mean_ci_upper = pred_summary["mean_ci_upper"] +obs_ci_lower = pred_summary["obs_ci_lower"] # Prediction intervals +obs_ci_upper = pred_summary["obs_ci_upper"] +``` + +**Formula API (R-style):** +```python +import statsmodels.formula.api as smf + +# Automatic handling of categorical variables and interactions +formula = 'y ~ x1 + x2 + C(category) + x1:x2' +results = smf.ols(formula, data=df).fit() +``` + +### WLS (Weighted Least Squares) + +Handles heteroscedastic errors (diagonal Σ) where variance differs across observations. + +**When to use:** +- Known heteroscedasticity (non-constant error variance) +- Different observations have different reliability +- Weights are known or can be estimated + +**Usage:** +```python +# If you know the weights (inverse variance) +weights = 1 / error_variance +model = sm.WLS(y, X, weights=weights) +results = model.fit() + +# Common weight patterns: +# - 1/variance: when variance is known +# - n_i: sample size for grouped data +# - 1/x: when variance proportional to x +``` + +**Feasible WLS (estimating weights):** +```python +# Step 1: Fit OLS +ols_results = sm.OLS(y, X).fit() + +# Step 2: Model squared residuals to estimate variance +abs_resid = np.abs(ols_results.resid) +variance_model = sm.OLS(np.log(abs_resid**2), X).fit() + +# Step 3: Use estimated variance as weights +weights = 1 / np.exp(variance_model.fittedvalues) +wls_results = sm.WLS(y, X, weights=weights).fit() +``` + +### GLS (Generalized Least Squares) + +Handles arbitrary covariance structure (Σ). Superclass for other regression methods. + +**When to use:** +- Known covariance structure +- Correlated errors +- More general than WLS + +**Usage:** +```python +# Specify covariance structure +# Sigma should be (n x n) covariance matrix +model = sm.GLS(y, X, sigma=Sigma) +results = model.fit() +``` + +### GLSAR (GLS with Autoregressive Errors) + +Feasible generalized least squares with AR(p) errors for time series data. + +**When to use:** +- Time series regression with autocorrelated errors +- Need to account for serial correlation +- Violations of error independence + +**Usage:** +```python +# AR(1) errors +model = sm.GLSAR(y, X, rho=1) # rho=1 for AR(1), rho=2 for AR(2), etc. +results = model.iterative_fit() # Iteratively estimates AR parameters + +print(results.summary()) +print(f"Estimated rho: {results.model.rho}") +``` + +### RLS (Recursive Least Squares) + +Sequential parameter estimation, useful for adaptive or online learning. + +**When to use:** +- Parameters change over time +- Online/streaming data +- Want to see parameter evolution + +**Usage:** +```python +from statsmodels.regression.recursive_ls import RecursiveLS + +model = RecursiveLS(y, X) +results = model.fit() + +# Access time-varying parameters +params_over_time = results.recursive_coefficients +cusum = results.cusum # CUSUM statistic for structural breaks +``` + +### Rolling Regressions + +Compute estimates across moving windows for time-varying parameter detection. + +**When to use:** +- Parameters vary over time +- Want to detect structural changes +- Time series with evolving relationships + +**Usage:** +```python +from statsmodels.regression.rolling import RollingOLS, RollingWLS + +# Rolling OLS with 60-period window +rolling_model = RollingOLS(y, X, window=60) +rolling_results = rolling_model.fit() + +# Extract time-varying parameters +rolling_params = rolling_results.params # DataFrame with parameters over time +rolling_rsquared = rolling_results.rsquared + +# Plot parameter evolution +import matplotlib.pyplot as plt +rolling_params.plot() +plt.title('Time-Varying Coefficients') +plt.show() +``` + +### Quantile Regression + +Analyzes conditional quantiles rather than conditional mean. + +**When to use:** +- Interest in quantiles (median, 90th percentile, etc.) +- Robust to outliers (median regression) +- Distributional effects across quantiles +- Heterogeneous effects + +**Usage:** +```python +from statsmodels.regression.quantile_regression import QuantReg + +# Median regression (50th percentile) +model = QuantReg(y, X) +results_median = model.fit(q=0.5) + +# Multiple quantiles +quantiles = [0.1, 0.25, 0.5, 0.75, 0.9] +results_dict = {} +for q in quantiles: + results_dict[q] = model.fit(q=q) + +# Plot quantile-varying effects +import matplotlib.pyplot as plt +coef_dict = {q: res.params for q, res in results_dict.items()} +coef_df = pd.DataFrame(coef_dict).T +coef_df.plot() +plt.xlabel('Quantile') +plt.ylabel('Coefficient') +plt.show() +``` + +## Mixed Effects Models + +For hierarchical/nested data with random effects. + +**When to use:** +- Clustered/grouped data (students in schools, patients in hospitals) +- Repeated measures +- Need random effects to account for grouping + +**Usage:** +```python +from statsmodels.regression.mixed_linear_model import MixedLM + +# Random intercept model +model = MixedLM(y, X, groups=group_ids) +results = model.fit() + +# Random intercept and slope +model = MixedLM(y, X, groups=group_ids, exog_re=X_random) +results = model.fit() + +print(results.summary()) +``` + +## Diagnostics and Model Assessment + +### Residual Analysis + +```python +# Basic residual plots +import matplotlib.pyplot as plt + +# Residuals vs fitted +plt.scatter(results.fittedvalues, results.resid) +plt.xlabel('Fitted values') +plt.ylabel('Residuals') +plt.axhline(y=0, color='r', linestyle='--') +plt.title('Residuals vs Fitted') +plt.show() + +# Q-Q plot for normality +from statsmodels.graphics.gofplots import qqplot +qqplot(results.resid, line='s') +plt.show() + +# Histogram of residuals +plt.hist(results.resid, bins=30, edgecolor='black') +plt.xlabel('Residuals') +plt.ylabel('Frequency') +plt.title('Distribution of Residuals') +plt.show() +``` + +### Specification Tests + +```python +from statsmodels.stats.diagnostic import het_breuschpagan, het_white +from statsmodels.stats.stattools import durbin_watson, jarque_bera + +# Heteroscedasticity tests +lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, X) +print(f"Breusch-Pagan test p-value: {lm_pval}") + +# White test +white_test = het_white(results.resid, X) +print(f"White test p-value: {white_test[1]}") + +# Autocorrelation +dw_stat = durbin_watson(results.resid) +print(f"Durbin-Watson statistic: {dw_stat}") +# DW ~ 2 indicates no autocorrelation +# DW < 2 suggests positive autocorrelation +# DW > 2 suggests negative autocorrelation + +# Normality test +jb_stat, jb_pval, skew, kurtosis = jarque_bera(results.resid) +print(f"Jarque-Bera test p-value: {jb_pval}") +``` + +### Multicollinearity + +```python +from statsmodels.stats.outliers_influence import variance_inflation_factor + +# Calculate VIF for each variable +vif_data = pd.DataFrame() +vif_data["Variable"] = X.columns +vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] + +print(vif_data) +# VIF > 10 indicates problematic multicollinearity +# VIF > 5 suggests moderate multicollinearity + +# Condition number (from summary) +print(f"Condition number: {results.condition_number}") +# Condition number > 20 suggests multicollinearity +# Condition number > 30 indicates serious problems +``` + +### Influence Statistics + +```python +from statsmodels.stats.outliers_influence import OLSInfluence + +influence = results.get_influence() + +# Leverage (hat values) +leverage = influence.hat_matrix_diag +# High leverage: > 2*p/n (p=predictors, n=observations) + +# Cook's distance +cooks_d = influence.cooks_distance[0] +# Influential if Cook's D > 4/n + +# DFFITS +dffits = influence.dffits[0] +# Influential if |DFFITS| > 2*sqrt(p/n) + +# Create influence plot +from statsmodels.graphics.regressionplots import influence_plot +fig, ax = plt.subplots(figsize=(12, 8)) +influence_plot(results, ax=ax) +plt.show() +``` + +### Hypothesis Testing + +```python +# Test single coefficient +# H0: beta_i = 0 (automatically in summary) + +# Test multiple restrictions using F-test +# Example: Test beta_1 = beta_2 = 0 +R = [[0, 1, 0, 0], [0, 0, 1, 0]] # Restriction matrix +f_test = results.f_test(R) +print(f_test) + +# Formula-based hypothesis testing +f_test = results.f_test("x1 = x2 = 0") +print(f_test) + +# Test linear combination: beta_1 + beta_2 = 1 +r_matrix = [[0, 1, 1, 0]] +q_matrix = [1] # RHS value +f_test = results.f_test((r_matrix, q_matrix)) +print(f_test) + +# Wald test (equivalent to F-test for linear restrictions) +wald_test = results.wald_test(R) +print(wald_test) +``` + +## Model Comparison + +```python +# Compare nested models using likelihood ratio test (if using MLE) +from statsmodels.stats.anova import anova_lm + +# Fit restricted and unrestricted models +model_restricted = sm.OLS(y, X_restricted).fit() +model_full = sm.OLS(y, X_full).fit() + +# ANOVA table for model comparison +anova_results = anova_lm(model_restricted, model_full) +print(anova_results) + +# AIC/BIC for non-nested model comparison +print(f"Model 1 AIC: {model1.aic}, BIC: {model1.bic}") +print(f"Model 2 AIC: {model2.aic}, BIC: {model2.bic}") +# Lower AIC/BIC indicates better model +``` + +## Robust Standard Errors + +Handle heteroscedasticity or clustering without reweighting. + +```python +# Heteroscedasticity-robust (HC) standard errors +results_hc = results.get_robustcov_results(cov_type='HC0') # White's +results_hc1 = results.get_robustcov_results(cov_type='HC1') +results_hc2 = results.get_robustcov_results(cov_type='HC2') +results_hc3 = results.get_robustcov_results(cov_type='HC3') # Most conservative + +# Newey-West HAC (Heteroscedasticity and Autocorrelation Consistent) +results_hac = results.get_robustcov_results(cov_type='HAC', maxlags=4) + +# Cluster-robust standard errors +results_cluster = results.get_robustcov_results(cov_type='cluster', + groups=cluster_ids) + +# View robust results +print(results_hc3.summary()) +``` + +## Best Practices + +1. **Always add constant**: Use `sm.add_constant()` unless you specifically want to exclude the intercept +2. **Check assumptions**: Run diagnostic tests (heteroscedasticity, autocorrelation, normality) +3. **Use formula API for categorical variables**: `smf.ols()` handles categorical variables automatically +4. **Robust standard errors**: Use when heteroscedasticity detected but model specification is correct +5. **Model selection**: Use AIC/BIC for non-nested models, F-test/likelihood ratio for nested models +6. **Outliers and influence**: Always check Cook's distance and leverage +7. **Multicollinearity**: Check VIF and condition number before interpretation +8. **Time series**: Use `GLSAR` or robust HAC standard errors for autocorrelated errors +9. **Grouped data**: Consider mixed effects models or cluster-robust standard errors +10. **Quantile regression**: Use for robust estimation or when interested in distributional effects + +## Common Pitfalls + +1. **Forgetting to add constant**: Results in no-intercept model +2. **Ignoring heteroscedasticity**: Use WLS or robust standard errors +3. **Using OLS with autocorrelated errors**: Use GLSAR or HAC standard errors +4. **Over-interpreting with multicollinearity**: Check VIF first +5. **Not checking residuals**: Always plot residuals vs fitted values +6. **Using t-SNE/PCA residuals**: Residuals should be from original space +7. **Confusing prediction vs confidence intervals**: Prediction intervals are wider +8. **Not handling categorical variables properly**: Use formula API or manual dummy coding +9. **Comparing models with different sample sizes**: Ensure same observations used +10. **Ignoring influential observations**: Check Cook's distance and DFFITS diff --git a/scientific-packages/statsmodels/references/stats_diagnostics.md b/scientific-packages/statsmodels/references/stats_diagnostics.md new file mode 100644 index 0000000..3d5432e --- /dev/null +++ b/scientific-packages/statsmodels/references/stats_diagnostics.md @@ -0,0 +1,859 @@ +# Statistical Tests and Diagnostics Reference + +This document provides comprehensive guidance on statistical tests, diagnostics, and tools available in statsmodels. + +## Overview + +Statsmodels provides extensive statistical testing capabilities: +- Residual diagnostics and specification tests +- Hypothesis testing (parametric and non-parametric) +- Goodness-of-fit tests +- Multiple comparisons and post-hoc tests +- Power and sample size calculations +- Robust covariance matrices +- Influence and outlier detection + +## Residual Diagnostics + +### Autocorrelation Tests + +**Ljung-Box Test**: Tests for autocorrelation in residuals + +```python +from statsmodels.stats.diagnostic import acorr_ljungbox + +# Test residuals for autocorrelation +lb_test = acorr_ljungbox(residuals, lags=10, return_df=True) +print(lb_test) + +# H0: No autocorrelation up to lag k +# If p-value < 0.05, reject H0 (autocorrelation present) +``` + +**Durbin-Watson Test**: Tests for first-order autocorrelation + +```python +from statsmodels.stats.stattools import durbin_watson + +dw_stat = durbin_watson(residuals) +print(f"Durbin-Watson: {dw_stat:.4f}") + +# DW ≈ 2: no autocorrelation +# DW < 2: positive autocorrelation +# DW > 2: negative autocorrelation +# Exact critical values depend on n and k +``` + +**Breusch-Godfrey Test**: More general test for autocorrelation + +```python +from statsmodels.stats.diagnostic import acorr_breusch_godfrey + +bg_test = acorr_breusch_godfrey(results, nlags=5) +lm_stat, lm_pval, f_stat, f_pval = bg_test + +print(f"LM statistic: {lm_stat:.4f}, p-value: {lm_pval:.4f}") +# H0: No autocorrelation up to lag k +``` + +### Heteroskedasticity Tests + +**Breusch-Pagan Test**: Tests for heteroskedasticity + +```python +from statsmodels.stats.diagnostic import het_breuschpagan + +bp_test = het_breuschpagan(residuals, exog) +lm_stat, lm_pval, f_stat, f_pval = bp_test + +print(f"Breusch-Pagan test p-value: {lm_pval:.4f}") +# H0: Homoskedasticity (constant variance) +# If p-value < 0.05, reject H0 (heteroskedasticity present) +``` + +**White Test**: More general test for heteroskedasticity + +```python +from statsmodels.stats.diagnostic import het_white + +white_test = het_white(residuals, exog) +lm_stat, lm_pval, f_stat, f_pval = white_test + +print(f"White test p-value: {lm_pval:.4f}") +# H0: Homoskedasticity +``` + +**ARCH Test**: Tests for autoregressive conditional heteroskedasticity + +```python +from statsmodels.stats.diagnostic import het_arch + +arch_test = het_arch(residuals, nlags=5) +lm_stat, lm_pval, f_stat, f_pval = arch_test + +print(f"ARCH test p-value: {lm_pval:.4f}") +# H0: No ARCH effects +# If significant, consider GARCH model +``` + +### Normality Tests + +**Jarque-Bera Test**: Tests for normality using skewness and kurtosis + +```python +from statsmodels.stats.stattools import jarque_bera + +jb_stat, jb_pval, skew, kurtosis = jarque_bera(residuals) + +print(f"Jarque-Bera statistic: {jb_stat:.4f}") +print(f"p-value: {jb_pval:.4f}") +print(f"Skewness: {skew:.4f}") +print(f"Kurtosis: {kurtosis:.4f}") + +# H0: Residuals are normally distributed +# Normal: skewness ≈ 0, kurtosis ≈ 3 +``` + +**Omnibus Test**: Another normality test (also based on skewness/kurtosis) + +```python +from statsmodels.stats.stattools import omni_normtest + +omni_stat, omni_pval = omni_normtest(residuals) +print(f"Omnibus test p-value: {omni_pval:.4f}") +# H0: Normality +``` + +**Anderson-Darling Test**: Distribution fit test + +```python +from statsmodels.stats.diagnostic import normal_ad + +ad_stat, ad_pval = normal_ad(residuals) +print(f"Anderson-Darling test p-value: {ad_pval:.4f}") +``` + +**Lilliefors Test**: Modified Kolmogorov-Smirnov test + +```python +from statsmodels.stats.diagnostic import lilliefors + +lf_stat, lf_pval = lilliefors(residuals, dist='norm') +print(f"Lilliefors test p-value: {lf_pval:.4f}") +``` + +### Linearity and Specification Tests + +**Ramsey RESET Test**: Tests for functional form misspecification + +```python +from statsmodels.stats.diagnostic import linear_reset + +reset_test = linear_reset(results, power=2) +f_stat, f_pval = reset_test + +print(f"RESET test p-value: {f_pval:.4f}") +# H0: Model is correctly specified (linear) +# If rejected, may need polynomial terms or transformations +``` + +**Harvey-Collier Test**: Tests for linearity + +```python +from statsmodels.stats.diagnostic import linear_harvey_collier + +hc_stat, hc_pval = linear_harvey_collier(results) +print(f"Harvey-Collier test p-value: {hc_pval:.4f}") +# H0: Linear specification is correct +``` + +## Multicollinearity Detection + +**Variance Inflation Factor (VIF)**: + +```python +from statsmodels.stats.outliers_influence import variance_inflation_factor +import pandas as pd + +# Calculate VIF for each variable +vif_data = pd.DataFrame() +vif_data["Variable"] = X.columns +vif_data["VIF"] = [variance_inflation_factor(X.values, i) + for i in range(X.shape[1])] + +print(vif_data.sort_values('VIF', ascending=False)) + +# Interpretation: +# VIF = 1: No correlation with other predictors +# VIF > 5: Moderate multicollinearity +# VIF > 10: Serious multicollinearity problem +# VIF > 20: Severe multicollinearity (consider removing variable) +``` + +**Condition Number**: From regression results + +```python +print(f"Condition number: {results.condition_number:.2f}") + +# Interpretation: +# < 10: No multicollinearity concern +# 10-30: Moderate multicollinearity +# > 30: Strong multicollinearity +# > 100: Severe multicollinearity +``` + +## Influence and Outlier Detection + +### Leverage + +High leverage points have extreme predictor values. + +```python +from statsmodels.stats.outliers_influence import OLSInfluence + +influence = results.get_influence() + +# Hat values (leverage) +leverage = influence.hat_matrix_diag + +# Rule of thumb: leverage > 2*p/n or 3*p/n is high +# p = number of parameters, n = sample size +threshold = 2 * len(results.params) / len(y) +high_leverage = np.where(leverage > threshold)[0] + +print(f"High leverage observations: {high_leverage}") +``` + +### Cook's Distance + +Measures overall influence of each observation. + +```python +# Cook's distance +cooks_d = influence.cooks_distance[0] + +# Rule of thumb: Cook's D > 4/n is influential +threshold = 4 / len(y) +influential = np.where(cooks_d > threshold)[0] + +print(f"Influential observations (Cook's D): {influential}") + +# Plot +import matplotlib.pyplot as plt +plt.stem(range(len(cooks_d)), cooks_d) +plt.axhline(y=threshold, color='r', linestyle='--', label=f'Threshold (4/n)') +plt.xlabel('Observation') +plt.ylabel("Cook's Distance") +plt.legend() +plt.show() +``` + +### DFFITS + +Measures influence on fitted value. + +```python +# DFFITS +dffits = influence.dffits[0] + +# Rule of thumb: |DFFITS| > 2*sqrt(p/n) is influential +p = len(results.params) +n = len(y) +threshold = 2 * np.sqrt(p / n) + +influential_dffits = np.where(np.abs(dffits) > threshold)[0] +print(f"Influential observations (DFFITS): {influential_dffits}") +``` + +### DFBETAs + +Measures influence on each coefficient. + +```python +# DFBETAs (one for each parameter) +dfbetas = influence.dfbetas + +# Rule of thumb: |DFBETA| > 2/sqrt(n) +threshold = 2 / np.sqrt(n) + +for i, param_name in enumerate(results.params.index): + influential = np.where(np.abs(dfbetas[:, i]) > threshold)[0] + if len(influential) > 0: + print(f"Influential for {param_name}: {influential}") +``` + +### Influence Plot + +```python +from statsmodels.graphics.regressionplots import influence_plot + +fig, ax = plt.subplots(figsize=(12, 8)) +influence_plot(results, ax=ax, criterion='cooks') +plt.show() + +# Combines leverage, residuals, and Cook's distance +# Large bubbles = high Cook's distance +# Far from x=0 = high leverage +# Far from y=0 = large residual +``` + +### Studentized Residuals + +```python +# Studentized residuals (outliers) +student_resid = influence.resid_studentized_internal + +# External studentized residuals (more conservative) +student_resid_external = influence.resid_studentized_external + +# Outliers: |studentized residual| > 3 (or > 2.5) +outliers = np.where(np.abs(student_resid_external) > 3)[0] +print(f"Outliers: {outliers}") +``` + +## Hypothesis Testing + +### t-tests + +**One-sample t-test**: Test if mean equals specific value + +```python +from scipy import stats + +# H0: population mean = mu_0 +t_stat, p_value = stats.ttest_1samp(data, popmean=mu_0) + +print(f"t-statistic: {t_stat:.4f}") +print(f"p-value: {p_value:.4f}") +``` + +**Two-sample t-test**: Compare means of two groups + +```python +# H0: mean1 = mean2 (equal variances) +t_stat, p_value = stats.ttest_ind(group1, group2) + +# Welch's t-test (unequal variances) +t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False) + +print(f"t-statistic: {t_stat:.4f}") +print(f"p-value: {p_value:.4f}") +``` + +**Paired t-test**: Compare paired observations + +```python +# H0: mean difference = 0 +t_stat, p_value = stats.ttest_rel(before, after) + +print(f"t-statistic: {t_stat:.4f}") +print(f"p-value: {p_value:.4f}") +``` + +### Proportion Tests + +**One-proportion test**: + +```python +from statsmodels.stats.proportion import proportions_ztest + +# H0: proportion = p0 +count = 45 # successes +nobs = 100 # total observations +p0 = 0.5 # hypothesized proportion + +z_stat, p_value = proportions_ztest(count, nobs, value=p0) + +print(f"z-statistic: {z_stat:.4f}") +print(f"p-value: {p_value:.4f}") +``` + +**Two-proportion test**: + +```python +# H0: proportion1 = proportion2 +counts = [45, 60] +nobs = [100, 120] + +z_stat, p_value = proportions_ztest(counts, nobs) +print(f"z-statistic: {z_stat:.4f}") +print(f"p-value: {p_value:.4f}") +``` + +### Chi-square Tests + +**Chi-square test of independence**: + +```python +from scipy.stats import chi2_contingency + +# Contingency table +contingency_table = pd.crosstab(variable1, variable2) + +chi2, p_value, dof, expected = chi2_contingency(contingency_table) + +print(f"Chi-square statistic: {chi2:.4f}") +print(f"p-value: {p_value:.4f}") +print(f"Degrees of freedom: {dof}") + +# H0: Variables are independent +``` + +**Chi-square goodness-of-fit**: + +```python +from scipy.stats import chisquare + +# Observed frequencies +observed = [20, 30, 25, 25] + +# Expected frequencies (equal by default) +expected = [25, 25, 25, 25] + +chi2, p_value = chisquare(observed, expected) + +print(f"Chi-square statistic: {chi2:.4f}") +print(f"p-value: {p_value:.4f}") + +# H0: Data follow the expected distribution +``` + +### Non-parametric Tests + +**Mann-Whitney U test** (independent samples): + +```python +from scipy.stats import mannwhitneyu + +# H0: Distributions are equal +u_stat, p_value = mannwhitneyu(group1, group2, alternative='two-sided') + +print(f"U statistic: {u_stat:.4f}") +print(f"p-value: {p_value:.4f}") +``` + +**Wilcoxon signed-rank test** (paired samples): + +```python +from scipy.stats import wilcoxon + +# H0: Median difference = 0 +w_stat, p_value = wilcoxon(before, after) + +print(f"W statistic: {w_stat:.4f}") +print(f"p-value: {p_value:.4f}") +``` + +**Kruskal-Wallis H test** (>2 groups): + +```python +from scipy.stats import kruskal + +# H0: All groups have same distribution +h_stat, p_value = kruskal(group1, group2, group3) + +print(f"H statistic: {h_stat:.4f}") +print(f"p-value: {p_value:.4f}") +``` + +**Sign test**: + +```python +from statsmodels.stats.descriptivestats import sign_test + +# H0: Median = m0 +result = sign_test(data, m0=0) +print(result) +``` + +### ANOVA + +**One-way ANOVA**: + +```python +from scipy.stats import f_oneway + +# H0: All group means are equal +f_stat, p_value = f_oneway(group1, group2, group3) + +print(f"F-statistic: {f_stat:.4f}") +print(f"p-value: {p_value:.4f}") +``` + +**Two-way ANOVA** (with statsmodels): + +```python +from statsmodels.formula.api import ols +from statsmodels.stats.anova import anova_lm + +# Fit model +model = ols('response ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', + data=df).fit() + +# ANOVA table +anova_table = anova_lm(model, typ=2) +print(anova_table) +``` + +**Repeated measures ANOVA**: + +```python +from statsmodels.stats.anova import AnovaRM + +# Requires long-format data +aovrm = AnovaRM(df, depvar='score', subject='subject_id', within=['time']) +results = aovrm.fit() + +print(results.summary()) +``` + +## Multiple Comparisons + +### Post-hoc Tests + +**Tukey's HSD** (Honest Significant Difference): + +```python +from statsmodels.stats.multicomp import pairwise_tukeyhsd + +# Perform Tukey HSD test +tukey = pairwise_tukeyhsd(data, groups, alpha=0.05) + +print(tukey.summary()) + +# Plot confidence intervals +tukey.plot_simultaneous() +plt.show() +``` + +**Bonferroni correction**: + +```python +from statsmodels.stats.multitest import multipletests + +# P-values from multiple tests +p_values = [0.01, 0.03, 0.04, 0.15, 0.001] + +# Apply correction +reject, pvals_corrected, alphac_sidak, alphac_bonf = multipletests( + p_values, + alpha=0.05, + method='bonferroni' +) + +print("Rejected:", reject) +print("Corrected p-values:", pvals_corrected) +``` + +**False Discovery Rate (FDR)**: + +```python +# FDR correction (less conservative than Bonferroni) +reject, pvals_corrected, alphac_sidak, alphac_bonf = multipletests( + p_values, + alpha=0.05, + method='fdr_bh' # Benjamini-Hochberg +) + +print("Rejected:", reject) +print("Corrected p-values:", pvals_corrected) +``` + +## Robust Covariance Matrices + +### Heteroskedasticity-Consistent (HC) Standard Errors + +```python +# After fitting OLS +results = sm.OLS(y, X).fit() + +# HC0 (White's heteroskedasticity-consistent SEs) +results_hc0 = results.get_robustcov_results(cov_type='HC0') + +# HC1 (degrees of freedom adjustment) +results_hc1 = results.get_robustcov_results(cov_type='HC1') + +# HC2 (leverage adjustment) +results_hc2 = results.get_robustcov_results(cov_type='HC2') + +# HC3 (most conservative, recommended for small samples) +results_hc3 = results.get_robustcov_results(cov_type='HC3') + +print("Standard OLS SEs:", results.bse) +print("Robust HC3 SEs:", results_hc3.bse) +``` + +### HAC (Heteroskedasticity and Autocorrelation Consistent) + +**Newey-West standard errors**: + +```python +# For time series with autocorrelation and heteroskedasticity +results_hac = results.get_robustcov_results(cov_type='HAC', maxlags=4) + +print("HAC (Newey-West) SEs:", results_hac.bse) +print(results_hac.summary()) +``` + +### Cluster-Robust Standard Errors + +```python +# For clustered/grouped data +results_cluster = results.get_robustcov_results( + cov_type='cluster', + groups=cluster_ids +) + +print("Cluster-robust SEs:", results_cluster.bse) +``` + +## Descriptive Statistics + +**Basic descriptive statistics**: + +```python +from statsmodels.stats.api import DescrStatsW + +# Comprehensive descriptive stats +desc = DescrStatsW(data) + +print("Mean:", desc.mean) +print("Std Dev:", desc.std) +print("Variance:", desc.var) +print("Confidence interval:", desc.tconfint_mean()) + +# Quantiles +print("Median:", desc.quantile(0.5)) +print("IQR:", desc.quantile([0.25, 0.75])) +``` + +**Weighted statistics**: + +```python +# With weights +desc_weighted = DescrStatsW(data, weights=weights) + +print("Weighted mean:", desc_weighted.mean) +print("Weighted std:", desc_weighted.std) +``` + +**Compare two groups**: + +```python +from statsmodels.stats.weightstats import CompareMeans + +# Create comparison object +cm = CompareMeans(DescrStatsW(group1), DescrStatsW(group2)) + +# t-test +print("t-test:", cm.ttest_ind()) + +# Confidence interval for difference +print("CI for difference:", cm.tconfint_diff()) + +# Test for equal variances +print("Equal variance test:", cm.test_equal_var()) +``` + +## Power Analysis and Sample Size + +**Power for t-test**: + +```python +from statsmodels.stats.power import tt_ind_solve_power + +# Solve for sample size +effect_size = 0.5 # Cohen's d +alpha = 0.05 +power = 0.8 + +n = tt_ind_solve_power(effect_size=effect_size, + alpha=alpha, + power=power, + alternative='two-sided') + +print(f"Required sample size per group: {n:.0f}") + +# Solve for power given n +power = tt_ind_solve_power(effect_size=0.5, + nobs1=50, + alpha=0.05, + alternative='two-sided') + +print(f"Power: {power:.4f}") +``` + +**Power for proportion test**: + +```python +from statsmodels.stats.power import zt_ind_solve_power + +# For proportion tests (z-test) +effect_size = 0.3 # Difference in proportions +alpha = 0.05 +power = 0.8 + +n = zt_ind_solve_power(effect_size=effect_size, + alpha=alpha, + power=power, + alternative='two-sided') + +print(f"Required sample size per group: {n:.0f}") +``` + +**Power curves**: + +```python +from statsmodels.stats.power import TTestIndPower +import matplotlib.pyplot as plt + +# Create power analysis object +analysis = TTestIndPower() + +# Plot power curves for different sample sizes +sample_sizes = range(10, 200, 10) +effect_sizes = [0.2, 0.5, 0.8] # Small, medium, large + +fig, ax = plt.subplots(figsize=(10, 6)) + +for es in effect_sizes: + power = [analysis.solve_power(effect_size=es, nobs1=n, alpha=0.05) + for n in sample_sizes] + ax.plot(sample_sizes, power, label=f'Effect size = {es}') + +ax.axhline(y=0.8, color='r', linestyle='--', label='Power = 0.8') +ax.set_xlabel('Sample size per group') +ax.set_ylabel('Power') +ax.set_title('Power Curves for Two-Sample t-test') +ax.legend() +ax.grid(True, alpha=0.3) +plt.show() +``` + +## Effect Sizes + +**Cohen's d** (standardized mean difference): + +```python +def cohens_d(group1, group2): + \"\"\"Calculate Cohen's d for independent samples\"\"\" + n1, n2 = len(group1), len(group2) + var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1) + + # Pooled standard deviation + pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2)) + + # Cohen's d + d = (np.mean(group1) - np.mean(group2)) / pooled_std + + return d + +d = cohens_d(group1, group2) +print(f"Cohen's d: {d:.4f}") + +# Interpretation: +# |d| < 0.2: negligible +# |d| ~ 0.2: small +# |d| ~ 0.5: medium +# |d| ~ 0.8: large +``` + +**Eta-squared** (for ANOVA): + +```python +# From ANOVA table +# η² = SS_between / SS_total + +def eta_squared(anova_table): + return anova_table['sum_sq'][0] / anova_table['sum_sq'].sum() + +# After running ANOVA +eta_sq = eta_squared(anova_table) +print(f"Eta-squared: {eta_sq:.4f}") + +# Interpretation: +# 0.01: small effect +# 0.06: medium effect +# 0.14: large effect +``` + +## Contingency Tables and Association + +**McNemar's test** (paired binary data): + +```python +from statsmodels.stats.contingency_tables import mcnemar + +# 2x2 contingency table +table = [[a, b], + [c, d]] + +result = mcnemar(table, exact=True) # or exact=False for large samples +print(f"p-value: {result.pvalue:.4f}") + +# H0: Marginal probabilities are equal +``` + +**Cochran-Mantel-Haenszel test**: + +```python +from statsmodels.stats.contingency_tables import StratifiedTable + +# For stratified 2x2 tables +strat_table = StratifiedTable(tables_list) +result = strat_table.test_null_odds() + +print(f"p-value: {result.pvalue:.4f}") +``` + +## Treatment Effects and Causal Inference + +**Propensity score matching**: + +```python +from statsmodels.treatment import propensity_score + +# Estimate propensity scores +ps_model = sm.Logit(treatment, X).fit() +propensity_scores = ps_model.predict(X) + +# Use for matching or weighting +# (manual implementation of matching needed) +``` + +**Difference-in-differences**: + +```python +# Did formula: outcome ~ treatment * post +model = ols('outcome ~ treatment + post + treatment:post', data=df).fit() + +# DiD estimate is the interaction coefficient +did_estimate = model.params['treatment:post'] +print(f"DiD estimate: {did_estimate:.4f}") +``` + +## Best Practices + +1. **Always check assumptions**: Test before interpreting results +2. **Report effect sizes**: Not just p-values +3. **Use appropriate tests**: Match test to data type and distribution +4. **Correct for multiple comparisons**: When conducting many tests +5. **Check sample size**: Ensure adequate power +6. **Visual inspection**: Plot data before testing +7. **Report confidence intervals**: Along with point estimates +8. **Consider alternatives**: Non-parametric when assumptions violated +9. **Robust standard errors**: Use when heteroskedasticity/autocorrelation present +10. **Document decisions**: Note which tests used and why + +## Common Pitfalls + +1. **Not checking test assumptions**: May invalidate results +2. **Multiple testing without correction**: Inflated Type I error +3. **Using parametric tests on non-normal data**: Consider non-parametric +4. **Ignoring heteroskedasticity**: Use robust SEs +5. **Confusing statistical and practical significance**: Check effect sizes +6. **Not reporting confidence intervals**: Only p-values insufficient +7. **Using wrong test**: Match test to research question +8. **Insufficient power**: Risk of Type II error (false negatives) +9. **p-hacking**: Testing many specifications until significant +10. **Overinterpreting p-values**: Remember limitations of NHST diff --git a/scientific-packages/statsmodels/references/time_series.md b/scientific-packages/statsmodels/references/time_series.md new file mode 100644 index 0000000..1e7cb2f --- /dev/null +++ b/scientific-packages/statsmodels/references/time_series.md @@ -0,0 +1,716 @@ +# Time Series Analysis Reference + +This document provides comprehensive guidance on time series models in statsmodels, including ARIMA, state space models, VAR, exponential smoothing, and forecasting methods. + +## Overview + +Statsmodels offers extensive time series capabilities: +- **Univariate models**: AR, ARIMA, SARIMAX, Exponential Smoothing +- **Multivariate models**: VAR, VARMAX, Dynamic Factor Models +- **State space framework**: Custom models, Kalman filtering +- **Diagnostic tools**: ACF, PACF, stationarity tests, residual analysis +- **Forecasting**: Point forecasts and prediction intervals + +## Univariate Time Series Models + +### AutoReg (AR Model) + +Autoregressive model: current value depends on past values. + +**When to use:** +- Univariate time series +- Past values predict future +- Stationary series + +**Model**: yₜ = c + φ₁yₜ₋₁ + φ₂yₜ₋₂ + ... + φₚyₜ₋ₚ + εₜ + +```python +from statsmodels.tsa.ar_model import AutoReg +import pandas as pd + +# Fit AR(p) model +model = AutoReg(y, lags=5) # AR(5) +results = model.fit() + +print(results.summary()) +``` + +**With exogenous regressors:** +```python +# AR with exogenous variables (ARX) +model = AutoReg(y, lags=5, exog=X_exog) +results = model.fit() +``` + +**Seasonal AR:** +```python +# Seasonal lags (e.g., monthly data with yearly seasonality) +model = AutoReg(y, lags=12, seasonal=True) +results = model.fit() +``` + +### ARIMA (Autoregressive Integrated Moving Average) + +Combines AR, differencing (I), and MA components. + +**When to use:** +- Non-stationary time series (needs differencing) +- Past values and errors predict future +- Flexible model for many time series + +**Model**: ARIMA(p,d,q) +- p: AR order (lags) +- d: differencing order (to achieve stationarity) +- q: MA order (lagged forecast errors) + +```python +from statsmodels.tsa.arima.model import ARIMA + +# Fit ARIMA(p,d,q) +model = ARIMA(y, order=(1, 1, 1)) # ARIMA(1,1,1) +results = model.fit() + +print(results.summary()) +``` + +**Choosing p, d, q:** + +1. **Determine d (differencing order)**: +```python +from statsmodels.tsa.stattools import adfuller + +# ADF test for stationarity +def check_stationarity(series): + result = adfuller(series) + print(f"ADF Statistic: {result[0]:.4f}") + print(f"p-value: {result[1]:.4f}") + if result[1] <= 0.05: + print("Series is stationary") + return True + else: + print("Series is non-stationary, needs differencing") + return False + +# Test original series +if not check_stationarity(y): + # Difference once + y_diff = y.diff().dropna() + if not check_stationarity(y_diff): + # Difference again + y_diff2 = y_diff.diff().dropna() + check_stationarity(y_diff2) +``` + +2. **Determine p and q (ACF/PACF)**: +```python +from statsmodels.graphics.tsaplots import plot_acf, plot_pacf +import matplotlib.pyplot as plt + +# After differencing to stationarity +fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8)) + +# ACF: helps determine q (MA order) +plot_acf(y_stationary, lags=40, ax=ax1) +ax1.set_title('Autocorrelation Function (ACF)') + +# PACF: helps determine p (AR order) +plot_pacf(y_stationary, lags=40, ax=ax2) +ax2.set_title('Partial Autocorrelation Function (PACF)') + +plt.tight_layout() +plt.show() + +# Rules of thumb: +# - PACF cuts off at lag p → AR(p) +# - ACF cuts off at lag q → MA(q) +# - Both decay → ARMA(p,q) +``` + +3. **Model selection (AIC/BIC)**: +```python +# Grid search for best (p,q) given d +import numpy as np + +best_aic = np.inf +best_order = None + +for p in range(5): + for q in range(5): + try: + model = ARIMA(y, order=(p, d, q)) + results = model.fit() + if results.aic < best_aic: + best_aic = results.aic + best_order = (p, d, q) + except: + continue + +print(f"Best order: {best_order} with AIC: {best_aic:.2f}") +``` + +### SARIMAX (Seasonal ARIMA with Exogenous Variables) + +Extends ARIMA with seasonality and exogenous regressors. + +**When to use:** +- Seasonal patterns (monthly, quarterly data) +- External variables influence series +- Most flexible univariate model + +**Model**: SARIMAX(p,d,q)(P,D,Q,s) +- (p,d,q): Non-seasonal ARIMA +- (P,D,Q,s): Seasonal ARIMA with period s + +```python +from statsmodels.tsa.statespace.sarimax import SARIMAX + +# Seasonal ARIMA for monthly data (s=12) +model = SARIMAX(y, + order=(1, 1, 1), # (p,d,q) + seasonal_order=(1, 1, 1, 12)) # (P,D,Q,s) +results = model.fit() + +print(results.summary()) +``` + +**With exogenous variables:** +```python +# SARIMAX with external predictors +model = SARIMAX(y, + exog=X_exog, + order=(1, 1, 1), + seasonal_order=(1, 1, 1, 12)) +results = model.fit() +``` + +**Example: Monthly sales with trend and seasonality** +```python +# Typical for monthly data: (p,d,q)(P,D,Q,12) +# Start with (1,1,1)(1,1,1,12) or (0,1,1)(0,1,1,12) + +model = SARIMAX(monthly_sales, + order=(0, 1, 1), + seasonal_order=(0, 1, 1, 12), + enforce_stationarity=False, + enforce_invertibility=False) +results = model.fit() +``` + +### Exponential Smoothing + +Weighted averages of past observations with exponentially decreasing weights. + +**When to use:** +- Simple, interpretable forecasts +- Trend and/or seasonality present +- No need for explicit model specification + +**Types:** +- Simple Exponential Smoothing: no trend, no seasonality +- Holt's method: with trend +- Holt-Winters: with trend and seasonality + +```python +from statsmodels.tsa.holtwinters import ExponentialSmoothing + +# Simple exponential smoothing +model = ExponentialSmoothing(y, trend=None, seasonal=None) +results = model.fit() + +# Holt's method (with trend) +model = ExponentialSmoothing(y, trend='add', seasonal=None) +results = model.fit() + +# Holt-Winters (trend + seasonality) +model = ExponentialSmoothing(y, + trend='add', # 'add' or 'mul' + seasonal='add', # 'add' or 'mul' + seasonal_periods=12) # e.g., 12 for monthly +results = model.fit() + +print(results.summary()) +``` + +**Additive vs Multiplicative:** +```python +# Additive: constant seasonal variation +# yₜ = Level + Trend + Seasonal + Error + +# Multiplicative: proportional seasonal variation +# yₜ = Level × Trend × Seasonal × Error + +# Choose based on data: +# - Additive: seasonal variation constant over time +# - Multiplicative: seasonal variation increases with level +``` + +**Innovations state space (ETS):** +```python +from statsmodels.tsa.exponential_smoothing.ets import ETSModel + +# More robust, state space formulation +model = ETSModel(y, + error='add', # 'add' or 'mul' + trend='add', # 'add', 'mul', or None + seasonal='add', # 'add', 'mul', or None + seasonal_periods=12) +results = model.fit() +``` + +## Multivariate Time Series + +### VAR (Vector Autoregression) + +System of equations where each variable depends on past values of all variables. + +**When to use:** +- Multiple interrelated time series +- Bidirectional relationships +- Granger causality testing + +**Model**: Each variable is AR on all variables: +- y₁ₜ = c₁ + φ₁₁y₁ₜ₋₁ + φ₁₂y₂ₜ₋₁ + ... + ε₁ₜ +- y₂ₜ = c₂ + φ₂₁y₁ₜ₋₁ + φ₂₂y₂ₜ₋₁ + ... + ε₂ₜ + +```python +from statsmodels.tsa.api import VAR +import pandas as pd + +# Data should be DataFrame with multiple columns +# Each column is a time series +df_multivariate = pd.DataFrame({'series1': y1, 'series2': y2, 'series3': y3}) + +# Fit VAR +model = VAR(df_multivariate) + +# Select lag order using AIC/BIC +lag_order_results = model.select_order(maxlags=15) +print(lag_order_results.summary()) + +# Fit with optimal lags +results = model.fit(maxlags=5, ic='aic') +print(results.summary()) +``` + +**Granger causality testing:** +```python +# Test if series1 Granger-causes series2 +from statsmodels.tsa.stattools import grangercausalitytests + +# Requires 2D array [series2, series1] +test_data = df_multivariate[['series2', 'series1']] + +# Test up to max_lag +max_lag = 5 +results = grangercausalitytests(test_data, max_lag, verbose=True) + +# P-values for each lag +for lag in range(1, max_lag + 1): + p_value = results[lag][0]['ssr_ftest'][1] + print(f"Lag {lag}: p-value = {p_value:.4f}") +``` + +**Impulse Response Functions (IRF):** +```python +# Trace effect of shock through system +irf = results.irf(10) # 10 periods ahead + +# Plot IRFs +irf.plot(orth=True) # Orthogonalized (Cholesky decomposition) +plt.show() + +# Cumulative effects +irf.plot_cum_effects(orth=True) +plt.show() +``` + +**Forecast Error Variance Decomposition:** +```python +# Contribution of each variable to forecast error variance +fevd = results.fevd(10) # 10 periods ahead +fevd.plot() +plt.show() +``` + +### VARMAX (VAR with Moving Average and Exogenous Variables) + +Extends VAR with MA component and external regressors. + +**When to use:** +- VAR inadequate (MA component needed) +- External variables affect system +- More flexible multivariate model + +```python +from statsmodels.tsa.statespace.varmax import VARMAX + +# VARMAX(p, q) with exogenous variables +model = VARMAX(df_multivariate, + order=(1, 1), # (p, q) + exog=X_exog) +results = model.fit() + +print(results.summary()) +``` + +## State Space Models + +Flexible framework for custom time series models. + +**When to use:** +- Custom model specification +- Unobserved components +- Kalman filtering/smoothing +- Missing data + +```python +from statsmodels.tsa.statespace.mlemodel import MLEModel + +# Extend MLEModel for custom state space models +# Example: Local level model (random walk + noise) +``` + +**Dynamic Factor Models:** +```python +from statsmodels.tsa.statespace.dynamic_factor import DynamicFactor + +# Extract common factors from multiple time series +model = DynamicFactor(df_multivariate, + k_factors=2, # Number of factors + factor_order=2) # AR order of factors +results = model.fit() + +# Estimated factors +factors = results.factors.filtered +``` + +## Forecasting + +### Point Forecasts + +```python +# ARIMA forecasting +model = ARIMA(y, order=(1, 1, 1)) +results = model.fit() + +# Forecast h steps ahead +h = 10 +forecast = results.forecast(steps=h) + +# With exogenous variables (SARIMAX) +model = SARIMAX(y, exog=X, order=(1, 1, 1)) +results = model.fit() + +# Need future exogenous values +forecast = results.forecast(steps=h, exog=X_future) +``` + +### Prediction Intervals + +```python +# Get forecast with confidence intervals +forecast_obj = results.get_forecast(steps=h) +forecast_df = forecast_obj.summary_frame() + +print(forecast_df) +# Contains: mean, mean_se, mean_ci_lower, mean_ci_upper + +# Extract components +forecast_mean = forecast_df['mean'] +forecast_ci_lower = forecast_df['mean_ci_lower'] +forecast_ci_upper = forecast_df['mean_ci_upper'] + +# Plot +import matplotlib.pyplot as plt + +plt.figure(figsize=(12, 6)) +plt.plot(y.index, y, label='Historical') +plt.plot(forecast_df.index, forecast_mean, label='Forecast', color='red') +plt.fill_between(forecast_df.index, + forecast_ci_lower, + forecast_ci_upper, + alpha=0.3, color='red', label='95% CI') +plt.legend() +plt.title('Forecast with Prediction Intervals') +plt.show() +``` + +### Dynamic vs Static Forecasts + +```python +# Static (one-step-ahead, using actual values) +static_forecast = results.get_prediction(start=split_point, end=len(y)-1) + +# Dynamic (multi-step, using predicted values) +dynamic_forecast = results.get_prediction(start=split_point, + end=len(y)-1, + dynamic=True) + +# Plot comparison +fig, ax = plt.subplots(figsize=(12, 6)) +y.plot(ax=ax, label='Actual') +static_forecast.predicted_mean.plot(ax=ax, label='Static forecast') +dynamic_forecast.predicted_mean.plot(ax=ax, label='Dynamic forecast') +ax.legend() +plt.show() +``` + +## Diagnostic Tests + +### Stationarity Tests + +```python +from statsmodels.tsa.stattools import adfuller, kpss + +# Augmented Dickey-Fuller (ADF) test +# H0: unit root (non-stationary) +adf_result = adfuller(y, autolag='AIC') +print(f"ADF Statistic: {adf_result[0]:.4f}") +print(f"p-value: {adf_result[1]:.4f}") +if adf_result[1] <= 0.05: + print("Reject H0: Series is stationary") +else: + print("Fail to reject H0: Series is non-stationary") + +# KPSS test +# H0: stationary (opposite of ADF) +kpss_result = kpss(y, regression='c', nlags='auto') +print(f"KPSS Statistic: {kpss_result[0]:.4f}") +print(f"p-value: {kpss_result[1]:.4f}") +if kpss_result[1] <= 0.05: + print("Reject H0: Series is non-stationary") +else: + print("Fail to reject H0: Series is stationary") +``` + +### Residual Diagnostics + +```python +# Ljung-Box test for autocorrelation in residuals +from statsmodels.stats.diagnostic import acorr_ljungbox + +lb_test = acorr_ljungbox(results.resid, lags=10, return_df=True) +print(lb_test) +# P-values > 0.05 indicate no significant autocorrelation (good) + +# Plot residual diagnostics +results.plot_diagnostics(figsize=(12, 8)) +plt.show() + +# Components: +# 1. Standardized residuals over time +# 2. Histogram + KDE of residuals +# 3. Q-Q plot for normality +# 4. Correlogram (ACF of residuals) +``` + +### Heteroskedasticity Tests + +```python +from statsmodels.stats.diagnostic import het_arch + +# ARCH test for heteroskedasticity +arch_test = het_arch(results.resid, nlags=10) +print(f"ARCH test statistic: {arch_test[0]:.4f}") +print(f"p-value: {arch_test[1]:.4f}") + +# If significant, consider GARCH model +``` + +## Seasonal Decomposition + +```python +from statsmodels.tsa.seasonal import seasonal_decompose + +# Decompose into trend, seasonal, residual +decomposition = seasonal_decompose(y, + model='additive', # or 'multiplicative' + period=12) # seasonal period + +# Plot components +fig = decomposition.plot() +fig.set_size_inches(12, 8) +plt.show() + +# Access components +trend = decomposition.trend +seasonal = decomposition.seasonal +residual = decomposition.resid + +# STL decomposition (more robust) +from statsmodels.tsa.seasonal import STL + +stl = STL(y, seasonal=13) # seasonal must be odd +stl_result = stl.fit() + +fig = stl_result.plot() +plt.show() +``` + +## Model Evaluation + +### In-Sample Metrics + +```python +# From results object +print(f"AIC: {results.aic:.2f}") +print(f"BIC: {results.bic:.2f}") +print(f"Log-likelihood: {results.llf:.2f}") + +# MSE on training data +from sklearn.metrics import mean_squared_error + +mse = mean_squared_error(y, results.fittedvalues) +rmse = np.sqrt(mse) +print(f"RMSE: {rmse:.4f}") + +# MAE +from sklearn.metrics import mean_absolute_error +mae = mean_absolute_error(y, results.fittedvalues) +print(f"MAE: {mae:.4f}") +``` + +### Out-of-Sample Evaluation + +```python +# Train-test split for time series (no shuffle!) +train_size = int(0.8 * len(y)) +y_train = y[:train_size] +y_test = y[train_size:] + +# Fit on training data +model = ARIMA(y_train, order=(1, 1, 1)) +results = model.fit() + +# Forecast test period +forecast = results.forecast(steps=len(y_test)) + +# Metrics +from sklearn.metrics import mean_squared_error, mean_absolute_error + +rmse = np.sqrt(mean_squared_error(y_test, forecast)) +mae = mean_absolute_error(y_test, forecast) +mape = np.mean(np.abs((y_test - forecast) / y_test)) * 100 + +print(f"Test RMSE: {rmse:.4f}") +print(f"Test MAE: {mae:.4f}") +print(f"Test MAPE: {mape:.2f}%") +``` + +### Rolling Forecast + +```python +# More realistic evaluation: rolling one-step-ahead forecasts +forecasts = [] + +for t in range(len(y_test)): + # Refit or update with new observation + y_current = y[:train_size + t] + model = ARIMA(y_current, order=(1, 1, 1)) + fit = model.fit() + + # One-step forecast + fc = fit.forecast(steps=1)[0] + forecasts.append(fc) + +forecasts = np.array(forecasts) + +rmse = np.sqrt(mean_squared_error(y_test, forecasts)) +print(f"Rolling forecast RMSE: {rmse:.4f}") +``` + +### Cross-Validation + +```python +# Time series cross-validation (expanding window) +from sklearn.model_selection import TimeSeriesSplit + +tscv = TimeSeriesSplit(n_splits=5) +rmse_scores = [] + +for train_idx, test_idx in tscv.split(y): + y_train_cv = y.iloc[train_idx] + y_test_cv = y.iloc[test_idx] + + model = ARIMA(y_train_cv, order=(1, 1, 1)) + results = model.fit() + + forecast = results.forecast(steps=len(test_idx)) + rmse = np.sqrt(mean_squared_error(y_test_cv, forecast)) + rmse_scores.append(rmse) + +print(f"CV RMSE: {np.mean(rmse_scores):.4f} ± {np.std(rmse_scores):.4f}") +``` + +## Advanced Topics + +### ARDL (Autoregressive Distributed Lag) + +Bridges univariate and multivariate time series. + +```python +from statsmodels.tsa.ardl import ARDL + +# ARDL(p, q) model +# y depends on its own lags and lags of X +model = ARDL(y, lags=2, exog=X, exog_lags=2) +results = model.fit() +``` + +### Error Correction Models + +For cointegrated series. + +```python +from statsmodels.tsa.vector_ar.vecm import coint_johansen + +# Test for cointegration +johansen_test = coint_johansen(df_multivariate, det_order=0, k_ar_diff=1) + +# Fit VECM if cointegrated +from statsmodels.tsa.vector_ar.vecm import VECM + +model = VECM(df_multivariate, k_ar_diff=1, coint_rank=1) +results = model.fit() +``` + +### Regime Switching Models + +For structural breaks and regime changes. + +```python +from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression + +# Markov switching model +model = MarkovRegression(y, k_regimes=2, order=1) +results = model.fit() + +# Smoothed probabilities of regimes +regime_probs = results.smoothed_marginal_probabilities +``` + +## Best Practices + +1. **Check stationarity**: Difference if needed, verify with ADF/KPSS tests +2. **Plot data**: Always visualize before modeling +3. **Identify seasonality**: Use appropriate seasonal models (SARIMAX, Holt-Winters) +4. **Model selection**: Use AIC/BIC and out-of-sample validation +5. **Residual diagnostics**: Check for autocorrelation, normality, heteroskedasticity +6. **Forecast evaluation**: Use rolling forecasts and proper time series CV +7. **Avoid overfitting**: Prefer simpler models, use information criteria +8. **Document assumptions**: Note any data transformations (log, differencing) +9. **Prediction intervals**: Always provide uncertainty estimates +10. **Refit regularly**: Update models as new data arrives + +## Common Pitfalls + +1. **Not checking stationarity**: Fit ARIMA on non-stationary data +2. **Data leakage**: Using future data in transformations +3. **Wrong seasonal period**: S=4 for quarterly, S=12 for monthly +4. **Overfitting**: Too many parameters relative to data +5. **Ignoring residual autocorrelation**: Model inadequate +6. **Using inappropriate metrics**: MAPE fails with zeros or negatives +7. **Not handling missing data**: Affects model estimation +8. **Extrapolating exogenous variables**: Need future X values for SARIMAX +9. **Confusing static vs dynamic forecasts**: Dynamic more realistic for multi-step +10. **Not validating forecasts**: Always check out-of-sample performance