16 KiB
Model Evaluation and Selection in scikit-learn
Overview
Model evaluation assesses how well models generalize to unseen data. Scikit-learn provides three main APIs for evaluation:
- Estimator score methods: Built-in evaluation (accuracy for classifiers, R² for regressors)
- Scoring parameter: Used in cross-validation and hyperparameter tuning
- Metric functions: Specialized evaluation in
sklearn.metrics
Cross-Validation
Cross-validation evaluates model performance by splitting data into multiple train/test sets. This addresses overfitting: "a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data."
Basic Cross-Validation
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
Cross-Validation Strategies
For i.i.d. Data
KFold: Standard k-fold cross-validation
- Splits data into k equal folds
- Each fold used once as test set
n_splits: Number of folds (typically 5 or 10)
from sklearn.model_selection import KFold
cv = KFold(n_splits=5, shuffle=True, random_state=42)
RepeatedKFold: Repeats KFold with different randomization
- More robust estimation
- Computationally expensive
LeaveOneOut (LOO): Each sample is a test set
- Maximum training data usage
- Very computationally expensive
- High variance in estimates
- Use only for small datasets (<1000 samples)
ShuffleSplit: Random train/test splits
- Flexible train/test sizes
- Can control number of iterations
- Good for quick evaluation
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
For Imbalanced Classes
StratifiedKFold: Preserves class proportions in each fold
- Essential for imbalanced datasets
- Default for classification in cross_val_score()
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
StratifiedShuffleSplit: Stratified random splits
For Grouped Data
Use when samples are not independent (e.g., multiple measurements from same subject).
GroupKFold: Groups don't appear in both train and test
from sklearn.model_selection import GroupKFold
cv = GroupKFold(n_splits=5)
scores = cross_val_score(model, X, y, groups=groups, cv=cv)
StratifiedGroupKFold: Combines stratification with group separation
LeaveOneGroupOut: Each group becomes a test set
For Time Series
TimeSeriesSplit: Expanding window approach
- Successive training sets are supersets
- Respects temporal ordering
- No data leakage from future to past
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in cv.split(X):
# Train on indices 0 to t, test on t+1 to t+k
pass
Cross-Validation Functions
cross_val_score: Returns array of scores
scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
cross_validate: Returns multiple metrics and timing
results = cross_validate(
model, X, y, cv=5,
scoring=['accuracy', 'f1_weighted', 'roc_auc'],
return_train_score=True,
return_estimator=True # Returns fitted estimators
)
print(results['test_accuracy'])
print(results['fit_time'])
cross_val_predict: Returns predictions for model blending/visualization
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(model, X, y, cv=5)
# Use for confusion matrix, error analysis, etc.
Hyperparameter Tuning
GridSearchCV
Exhaustively searches all parameter combinations.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
'n_estimators': [100, 200, 500],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='f1_weighted',
n_jobs=-1, # Use all CPU cores
verbose=2
)
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)
# Use best model
best_model = grid_search.best_estimator_
When to use:
- Small parameter spaces
- When computational resources allow
- When exhaustive search is desired
RandomizedSearchCV
Samples parameter combinations from distributions.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_distributions = {
'n_estimators': randint(100, 1000),
'max_depth': randint(5, 50),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': uniform(0.1, 0.9)
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=100, # Number of parameter settings sampled
cv=5,
scoring='f1_weighted',
n_jobs=-1,
random_state=42
)
random_search.fit(X_train, y_train)
When to use:
- Large parameter spaces
- When budget is limited
- Often finds good parameters faster than GridSearchCV
Advantage: "Budget can be chosen independent of the number of parameters and possible values"
Successive Halving
HalvingGridSearchCV and HalvingRandomSearchCV: Tournament-style selection
How it works:
- Start with many candidates, minimal resources
- Eliminate poor performers
- Increase resources for remaining candidates
- Repeat until best candidates found
When to use:
- Large parameter spaces
- Expensive model training
- When many parameter combinations are clearly inferior
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
halving_search = HalvingGridSearchCV(
estimator,
param_grid,
factor=3, # Proportion of candidates eliminated each round
cv=5
)
Classification Metrics
Accuracy-Based Metrics
Accuracy: Proportion of correct predictions
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_true, y_pred)
When to use: Balanced datasets only When NOT to use: Imbalanced datasets (misleading)
Balanced Accuracy: Average recall per class
from sklearn.metrics import balanced_accuracy_score
bal_acc = balanced_accuracy_score(y_true, y_pred)
When to use: Imbalanced datasets, ensures all classes matter equally
Precision, Recall, F-Score
Precision: Of predicted positives, how many are actually positive
- Formula: TP / (TP + FP)
- Answers: "How reliable are positive predictions?"
Recall (Sensitivity): Of actual positives, how many are predicted positive
- Formula: TP / (TP + FN)
- Answers: "How complete is positive detection?"
F1-Score: Harmonic mean of precision and recall
- Formula: 2 * (precision * recall) / (precision + recall)
- Balanced measure when both precision and recall are important
from sklearn.metrics import precision_recall_fscore_support, f1_score
precision, recall, f1, support = precision_recall_fscore_support(
y_true, y_pred, average='weighted'
)
# Or individually
f1 = f1_score(y_true, y_pred, average='weighted')
Averaging strategies for multiclass:
binary: Binary classification onlymicro: Calculate globally (total TP, FP, FN)macro: Calculate per class, unweighted mean (all classes equal)weighted: Calculate per class, weighted by support (class frequency)samples: For multilabel classification
When to use:
macro: When all classes equally important (even rare ones)weighted: When class frequency mattersmicro: When overall performance across all samples matters
Confusion Matrix
Shows true positives, false positives, true negatives, false negatives.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
disp.plot()
plt.show()
ROC Curve and AUC
ROC (Receiver Operating Characteristic): Plot of true positive rate vs false positive rate at different thresholds
AUC (Area Under Curve): Measures overall ability to discriminate between classes
- 1.0 = perfect classifier
- 0.5 = random classifier
- <0.5 = worse than random
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# Requires probability predictions
y_proba = model.predict_proba(X_test)[:, 1] # Probabilities for positive class
auc = roc_auc_score(y_true, y_proba)
fpr, tpr, thresholds = roc_curve(y_true, y_proba)
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
Multiclass ROC: Use multi_class='ovr' (one-vs-rest) or 'ovo' (one-vs-one)
auc = roc_auc_score(y_true, y_proba, multi_class='ovr')
Log Loss
Measures probability calibration quality.
from sklearn.metrics import log_loss
loss = log_loss(y_true, y_proba)
When to use: When probability quality matters, not just class predictions Lower is better: Perfect predictions have log loss of 0
Classification Report
Comprehensive summary of precision, recall, f1-score per class.
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred, target_names=['Class 0', 'Class 1']))
Regression Metrics
Mean Squared Error (MSE)
Average squared difference between predictions and true values.
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_true, y_pred)
rmse = mean_squared_error(y_true, y_pred, squared=False) # Root MSE
Characteristics:
- Penalizes large errors heavily (squared term)
- Same units as target² (use RMSE for same units as target)
- Lower is better
Mean Absolute Error (MAE)
Average absolute difference between predictions and true values.
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_true, y_pred)
Characteristics:
- More robust to outliers than MSE
- Same units as target
- More interpretable
- Lower is better
MSE vs MAE: Use MAE when outliers shouldn't dominate the metric
R² Score (Coefficient of Determination)
Proportion of variance explained by the model.
from sklearn.metrics import r2_score
r2 = r2_score(y_true, y_pred)
Interpretation:
- 1.0 = perfect predictions
- 0.0 = model as good as mean
- <0.0 = model worse than mean (possible!)
- Higher is better
Note: Can be negative for models that perform worse than predicting the mean.
Mean Absolute Percentage Error (MAPE)
Percentage-based error metric.
from sklearn.metrics import mean_absolute_percentage_error
mape = mean_absolute_percentage_error(y_true, y_pred)
When to use: When relative errors matter more than absolute errors Warning: Undefined when true values are zero
Median Absolute Error
Median of absolute errors (robust to outliers).
from sklearn.metrics import median_absolute_error
med_ae = median_absolute_error(y_true, y_pred)
Max Error
Maximum residual error.
from sklearn.metrics import max_error
max_err = max_error(y_true, y_pred)
When to use: When worst-case performance matters
Custom Scoring Functions
Create custom scorers for GridSearchCV and cross_val_score:
from sklearn.metrics import make_scorer, fbeta_score
# F2 score (weights recall higher than precision)
f2_scorer = make_scorer(fbeta_score, beta=2)
# Custom function
def custom_metric(y_true, y_pred):
# Your custom logic
return score
custom_scorer = make_scorer(custom_metric, greater_is_better=True)
# Use in cross-validation or grid search
scores = cross_val_score(model, X, y, cv=5, scoring=custom_scorer)
Scoring Parameter Options
Common scoring strings for scoring parameter:
Classification:
'accuracy','balanced_accuracy''precision','recall','f1'(add_macro,_micro,_weightedfor multiclass)'roc_auc','roc_auc_ovr','roc_auc_ovo''log_loss'(lower is better, negate for maximization)'jaccard'(Jaccard similarity)
Regression:
'r2''neg_mean_squared_error','neg_root_mean_squared_error''neg_mean_absolute_error''neg_mean_absolute_percentage_error''neg_median_absolute_error'
Note: Many metrics are negated (neg_*) so GridSearchCV can maximize them.
Validation Strategies
Train-Test Split
Simple single split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # For classification with imbalanced classes
)
When to use: Large datasets, quick evaluation Parameters:
test_size: Proportion for test (typically 0.2-0.3)stratify: Preserves class proportionsrandom_state: Reproducibility
Train-Validation-Test Split
Three-way split for hyperparameter tuning.
# First split: train+val and test
X_trainval, X_test, y_trainval, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Second split: train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_trainval, y_trainval, test_size=0.2, random_state=42
)
# Or use GridSearchCV with train+val, then evaluate on test
When to use: Model selection and final evaluation Strategy:
- Train: Model training
- Validation: Hyperparameter tuning
- Test: Final, unbiased evaluation (touch only once!)
Learning Curves
Diagnose bias vs variance issues.
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy',
n_jobs=-1
)
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
plt.xlabel('Training set size')
plt.ylabel('Score')
plt.legend()
plt.show()
Interpretation:
- Large gap between train and validation: Overfitting (high variance)
- Both scores low: Underfitting (high bias)
- Scores converging but low: Need better features or more complex model
- Validation score still improving: More data would help
Best Practices
Metric Selection Guidelines
Classification - Balanced classes:
- Accuracy or F1-score
Classification - Imbalanced classes:
- Balanced accuracy
- F1-score (weighted or macro)
- ROC-AUC
- Precision-Recall curve
Classification - Cost-sensitive:
- Custom scorer with cost matrix
- Adjust threshold on probabilities
Regression - Typical use:
- RMSE (sensitive to outliers)
- R² (proportion of variance explained)
Regression - Outliers present:
- MAE (robust to outliers)
- Median absolute error
Regression - Percentage errors matter:
- MAPE
Cross-Validation Guidelines
Number of folds:
- 5-10 folds typical
- More folds = more computation, less variance in estimate
- LeaveOneOut only for small datasets
Stratification:
- Always use for classification with imbalanced classes
- Use StratifiedKFold by default for classification
Grouping:
- Always use when samples are not independent
- Time series: Always use TimeSeriesSplit
Nested cross-validation:
- For unbiased performance estimate when doing hyperparameter tuning
- Outer loop: Performance estimation
- Inner loop: Hyperparameter selection
Avoiding Common Pitfalls
- Data leakage: Fit preprocessors only on training data within each CV fold (use Pipeline!)
- Test set leakage: Never use test set for model selection
- Improper metric: Use metrics appropriate for problem (balanced_accuracy for imbalanced data)
- Multiple testing: More models evaluated = higher chance of random good results
- Temporal leakage: For time series, use TimeSeriesSplit, not random splits
- Target leakage: Features shouldn't contain information not available at prediction time