mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-28 07:33:45 +08:00
Add more scientific skills
This commit is contained in:
601
scientific-packages/scikit-learn/references/model_evaluation.md
Normal file
601
scientific-packages/scikit-learn/references/model_evaluation.md
Normal file
@@ -0,0 +1,601 @@
|
||||
# Model Evaluation and Selection in scikit-learn
|
||||
|
||||
## Overview
|
||||
Model evaluation assesses how well models generalize to unseen data. Scikit-learn provides three main APIs for evaluation:
|
||||
1. **Estimator score methods**: Built-in evaluation (accuracy for classifiers, R² for regressors)
|
||||
2. **Scoring parameter**: Used in cross-validation and hyperparameter tuning
|
||||
3. **Metric functions**: Specialized evaluation in `sklearn.metrics`
|
||||
|
||||
## Cross-Validation
|
||||
|
||||
Cross-validation evaluates model performance by splitting data into multiple train/test sets. This addresses overfitting: "a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data."
|
||||
|
||||
### Basic Cross-Validation
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
model = LogisticRegression()
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
|
||||
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Cross-Validation Strategies
|
||||
|
||||
#### For i.i.d. Data
|
||||
|
||||
**KFold**: Standard k-fold cross-validation
|
||||
- Splits data into k equal folds
|
||||
- Each fold used once as test set
|
||||
- `n_splits`: Number of folds (typically 5 or 10)
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import KFold
|
||||
cv = KFold(n_splits=5, shuffle=True, random_state=42)
|
||||
```
|
||||
|
||||
**RepeatedKFold**: Repeats KFold with different randomization
|
||||
- More robust estimation
|
||||
- Computationally expensive
|
||||
|
||||
**LeaveOneOut (LOO)**: Each sample is a test set
|
||||
- Maximum training data usage
|
||||
- Very computationally expensive
|
||||
- High variance in estimates
|
||||
- Use only for small datasets (<1000 samples)
|
||||
|
||||
**ShuffleSplit**: Random train/test splits
|
||||
- Flexible train/test sizes
|
||||
- Can control number of iterations
|
||||
- Good for quick evaluation
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import ShuffleSplit
|
||||
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
|
||||
```
|
||||
|
||||
#### For Imbalanced Classes
|
||||
|
||||
**StratifiedKFold**: Preserves class proportions in each fold
|
||||
- Essential for imbalanced datasets
|
||||
- Default for classification in cross_val_score()
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import StratifiedKFold
|
||||
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
|
||||
```
|
||||
|
||||
**StratifiedShuffleSplit**: Stratified random splits
|
||||
|
||||
#### For Grouped Data
|
||||
|
||||
Use when samples are not independent (e.g., multiple measurements from same subject).
|
||||
|
||||
**GroupKFold**: Groups don't appear in both train and test
|
||||
```python
|
||||
from sklearn.model_selection import GroupKFold
|
||||
cv = GroupKFold(n_splits=5)
|
||||
scores = cross_val_score(model, X, y, groups=groups, cv=cv)
|
||||
```
|
||||
|
||||
**StratifiedGroupKFold**: Combines stratification with group separation
|
||||
|
||||
**LeaveOneGroupOut**: Each group becomes a test set
|
||||
|
||||
#### For Time Series
|
||||
|
||||
**TimeSeriesSplit**: Expanding window approach
|
||||
- Successive training sets are supersets
|
||||
- Respects temporal ordering
|
||||
- No data leakage from future to past
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import TimeSeriesSplit
|
||||
cv = TimeSeriesSplit(n_splits=5)
|
||||
for train_idx, test_idx in cv.split(X):
|
||||
# Train on indices 0 to t, test on t+1 to t+k
|
||||
pass
|
||||
```
|
||||
|
||||
### Cross-Validation Functions
|
||||
|
||||
**cross_val_score**: Returns array of scores
|
||||
```python
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
|
||||
```
|
||||
|
||||
**cross_validate**: Returns multiple metrics and timing
|
||||
```python
|
||||
results = cross_validate(
|
||||
model, X, y, cv=5,
|
||||
scoring=['accuracy', 'f1_weighted', 'roc_auc'],
|
||||
return_train_score=True,
|
||||
return_estimator=True # Returns fitted estimators
|
||||
)
|
||||
print(results['test_accuracy'])
|
||||
print(results['fit_time'])
|
||||
```
|
||||
|
||||
**cross_val_predict**: Returns predictions for model blending/visualization
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_predict
|
||||
y_pred = cross_val_predict(model, X, y, cv=5)
|
||||
# Use for confusion matrix, error analysis, etc.
|
||||
```
|
||||
|
||||
## Hyperparameter Tuning
|
||||
|
||||
### GridSearchCV
|
||||
Exhaustively searches all parameter combinations.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
param_grid = {
|
||||
'n_estimators': [100, 200, 500],
|
||||
'max_depth': [10, 20, 30, None],
|
||||
'min_samples_split': [2, 5, 10],
|
||||
'min_samples_leaf': [1, 2, 4]
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_grid,
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1, # Use all CPU cores
|
||||
verbose=2
|
||||
)
|
||||
|
||||
grid_search.fit(X_train, y_train)
|
||||
print("Best parameters:", grid_search.best_params_)
|
||||
print("Best score:", grid_search.best_score_)
|
||||
|
||||
# Use best model
|
||||
best_model = grid_search.best_estimator_
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Small parameter spaces
|
||||
- When computational resources allow
|
||||
- When exhaustive search is desired
|
||||
|
||||
### RandomizedSearchCV
|
||||
Samples parameter combinations from distributions.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import RandomizedSearchCV
|
||||
from scipy.stats import randint, uniform
|
||||
|
||||
param_distributions = {
|
||||
'n_estimators': randint(100, 1000),
|
||||
'max_depth': randint(5, 50),
|
||||
'min_samples_split': randint(2, 20),
|
||||
'min_samples_leaf': randint(1, 10),
|
||||
'max_features': uniform(0.1, 0.9)
|
||||
}
|
||||
|
||||
random_search = RandomizedSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_distributions,
|
||||
n_iter=100, # Number of parameter settings sampled
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
random_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Large parameter spaces
|
||||
- When budget is limited
|
||||
- Often finds good parameters faster than GridSearchCV
|
||||
|
||||
**Advantage**: "Budget can be chosen independent of the number of parameters and possible values"
|
||||
|
||||
### Successive Halving
|
||||
|
||||
**HalvingGridSearchCV** and **HalvingRandomSearchCV**: Tournament-style selection
|
||||
|
||||
**How it works**:
|
||||
1. Start with many candidates, minimal resources
|
||||
2. Eliminate poor performers
|
||||
3. Increase resources for remaining candidates
|
||||
4. Repeat until best candidates found
|
||||
|
||||
**When to use**:
|
||||
- Large parameter spaces
|
||||
- Expensive model training
|
||||
- When many parameter combinations are clearly inferior
|
||||
|
||||
```python
|
||||
from sklearn.experimental import enable_halving_search_cv
|
||||
from sklearn.model_selection import HalvingGridSearchCV
|
||||
|
||||
halving_search = HalvingGridSearchCV(
|
||||
estimator,
|
||||
param_grid,
|
||||
factor=3, # Proportion of candidates eliminated each round
|
||||
cv=5
|
||||
)
|
||||
```
|
||||
|
||||
## Classification Metrics
|
||||
|
||||
### Accuracy-Based Metrics
|
||||
|
||||
**Accuracy**: Proportion of correct predictions
|
||||
```python
|
||||
from sklearn.metrics import accuracy_score
|
||||
accuracy = accuracy_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: Balanced datasets only
|
||||
**When NOT to use**: Imbalanced datasets (misleading)
|
||||
|
||||
**Balanced Accuracy**: Average recall per class
|
||||
```python
|
||||
from sklearn.metrics import balanced_accuracy_score
|
||||
bal_acc = balanced_accuracy_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: Imbalanced datasets, ensures all classes matter equally
|
||||
|
||||
### Precision, Recall, F-Score
|
||||
|
||||
**Precision**: Of predicted positives, how many are actually positive
|
||||
- Formula: TP / (TP + FP)
|
||||
- Answers: "How reliable are positive predictions?"
|
||||
|
||||
**Recall** (Sensitivity): Of actual positives, how many are predicted positive
|
||||
- Formula: TP / (TP + FN)
|
||||
- Answers: "How complete is positive detection?"
|
||||
|
||||
**F1-Score**: Harmonic mean of precision and recall
|
||||
- Formula: 2 * (precision * recall) / (precision + recall)
|
||||
- Balanced measure when both precision and recall are important
|
||||
|
||||
```python
|
||||
from sklearn.metrics import precision_recall_fscore_support, f1_score
|
||||
|
||||
precision, recall, f1, support = precision_recall_fscore_support(
|
||||
y_true, y_pred, average='weighted'
|
||||
)
|
||||
|
||||
# Or individually
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
```
|
||||
|
||||
**Averaging strategies for multiclass**:
|
||||
- `binary`: Binary classification only
|
||||
- `micro`: Calculate globally (total TP, FP, FN)
|
||||
- `macro`: Calculate per class, unweighted mean (all classes equal)
|
||||
- `weighted`: Calculate per class, weighted by support (class frequency)
|
||||
- `samples`: For multilabel classification
|
||||
|
||||
**When to use**:
|
||||
- `macro`: When all classes equally important (even rare ones)
|
||||
- `weighted`: When class frequency matters
|
||||
- `micro`: When overall performance across all samples matters
|
||||
|
||||
### Confusion Matrix
|
||||
|
||||
Shows true positives, false positives, true negatives, false negatives.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
cm = confusion_matrix(y_true, y_pred)
|
||||
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
|
||||
disp.plot()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### ROC Curve and AUC
|
||||
|
||||
**ROC (Receiver Operating Characteristic)**: Plot of true positive rate vs false positive rate at different thresholds
|
||||
|
||||
**AUC (Area Under Curve)**: Measures overall ability to discriminate between classes
|
||||
- 1.0 = perfect classifier
|
||||
- 0.5 = random classifier
|
||||
- <0.5 = worse than random
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_auc_score, roc_curve
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Requires probability predictions
|
||||
y_proba = model.predict_proba(X_test)[:, 1] # Probabilities for positive class
|
||||
|
||||
auc = roc_auc_score(y_true, y_proba)
|
||||
fpr, tpr, thresholds = roc_curve(y_true, y_proba)
|
||||
|
||||
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
|
||||
plt.xlabel('False Positive Rate')
|
||||
plt.ylabel('True Positive Rate')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Multiclass ROC**: Use `multi_class='ovr'` (one-vs-rest) or `'ovo'` (one-vs-one)
|
||||
|
||||
```python
|
||||
auc = roc_auc_score(y_true, y_proba, multi_class='ovr')
|
||||
```
|
||||
|
||||
### Log Loss
|
||||
|
||||
Measures probability calibration quality.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import log_loss
|
||||
loss = log_loss(y_true, y_proba)
|
||||
```
|
||||
|
||||
**When to use**: When probability quality matters, not just class predictions
|
||||
**Lower is better**: Perfect predictions have log loss of 0
|
||||
|
||||
### Classification Report
|
||||
|
||||
Comprehensive summary of precision, recall, f1-score per class.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import classification_report
|
||||
|
||||
print(classification_report(y_true, y_pred, target_names=['Class 0', 'Class 1']))
|
||||
```
|
||||
|
||||
## Regression Metrics
|
||||
|
||||
### Mean Squared Error (MSE)
|
||||
Average squared difference between predictions and true values.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import mean_squared_error
|
||||
mse = mean_squared_error(y_true, y_pred)
|
||||
rmse = mean_squared_error(y_true, y_pred, squared=False) # Root MSE
|
||||
```
|
||||
|
||||
**Characteristics**:
|
||||
- Penalizes large errors heavily (squared term)
|
||||
- Same units as target² (use RMSE for same units as target)
|
||||
- Lower is better
|
||||
|
||||
### Mean Absolute Error (MAE)
|
||||
Average absolute difference between predictions and true values.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import mean_absolute_error
|
||||
mae = mean_absolute_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
**Characteristics**:
|
||||
- More robust to outliers than MSE
|
||||
- Same units as target
|
||||
- More interpretable
|
||||
- Lower is better
|
||||
|
||||
**MSE vs MAE**: Use MAE when outliers shouldn't dominate the metric
|
||||
|
||||
### R² Score (Coefficient of Determination)
|
||||
Proportion of variance explained by the model.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import r2_score
|
||||
r2 = r2_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
**Interpretation**:
|
||||
- 1.0 = perfect predictions
|
||||
- 0.0 = model as good as mean
|
||||
- <0.0 = model worse than mean (possible!)
|
||||
- Higher is better
|
||||
|
||||
**Note**: Can be negative for models that perform worse than predicting the mean.
|
||||
|
||||
### Mean Absolute Percentage Error (MAPE)
|
||||
Percentage-based error metric.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import mean_absolute_percentage_error
|
||||
mape = mean_absolute_percentage_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: When relative errors matter more than absolute errors
|
||||
**Warning**: Undefined when true values are zero
|
||||
|
||||
### Median Absolute Error
|
||||
Median of absolute errors (robust to outliers).
|
||||
|
||||
```python
|
||||
from sklearn.metrics import median_absolute_error
|
||||
med_ae = median_absolute_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
### Max Error
|
||||
Maximum residual error.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import max_error
|
||||
max_err = max_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: When worst-case performance matters
|
||||
|
||||
## Custom Scoring Functions
|
||||
|
||||
Create custom scorers for GridSearchCV and cross_val_score:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import make_scorer, fbeta_score
|
||||
|
||||
# F2 score (weights recall higher than precision)
|
||||
f2_scorer = make_scorer(fbeta_score, beta=2)
|
||||
|
||||
# Custom function
|
||||
def custom_metric(y_true, y_pred):
|
||||
# Your custom logic
|
||||
return score
|
||||
|
||||
custom_scorer = make_scorer(custom_metric, greater_is_better=True)
|
||||
|
||||
# Use in cross-validation or grid search
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring=custom_scorer)
|
||||
```
|
||||
|
||||
## Scoring Parameter Options
|
||||
|
||||
Common scoring strings for `scoring` parameter:
|
||||
|
||||
**Classification**:
|
||||
- `'accuracy'`, `'balanced_accuracy'`
|
||||
- `'precision'`, `'recall'`, `'f1'` (add `_macro`, `_micro`, `_weighted` for multiclass)
|
||||
- `'roc_auc'`, `'roc_auc_ovr'`, `'roc_auc_ovo'`
|
||||
- `'log_loss'` (lower is better, negate for maximization)
|
||||
- `'jaccard'` (Jaccard similarity)
|
||||
|
||||
**Regression**:
|
||||
- `'r2'`
|
||||
- `'neg_mean_squared_error'`, `'neg_root_mean_squared_error'`
|
||||
- `'neg_mean_absolute_error'`
|
||||
- `'neg_mean_absolute_percentage_error'`
|
||||
- `'neg_median_absolute_error'`
|
||||
|
||||
**Note**: Many metrics are negated (neg_*) so GridSearchCV can maximize them.
|
||||
|
||||
## Validation Strategies
|
||||
|
||||
### Train-Test Split
|
||||
Simple single split.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y,
|
||||
test_size=0.2,
|
||||
random_state=42,
|
||||
stratify=y # For classification with imbalanced classes
|
||||
)
|
||||
```
|
||||
|
||||
**When to use**: Large datasets, quick evaluation
|
||||
**Parameters**:
|
||||
- `test_size`: Proportion for test (typically 0.2-0.3)
|
||||
- `stratify`: Preserves class proportions
|
||||
- `random_state`: Reproducibility
|
||||
|
||||
### Train-Validation-Test Split
|
||||
Three-way split for hyperparameter tuning.
|
||||
|
||||
```python
|
||||
# First split: train+val and test
|
||||
X_trainval, X_test, y_trainval, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# Second split: train and validation
|
||||
X_train, X_val, y_train, y_val = train_test_split(
|
||||
X_trainval, y_trainval, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# Or use GridSearchCV with train+val, then evaluate on test
|
||||
```
|
||||
|
||||
**When to use**: Model selection and final evaluation
|
||||
**Strategy**:
|
||||
1. Train: Model training
|
||||
2. Validation: Hyperparameter tuning
|
||||
3. Test: Final, unbiased evaluation (touch only once!)
|
||||
|
||||
### Learning Curves
|
||||
|
||||
Diagnose bias vs variance issues.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import learning_curve
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
train_sizes, train_scores, val_scores = learning_curve(
|
||||
model, X, y,
|
||||
cv=5,
|
||||
train_sizes=np.linspace(0.1, 1.0, 10),
|
||||
scoring='accuracy',
|
||||
n_jobs=-1
|
||||
)
|
||||
|
||||
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
|
||||
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
|
||||
plt.xlabel('Training set size')
|
||||
plt.ylabel('Score')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Interpretation**:
|
||||
- Large gap between train and validation: **Overfitting** (high variance)
|
||||
- Both scores low: **Underfitting** (high bias)
|
||||
- Scores converging but low: Need better features or more complex model
|
||||
- Validation score still improving: More data would help
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Metric Selection Guidelines
|
||||
|
||||
**Classification - Balanced classes**:
|
||||
- Accuracy or F1-score
|
||||
|
||||
**Classification - Imbalanced classes**:
|
||||
- Balanced accuracy
|
||||
- F1-score (weighted or macro)
|
||||
- ROC-AUC
|
||||
- Precision-Recall curve
|
||||
|
||||
**Classification - Cost-sensitive**:
|
||||
- Custom scorer with cost matrix
|
||||
- Adjust threshold on probabilities
|
||||
|
||||
**Regression - Typical use**:
|
||||
- RMSE (sensitive to outliers)
|
||||
- R² (proportion of variance explained)
|
||||
|
||||
**Regression - Outliers present**:
|
||||
- MAE (robust to outliers)
|
||||
- Median absolute error
|
||||
|
||||
**Regression - Percentage errors matter**:
|
||||
- MAPE
|
||||
|
||||
### Cross-Validation Guidelines
|
||||
|
||||
**Number of folds**:
|
||||
- 5-10 folds typical
|
||||
- More folds = more computation, less variance in estimate
|
||||
- LeaveOneOut only for small datasets
|
||||
|
||||
**Stratification**:
|
||||
- Always use for classification with imbalanced classes
|
||||
- Use StratifiedKFold by default for classification
|
||||
|
||||
**Grouping**:
|
||||
- Always use when samples are not independent
|
||||
- Time series: Always use TimeSeriesSplit
|
||||
|
||||
**Nested cross-validation**:
|
||||
- For unbiased performance estimate when doing hyperparameter tuning
|
||||
- Outer loop: Performance estimation
|
||||
- Inner loop: Hyperparameter selection
|
||||
|
||||
### Avoiding Common Pitfalls
|
||||
|
||||
1. **Data leakage**: Fit preprocessors only on training data within each CV fold (use Pipeline!)
|
||||
2. **Test set leakage**: Never use test set for model selection
|
||||
3. **Improper metric**: Use metrics appropriate for problem (balanced_accuracy for imbalanced data)
|
||||
4. **Multiple testing**: More models evaluated = higher chance of random good results
|
||||
5. **Temporal leakage**: For time series, use TimeSeriesSplit, not random splits
|
||||
6. **Target leakage**: Features shouldn't contain information not available at prediction time
|
||||
Reference in New Issue
Block a user