mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-28 07:33:45 +08:00
Add more scientific skills
This commit is contained in:
601
scientific-packages/scikit-learn/references/model_evaluation.md
Normal file
601
scientific-packages/scikit-learn/references/model_evaluation.md
Normal file
@@ -0,0 +1,601 @@
|
||||
# Model Evaluation and Selection in scikit-learn
|
||||
|
||||
## Overview
|
||||
Model evaluation assesses how well models generalize to unseen data. Scikit-learn provides three main APIs for evaluation:
|
||||
1. **Estimator score methods**: Built-in evaluation (accuracy for classifiers, R² for regressors)
|
||||
2. **Scoring parameter**: Used in cross-validation and hyperparameter tuning
|
||||
3. **Metric functions**: Specialized evaluation in `sklearn.metrics`
|
||||
|
||||
## Cross-Validation
|
||||
|
||||
Cross-validation evaluates model performance by splitting data into multiple train/test sets. This addresses overfitting: "a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data."
|
||||
|
||||
### Basic Cross-Validation
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
model = LogisticRegression()
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
|
||||
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Cross-Validation Strategies
|
||||
|
||||
#### For i.i.d. Data
|
||||
|
||||
**KFold**: Standard k-fold cross-validation
|
||||
- Splits data into k equal folds
|
||||
- Each fold used once as test set
|
||||
- `n_splits`: Number of folds (typically 5 or 10)
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import KFold
|
||||
cv = KFold(n_splits=5, shuffle=True, random_state=42)
|
||||
```
|
||||
|
||||
**RepeatedKFold**: Repeats KFold with different randomization
|
||||
- More robust estimation
|
||||
- Computationally expensive
|
||||
|
||||
**LeaveOneOut (LOO)**: Each sample is a test set
|
||||
- Maximum training data usage
|
||||
- Very computationally expensive
|
||||
- High variance in estimates
|
||||
- Use only for small datasets (<1000 samples)
|
||||
|
||||
**ShuffleSplit**: Random train/test splits
|
||||
- Flexible train/test sizes
|
||||
- Can control number of iterations
|
||||
- Good for quick evaluation
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import ShuffleSplit
|
||||
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
|
||||
```
|
||||
|
||||
#### For Imbalanced Classes
|
||||
|
||||
**StratifiedKFold**: Preserves class proportions in each fold
|
||||
- Essential for imbalanced datasets
|
||||
- Default for classification in cross_val_score()
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import StratifiedKFold
|
||||
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
|
||||
```
|
||||
|
||||
**StratifiedShuffleSplit**: Stratified random splits
|
||||
|
||||
#### For Grouped Data
|
||||
|
||||
Use when samples are not independent (e.g., multiple measurements from same subject).
|
||||
|
||||
**GroupKFold**: Groups don't appear in both train and test
|
||||
```python
|
||||
from sklearn.model_selection import GroupKFold
|
||||
cv = GroupKFold(n_splits=5)
|
||||
scores = cross_val_score(model, X, y, groups=groups, cv=cv)
|
||||
```
|
||||
|
||||
**StratifiedGroupKFold**: Combines stratification with group separation
|
||||
|
||||
**LeaveOneGroupOut**: Each group becomes a test set
|
||||
|
||||
#### For Time Series
|
||||
|
||||
**TimeSeriesSplit**: Expanding window approach
|
||||
- Successive training sets are supersets
|
||||
- Respects temporal ordering
|
||||
- No data leakage from future to past
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import TimeSeriesSplit
|
||||
cv = TimeSeriesSplit(n_splits=5)
|
||||
for train_idx, test_idx in cv.split(X):
|
||||
# Train on indices 0 to t, test on t+1 to t+k
|
||||
pass
|
||||
```
|
||||
|
||||
### Cross-Validation Functions
|
||||
|
||||
**cross_val_score**: Returns array of scores
|
||||
```python
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
|
||||
```
|
||||
|
||||
**cross_validate**: Returns multiple metrics and timing
|
||||
```python
|
||||
results = cross_validate(
|
||||
model, X, y, cv=5,
|
||||
scoring=['accuracy', 'f1_weighted', 'roc_auc'],
|
||||
return_train_score=True,
|
||||
return_estimator=True # Returns fitted estimators
|
||||
)
|
||||
print(results['test_accuracy'])
|
||||
print(results['fit_time'])
|
||||
```
|
||||
|
||||
**cross_val_predict**: Returns predictions for model blending/visualization
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_predict
|
||||
y_pred = cross_val_predict(model, X, y, cv=5)
|
||||
# Use for confusion matrix, error analysis, etc.
|
||||
```
|
||||
|
||||
## Hyperparameter Tuning
|
||||
|
||||
### GridSearchCV
|
||||
Exhaustively searches all parameter combinations.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
param_grid = {
|
||||
'n_estimators': [100, 200, 500],
|
||||
'max_depth': [10, 20, 30, None],
|
||||
'min_samples_split': [2, 5, 10],
|
||||
'min_samples_leaf': [1, 2, 4]
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_grid,
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1, # Use all CPU cores
|
||||
verbose=2
|
||||
)
|
||||
|
||||
grid_search.fit(X_train, y_train)
|
||||
print("Best parameters:", grid_search.best_params_)
|
||||
print("Best score:", grid_search.best_score_)
|
||||
|
||||
# Use best model
|
||||
best_model = grid_search.best_estimator_
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Small parameter spaces
|
||||
- When computational resources allow
|
||||
- When exhaustive search is desired
|
||||
|
||||
### RandomizedSearchCV
|
||||
Samples parameter combinations from distributions.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import RandomizedSearchCV
|
||||
from scipy.stats import randint, uniform
|
||||
|
||||
param_distributions = {
|
||||
'n_estimators': randint(100, 1000),
|
||||
'max_depth': randint(5, 50),
|
||||
'min_samples_split': randint(2, 20),
|
||||
'min_samples_leaf': randint(1, 10),
|
||||
'max_features': uniform(0.1, 0.9)
|
||||
}
|
||||
|
||||
random_search = RandomizedSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_distributions,
|
||||
n_iter=100, # Number of parameter settings sampled
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
random_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Large parameter spaces
|
||||
- When budget is limited
|
||||
- Often finds good parameters faster than GridSearchCV
|
||||
|
||||
**Advantage**: "Budget can be chosen independent of the number of parameters and possible values"
|
||||
|
||||
### Successive Halving
|
||||
|
||||
**HalvingGridSearchCV** and **HalvingRandomSearchCV**: Tournament-style selection
|
||||
|
||||
**How it works**:
|
||||
1. Start with many candidates, minimal resources
|
||||
2. Eliminate poor performers
|
||||
3. Increase resources for remaining candidates
|
||||
4. Repeat until best candidates found
|
||||
|
||||
**When to use**:
|
||||
- Large parameter spaces
|
||||
- Expensive model training
|
||||
- When many parameter combinations are clearly inferior
|
||||
|
||||
```python
|
||||
from sklearn.experimental import enable_halving_search_cv
|
||||
from sklearn.model_selection import HalvingGridSearchCV
|
||||
|
||||
halving_search = HalvingGridSearchCV(
|
||||
estimator,
|
||||
param_grid,
|
||||
factor=3, # Proportion of candidates eliminated each round
|
||||
cv=5
|
||||
)
|
||||
```
|
||||
|
||||
## Classification Metrics
|
||||
|
||||
### Accuracy-Based Metrics
|
||||
|
||||
**Accuracy**: Proportion of correct predictions
|
||||
```python
|
||||
from sklearn.metrics import accuracy_score
|
||||
accuracy = accuracy_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: Balanced datasets only
|
||||
**When NOT to use**: Imbalanced datasets (misleading)
|
||||
|
||||
**Balanced Accuracy**: Average recall per class
|
||||
```python
|
||||
from sklearn.metrics import balanced_accuracy_score
|
||||
bal_acc = balanced_accuracy_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: Imbalanced datasets, ensures all classes matter equally
|
||||
|
||||
### Precision, Recall, F-Score
|
||||
|
||||
**Precision**: Of predicted positives, how many are actually positive
|
||||
- Formula: TP / (TP + FP)
|
||||
- Answers: "How reliable are positive predictions?"
|
||||
|
||||
**Recall** (Sensitivity): Of actual positives, how many are predicted positive
|
||||
- Formula: TP / (TP + FN)
|
||||
- Answers: "How complete is positive detection?"
|
||||
|
||||
**F1-Score**: Harmonic mean of precision and recall
|
||||
- Formula: 2 * (precision * recall) / (precision + recall)
|
||||
- Balanced measure when both precision and recall are important
|
||||
|
||||
```python
|
||||
from sklearn.metrics import precision_recall_fscore_support, f1_score
|
||||
|
||||
precision, recall, f1, support = precision_recall_fscore_support(
|
||||
y_true, y_pred, average='weighted'
|
||||
)
|
||||
|
||||
# Or individually
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
```
|
||||
|
||||
**Averaging strategies for multiclass**:
|
||||
- `binary`: Binary classification only
|
||||
- `micro`: Calculate globally (total TP, FP, FN)
|
||||
- `macro`: Calculate per class, unweighted mean (all classes equal)
|
||||
- `weighted`: Calculate per class, weighted by support (class frequency)
|
||||
- `samples`: For multilabel classification
|
||||
|
||||
**When to use**:
|
||||
- `macro`: When all classes equally important (even rare ones)
|
||||
- `weighted`: When class frequency matters
|
||||
- `micro`: When overall performance across all samples matters
|
||||
|
||||
### Confusion Matrix
|
||||
|
||||
Shows true positives, false positives, true negatives, false negatives.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
cm = confusion_matrix(y_true, y_pred)
|
||||
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
|
||||
disp.plot()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### ROC Curve and AUC
|
||||
|
||||
**ROC (Receiver Operating Characteristic)**: Plot of true positive rate vs false positive rate at different thresholds
|
||||
|
||||
**AUC (Area Under Curve)**: Measures overall ability to discriminate between classes
|
||||
- 1.0 = perfect classifier
|
||||
- 0.5 = random classifier
|
||||
- <0.5 = worse than random
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_auc_score, roc_curve
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Requires probability predictions
|
||||
y_proba = model.predict_proba(X_test)[:, 1] # Probabilities for positive class
|
||||
|
||||
auc = roc_auc_score(y_true, y_proba)
|
||||
fpr, tpr, thresholds = roc_curve(y_true, y_proba)
|
||||
|
||||
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
|
||||
plt.xlabel('False Positive Rate')
|
||||
plt.ylabel('True Positive Rate')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Multiclass ROC**: Use `multi_class='ovr'` (one-vs-rest) or `'ovo'` (one-vs-one)
|
||||
|
||||
```python
|
||||
auc = roc_auc_score(y_true, y_proba, multi_class='ovr')
|
||||
```
|
||||
|
||||
### Log Loss
|
||||
|
||||
Measures probability calibration quality.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import log_loss
|
||||
loss = log_loss(y_true, y_proba)
|
||||
```
|
||||
|
||||
**When to use**: When probability quality matters, not just class predictions
|
||||
**Lower is better**: Perfect predictions have log loss of 0
|
||||
|
||||
### Classification Report
|
||||
|
||||
Comprehensive summary of precision, recall, f1-score per class.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import classification_report
|
||||
|
||||
print(classification_report(y_true, y_pred, target_names=['Class 0', 'Class 1']))
|
||||
```
|
||||
|
||||
## Regression Metrics
|
||||
|
||||
### Mean Squared Error (MSE)
|
||||
Average squared difference between predictions and true values.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import mean_squared_error
|
||||
mse = mean_squared_error(y_true, y_pred)
|
||||
rmse = mean_squared_error(y_true, y_pred, squared=False) # Root MSE
|
||||
```
|
||||
|
||||
**Characteristics**:
|
||||
- Penalizes large errors heavily (squared term)
|
||||
- Same units as target² (use RMSE for same units as target)
|
||||
- Lower is better
|
||||
|
||||
### Mean Absolute Error (MAE)
|
||||
Average absolute difference between predictions and true values.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import mean_absolute_error
|
||||
mae = mean_absolute_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
**Characteristics**:
|
||||
- More robust to outliers than MSE
|
||||
- Same units as target
|
||||
- More interpretable
|
||||
- Lower is better
|
||||
|
||||
**MSE vs MAE**: Use MAE when outliers shouldn't dominate the metric
|
||||
|
||||
### R² Score (Coefficient of Determination)
|
||||
Proportion of variance explained by the model.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import r2_score
|
||||
r2 = r2_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
**Interpretation**:
|
||||
- 1.0 = perfect predictions
|
||||
- 0.0 = model as good as mean
|
||||
- <0.0 = model worse than mean (possible!)
|
||||
- Higher is better
|
||||
|
||||
**Note**: Can be negative for models that perform worse than predicting the mean.
|
||||
|
||||
### Mean Absolute Percentage Error (MAPE)
|
||||
Percentage-based error metric.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import mean_absolute_percentage_error
|
||||
mape = mean_absolute_percentage_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: When relative errors matter more than absolute errors
|
||||
**Warning**: Undefined when true values are zero
|
||||
|
||||
### Median Absolute Error
|
||||
Median of absolute errors (robust to outliers).
|
||||
|
||||
```python
|
||||
from sklearn.metrics import median_absolute_error
|
||||
med_ae = median_absolute_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
### Max Error
|
||||
Maximum residual error.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import max_error
|
||||
max_err = max_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: When worst-case performance matters
|
||||
|
||||
## Custom Scoring Functions
|
||||
|
||||
Create custom scorers for GridSearchCV and cross_val_score:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import make_scorer, fbeta_score
|
||||
|
||||
# F2 score (weights recall higher than precision)
|
||||
f2_scorer = make_scorer(fbeta_score, beta=2)
|
||||
|
||||
# Custom function
|
||||
def custom_metric(y_true, y_pred):
|
||||
# Your custom logic
|
||||
return score
|
||||
|
||||
custom_scorer = make_scorer(custom_metric, greater_is_better=True)
|
||||
|
||||
# Use in cross-validation or grid search
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring=custom_scorer)
|
||||
```
|
||||
|
||||
## Scoring Parameter Options
|
||||
|
||||
Common scoring strings for `scoring` parameter:
|
||||
|
||||
**Classification**:
|
||||
- `'accuracy'`, `'balanced_accuracy'`
|
||||
- `'precision'`, `'recall'`, `'f1'` (add `_macro`, `_micro`, `_weighted` for multiclass)
|
||||
- `'roc_auc'`, `'roc_auc_ovr'`, `'roc_auc_ovo'`
|
||||
- `'log_loss'` (lower is better, negate for maximization)
|
||||
- `'jaccard'` (Jaccard similarity)
|
||||
|
||||
**Regression**:
|
||||
- `'r2'`
|
||||
- `'neg_mean_squared_error'`, `'neg_root_mean_squared_error'`
|
||||
- `'neg_mean_absolute_error'`
|
||||
- `'neg_mean_absolute_percentage_error'`
|
||||
- `'neg_median_absolute_error'`
|
||||
|
||||
**Note**: Many metrics are negated (neg_*) so GridSearchCV can maximize them.
|
||||
|
||||
## Validation Strategies
|
||||
|
||||
### Train-Test Split
|
||||
Simple single split.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y,
|
||||
test_size=0.2,
|
||||
random_state=42,
|
||||
stratify=y # For classification with imbalanced classes
|
||||
)
|
||||
```
|
||||
|
||||
**When to use**: Large datasets, quick evaluation
|
||||
**Parameters**:
|
||||
- `test_size`: Proportion for test (typically 0.2-0.3)
|
||||
- `stratify`: Preserves class proportions
|
||||
- `random_state`: Reproducibility
|
||||
|
||||
### Train-Validation-Test Split
|
||||
Three-way split for hyperparameter tuning.
|
||||
|
||||
```python
|
||||
# First split: train+val and test
|
||||
X_trainval, X_test, y_trainval, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# Second split: train and validation
|
||||
X_train, X_val, y_train, y_val = train_test_split(
|
||||
X_trainval, y_trainval, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# Or use GridSearchCV with train+val, then evaluate on test
|
||||
```
|
||||
|
||||
**When to use**: Model selection and final evaluation
|
||||
**Strategy**:
|
||||
1. Train: Model training
|
||||
2. Validation: Hyperparameter tuning
|
||||
3. Test: Final, unbiased evaluation (touch only once!)
|
||||
|
||||
### Learning Curves
|
||||
|
||||
Diagnose bias vs variance issues.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import learning_curve
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
train_sizes, train_scores, val_scores = learning_curve(
|
||||
model, X, y,
|
||||
cv=5,
|
||||
train_sizes=np.linspace(0.1, 1.0, 10),
|
||||
scoring='accuracy',
|
||||
n_jobs=-1
|
||||
)
|
||||
|
||||
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
|
||||
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
|
||||
plt.xlabel('Training set size')
|
||||
plt.ylabel('Score')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Interpretation**:
|
||||
- Large gap between train and validation: **Overfitting** (high variance)
|
||||
- Both scores low: **Underfitting** (high bias)
|
||||
- Scores converging but low: Need better features or more complex model
|
||||
- Validation score still improving: More data would help
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Metric Selection Guidelines
|
||||
|
||||
**Classification - Balanced classes**:
|
||||
- Accuracy or F1-score
|
||||
|
||||
**Classification - Imbalanced classes**:
|
||||
- Balanced accuracy
|
||||
- F1-score (weighted or macro)
|
||||
- ROC-AUC
|
||||
- Precision-Recall curve
|
||||
|
||||
**Classification - Cost-sensitive**:
|
||||
- Custom scorer with cost matrix
|
||||
- Adjust threshold on probabilities
|
||||
|
||||
**Regression - Typical use**:
|
||||
- RMSE (sensitive to outliers)
|
||||
- R² (proportion of variance explained)
|
||||
|
||||
**Regression - Outliers present**:
|
||||
- MAE (robust to outliers)
|
||||
- Median absolute error
|
||||
|
||||
**Regression - Percentage errors matter**:
|
||||
- MAPE
|
||||
|
||||
### Cross-Validation Guidelines
|
||||
|
||||
**Number of folds**:
|
||||
- 5-10 folds typical
|
||||
- More folds = more computation, less variance in estimate
|
||||
- LeaveOneOut only for small datasets
|
||||
|
||||
**Stratification**:
|
||||
- Always use for classification with imbalanced classes
|
||||
- Use StratifiedKFold by default for classification
|
||||
|
||||
**Grouping**:
|
||||
- Always use when samples are not independent
|
||||
- Time series: Always use TimeSeriesSplit
|
||||
|
||||
**Nested cross-validation**:
|
||||
- For unbiased performance estimate when doing hyperparameter tuning
|
||||
- Outer loop: Performance estimation
|
||||
- Inner loop: Hyperparameter selection
|
||||
|
||||
### Avoiding Common Pitfalls
|
||||
|
||||
1. **Data leakage**: Fit preprocessors only on training data within each CV fold (use Pipeline!)
|
||||
2. **Test set leakage**: Never use test set for model selection
|
||||
3. **Improper metric**: Use metrics appropriate for problem (balanced_accuracy for imbalanced data)
|
||||
4. **Multiple testing**: More models evaluated = higher chance of random good results
|
||||
5. **Temporal leakage**: For time series, use TimeSeriesSplit, not random splits
|
||||
6. **Target leakage**: Features shouldn't contain information not available at prediction time
|
||||
@@ -0,0 +1,679 @@
|
||||
# Pipelines and Composite Estimators in scikit-learn
|
||||
|
||||
## Overview
|
||||
Pipelines chain multiple estimators into a single unit, ensuring proper workflow sequencing and preventing data leakage. As the documentation states: "Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification."
|
||||
|
||||
## Pipeline Basics
|
||||
|
||||
### Creating Pipelines
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
# Method 1: List of (name, estimator) tuples
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
|
||||
# Method 2: Using make_pipeline (auto-generates names)
|
||||
from sklearn.pipeline import make_pipeline
|
||||
pipeline = make_pipeline(
|
||||
StandardScaler(),
|
||||
PCA(n_components=10),
|
||||
LogisticRegression()
|
||||
)
|
||||
```
|
||||
|
||||
### Using Pipelines
|
||||
|
||||
```python
|
||||
# Fit and predict like any estimator
|
||||
pipeline.fit(X_train, y_train)
|
||||
y_pred = pipeline.predict(X_test)
|
||||
score = pipeline.score(X_test, y_test)
|
||||
|
||||
# Access steps
|
||||
pipeline.named_steps['scaler']
|
||||
pipeline.steps[0] # Returns ('scaler', StandardScaler(...))
|
||||
pipeline[0] # Returns StandardScaler(...) object
|
||||
pipeline['scaler'] # Returns StandardScaler(...) object
|
||||
|
||||
# Get final estimator
|
||||
pipeline[-1] # Returns LogisticRegression(...) object
|
||||
```
|
||||
|
||||
### Pipeline Rules
|
||||
|
||||
**All steps except the last must be transformers** (have `fit()` and `transform()` methods).
|
||||
|
||||
**The final step** can be:
|
||||
- Predictor (classifier/regressor) with `fit()` and `predict()`
|
||||
- Transformer with `fit()` and `transform()`
|
||||
- Any estimator with at least `fit()`
|
||||
|
||||
### Pipeline Benefits
|
||||
|
||||
1. **Convenience**: Single `fit()` and `predict()` call
|
||||
2. **Prevents data leakage**: Ensures proper fit/transform on train/test
|
||||
3. **Joint parameter selection**: Tune all steps together with GridSearchCV
|
||||
4. **Reproducibility**: Encapsulates entire workflow
|
||||
|
||||
## Accessing and Setting Parameters
|
||||
|
||||
### Nested Parameters
|
||||
|
||||
Access step parameters using `stepname__parameter` syntax:
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('clf', LogisticRegression())
|
||||
])
|
||||
|
||||
# Grid search over pipeline parameters
|
||||
param_grid = {
|
||||
'scaler__with_mean': [True, False],
|
||||
'clf__C': [0.1, 1.0, 10.0],
|
||||
'clf__penalty': ['l1', 'l2']
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
|
||||
grid_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Setting Parameters
|
||||
|
||||
```python
|
||||
# Set parameters
|
||||
pipeline.set_params(clf__C=10.0, scaler__with_std=False)
|
||||
|
||||
# Get parameters
|
||||
params = pipeline.get_params()
|
||||
```
|
||||
|
||||
## Caching Intermediate Results
|
||||
|
||||
Cache fitted transformers to avoid recomputation:
|
||||
|
||||
```python
|
||||
from tempfile import mkdtemp
|
||||
from shutil import rmtree
|
||||
|
||||
# Create cache directory
|
||||
cachedir = mkdtemp()
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('clf', LogisticRegression())
|
||||
], memory=cachedir)
|
||||
|
||||
# When doing grid search, scaler and PCA only fit once per fold
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
|
||||
grid_search.fit(X_train, y_train)
|
||||
|
||||
# Clean up cache
|
||||
rmtree(cachedir)
|
||||
|
||||
# Or use joblib for persistent caching
|
||||
from joblib import Memory
|
||||
memory = Memory(location='./cache', verbose=0)
|
||||
pipeline = Pipeline([...], memory=memory)
|
||||
```
|
||||
|
||||
**When to use caching**:
|
||||
- Expensive transformations (PCA, feature selection)
|
||||
- Grid search over final estimator parameters only
|
||||
- Multiple experiments with same preprocessing
|
||||
|
||||
## ColumnTransformer
|
||||
|
||||
Apply different transformations to different columns (essential for heterogeneous data).
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
|
||||
# Define which transformations for which columns
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', StandardScaler(), ['age', 'income', 'credit_score']),
|
||||
('cat', OneHotEncoder(), ['country', 'occupation'])
|
||||
],
|
||||
remainder='drop' # What to do with remaining columns
|
||||
)
|
||||
|
||||
X_transformed = preprocessor.fit_transform(X)
|
||||
```
|
||||
|
||||
### Column Selection Methods
|
||||
|
||||
```python
|
||||
# Method 1: Column names (list of strings)
|
||||
('num', StandardScaler(), ['age', 'income'])
|
||||
|
||||
# Method 2: Column indices (list of integers)
|
||||
('num', StandardScaler(), [0, 1, 2])
|
||||
|
||||
# Method 3: Boolean mask
|
||||
('num', StandardScaler(), [True, True, False, True, False])
|
||||
|
||||
# Method 4: Slice
|
||||
('num', StandardScaler(), slice(0, 3))
|
||||
|
||||
# Method 5: make_column_selector (by dtype or pattern)
|
||||
from sklearn.compose import make_column_selector as selector
|
||||
|
||||
preprocessor = ColumnTransformer([
|
||||
('num', StandardScaler(), selector(dtype_include='number')),
|
||||
('cat', OneHotEncoder(), selector(dtype_include='object'))
|
||||
])
|
||||
|
||||
# Select by pattern
|
||||
selector(pattern='.*_score$') # All columns ending with '_score'
|
||||
```
|
||||
|
||||
### Remainder Parameter
|
||||
|
||||
Controls what happens to columns not specified:
|
||||
|
||||
```python
|
||||
# Drop remaining columns (default)
|
||||
remainder='drop'
|
||||
|
||||
# Pass through remaining columns unchanged
|
||||
remainder='passthrough'
|
||||
|
||||
# Apply transformer to remaining columns
|
||||
remainder=StandardScaler()
|
||||
```
|
||||
|
||||
### Full Pipeline with ColumnTransformer
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
# Separate preprocessing for numeric and categorical
|
||||
numeric_features = ['age', 'income', 'credit_score']
|
||||
categorical_features = ['country', 'occupation', 'education']
|
||||
|
||||
numeric_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
|
||||
categorical_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
])
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
|
||||
# Complete pipeline
|
||||
clf = Pipeline(steps=[
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', RandomForestClassifier())
|
||||
])
|
||||
|
||||
clf.fit(X_train, y_train)
|
||||
y_pred = clf.predict(X_test)
|
||||
|
||||
# Grid search over preprocessing and model parameters
|
||||
param_grid = {
|
||||
'preprocessor__num__imputer__strategy': ['mean', 'median'],
|
||||
'preprocessor__cat__onehot__max_categories': [10, 20, None],
|
||||
'classifier__n_estimators': [100, 200],
|
||||
'classifier__max_depth': [10, 20, None]
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(clf, param_grid, cv=5)
|
||||
grid_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## FeatureUnion
|
||||
|
||||
Combine multiple transformer outputs by concatenating features side-by-side.
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import FeatureUnion
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.feature_selection import SelectKBest
|
||||
|
||||
# Combine PCA and feature selection
|
||||
combined_features = FeatureUnion([
|
||||
('pca', PCA(n_components=10)),
|
||||
('univ_select', SelectKBest(k=5))
|
||||
])
|
||||
|
||||
X_features = combined_features.fit_transform(X, y)
|
||||
# Result: 15 features (10 from PCA + 5 from SelectKBest)
|
||||
|
||||
# In a pipeline
|
||||
pipeline = Pipeline([
|
||||
('features', combined_features),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
```
|
||||
|
||||
### FeatureUnion with Transformers on Different Data
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import FeatureUnion
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
import numpy as np
|
||||
|
||||
def get_numeric_data(X):
|
||||
return X[:, :3] # First 3 columns
|
||||
|
||||
def get_text_data(X):
|
||||
return X[:, 3] # 4th column (text)
|
||||
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
|
||||
combined = FeatureUnion([
|
||||
('numeric_features', Pipeline([
|
||||
('selector', FunctionTransformer(get_numeric_data)),
|
||||
('scaler', StandardScaler())
|
||||
])),
|
||||
('text_features', Pipeline([
|
||||
('selector', FunctionTransformer(get_text_data)),
|
||||
('tfidf', TfidfVectorizer())
|
||||
]))
|
||||
])
|
||||
```
|
||||
|
||||
**Note**: ColumnTransformer is usually more convenient than FeatureUnion for heterogeneous data.
|
||||
|
||||
## Common Pipeline Patterns
|
||||
|
||||
### Classification Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.feature_selection import SelectKBest, f_classif
|
||||
from sklearn.svm import SVC
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('feature_selection', SelectKBest(f_classif, k=10)),
|
||||
('classifier', SVC(kernel='rbf'))
|
||||
])
|
||||
```
|
||||
|
||||
### Regression Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
|
||||
from sklearn.linear_model import Ridge
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('poly', PolynomialFeatures(degree=2)),
|
||||
('ridge', Ridge(alpha=1.0))
|
||||
])
|
||||
```
|
||||
|
||||
### Text Classification Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from sklearn.naive_bayes import MultinomialNB
|
||||
|
||||
pipeline = Pipeline([
|
||||
('tfidf', TfidfVectorizer(max_features=1000)),
|
||||
('classifier', MultinomialNB())
|
||||
])
|
||||
|
||||
# Works directly with text
|
||||
pipeline.fit(X_train_text, y_train)
|
||||
y_pred = pipeline.predict(X_test_text)
|
||||
```
|
||||
|
||||
### Image Processing Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.neural_network import MLPClassifier
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=100)),
|
||||
('mlp', MLPClassifier(hidden_layer_sizes=(100, 50)))
|
||||
])
|
||||
```
|
||||
|
||||
### Dimensionality Reduction + Clustering
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.cluster import KMeans
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('kmeans', KMeans(n_clusters=5))
|
||||
])
|
||||
|
||||
labels = pipeline.fit_predict(X)
|
||||
```
|
||||
|
||||
## Custom Transformers
|
||||
|
||||
### Using FunctionTransformer
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
import numpy as np
|
||||
|
||||
# Log transformation
|
||||
log_transformer = FunctionTransformer(np.log1p)
|
||||
|
||||
# Custom function
|
||||
def custom_transform(X):
|
||||
# Your transformation logic
|
||||
return X_transformed
|
||||
|
||||
custom_transformer = FunctionTransformer(custom_transform)
|
||||
|
||||
# In pipeline
|
||||
pipeline = Pipeline([
|
||||
('log', log_transformer),
|
||||
('scaler', StandardScaler()),
|
||||
('model', LinearRegression())
|
||||
])
|
||||
```
|
||||
|
||||
### Creating Custom Transformer Class
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
|
||||
class CustomTransformer(BaseEstimator, TransformerMixin):
|
||||
def __init__(self, parameter=1.0):
|
||||
self.parameter = parameter
|
||||
|
||||
def fit(self, X, y=None):
|
||||
# Learn parameters from X
|
||||
self.learned_param_ = X.mean() # Example
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
# Transform X using learned parameters
|
||||
return X * self.parameter - self.learned_param_
|
||||
|
||||
# Optional: for pipelines that need inverse transform
|
||||
def inverse_transform(self, X):
|
||||
return (X + self.learned_param_) / self.parameter
|
||||
|
||||
# Use in pipeline
|
||||
pipeline = Pipeline([
|
||||
('custom', CustomTransformer(parameter=2.0)),
|
||||
('model', LinearRegression())
|
||||
])
|
||||
```
|
||||
|
||||
**Key requirements**:
|
||||
- Inherit from `BaseEstimator` and `TransformerMixin`
|
||||
- Implement `fit()` and `transform()` methods
|
||||
- `fit()` must return `self`
|
||||
- Use trailing underscore for learned attributes (`learned_param_`)
|
||||
- Constructor parameters should be stored as attributes
|
||||
|
||||
### Transformer for Pandas DataFrames
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
import pandas as pd
|
||||
|
||||
class DataFrameTransformer(BaseEstimator, TransformerMixin):
|
||||
def __init__(self, columns=None):
|
||||
self.columns = columns
|
||||
|
||||
def fit(self, X, y=None):
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
if isinstance(X, pd.DataFrame):
|
||||
if self.columns:
|
||||
return X[self.columns].values
|
||||
return X.values
|
||||
return X
|
||||
```
|
||||
|
||||
## Visualization
|
||||
|
||||
### Display Pipeline in Jupyter
|
||||
|
||||
```python
|
||||
from sklearn import set_config
|
||||
|
||||
# Enable HTML display
|
||||
set_config(display='diagram')
|
||||
|
||||
# Now displaying the pipeline shows interactive diagram
|
||||
pipeline
|
||||
```
|
||||
|
||||
### Print Pipeline Structure
|
||||
|
||||
```python
|
||||
from sklearn.utils import estimator_html_repr
|
||||
|
||||
# Get HTML representation
|
||||
html = estimator_html_repr(pipeline)
|
||||
|
||||
# Or just print
|
||||
print(pipeline)
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Conditional Transformations
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler, FunctionTransformer
|
||||
|
||||
def conditional_scale(X, scale=True):
|
||||
if scale:
|
||||
return StandardScaler().fit_transform(X)
|
||||
return X
|
||||
|
||||
pipeline = Pipeline([
|
||||
('conditional_scaler', FunctionTransformer(
|
||||
conditional_scale,
|
||||
kw_args={'scale': True}
|
||||
)),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
```
|
||||
|
||||
### Multiple Preprocessing Paths
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
# Different preprocessing for different feature types
|
||||
preprocessor = ColumnTransformer([
|
||||
# Numeric: impute + scale
|
||||
('num_standard', Pipeline([
|
||||
('imputer', SimpleImputer(strategy='mean')),
|
||||
('scaler', StandardScaler())
|
||||
]), ['age', 'income']),
|
||||
|
||||
# Numeric: impute + log + scale
|
||||
('num_skewed', Pipeline([
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('log', FunctionTransformer(np.log1p)),
|
||||
('scaler', StandardScaler())
|
||||
]), ['price', 'revenue']),
|
||||
|
||||
# Categorical: impute + one-hot
|
||||
('cat', Pipeline([
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
]), ['category', 'region']),
|
||||
|
||||
# Text: TF-IDF
|
||||
('text', TfidfVectorizer(), 'description')
|
||||
])
|
||||
```
|
||||
|
||||
### Feature Engineering Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
|
||||
class FeatureEngineer(BaseEstimator, TransformerMixin):
|
||||
def fit(self, X, y=None):
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
X = X.copy()
|
||||
# Add engineered features
|
||||
X['age_income_ratio'] = X['age'] / (X['income'] + 1)
|
||||
X['total_score'] = X['score1'] + X['score2'] + X['score3']
|
||||
return X
|
||||
|
||||
pipeline = Pipeline([
|
||||
('engineer', FeatureEngineer()),
|
||||
('preprocessor', preprocessor),
|
||||
('model', RandomForestClassifier())
|
||||
])
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Always Use Pipelines When
|
||||
|
||||
1. **Preprocessing is needed**: Scaling, encoding, imputation
|
||||
2. **Cross-validation**: Ensures proper fit/transform split
|
||||
3. **Hyperparameter tuning**: Joint optimization of preprocessing and model
|
||||
4. **Production deployment**: Single object to serialize
|
||||
5. **Multiple steps**: Any workflow with >1 step
|
||||
|
||||
### Pipeline Do's
|
||||
|
||||
- ✅ Fit pipeline only on training data
|
||||
- ✅ Use ColumnTransformer for heterogeneous data
|
||||
- ✅ Cache expensive transformations during grid search
|
||||
- ✅ Use make_pipeline for simple cases
|
||||
- ✅ Set verbose=True to debug issues
|
||||
- ✅ Use remainder='passthrough' when appropriate
|
||||
|
||||
### Pipeline Don'ts
|
||||
|
||||
- ❌ Fit preprocessing on full dataset before split (data leakage!)
|
||||
- ❌ Manually transform test data (use pipeline.predict())
|
||||
- ❌ Forget to handle missing values before scaling
|
||||
- ❌ Mix pandas DataFrames and arrays inconsistently
|
||||
- ❌ Skip using pipelines for "just one preprocessing step"
|
||||
|
||||
### Data Leakage Prevention
|
||||
|
||||
```python
|
||||
# ❌ WRONG - Data leakage
|
||||
scaler = StandardScaler().fit(X) # Fit on all data
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y)
|
||||
X_train_scaled = scaler.transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# ✅ CORRECT - No leakage with pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y)
|
||||
pipeline.fit(X_train, y_train) # Scaler fits only on train
|
||||
y_pred = pipeline.predict(X_test) # Scaler transforms only on test
|
||||
|
||||
# ✅ CORRECT - No leakage in cross-validation
|
||||
scores = cross_val_score(pipeline, X, y, cv=5)
|
||||
# Each fold: scaler fits on train folds, transforms on test fold
|
||||
```
|
||||
|
||||
### Debugging Pipelines
|
||||
|
||||
```python
|
||||
# Examine intermediate outputs
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
|
||||
# Fit pipeline
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Get output after scaling
|
||||
X_scaled = pipeline.named_steps['scaler'].transform(X_train)
|
||||
|
||||
# Get output after PCA
|
||||
X_pca = pipeline[:-1].transform(X_train) # All steps except last
|
||||
|
||||
# Or build partial pipeline
|
||||
partial_pipeline = Pipeline(pipeline.steps[:-1])
|
||||
X_transformed = partial_pipeline.transform(X_train)
|
||||
```
|
||||
|
||||
### Saving and Loading Pipelines
|
||||
|
||||
```python
|
||||
import joblib
|
||||
|
||||
# Save pipeline
|
||||
joblib.dump(pipeline, 'model_pipeline.pkl')
|
||||
|
||||
# Load pipeline
|
||||
pipeline = joblib.load('model_pipeline.pkl')
|
||||
|
||||
# Use loaded pipeline
|
||||
y_pred = pipeline.predict(X_new)
|
||||
```
|
||||
|
||||
## Common Errors and Solutions
|
||||
|
||||
**Error**: `ValueError: could not convert string to float`
|
||||
- **Cause**: Categorical features not encoded
|
||||
- **Solution**: Add OneHotEncoder or OrdinalEncoder to pipeline
|
||||
|
||||
**Error**: `All intermediate steps should be transformers`
|
||||
- **Cause**: Non-transformer in non-final position
|
||||
- **Solution**: Ensure only last step is predictor
|
||||
|
||||
**Error**: `X has different number of features than during fitting`
|
||||
- **Cause**: Different columns in train and test
|
||||
- **Solution**: Ensure consistent column handling, use `handle_unknown='ignore'` in OneHotEncoder
|
||||
|
||||
**Error**: Different results in cross-validation vs train-test split
|
||||
- **Cause**: Data leakage (fitting preprocessing on all data)
|
||||
- **Solution**: Always use Pipeline for preprocessing
|
||||
|
||||
**Error**: Pipeline too slow during grid search
|
||||
- **Solution**: Use caching with `memory` parameter
|
||||
413
scientific-packages/scikit-learn/references/preprocessing.md
Normal file
413
scientific-packages/scikit-learn/references/preprocessing.md
Normal file
@@ -0,0 +1,413 @@
|
||||
# Data Preprocessing in scikit-learn
|
||||
|
||||
## Overview
|
||||
Preprocessing transforms raw data into a format suitable for machine learning algorithms. Many algorithms require standardized or normalized data to perform well.
|
||||
|
||||
## Standardization and Scaling
|
||||
|
||||
### StandardScaler
|
||||
Removes mean and scales to unit variance (z-score normalization).
|
||||
|
||||
**Formula**: `z = (x - μ) / σ`
|
||||
|
||||
**Use cases**:
|
||||
- Most ML algorithms (especially SVM, neural networks, PCA)
|
||||
- When features have different units or scales
|
||||
- When assuming Gaussian-like distribution
|
||||
|
||||
**Important**: Fit only on training data, then transform both train and test sets.
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test) # Use same parameters
|
||||
```
|
||||
|
||||
### MinMaxScaler
|
||||
Scales features to a specified range, typically [0, 1].
|
||||
|
||||
**Formula**: `X_scaled = (X - X_min) / (X_max - X_min)`
|
||||
|
||||
**Use cases**:
|
||||
- When bounded range is needed
|
||||
- Neural networks (often prefer [0, 1] range)
|
||||
- When distribution is not Gaussian
|
||||
- Image pixel values
|
||||
|
||||
**Parameters**:
|
||||
- `feature_range`: Tuple (min, max), default (0, 1)
|
||||
|
||||
**Warning**: Sensitive to outliers since it uses min/max.
|
||||
|
||||
### MaxAbsScaler
|
||||
Scales to [-1, 1] by dividing by maximum absolute value.
|
||||
|
||||
**Use cases**:
|
||||
- Sparse data (preserves sparsity)
|
||||
- Data already centered at zero
|
||||
- When sign of values is meaningful
|
||||
|
||||
**Advantage**: Doesn't shift/center the data, preserves zero entries.
|
||||
|
||||
### RobustScaler
|
||||
Uses median and interquartile range (IQR) instead of mean and standard deviation.
|
||||
|
||||
**Formula**: `X_scaled = (X - median) / IQR`
|
||||
|
||||
**Use cases**:
|
||||
- When outliers are present
|
||||
- When StandardScaler produces skewed results
|
||||
- Robust statistics preferred
|
||||
|
||||
**Parameters**:
|
||||
- `quantile_range`: Tuple (q_min, q_max), default (25.0, 75.0)
|
||||
|
||||
## Normalization
|
||||
|
||||
### normalize() function and Normalizer
|
||||
Scales individual samples (rows) to unit norm, not features (columns).
|
||||
|
||||
**Use cases**:
|
||||
- Text classification (TF-IDF vectors)
|
||||
- When similarity metrics (dot product, cosine) are used
|
||||
- When each sample should have equal weight
|
||||
|
||||
**Norms**:
|
||||
- `l1`: Manhattan norm (sum of absolutes = 1)
|
||||
- `l2`: Euclidean norm (sum of squares = 1) - **most common**
|
||||
- `max`: Maximum absolute value = 1
|
||||
|
||||
**Key difference from scalers**: Operates on rows (samples), not columns (features).
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import Normalizer
|
||||
normalizer = Normalizer(norm='l2')
|
||||
X_normalized = normalizer.transform(X)
|
||||
```
|
||||
|
||||
## Encoding Categorical Features
|
||||
|
||||
### OrdinalEncoder
|
||||
Converts categories to integers (0 to n_categories - 1).
|
||||
|
||||
**Use cases**:
|
||||
- Ordinal relationships exist (small < medium < large)
|
||||
- Preprocessing before other transformations
|
||||
- Tree-based algorithms (which can handle integers)
|
||||
|
||||
**Parameters**:
|
||||
- `handle_unknown`: 'error' or 'use_encoded_value'
|
||||
- `unknown_value`: Value for unknown categories
|
||||
- `encoded_missing_value`: Value for missing data
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import OrdinalEncoder
|
||||
encoder = OrdinalEncoder()
|
||||
X_encoded = encoder.fit_transform(X_categorical)
|
||||
```
|
||||
|
||||
### OneHotEncoder
|
||||
Creates binary columns for each category.
|
||||
|
||||
**Use cases**:
|
||||
- Nominal categories (no order)
|
||||
- Linear models, neural networks
|
||||
- When category relationships shouldn't be assumed
|
||||
|
||||
**Parameters**:
|
||||
- `drop`: 'first', 'if_binary', array-like (prevents multicollinearity)
|
||||
- `sparse_output`: True (default, memory efficient) or False
|
||||
- `handle_unknown`: 'error', 'ignore', 'infrequent_if_exist'
|
||||
- `min_frequency`: Group infrequent categories
|
||||
- `max_categories`: Limit number of categories
|
||||
|
||||
**High cardinality handling**:
|
||||
```python
|
||||
encoder = OneHotEncoder(min_frequency=100, handle_unknown='infrequent_if_exist')
|
||||
# Groups categories appearing < 100 times into 'infrequent' category
|
||||
```
|
||||
|
||||
**Memory tip**: Use `sparse_output=True` (default) for high-cardinality features.
|
||||
|
||||
### TargetEncoder
|
||||
Uses target statistics to encode categories.
|
||||
|
||||
**Use cases**:
|
||||
- High-cardinality categorical features (zip codes, user IDs)
|
||||
- When linear relationships with target are expected
|
||||
- Often improves performance over one-hot encoding
|
||||
|
||||
**How it works**:
|
||||
- Replaces category with mean of target for that category
|
||||
- Uses cross-fitting during fit_transform() to prevent target leakage
|
||||
- Applies smoothing to handle rare categories
|
||||
|
||||
**Parameters**:
|
||||
- `smooth`: Smoothing parameter for rare categories
|
||||
- `cv`: Cross-validation strategy
|
||||
|
||||
**Warning**: Only for supervised learning. Requires target variable.
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import TargetEncoder
|
||||
encoder = TargetEncoder()
|
||||
X_encoded = encoder.fit_transform(X_categorical, y)
|
||||
```
|
||||
|
||||
### LabelEncoder
|
||||
Encodes target labels into integers 0 to n_classes - 1.
|
||||
|
||||
**Use cases**: Encoding target variable for classification (not features!)
|
||||
|
||||
**Important**: Use `LabelEncoder` for targets, not features. For features, use OrdinalEncoder or OneHotEncoder.
|
||||
|
||||
### Binarizer
|
||||
Converts numeric values to binary (0 or 1) based on threshold.
|
||||
|
||||
**Use cases**: Creating binary features from continuous values
|
||||
|
||||
## Non-linear Transformations
|
||||
|
||||
### QuantileTransformer
|
||||
Maps features to uniform or normal distribution using rank transformation.
|
||||
|
||||
**Use cases**:
|
||||
- Unusual distributions (bimodal, heavy tails)
|
||||
- Reducing outlier impact
|
||||
- When normal distribution is desired
|
||||
|
||||
**Parameters**:
|
||||
- `output_distribution`: 'uniform' (default) or 'normal'
|
||||
- `n_quantiles`: Number of quantiles (default: min(1000, n_samples))
|
||||
|
||||
**Effect**: Strong transformation that reduces outlier influence and makes data more Gaussian-like.
|
||||
|
||||
### PowerTransformer
|
||||
Applies parametric monotonic transformation to make data more Gaussian.
|
||||
|
||||
**Methods**:
|
||||
- `yeo-johnson`: Works with positive and negative values (default)
|
||||
- `box-cox`: Only positive values
|
||||
|
||||
**Use cases**:
|
||||
- Skewed distributions
|
||||
- When Gaussian assumption is important
|
||||
- Variance stabilization
|
||||
|
||||
**Advantage**: Less radical than QuantileTransformer, preserves more of original relationships.
|
||||
|
||||
## Discretization
|
||||
|
||||
### KBinsDiscretizer
|
||||
Bins continuous features into discrete intervals.
|
||||
|
||||
**Strategies**:
|
||||
- `uniform`: Equal-width bins
|
||||
- `quantile`: Equal-frequency bins
|
||||
- `kmeans`: K-means clustering to determine bins
|
||||
|
||||
**Encoding**:
|
||||
- `ordinal`: Integer encoding (0 to n_bins - 1)
|
||||
- `onehot`: One-hot encoding
|
||||
- `onehot-dense`: Dense one-hot encoding
|
||||
|
||||
**Use cases**:
|
||||
- Making linear models handle non-linear relationships
|
||||
- Reducing noise in features
|
||||
- Making features more interpretable
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import KBinsDiscretizer
|
||||
disc = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
|
||||
X_binned = disc.fit_transform(X)
|
||||
```
|
||||
|
||||
## Feature Generation
|
||||
|
||||
### PolynomialFeatures
|
||||
Generates polynomial and interaction features.
|
||||
|
||||
**Parameters**:
|
||||
- `degree`: Polynomial degree
|
||||
- `interaction_only`: Only multiplicative interactions (no x²)
|
||||
- `include_bias`: Include constant feature
|
||||
|
||||
**Use cases**:
|
||||
- Adding non-linearity to linear models
|
||||
- Feature engineering
|
||||
- Polynomial regression
|
||||
|
||||
**Warning**: Number of features grows rapidly: (n+d)!/d!n! for degree d.
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
poly = PolynomialFeatures(degree=2, include_bias=False)
|
||||
X_poly = poly.fit_transform(X)
|
||||
# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
|
||||
```
|
||||
|
||||
### SplineTransformer
|
||||
Generates B-spline basis functions.
|
||||
|
||||
**Use cases**:
|
||||
- Smooth non-linear transformations
|
||||
- Alternative to PolynomialFeatures (less oscillation at boundaries)
|
||||
- Generalized additive models (GAMs)
|
||||
|
||||
**Parameters**:
|
||||
- `n_knots`: Number of knots
|
||||
- `degree`: Spline degree
|
||||
- `knots`: Knot positions ('uniform', 'quantile', or array)
|
||||
|
||||
## Missing Value Handling
|
||||
|
||||
### SimpleImputer
|
||||
Imputes missing values with various strategies.
|
||||
|
||||
**Strategies**:
|
||||
- `mean`: Mean of column (numeric only)
|
||||
- `median`: Median of column (numeric only)
|
||||
- `most_frequent`: Mode (numeric or categorical)
|
||||
- `constant`: Fill with constant value
|
||||
|
||||
**Parameters**:
|
||||
- `strategy`: Imputation strategy
|
||||
- `fill_value`: Value when strategy='constant'
|
||||
- `missing_values`: What represents missing (np.nan, None, specific value)
|
||||
|
||||
```python
|
||||
from sklearn.impute import SimpleImputer
|
||||
imputer = SimpleImputer(strategy='median')
|
||||
X_imputed = imputer.fit_transform(X)
|
||||
```
|
||||
|
||||
### KNNImputer
|
||||
Imputes using k-nearest neighbors.
|
||||
|
||||
**Use cases**: When relationships between features should inform imputation
|
||||
|
||||
**Parameters**:
|
||||
- `n_neighbors`: Number of neighbors
|
||||
- `weights`: 'uniform' or 'distance'
|
||||
|
||||
### IterativeImputer
|
||||
Models each feature with missing values as function of other features.
|
||||
|
||||
**Use cases**:
|
||||
- Complex relationships between features
|
||||
- When multiple features have missing values
|
||||
- Higher quality imputation (but slower)
|
||||
|
||||
**Parameters**:
|
||||
- `estimator`: Estimator for regression (default: BayesianRidge)
|
||||
- `max_iter`: Maximum iterations
|
||||
|
||||
## Function Transformers
|
||||
|
||||
### FunctionTransformer
|
||||
Applies custom function to data.
|
||||
|
||||
**Use cases**:
|
||||
- Custom transformations in pipelines
|
||||
- Log transformation, square root, etc.
|
||||
- Domain-specific preprocessing
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
import numpy as np
|
||||
|
||||
log_transformer = FunctionTransformer(np.log1p, validate=True)
|
||||
X_log = log_transformer.transform(X)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Feature Scaling Guidelines
|
||||
|
||||
**Always scale**:
|
||||
- SVM, neural networks
|
||||
- K-nearest neighbors
|
||||
- Linear/Logistic regression with regularization
|
||||
- PCA, LDA
|
||||
- Gradient descent-based algorithms
|
||||
|
||||
**Don't need to scale**:
|
||||
- Tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting)
|
||||
- Naive Bayes
|
||||
|
||||
### Pipeline Integration
|
||||
|
||||
Always use preprocessing within pipelines to prevent data leakage:
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
|
||||
pipeline.fit(X_train, y_train) # Scaler fit only on train data
|
||||
y_pred = pipeline.predict(X_test) # Scaler transform only on test data
|
||||
```
|
||||
|
||||
### Common Transformations by Data Type
|
||||
|
||||
**Numeric - Continuous**:
|
||||
- StandardScaler (most common)
|
||||
- MinMaxScaler (neural networks)
|
||||
- RobustScaler (outliers present)
|
||||
- PowerTransformer (skewed data)
|
||||
|
||||
**Numeric - Count Data**:
|
||||
- sqrt or log transformation
|
||||
- QuantileTransformer
|
||||
- StandardScaler after transformation
|
||||
|
||||
**Categorical - Low Cardinality (<10 categories)**:
|
||||
- OneHotEncoder
|
||||
|
||||
**Categorical - High Cardinality (>10 categories)**:
|
||||
- TargetEncoder (supervised)
|
||||
- Frequency encoding
|
||||
- OneHotEncoder with min_frequency parameter
|
||||
|
||||
**Categorical - Ordinal**:
|
||||
- OrdinalEncoder
|
||||
|
||||
**Text**:
|
||||
- CountVectorizer or TfidfVectorizer
|
||||
- Normalizer after vectorization
|
||||
|
||||
### Data Leakage Prevention
|
||||
|
||||
1. **Fit only on training data**: Never include test data when fitting preprocessors
|
||||
2. **Use pipelines**: Ensures proper fit/transform separation
|
||||
3. **Cross-validation**: Use Pipeline with cross_val_score() for proper evaluation
|
||||
4. **Target encoding**: Use cv parameter in TargetEncoder for cross-fitting
|
||||
|
||||
```python
|
||||
# WRONG - data leakage
|
||||
scaler = StandardScaler().fit(X_full)
|
||||
X_train_scaled = scaler.transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# CORRECT - no leakage
|
||||
scaler = StandardScaler().fit(X_train)
|
||||
X_train_scaled = scaler.transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
```
|
||||
|
||||
## Preprocessing Checklist
|
||||
|
||||
Before modeling:
|
||||
1. Handle missing values (imputation or removal)
|
||||
2. Encode categorical variables appropriately
|
||||
3. Scale/normalize numeric features (if needed for algorithm)
|
||||
4. Handle outliers (RobustScaler, clipping, removal)
|
||||
5. Create additional features if beneficial (PolynomialFeatures, domain knowledge)
|
||||
6. Check for data leakage in preprocessing steps
|
||||
7. Wrap everything in a Pipeline
|
||||
625
scientific-packages/scikit-learn/references/quick_reference.md
Normal file
625
scientific-packages/scikit-learn/references/quick_reference.md
Normal file
@@ -0,0 +1,625 @@
|
||||
# Scikit-learn Quick Reference
|
||||
|
||||
## Essential Imports
|
||||
|
||||
```python
|
||||
# Core
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
|
||||
from sklearn.pipeline import Pipeline, make_pipeline
|
||||
from sklearn.compose import ColumnTransformer
|
||||
|
||||
# Preprocessing
|
||||
from sklearn.preprocessing import (
|
||||
StandardScaler, MinMaxScaler, RobustScaler,
|
||||
OneHotEncoder, OrdinalEncoder, LabelEncoder,
|
||||
PolynomialFeatures
|
||||
)
|
||||
from sklearn.impute import SimpleImputer
|
||||
|
||||
# Models - Classification
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
from sklearn.ensemble import (
|
||||
RandomForestClassifier,
|
||||
GradientBoostingClassifier,
|
||||
HistGradientBoostingClassifier
|
||||
)
|
||||
from sklearn.svm import SVC
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
|
||||
# Models - Regression
|
||||
from sklearn.linear_model import LinearRegression, Ridge, Lasso
|
||||
from sklearn.ensemble import (
|
||||
RandomForestRegressor,
|
||||
GradientBoostingRegressor,
|
||||
HistGradientBoostingRegressor
|
||||
)
|
||||
|
||||
# Clustering
|
||||
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
|
||||
from sklearn.mixture import GaussianMixture
|
||||
|
||||
# Dimensionality Reduction
|
||||
from sklearn.decomposition import PCA, NMF, TruncatedSVD
|
||||
from sklearn.manifold import TSNE
|
||||
|
||||
# Metrics
|
||||
from sklearn.metrics import (
|
||||
accuracy_score, precision_score, recall_score, f1_score,
|
||||
confusion_matrix, classification_report,
|
||||
mean_squared_error, r2_score, mean_absolute_error
|
||||
)
|
||||
```
|
||||
|
||||
## Basic Workflow Template
|
||||
|
||||
### Classification
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.metrics import classification_report
|
||||
|
||||
# Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42, stratify=y
|
||||
)
|
||||
|
||||
# Scale features
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# Train model
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
model.fit(X_train_scaled, y_train)
|
||||
|
||||
# Predict and evaluate
|
||||
y_pred = model.predict(X_test_scaled)
|
||||
print(classification_report(y_test, y_pred))
|
||||
```
|
||||
|
||||
### Regression
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
from sklearn.metrics import mean_squared_error, r2_score
|
||||
|
||||
# Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# Scale features
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# Train model
|
||||
model = RandomForestRegressor(n_estimators=100, random_state=42)
|
||||
model.fit(X_train_scaled, y_train)
|
||||
|
||||
# Predict and evaluate
|
||||
y_pred = model.predict(X_test_scaled)
|
||||
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.3f}")
|
||||
print(f"R²: {r2_score(y_test, y_pred):.3f}")
|
||||
```
|
||||
|
||||
### With Pipeline (Recommended)
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import train_test_split, cross_val_score
|
||||
|
||||
# Create pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
|
||||
])
|
||||
|
||||
# Split and train
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Evaluate
|
||||
score = pipeline.score(X_test, y_test)
|
||||
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
|
||||
print(f"Test accuracy: {score:.3f}")
|
||||
print(f"CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
|
||||
```
|
||||
|
||||
## Common Preprocessing Patterns
|
||||
|
||||
### Numeric Data
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
numeric_transformer = Pipeline([
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
```
|
||||
|
||||
### Categorical Data
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
categorical_transformer = Pipeline([
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
])
|
||||
```
|
||||
|
||||
### Mixed Data with ColumnTransformer
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
|
||||
numeric_features = ['age', 'income', 'credit_score']
|
||||
categorical_features = ['country', 'occupation']
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
|
||||
# Complete pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
pipeline = Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', RandomForestClassifier())
|
||||
])
|
||||
```
|
||||
|
||||
## Model Selection Cheat Sheet
|
||||
|
||||
### Quick Decision Tree
|
||||
|
||||
```
|
||||
Is it supervised?
|
||||
├─ Yes
|
||||
│ ├─ Predicting categories? → Classification
|
||||
│ │ ├─ Start with: LogisticRegression (baseline)
|
||||
│ │ ├─ Then try: RandomForestClassifier
|
||||
│ │ └─ Best performance: HistGradientBoostingClassifier
|
||||
│ └─ Predicting numbers? → Regression
|
||||
│ ├─ Start with: LinearRegression/Ridge (baseline)
|
||||
│ ├─ Then try: RandomForestRegressor
|
||||
│ └─ Best performance: HistGradientBoostingRegressor
|
||||
└─ No
|
||||
├─ Grouping similar items? → Clustering
|
||||
│ ├─ Know # clusters: KMeans
|
||||
│ └─ Unknown # clusters: DBSCAN or HDBSCAN
|
||||
├─ Reducing dimensions?
|
||||
│ ├─ For preprocessing: PCA
|
||||
│ └─ For visualization: t-SNE or UMAP
|
||||
└─ Finding outliers? → IsolationForest or LocalOutlierFactor
|
||||
```
|
||||
|
||||
### Algorithm Selection by Data Size
|
||||
|
||||
- **Small (<1K samples)**: Any algorithm
|
||||
- **Medium (1K-100K)**: Random Forests, Gradient Boosting, Neural Networks
|
||||
- **Large (>100K)**: SGDClassifier/Regressor, HistGradientBoosting, LinearSVC
|
||||
|
||||
### When to Scale Features
|
||||
|
||||
**Always scale**:
|
||||
- SVM, Neural Networks
|
||||
- K-Nearest Neighbors
|
||||
- Linear/Logistic Regression (with regularization)
|
||||
- PCA, LDA
|
||||
- Any gradient descent algorithm
|
||||
|
||||
**Don't need to scale**:
|
||||
- Tree-based (Decision Trees, Random Forests, Gradient Boosting)
|
||||
- Naive Bayes
|
||||
|
||||
## Hyperparameter Tuning
|
||||
|
||||
### GridSearchCV
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
param_grid = {
|
||||
'n_estimators': [100, 200, 500],
|
||||
'max_depth': [10, 20, None],
|
||||
'min_samples_split': [2, 5, 10]
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_grid,
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1
|
||||
)
|
||||
|
||||
grid_search.fit(X_train, y_train)
|
||||
best_model = grid_search.best_estimator_
|
||||
print(f"Best params: {grid_search.best_params_}")
|
||||
```
|
||||
|
||||
### RandomizedSearchCV (Faster)
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import RandomizedSearchCV
|
||||
from scipy.stats import randint, uniform
|
||||
|
||||
param_distributions = {
|
||||
'n_estimators': randint(100, 1000),
|
||||
'max_depth': randint(5, 50),
|
||||
'min_samples_split': randint(2, 20)
|
||||
}
|
||||
|
||||
random_search = RandomizedSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_distributions,
|
||||
n_iter=50, # Number of combinations to try
|
||||
cv=5,
|
||||
n_jobs=-1,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
random_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Pipeline with GridSearchCV
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.svm import SVC
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('svm', SVC())
|
||||
])
|
||||
|
||||
param_grid = {
|
||||
'svm__C': [0.1, 1, 10],
|
||||
'svm__kernel': ['rbf', 'linear'],
|
||||
'svm__gamma': ['scale', 'auto']
|
||||
}
|
||||
|
||||
grid = GridSearchCV(pipeline, param_grid, cv=5)
|
||||
grid.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Cross-Validation
|
||||
|
||||
### Basic Cross-Validation
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_score
|
||||
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
|
||||
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Multiple Metrics
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_validate
|
||||
|
||||
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
|
||||
results = cross_validate(model, X, y, cv=5, scoring=scoring)
|
||||
|
||||
for metric in scoring:
|
||||
scores = results[f'test_{metric}']
|
||||
print(f"{metric}: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Custom CV Strategies
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
|
||||
|
||||
# For imbalanced classification
|
||||
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
|
||||
|
||||
# For time series
|
||||
cv = TimeSeriesSplit(n_splits=5)
|
||||
|
||||
scores = cross_val_score(model, X, y, cv=cv)
|
||||
```
|
||||
|
||||
## Common Metrics
|
||||
|
||||
### Classification
|
||||
|
||||
```python
|
||||
from sklearn.metrics import (
|
||||
accuracy_score, balanced_accuracy_score,
|
||||
precision_score, recall_score, f1_score,
|
||||
confusion_matrix, classification_report,
|
||||
roc_auc_score
|
||||
)
|
||||
|
||||
# Basic metrics
|
||||
accuracy = accuracy_score(y_true, y_pred)
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
|
||||
# Comprehensive report
|
||||
print(classification_report(y_true, y_pred))
|
||||
|
||||
# ROC AUC (requires probabilities)
|
||||
y_proba = model.predict_proba(X_test)[:, 1]
|
||||
auc = roc_auc_score(y_true, y_proba)
|
||||
```
|
||||
|
||||
### Regression
|
||||
|
||||
```python
|
||||
from sklearn.metrics import (
|
||||
mean_squared_error,
|
||||
mean_absolute_error,
|
||||
r2_score
|
||||
)
|
||||
|
||||
mse = mean_squared_error(y_true, y_pred)
|
||||
rmse = mean_squared_error(y_true, y_pred, squared=False)
|
||||
mae = mean_absolute_error(y_true, y_pred)
|
||||
r2 = r2_score(y_true, y_pred)
|
||||
|
||||
print(f"RMSE: {rmse:.3f}")
|
||||
print(f"MAE: {mae:.3f}")
|
||||
print(f"R²: {r2:.3f}")
|
||||
```
|
||||
|
||||
## Feature Engineering
|
||||
|
||||
### Polynomial Features
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
|
||||
poly = PolynomialFeatures(degree=2, include_bias=False)
|
||||
X_poly = poly.fit_transform(X)
|
||||
# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
|
||||
```
|
||||
|
||||
### Feature Selection
|
||||
|
||||
```python
|
||||
from sklearn.feature_selection import (
|
||||
SelectKBest, f_classif,
|
||||
RFE,
|
||||
SelectFromModel
|
||||
)
|
||||
|
||||
# Univariate selection
|
||||
selector = SelectKBest(f_classif, k=10)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
|
||||
# Recursive feature elimination
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
rfe = RFE(RandomForestClassifier(), n_features_to_select=10)
|
||||
X_selected = rfe.fit_transform(X, y)
|
||||
|
||||
# Model-based selection
|
||||
selector = SelectFromModel(
|
||||
RandomForestClassifier(n_estimators=100),
|
||||
threshold='median'
|
||||
)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
```
|
||||
|
||||
### Feature Importance
|
||||
|
||||
```python
|
||||
# Tree-based models
|
||||
model = RandomForestClassifier()
|
||||
model.fit(X_train, y_train)
|
||||
importances = model.feature_importances_
|
||||
|
||||
# Visualize
|
||||
import matplotlib.pyplot as plt
|
||||
indices = np.argsort(importances)[::-1]
|
||||
plt.bar(range(X.shape[1]), importances[indices])
|
||||
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
|
||||
plt.show()
|
||||
|
||||
# Permutation importance (works for any model)
|
||||
from sklearn.inspection import permutation_importance
|
||||
result = permutation_importance(model, X_test, y_test, n_repeats=10)
|
||||
importances = result.importances_mean
|
||||
```
|
||||
|
||||
## Clustering
|
||||
|
||||
### K-Means
|
||||
|
||||
```python
|
||||
from sklearn.cluster import KMeans
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
# Always scale for k-means
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Fit k-means
|
||||
kmeans = KMeans(n_clusters=3, random_state=42)
|
||||
labels = kmeans.fit_predict(X_scaled)
|
||||
|
||||
# Evaluate
|
||||
from sklearn.metrics import silhouette_score
|
||||
score = silhouette_score(X_scaled, labels)
|
||||
print(f"Silhouette score: {score:.3f}")
|
||||
```
|
||||
|
||||
### Elbow Method
|
||||
|
||||
```python
|
||||
inertias = []
|
||||
K_range = range(2, 11)
|
||||
|
||||
for k in K_range:
|
||||
kmeans = KMeans(n_clusters=k, random_state=42)
|
||||
kmeans.fit(X_scaled)
|
||||
inertias.append(kmeans.inertia_)
|
||||
|
||||
plt.plot(K_range, inertias, 'bo-')
|
||||
plt.xlabel('k')
|
||||
plt.ylabel('Inertia')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### DBSCAN
|
||||
|
||||
```python
|
||||
from sklearn.cluster import DBSCAN
|
||||
|
||||
dbscan = DBSCAN(eps=0.5, min_samples=5)
|
||||
labels = dbscan.fit_predict(X_scaled)
|
||||
|
||||
# -1 indicates noise/outliers
|
||||
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
|
||||
n_noise = list(labels).count(-1)
|
||||
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
|
||||
```
|
||||
|
||||
## Dimensionality Reduction
|
||||
|
||||
### PCA
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
# Always scale before PCA
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Specify n_components
|
||||
pca = PCA(n_components=2)
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
# Or specify variance to retain
|
||||
pca = PCA(n_components=0.95) # Keep 95% variance
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
print(f"Explained variance: {pca.explained_variance_ratio_}")
|
||||
print(f"Components needed: {pca.n_components_}")
|
||||
```
|
||||
|
||||
### t-SNE (Visualization Only)
|
||||
|
||||
```python
|
||||
from sklearn.manifold import TSNE
|
||||
|
||||
# Reduce to 50 dimensions with PCA first (recommended)
|
||||
pca = PCA(n_components=50)
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
# Apply t-SNE
|
||||
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
|
||||
X_tsne = tsne.fit_transform(X_pca)
|
||||
|
||||
# Visualize
|
||||
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
|
||||
plt.colorbar()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Saving and Loading Models
|
||||
|
||||
```python
|
||||
import joblib
|
||||
|
||||
# Save model
|
||||
joblib.dump(model, 'model.pkl')
|
||||
|
||||
# Save pipeline
|
||||
joblib.dump(pipeline, 'pipeline.pkl')
|
||||
|
||||
# Load
|
||||
model = joblib.load('model.pkl')
|
||||
pipeline = joblib.load('pipeline.pkl')
|
||||
|
||||
# Use loaded model
|
||||
y_pred = model.predict(X_new)
|
||||
```
|
||||
|
||||
## Common Pitfalls and Solutions
|
||||
|
||||
### Data Leakage
|
||||
❌ **Wrong**: Fit on all data before split
|
||||
```python
|
||||
scaler = StandardScaler().fit(X)
|
||||
X_train, X_test = train_test_split(scaler.transform(X))
|
||||
```
|
||||
|
||||
✅ **Correct**: Use pipeline or fit only on train
|
||||
```python
|
||||
X_train, X_test = train_test_split(X)
|
||||
pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Not Scaling
|
||||
❌ **Wrong**: Using SVM without scaling
|
||||
```python
|
||||
svm = SVC()
|
||||
svm.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
✅ **Correct**: Scale for SVM
|
||||
```python
|
||||
pipeline = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Wrong Metric for Imbalanced Data
|
||||
❌ **Wrong**: Using accuracy for 99:1 imbalance
|
||||
```python
|
||||
accuracy = accuracy_score(y_true, y_pred) # Can be misleading
|
||||
```
|
||||
|
||||
✅ **Correct**: Use appropriate metrics
|
||||
```python
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
balanced_acc = balanced_accuracy_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
### Not Using Stratification
|
||||
❌ **Wrong**: Random split for imbalanced data
|
||||
```python
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
|
||||
```
|
||||
|
||||
✅ **Correct**: Stratify for imbalanced classes
|
||||
```python
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, stratify=y
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use n_jobs=-1** for parallel processing (RandomForest, GridSearchCV)
|
||||
2. **Use HistGradientBoosting** for large datasets (>10K samples)
|
||||
3. **Use MiniBatchKMeans** for large clustering tasks
|
||||
4. **Use IncrementalPCA** for data that doesn't fit in memory
|
||||
5. **Use sparse matrices** for high-dimensional sparse data (text)
|
||||
6. **Cache transformers** in pipelines during grid search
|
||||
7. **Use RandomizedSearchCV** instead of GridSearchCV for large parameter spaces
|
||||
8. **Reduce dimensionality** with PCA before applying expensive algorithms
|
||||
@@ -0,0 +1,261 @@
|
||||
# Supervised Learning in scikit-learn
|
||||
|
||||
## Overview
|
||||
Supervised learning algorithms learn patterns from labeled training data to make predictions on new data. Scikit-learn organizes supervised learning into 17 major categories.
|
||||
|
||||
## Linear Models
|
||||
|
||||
### Regression
|
||||
- **LinearRegression**: Ordinary least squares regression
|
||||
- **Ridge**: L2-regularized regression, good for multicollinearity
|
||||
- **Lasso**: L1-regularized regression, performs feature selection
|
||||
- **ElasticNet**: Combined L1/L2 regularization
|
||||
- **LassoLars**: Lasso using Least Angle Regression algorithm
|
||||
- **BayesianRidge**: Bayesian approach with automatic relevance determination
|
||||
|
||||
### Classification
|
||||
- **LogisticRegression**: Binary and multiclass classification
|
||||
- **RidgeClassifier**: Ridge regression for classification
|
||||
- **SGDClassifier**: Linear classifiers with SGD training
|
||||
|
||||
**Use cases**: Baseline models, interpretable predictions, high-dimensional data, when linear relationships are expected
|
||||
|
||||
**Key parameters**:
|
||||
- `alpha`: Regularization strength (higher = more regularization)
|
||||
- `fit_intercept`: Whether to calculate intercept
|
||||
- `solver`: Optimization algorithm ('lbfgs', 'saga', 'liblinear')
|
||||
|
||||
## Support Vector Machines (SVM)
|
||||
|
||||
- **SVC**: Support Vector Classification
|
||||
- **SVR**: Support Vector Regression
|
||||
- **LinearSVC**: Linear SVM using liblinear (faster for large datasets)
|
||||
- **OneClassSVM**: Unsupervised outlier detection
|
||||
|
||||
**Use cases**: Complex non-linear decision boundaries, high-dimensional spaces, when clear margin of separation exists
|
||||
|
||||
**Key parameters**:
|
||||
- `kernel`: 'linear', 'poly', 'rbf', 'sigmoid'
|
||||
- `C`: Regularization parameter (lower = more regularization)
|
||||
- `gamma`: Kernel coefficient ('scale', 'auto', or float)
|
||||
- `degree`: Polynomial degree (for poly kernel)
|
||||
|
||||
**Performance tip**: SVMs don't scale well beyond tens of thousands of samples. Use LinearSVC for large datasets with linear kernel.
|
||||
|
||||
## Decision Trees
|
||||
|
||||
- **DecisionTreeClassifier**: Classification tree
|
||||
- **DecisionTreeRegressor**: Regression tree
|
||||
- **ExtraTreeClassifier/Regressor**: Extremely randomized tree
|
||||
|
||||
**Use cases**: Non-linear relationships, feature importance analysis, interpretable rules, handling mixed data types
|
||||
|
||||
**Key parameters**:
|
||||
- `max_depth`: Maximum tree depth (controls overfitting)
|
||||
- `min_samples_split`: Minimum samples to split a node
|
||||
- `min_samples_leaf`: Minimum samples in leaf node
|
||||
- `max_features`: Number of features to consider for splits
|
||||
- `criterion`: 'gini', 'entropy' (classification); 'squared_error', 'absolute_error' (regression)
|
||||
|
||||
**Overfitting prevention**: Limit `max_depth`, increase `min_samples_split/leaf`, use pruning with `ccp_alpha`
|
||||
|
||||
## Ensemble Methods
|
||||
|
||||
### Random Forests
|
||||
- **RandomForestClassifier**: Ensemble of decision trees
|
||||
- **RandomForestRegressor**: Regression variant
|
||||
|
||||
**Use cases**: Robust general-purpose algorithm, reduces overfitting vs single trees, handles non-linear relationships
|
||||
|
||||
**Key parameters**:
|
||||
- `n_estimators`: Number of trees (higher = better but slower)
|
||||
- `max_depth`: Maximum tree depth
|
||||
- `max_features`: Features per split ('sqrt', 'log2', int, float)
|
||||
- `bootstrap`: Whether to use bootstrap samples
|
||||
- `n_jobs`: Parallel processing (-1 uses all cores)
|
||||
|
||||
### Gradient Boosting
|
||||
- **HistGradientBoostingClassifier/Regressor**: Histogram-based, fast for large datasets (>10k samples)
|
||||
- **GradientBoostingClassifier/Regressor**: Traditional implementation, better for small datasets
|
||||
|
||||
**Use cases**: High-performance predictions, winning Kaggle competitions, structured/tabular data
|
||||
|
||||
**Key parameters**:
|
||||
- `n_estimators`: Number of boosting stages
|
||||
- `learning_rate`: Shrinks contribution of each tree
|
||||
- `max_depth`: Maximum tree depth (typically 3-8)
|
||||
- `subsample`: Fraction of samples per tree (enables stochastic gradient boosting)
|
||||
- `early_stopping`: Stop when validation score stops improving
|
||||
|
||||
**Performance tip**: HistGradientBoosting is orders of magnitude faster for large datasets
|
||||
|
||||
### AdaBoost
|
||||
- **AdaBoostClassifier/Regressor**: Adaptive boosting
|
||||
|
||||
**Use cases**: Boosting weak learners, less prone to overfitting than other methods
|
||||
|
||||
**Key parameters**:
|
||||
- `estimator`: Base estimator (default: DecisionTreeClassifier with max_depth=1)
|
||||
- `n_estimators`: Number of boosting iterations
|
||||
- `learning_rate`: Weight applied to each classifier
|
||||
|
||||
### Bagging
|
||||
- **BaggingClassifier/Regressor**: Bootstrap aggregating with any base estimator
|
||||
|
||||
**Use cases**: Reducing variance of unstable models, parallel ensemble creation
|
||||
|
||||
**Key parameters**:
|
||||
- `estimator`: Base estimator to fit
|
||||
- `n_estimators`: Number of estimators
|
||||
- `max_samples`: Samples to draw per estimator
|
||||
- `bootstrap`: Whether to use replacement
|
||||
|
||||
### Voting & Stacking
|
||||
- **VotingClassifier/Regressor**: Combines different model types
|
||||
- **StackingClassifier/Regressor**: Meta-learner trained on base predictions
|
||||
|
||||
**Use cases**: Combining diverse models, leveraging different model strengths
|
||||
|
||||
## Neural Networks
|
||||
|
||||
- **MLPClassifier**: Multi-layer perceptron classifier
|
||||
- **MLPRegressor**: Multi-layer perceptron regressor
|
||||
|
||||
**Use cases**: Complex non-linear patterns, when gradient boosting is too slow, deep feature learning
|
||||
|
||||
**Key parameters**:
|
||||
- `hidden_layer_sizes`: Tuple of hidden layer sizes (e.g., (100, 50))
|
||||
- `activation`: 'relu', 'tanh', 'logistic'
|
||||
- `solver`: 'adam', 'lbfgs', 'sgd'
|
||||
- `alpha`: L2 regularization term
|
||||
- `learning_rate`: Learning rate schedule
|
||||
- `early_stopping`: Stop when validation score stops improving
|
||||
|
||||
**Important**: Feature scaling is critical for neural networks. Always use StandardScaler or similar.
|
||||
|
||||
## Nearest Neighbors
|
||||
|
||||
- **KNeighborsClassifier/Regressor**: K-nearest neighbors
|
||||
- **RadiusNeighborsClassifier/Regressor**: Radius-based neighbors
|
||||
- **NearestCentroid**: Classification using class centroids
|
||||
|
||||
**Use cases**: Simple baseline, irregular decision boundaries, when interpretability isn't critical
|
||||
|
||||
**Key parameters**:
|
||||
- `n_neighbors`: Number of neighbors (typically 3-11)
|
||||
- `weights`: 'uniform' or 'distance' (distance-weighted voting)
|
||||
- `metric`: Distance metric ('euclidean', 'manhattan', 'minkowski')
|
||||
- `algorithm`: 'auto', 'ball_tree', 'kd_tree', 'brute'
|
||||
|
||||
## Naive Bayes
|
||||
|
||||
- **GaussianNB**: Assumes Gaussian distribution of features
|
||||
- **MultinomialNB**: For discrete counts (text classification)
|
||||
- **BernoulliNB**: For binary/boolean features
|
||||
- **CategoricalNB**: For categorical features
|
||||
- **ComplementNB**: Adapted for imbalanced datasets
|
||||
|
||||
**Use cases**: Text classification, fast baseline, when features are independent, small training sets
|
||||
|
||||
**Key parameters**:
|
||||
- `alpha`: Smoothing parameter (Laplace/Lidstone smoothing)
|
||||
- `fit_prior`: Whether to learn class prior probabilities
|
||||
|
||||
## Linear/Quadratic Discriminant Analysis
|
||||
|
||||
- **LinearDiscriminantAnalysis**: Linear decision boundary with dimensionality reduction
|
||||
- **QuadraticDiscriminantAnalysis**: Quadratic decision boundary
|
||||
|
||||
**Use cases**: When classes have Gaussian distributions, dimensionality reduction, when covariance assumptions hold
|
||||
|
||||
## Gaussian Processes
|
||||
|
||||
- **GaussianProcessClassifier**: Probabilistic classification
|
||||
- **GaussianProcessRegressor**: Probabilistic regression with uncertainty estimates
|
||||
|
||||
**Use cases**: When uncertainty quantification is important, small datasets, smooth function approximation
|
||||
|
||||
**Key parameters**:
|
||||
- `kernel`: Covariance function (RBF, Matern, RationalQuadratic, etc.)
|
||||
- `alpha`: Noise level
|
||||
|
||||
**Limitation**: Doesn't scale well to large datasets (O(n³) complexity)
|
||||
|
||||
## Stochastic Gradient Descent
|
||||
|
||||
- **SGDClassifier**: Linear classifiers with SGD
|
||||
- **SGDRegressor**: Linear regressors with SGD
|
||||
|
||||
**Use cases**: Very large datasets (>100k samples), online learning, when data doesn't fit in memory
|
||||
|
||||
**Key parameters**:
|
||||
- `loss`: Loss function ('hinge', 'log_loss', 'squared_error', etc.)
|
||||
- `penalty`: Regularization ('l2', 'l1', 'elasticnet')
|
||||
- `alpha`: Regularization strength
|
||||
- `learning_rate`: Learning rate schedule
|
||||
|
||||
## Semi-Supervised Learning
|
||||
|
||||
- **SelfTrainingClassifier**: Self-training with any base classifier
|
||||
- **LabelPropagation**: Label propagation through graph
|
||||
- **LabelSpreading**: Label spreading (modified label propagation)
|
||||
|
||||
**Use cases**: When labeled data is scarce but unlabeled data is abundant
|
||||
|
||||
## Feature Selection
|
||||
|
||||
- **VarianceThreshold**: Remove low-variance features
|
||||
- **SelectKBest**: Select K highest scoring features
|
||||
- **SelectPercentile**: Select top percentile of features
|
||||
- **RFE**: Recursive feature elimination
|
||||
- **RFECV**: RFE with cross-validation
|
||||
- **SelectFromModel**: Select features based on importance
|
||||
- **SequentialFeatureSelector**: Forward/backward feature selection
|
||||
|
||||
**Use cases**: Reducing dimensionality, removing irrelevant features, improving interpretability, reducing overfitting
|
||||
|
||||
## Probability Calibration
|
||||
|
||||
- **CalibratedClassifierCV**: Calibrate classifier probabilities
|
||||
|
||||
**Use cases**: When probability estimates are important (not just class predictions), especially with SVM and Naive Bayes
|
||||
|
||||
**Methods**:
|
||||
- `sigmoid`: Platt scaling
|
||||
- `isotonic`: Isotonic regression (more flexible, needs more data)
|
||||
|
||||
## Multi-Output Methods
|
||||
|
||||
- **MultiOutputClassifier**: Fit one classifier per target
|
||||
- **MultiOutputRegressor**: Fit one regressor per target
|
||||
- **ClassifierChain**: Models dependencies between targets
|
||||
- **RegressorChain**: Regression variant
|
||||
|
||||
**Use cases**: Predicting multiple related targets simultaneously
|
||||
|
||||
## Specialized Regression
|
||||
|
||||
- **IsotonicRegression**: Monotonic regression
|
||||
- **QuantileRegressor**: Quantile regression for prediction intervals
|
||||
|
||||
## Algorithm Selection Guidelines
|
||||
|
||||
**Start with**:
|
||||
1. **Logistic Regression** (classification) or **LinearRegression/Ridge** (regression) as baseline
|
||||
2. **RandomForestClassifier/Regressor** for general non-linear problems
|
||||
3. **HistGradientBoostingClassifier/Regressor** when best performance is needed
|
||||
|
||||
**Consider dataset size**:
|
||||
- Small (<1k samples): SVM, Gaussian Processes, any algorithm
|
||||
- Medium (1k-100k): Random Forests, Gradient Boosting, Neural Networks
|
||||
- Large (>100k): SGD, HistGradientBoosting, LinearSVC
|
||||
|
||||
**Consider interpretability needs**:
|
||||
- High interpretability: Linear models, Decision Trees, Naive Bayes
|
||||
- Medium: Random Forests (feature importance), Rule extraction
|
||||
- Low (black box acceptable): Gradient Boosting, Neural Networks, SVM with RBF kernel
|
||||
|
||||
**Consider training time**:
|
||||
- Fast: Linear models, Naive Bayes, Decision Trees
|
||||
- Medium: Random Forests (parallelizable), SVM (small data)
|
||||
- Slow: Gradient Boosting, Neural Networks, SVM (large data), Gaussian Processes
|
||||
@@ -0,0 +1,728 @@
|
||||
# Unsupervised Learning in scikit-learn
|
||||
|
||||
## Overview
|
||||
Unsupervised learning discovers patterns in data without labeled targets. Main tasks include clustering (grouping similar samples), dimensionality reduction (reducing feature count), and anomaly detection (finding outliers).
|
||||
|
||||
## Clustering Algorithms
|
||||
|
||||
### K-Means
|
||||
|
||||
Groups data into k clusters by minimizing within-cluster variance.
|
||||
|
||||
**Algorithm**:
|
||||
1. Initialize k centroids (k-means++ initialization recommended)
|
||||
2. Assign each point to nearest centroid
|
||||
3. Update centroids to mean of assigned points
|
||||
4. Repeat until convergence
|
||||
|
||||
```python
|
||||
from sklearn.cluster import KMeans
|
||||
|
||||
kmeans = KMeans(
|
||||
n_clusters=3,
|
||||
init='k-means++', # Smart initialization
|
||||
n_init=10, # Number of times to run with different seeds
|
||||
max_iter=300,
|
||||
random_state=42
|
||||
)
|
||||
labels = kmeans.fit_predict(X)
|
||||
centroids = kmeans.cluster_centers_
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Customer segmentation
|
||||
- Image compression
|
||||
- Data preprocessing (clustering as features)
|
||||
|
||||
**Strengths**:
|
||||
- Fast and scalable
|
||||
- Simple to understand
|
||||
- Works well with spherical clusters
|
||||
|
||||
**Limitations**:
|
||||
- Assumes spherical clusters of similar size
|
||||
- Sensitive to initialization (mitigated by k-means++)
|
||||
- Must specify k beforehand
|
||||
- Sensitive to outliers
|
||||
|
||||
**Choosing k**: Use elbow method, silhouette score, or domain knowledge
|
||||
|
||||
**Variants**:
|
||||
- **MiniBatchKMeans**: Faster for large datasets, uses mini-batches
|
||||
- **KMeans with n_init='auto'**: Adaptive number of initializations
|
||||
|
||||
### DBSCAN
|
||||
|
||||
Density-Based Spatial Clustering of Applications with Noise. Identifies clusters as dense regions separated by sparse areas.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import DBSCAN
|
||||
|
||||
dbscan = DBSCAN(
|
||||
eps=0.5, # Maximum distance between neighbors
|
||||
min_samples=5, # Minimum points to form dense region
|
||||
metric='euclidean'
|
||||
)
|
||||
labels = dbscan.fit_predict(X)
|
||||
# -1 indicates noise/outliers
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Arbitrary cluster shapes
|
||||
- Outlier detection
|
||||
- When cluster count is unknown
|
||||
- Geographic/spatial data
|
||||
|
||||
**Strengths**:
|
||||
- Discovers arbitrary-shaped clusters
|
||||
- Automatically detects outliers
|
||||
- Doesn't require specifying number of clusters
|
||||
- Robust to outliers
|
||||
|
||||
**Limitations**:
|
||||
- Struggles with varying densities
|
||||
- Sensitive to eps and min_samples parameters
|
||||
- Not deterministic (border points may vary)
|
||||
|
||||
**Parameter tuning**:
|
||||
- `eps`: Plot k-distance graph, look for elbow
|
||||
- `min_samples`: Rule of thumb: 2 * dimensions
|
||||
|
||||
### HDBSCAN
|
||||
|
||||
Hierarchical DBSCAN that handles variable cluster densities.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import HDBSCAN
|
||||
|
||||
hdbscan = HDBSCAN(
|
||||
min_cluster_size=5,
|
||||
min_samples=None, # Defaults to min_cluster_size
|
||||
metric='euclidean'
|
||||
)
|
||||
labels = hdbscan.fit_predict(X)
|
||||
```
|
||||
|
||||
**Advantages over DBSCAN**:
|
||||
- Handles variable density clusters
|
||||
- More robust parameter selection
|
||||
- Provides cluster membership probabilities
|
||||
- Hierarchical structure
|
||||
|
||||
**Use cases**: When DBSCAN struggles with varying densities
|
||||
|
||||
### Hierarchical Clustering
|
||||
|
||||
Builds nested cluster hierarchies using agglomerative (bottom-up) approach.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import AgglomerativeClustering
|
||||
|
||||
agg_clust = AgglomerativeClustering(
|
||||
n_clusters=3,
|
||||
linkage='ward', # 'ward', 'complete', 'average', 'single'
|
||||
metric='euclidean'
|
||||
)
|
||||
labels = agg_clust.fit_predict(X)
|
||||
|
||||
# Visualize with dendrogram
|
||||
from scipy.cluster.hierarchy import dendrogram, linkage as scipy_linkage
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
linkage_matrix = scipy_linkage(X, method='ward')
|
||||
dendrogram(linkage_matrix)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Linkage methods**:
|
||||
- `ward`: Minimizes variance (only with Euclidean) - **most common**
|
||||
- `complete`: Maximum distance between clusters
|
||||
- `average`: Average distance between clusters
|
||||
- `single`: Minimum distance between clusters
|
||||
|
||||
**Use cases**:
|
||||
- When hierarchical structure is meaningful
|
||||
- Taxonomy/phylogenetic trees
|
||||
- When visualization is important (dendrograms)
|
||||
|
||||
**Strengths**:
|
||||
- No need to specify k initially (cut dendrogram at desired level)
|
||||
- Produces hierarchy of clusters
|
||||
- Deterministic
|
||||
|
||||
**Limitations**:
|
||||
- Computationally expensive (O(n²) to O(n³))
|
||||
- Not suitable for large datasets
|
||||
- Cannot undo previous merges
|
||||
|
||||
### Spectral Clustering
|
||||
|
||||
Performs dimensionality reduction using affinity matrix before clustering.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import SpectralClustering
|
||||
|
||||
spectral = SpectralClustering(
|
||||
n_clusters=3,
|
||||
affinity='rbf', # 'rbf', 'nearest_neighbors', 'precomputed'
|
||||
gamma=1.0,
|
||||
n_neighbors=10,
|
||||
random_state=42
|
||||
)
|
||||
labels = spectral.fit_predict(X)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Non-convex clusters
|
||||
- Image segmentation
|
||||
- Graph clustering
|
||||
- When similarity matrix is available
|
||||
|
||||
**Strengths**:
|
||||
- Handles non-convex clusters
|
||||
- Works with similarity matrices
|
||||
- Often better than k-means for complex shapes
|
||||
|
||||
**Limitations**:
|
||||
- Computationally expensive
|
||||
- Requires specifying number of clusters
|
||||
- Memory intensive
|
||||
|
||||
### Mean Shift
|
||||
|
||||
Discovers clusters through iterative centroid updates based on density.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import MeanShift, estimate_bandwidth
|
||||
|
||||
# Estimate bandwidth
|
||||
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)
|
||||
|
||||
mean_shift = MeanShift(bandwidth=bandwidth)
|
||||
labels = mean_shift.fit_predict(X)
|
||||
cluster_centers = mean_shift.cluster_centers_
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- When cluster count is unknown
|
||||
- Computer vision applications
|
||||
- Object tracking
|
||||
|
||||
**Strengths**:
|
||||
- Automatically determines number of clusters
|
||||
- Handles arbitrary shapes
|
||||
- No assumptions about cluster shape
|
||||
|
||||
**Limitations**:
|
||||
- Computationally expensive
|
||||
- Very sensitive to bandwidth parameter
|
||||
- Doesn't scale well
|
||||
|
||||
### Affinity Propagation
|
||||
|
||||
Uses message-passing between samples to identify exemplars.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import AffinityPropagation
|
||||
|
||||
affinity_prop = AffinityPropagation(
|
||||
damping=0.5, # Damping factor (0.5-1.0)
|
||||
preference=None, # Self-preference (controls number of clusters)
|
||||
random_state=42
|
||||
)
|
||||
labels = affinity_prop.fit_predict(X)
|
||||
exemplars = affinity_prop.cluster_centers_indices_
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- When number of clusters is unknown
|
||||
- When exemplars (representative samples) are needed
|
||||
|
||||
**Strengths**:
|
||||
- Automatically determines number of clusters
|
||||
- Identifies exemplar samples
|
||||
- No initialization required
|
||||
|
||||
**Limitations**:
|
||||
- Very slow: O(n²t) where t is iterations
|
||||
- Not suitable for large datasets
|
||||
- Memory intensive
|
||||
|
||||
### Gaussian Mixture Models (GMM)
|
||||
|
||||
Probabilistic model assuming data comes from mixture of Gaussian distributions.
|
||||
|
||||
```python
|
||||
from sklearn.mixture import GaussianMixture
|
||||
|
||||
gmm = GaussianMixture(
|
||||
n_components=3,
|
||||
covariance_type='full', # 'full', 'tied', 'diag', 'spherical'
|
||||
random_state=42
|
||||
)
|
||||
labels = gmm.fit_predict(X)
|
||||
probabilities = gmm.predict_proba(X) # Soft clustering
|
||||
```
|
||||
|
||||
**Covariance types**:
|
||||
- `full`: Each component has its own covariance matrix
|
||||
- `tied`: All components share same covariance
|
||||
- `diag`: Diagonal covariance (independent features)
|
||||
- `spherical`: Spherical covariance (isotropic)
|
||||
|
||||
**Use cases**:
|
||||
- When soft clustering is needed (probabilities)
|
||||
- When clusters have different shapes/sizes
|
||||
- Generative modeling
|
||||
- Density estimation
|
||||
|
||||
**Strengths**:
|
||||
- Provides probabilities (soft clustering)
|
||||
- Can handle elliptical clusters
|
||||
- Generative model (can sample new data)
|
||||
- Model selection with BIC/AIC
|
||||
|
||||
**Limitations**:
|
||||
- Assumes Gaussian distributions
|
||||
- Sensitive to initialization
|
||||
- Can converge to local optima
|
||||
|
||||
**Model selection**:
|
||||
```python
|
||||
from sklearn.mixture import GaussianMixture
|
||||
import numpy as np
|
||||
|
||||
n_components_range = range(2, 10)
|
||||
bic_scores = []
|
||||
|
||||
for n in n_components_range:
|
||||
gmm = GaussianMixture(n_components=n, random_state=42)
|
||||
gmm.fit(X)
|
||||
bic_scores.append(gmm.bic(X))
|
||||
|
||||
optimal_n = n_components_range[np.argmin(bic_scores)]
|
||||
```
|
||||
|
||||
### BIRCH
|
||||
|
||||
Builds Clustering Feature Tree for memory-efficient processing of large datasets.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import Birch
|
||||
|
||||
birch = Birch(
|
||||
n_clusters=3,
|
||||
threshold=0.5,
|
||||
branching_factor=50
|
||||
)
|
||||
labels = birch.fit_predict(X)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Very large datasets
|
||||
- Streaming data
|
||||
- Memory constraints
|
||||
|
||||
**Strengths**:
|
||||
- Memory efficient
|
||||
- Single pass over data
|
||||
- Incremental learning
|
||||
|
||||
## Dimensionality Reduction
|
||||
|
||||
### Principal Component Analysis (PCA)
|
||||
|
||||
Finds orthogonal components that explain maximum variance.
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import PCA
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Specify number of components
|
||||
pca = PCA(n_components=2, random_state=42)
|
||||
X_transformed = pca.fit_transform(X)
|
||||
|
||||
print("Explained variance ratio:", pca.explained_variance_ratio_)
|
||||
print("Total variance explained:", pca.explained_variance_ratio_.sum())
|
||||
|
||||
# Or specify variance to retain
|
||||
pca = PCA(n_components=0.95) # Keep 95% of variance
|
||||
X_transformed = pca.fit_transform(X)
|
||||
print(f"Components needed: {pca.n_components_}")
|
||||
|
||||
# Visualize explained variance
|
||||
plt.plot(np.cumsum(pca.explained_variance_ratio_))
|
||||
plt.xlabel('Number of components')
|
||||
plt.ylabel('Cumulative explained variance')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Visualization (reduce to 2-3 dimensions)
|
||||
- Remove multicollinearity
|
||||
- Noise reduction
|
||||
- Speed up training
|
||||
- Feature extraction
|
||||
|
||||
**Strengths**:
|
||||
- Fast and efficient
|
||||
- Reduces multicollinearity
|
||||
- Works well for linear relationships
|
||||
- Interpretable components
|
||||
|
||||
**Limitations**:
|
||||
- Only linear transformations
|
||||
- Sensitive to scaling (always standardize first!)
|
||||
- Components may be hard to interpret
|
||||
|
||||
**Variants**:
|
||||
- **IncrementalPCA**: For datasets that don't fit in memory
|
||||
- **KernelPCA**: Non-linear dimensionality reduction
|
||||
- **SparsePCA**: Sparse loadings for interpretability
|
||||
|
||||
### t-SNE
|
||||
|
||||
t-Distributed Stochastic Neighbor Embedding for visualization.
|
||||
|
||||
```python
|
||||
from sklearn.manifold import TSNE
|
||||
|
||||
tsne = TSNE(
|
||||
n_components=2,
|
||||
perplexity=30, # Balance local vs global structure (5-50)
|
||||
learning_rate='auto',
|
||||
n_iter=1000,
|
||||
random_state=42
|
||||
)
|
||||
X_embedded = tsne.fit_transform(X)
|
||||
|
||||
# Visualize
|
||||
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Visualization only (do not use for preprocessing!)
|
||||
- Exploring high-dimensional data
|
||||
- Finding clusters visually
|
||||
|
||||
**Important notes**:
|
||||
- **Only for visualization**, not for preprocessing
|
||||
- Each run produces different results (use random_state for reproducibility)
|
||||
- Slow for large datasets
|
||||
- Cannot transform new data (no transform() method)
|
||||
|
||||
**Parameter tuning**:
|
||||
- `perplexity`: 5-50, larger for larger datasets
|
||||
- Lower perplexity = focus on local structure
|
||||
- Higher perplexity = focus on global structure
|
||||
|
||||
### UMAP
|
||||
|
||||
Uniform Manifold Approximation and Projection (requires umap-learn package).
|
||||
|
||||
**Advantages over t-SNE**:
|
||||
- Preserves global structure better
|
||||
- Faster
|
||||
- Can transform new data
|
||||
- Can be used for preprocessing (not just visualization)
|
||||
|
||||
### Truncated SVD (LSA)
|
||||
|
||||
Similar to PCA but works with sparse matrices (e.g., TF-IDF).
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import TruncatedSVD
|
||||
|
||||
svd = TruncatedSVD(n_components=100, random_state=42)
|
||||
X_reduced = svd.fit_transform(X_sparse)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Text data (after TF-IDF)
|
||||
- Sparse matrices
|
||||
- Latent Semantic Analysis (LSA)
|
||||
|
||||
### Non-negative Matrix Factorization (NMF)
|
||||
|
||||
Factorizes data into non-negative components.
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import NMF
|
||||
|
||||
nmf = NMF(n_components=10, init='nndsvd', random_state=42)
|
||||
W = nmf.fit_transform(X) # Document-topic matrix
|
||||
H = nmf.components_ # Topic-word matrix
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Topic modeling
|
||||
- Audio source separation
|
||||
- Image processing
|
||||
- When non-negativity is important (e.g., counts)
|
||||
|
||||
**Strengths**:
|
||||
- Interpretable components (additive, non-negative)
|
||||
- Sparse representations
|
||||
|
||||
### Independent Component Analysis (ICA)
|
||||
|
||||
Separates multivariate signal into independent components.
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import FastICA
|
||||
|
||||
ica = FastICA(n_components=10, random_state=42)
|
||||
X_independent = ica.fit_transform(X)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Blind source separation
|
||||
- Signal processing
|
||||
- Feature extraction when independence is expected
|
||||
|
||||
### Factor Analysis
|
||||
|
||||
Models observed variables as linear combinations of latent factors plus noise.
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import FactorAnalysis
|
||||
|
||||
fa = FactorAnalysis(n_components=5, random_state=42)
|
||||
X_factors = fa.fit_transform(X)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- When noise is heteroscedastic
|
||||
- Latent variable modeling
|
||||
- Psychology/social science research
|
||||
|
||||
**Difference from PCA**: Models noise explicitly, assumes features have independent noise
|
||||
|
||||
## Anomaly Detection
|
||||
|
||||
### One-Class SVM
|
||||
|
||||
Learns boundary around normal data.
|
||||
|
||||
```python
|
||||
from sklearn.svm import OneClassSVM
|
||||
|
||||
oc_svm = OneClassSVM(
|
||||
nu=0.1, # Proportion of outliers expected
|
||||
kernel='rbf',
|
||||
gamma='auto'
|
||||
)
|
||||
oc_svm.fit(X_train)
|
||||
predictions = oc_svm.predict(X_test) # 1 for inliers, -1 for outliers
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Novelty detection
|
||||
- When only normal data is available for training
|
||||
|
||||
### Isolation Forest
|
||||
|
||||
Isolates outliers using random forests.
|
||||
|
||||
```python
|
||||
from sklearn.ensemble import IsolationForest
|
||||
|
||||
iso_forest = IsolationForest(
|
||||
contamination=0.1, # Expected proportion of outliers
|
||||
random_state=42
|
||||
)
|
||||
predictions = iso_forest.fit_predict(X) # 1 for inliers, -1 for outliers
|
||||
scores = iso_forest.score_samples(X) # Anomaly scores
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- General anomaly detection
|
||||
- Works well with high-dimensional data
|
||||
- Fast and scalable
|
||||
|
||||
**Strengths**:
|
||||
- Fast
|
||||
- Effective in high dimensions
|
||||
- Low memory requirements
|
||||
|
||||
### Local Outlier Factor (LOF)
|
||||
|
||||
Detects outliers based on local density deviation.
|
||||
|
||||
```python
|
||||
from sklearn.neighbors import LocalOutlierFactor
|
||||
|
||||
lof = LocalOutlierFactor(
|
||||
n_neighbors=20,
|
||||
contamination=0.1
|
||||
)
|
||||
predictions = lof.fit_predict(X) # 1 for inliers, -1 for outliers
|
||||
scores = lof.negative_outlier_factor_ # Anomaly scores (negative)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Finding local outliers
|
||||
- When global methods fail
|
||||
|
||||
## Clustering Evaluation
|
||||
|
||||
### With Ground Truth Labels
|
||||
|
||||
When true labels are available (for validation):
|
||||
|
||||
**Adjusted Rand Index (ARI)**:
|
||||
```python
|
||||
from sklearn.metrics import adjusted_rand_score
|
||||
ari = adjusted_rand_score(y_true, y_pred)
|
||||
# Range: [-1, 1], 1 = perfect, 0 = random
|
||||
```
|
||||
|
||||
**Normalized Mutual Information (NMI)**:
|
||||
```python
|
||||
from sklearn.metrics import normalized_mutual_info_score
|
||||
nmi = normalized_mutual_info_score(y_true, y_pred)
|
||||
# Range: [0, 1], 1 = perfect
|
||||
```
|
||||
|
||||
**V-Measure**:
|
||||
```python
|
||||
from sklearn.metrics import v_measure_score
|
||||
v = v_measure_score(y_true, y_pred)
|
||||
# Range: [0, 1], harmonic mean of homogeneity and completeness
|
||||
```
|
||||
|
||||
### Without Ground Truth Labels
|
||||
|
||||
When true labels are unavailable (unsupervised evaluation):
|
||||
|
||||
**Silhouette Score**:
|
||||
Measures how similar objects are to their own cluster vs other clusters.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import silhouette_score, silhouette_samples
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
score = silhouette_score(X, labels)
|
||||
# Range: [-1, 1], higher is better
|
||||
# >0.7: Strong structure
|
||||
# 0.5-0.7: Reasonable structure
|
||||
# 0.25-0.5: Weak structure
|
||||
# <0.25: No substantial structure
|
||||
|
||||
# Per-sample scores for detailed analysis
|
||||
sample_scores = silhouette_samples(X, labels)
|
||||
|
||||
# Visualize silhouette plot
|
||||
for i in range(n_clusters):
|
||||
cluster_scores = sample_scores[labels == i]
|
||||
cluster_scores.sort()
|
||||
plt.barh(range(len(cluster_scores)), cluster_scores)
|
||||
plt.axvline(x=score, color='red', linestyle='--')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Davies-Bouldin Index**:
|
||||
```python
|
||||
from sklearn.metrics import davies_bouldin_score
|
||||
db = davies_bouldin_score(X, labels)
|
||||
# Lower is better, 0 = perfect
|
||||
```
|
||||
|
||||
**Calinski-Harabasz Index** (Variance Ratio Criterion):
|
||||
```python
|
||||
from sklearn.metrics import calinski_harabasz_score
|
||||
ch = calinski_harabasz_score(X, labels)
|
||||
# Higher is better
|
||||
```
|
||||
|
||||
**Inertia** (K-Means specific):
|
||||
```python
|
||||
inertia = kmeans.inertia_
|
||||
# Sum of squared distances to nearest cluster center
|
||||
# Use for elbow method
|
||||
```
|
||||
|
||||
### Elbow Method (K-Means)
|
||||
|
||||
```python
|
||||
from sklearn.cluster import KMeans
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
inertias = []
|
||||
K_range = range(2, 11)
|
||||
|
||||
for k in K_range:
|
||||
kmeans = KMeans(n_clusters=k, random_state=42)
|
||||
kmeans.fit(X)
|
||||
inertias.append(kmeans.inertia_)
|
||||
|
||||
plt.plot(K_range, inertias, 'bo-')
|
||||
plt.xlabel('Number of clusters (k)')
|
||||
plt.ylabel('Inertia')
|
||||
plt.title('Elbow Method')
|
||||
plt.show()
|
||||
# Look for "elbow" where inertia starts decreasing more slowly
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Clustering Algorithm Selection
|
||||
|
||||
**Use K-Means when**:
|
||||
- Clusters are spherical and similar size
|
||||
- Speed is important
|
||||
- Data is not too high-dimensional
|
||||
|
||||
**Use DBSCAN when**:
|
||||
- Arbitrary cluster shapes
|
||||
- Number of clusters unknown
|
||||
- Outlier detection needed
|
||||
|
||||
**Use Hierarchical when**:
|
||||
- Hierarchy is meaningful
|
||||
- Small to medium datasets
|
||||
- Visualization is important
|
||||
|
||||
**Use GMM when**:
|
||||
- Soft clustering needed
|
||||
- Clusters have different shapes/sizes
|
||||
- Probabilistic interpretation needed
|
||||
|
||||
**Use Spectral Clustering when**:
|
||||
- Non-convex clusters
|
||||
- Have similarity matrix
|
||||
- Moderate dataset size
|
||||
|
||||
### Preprocessing for Clustering
|
||||
|
||||
1. **Always scale features**: Use StandardScaler or MinMaxScaler
|
||||
2. **Handle outliers**: Remove or use robust algorithms (DBSCAN, HDBSCAN)
|
||||
3. **Reduce dimensionality if needed**: PCA for speed, careful with interpretation
|
||||
4. **Check for categorical variables**: Encode appropriately or use specialized algorithms
|
||||
|
||||
### Dimensionality Reduction Guidelines
|
||||
|
||||
**For preprocessing/feature extraction**:
|
||||
- PCA (linear relationships)
|
||||
- TruncatedSVD (sparse data)
|
||||
- NMF (non-negative data)
|
||||
|
||||
**For visualization only**:
|
||||
- t-SNE (preserves local structure)
|
||||
- UMAP (preserves both local and global structure)
|
||||
|
||||
**Always**:
|
||||
- Standardize features before PCA
|
||||
- Use appropriate n_components (elbow plot, explained variance)
|
||||
- Don't use t-SNE for anything except visualization
|
||||
|
||||
### Common Pitfalls
|
||||
|
||||
1. **Not scaling data**: Most algorithms sensitive to scale
|
||||
2. **Using t-SNE for preprocessing**: Only for visualization!
|
||||
3. **Overfitting cluster count**: Too many clusters = overfitting noise
|
||||
4. **Ignoring outliers**: Can severely affect centroid-based methods
|
||||
5. **Wrong metric**: Euclidean assumes all features equally important
|
||||
6. **Not validating results**: Always check with multiple metrics and domain knowledge
|
||||
7. **PCA without standardization**: Components dominated by high-variance features
|
||||
Reference in New Issue
Block a user