mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Add more scientific skills
This commit is contained in:
780
scientific-packages/scikit-learn/SKILL.md
Normal file
780
scientific-packages/scikit-learn/SKILL.md
Normal file
@@ -0,0 +1,780 @@
|
||||
---
|
||||
name: scikit-learn
|
||||
description: Comprehensive guide for scikit-learn, Python's machine learning library. This skill should be used when building classification or regression models, performing clustering analysis, reducing dimensionality, preprocessing data (scaling, encoding, imputation), evaluating models with cross-validation and metrics, tuning hyperparameters, creating ML pipelines, detecting anomalies, or implementing any supervised or unsupervised learning tasks. Provides algorithm selection guidance, best practices for preventing data leakage, handling imbalanced data, and working with mixed data types.
|
||||
---
|
||||
|
||||
# Scikit-learn: Machine Learning in Python
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides comprehensive guidance for using scikit-learn, Python's premier machine learning library. Scikit-learn offers simple, efficient tools for predictive data analysis, including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. This skill should be used when implementing machine learning workflows, building predictive models, analyzing datasets using supervised or unsupervised learning, preprocessing data for ML tasks, evaluating model performance, or optimizing hyperparameters.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Activate this skill when:
|
||||
- Building classification models (spam detection, image recognition, medical diagnosis)
|
||||
- Creating regression models (price prediction, forecasting, trend analysis)
|
||||
- Performing clustering analysis (customer segmentation, pattern discovery)
|
||||
- Reducing dimensionality (PCA, t-SNE for visualization)
|
||||
- Preprocessing data (scaling, encoding, imputation)
|
||||
- Evaluating model performance (cross-validation, metrics)
|
||||
- Tuning hyperparameters (grid search, random search)
|
||||
- Creating machine learning pipelines
|
||||
- Detecting anomalies or outliers
|
||||
- Implementing ensemble methods
|
||||
|
||||
## Core Machine Learning Workflow
|
||||
|
||||
### Standard ML Pipeline
|
||||
|
||||
Follow this general workflow for supervised learning tasks:
|
||||
|
||||
1. **Data Preparation**
|
||||
- Load and explore data
|
||||
- Split into train/test sets
|
||||
- Handle missing values
|
||||
- Encode categorical features
|
||||
- Scale/normalize features
|
||||
|
||||
2. **Model Selection**
|
||||
- Start with baseline model
|
||||
- Try more complex models
|
||||
- Use domain knowledge to guide selection
|
||||
|
||||
3. **Model Training**
|
||||
- Fit model on training data
|
||||
- Use pipelines to prevent data leakage
|
||||
- Apply cross-validation
|
||||
|
||||
4. **Model Evaluation**
|
||||
- Evaluate on test set
|
||||
- Use appropriate metrics
|
||||
- Analyze errors
|
||||
|
||||
5. **Model Optimization**
|
||||
- Tune hyperparameters
|
||||
- Feature engineering
|
||||
- Ensemble methods
|
||||
|
||||
6. **Deployment**
|
||||
- Save model using joblib
|
||||
- Create prediction pipeline
|
||||
- Monitor performance
|
||||
|
||||
### Classification Quick Start
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.metrics import classification_report
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
# Create pipeline (prevents data leakage)
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
|
||||
])
|
||||
|
||||
# Split data (use stratify for imbalanced classes)
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42, stratify=y
|
||||
)
|
||||
|
||||
# Train
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Evaluate
|
||||
y_pred = pipeline.predict(X_test)
|
||||
print(classification_report(y_test, y_pred))
|
||||
|
||||
# Cross-validation for robust evaluation
|
||||
from sklearn.model_selection import cross_val_score
|
||||
scores = cross_val_score(pipeline, X_train, y_train, cv=5)
|
||||
print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Regression Quick Start
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
from sklearn.metrics import mean_squared_error, r2_score
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
# Create pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
|
||||
])
|
||||
|
||||
# Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# Train
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Evaluate
|
||||
y_pred = pipeline.predict(X_test)
|
||||
rmse = mean_squared_error(y_test, y_pred, squared=False)
|
||||
r2 = r2_score(y_test, y_pred)
|
||||
print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")
|
||||
```
|
||||
|
||||
## Algorithm Selection Guide
|
||||
|
||||
### Classification Algorithms
|
||||
|
||||
**Start with baseline**: LogisticRegression
|
||||
- Fast, interpretable, works well for linearly separable data
|
||||
- Good for high-dimensional data (text classification)
|
||||
|
||||
**General-purpose**: RandomForestClassifier
|
||||
- Handles non-linear relationships
|
||||
- Robust to outliers
|
||||
- Provides feature importance
|
||||
- Good default choice
|
||||
|
||||
**Best performance**: HistGradientBoostingClassifier
|
||||
- State-of-the-art for tabular data
|
||||
- Fast on large datasets (>10K samples)
|
||||
- Often wins Kaggle competitions
|
||||
|
||||
**Special cases**:
|
||||
- **Small datasets (<1K)**: SVC with RBF kernel
|
||||
- **Very large datasets (>100K)**: SGDClassifier or LinearSVC
|
||||
- **Interpretability critical**: LogisticRegression or DecisionTreeClassifier
|
||||
- **Probabilistic predictions**: GaussianNB or calibrated models
|
||||
- **Text classification**: LogisticRegression with TfidfVectorizer
|
||||
|
||||
### Regression Algorithms
|
||||
|
||||
**Start with baseline**: LinearRegression or Ridge
|
||||
- Fast, interpretable
|
||||
- Works well when relationships are linear
|
||||
|
||||
**General-purpose**: RandomForestRegressor
|
||||
- Handles non-linear relationships
|
||||
- Robust to outliers
|
||||
- Good default choice
|
||||
|
||||
**Best performance**: HistGradientBoostingRegressor
|
||||
- State-of-the-art for tabular data
|
||||
- Fast on large datasets
|
||||
|
||||
**Special cases**:
|
||||
- **Regularization needed**: Ridge (L2) or Lasso (L1 + feature selection)
|
||||
- **Very large datasets**: SGDRegressor
|
||||
- **Outliers present**: HuberRegressor or RANSAC
|
||||
|
||||
### Clustering Algorithms
|
||||
|
||||
**Known number of clusters**: KMeans
|
||||
- Fast and scalable
|
||||
- Assumes spherical clusters
|
||||
|
||||
**Unknown number of clusters**: DBSCAN or HDBSCAN
|
||||
- Handles arbitrary shapes
|
||||
- Automatic outlier detection
|
||||
|
||||
**Hierarchical relationships**: AgglomerativeClustering
|
||||
- Creates hierarchy of clusters
|
||||
- Good for visualization (dendrograms)
|
||||
|
||||
**Soft clustering (probabilities)**: GaussianMixture
|
||||
- Provides cluster probabilities
|
||||
- Handles elliptical clusters
|
||||
|
||||
### Dimensionality Reduction
|
||||
|
||||
**Preprocessing/feature extraction**: PCA
|
||||
- Fast and efficient
|
||||
- Linear transformation
|
||||
- ALWAYS standardize first
|
||||
|
||||
**Visualization only**: t-SNE or UMAP
|
||||
- Preserves local structure
|
||||
- Non-linear
|
||||
- DO NOT use for preprocessing
|
||||
|
||||
**Sparse data (text)**: TruncatedSVD
|
||||
- Works with sparse matrices
|
||||
- Latent Semantic Analysis
|
||||
|
||||
**Non-negative data**: NMF
|
||||
- Interpretable components
|
||||
- Topic modeling
|
||||
|
||||
## Working with Different Data Types
|
||||
|
||||
### Numeric Features
|
||||
|
||||
**Continuous features**:
|
||||
1. Check distribution
|
||||
2. Handle outliers (remove, clip, or use RobustScaler)
|
||||
3. Scale using StandardScaler (most algorithms) or MinMaxScaler (neural networks)
|
||||
|
||||
**Count data**:
|
||||
1. Consider log transformation or sqrt
|
||||
2. Scale after transformation
|
||||
|
||||
**Skewed data**:
|
||||
1. Use PowerTransformer (Yeo-Johnson or Box-Cox)
|
||||
2. Or QuantileTransformer for stronger normalization
|
||||
|
||||
### Categorical Features
|
||||
|
||||
**Low cardinality (<10 categories)**:
|
||||
```python
|
||||
from sklearn.preprocessing import OneHotEncoder
|
||||
encoder = OneHotEncoder(drop='first', sparse_output=True)
|
||||
```
|
||||
|
||||
**High cardinality (>10 categories)**:
|
||||
```python
|
||||
from sklearn.preprocessing import TargetEncoder
|
||||
encoder = TargetEncoder()
|
||||
# Uses target statistics, prevents leakage with cross-fitting
|
||||
```
|
||||
|
||||
**Ordinal relationships**:
|
||||
```python
|
||||
from sklearn.preprocessing import OrdinalEncoder
|
||||
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
|
||||
```
|
||||
|
||||
### Text Data
|
||||
|
||||
```python
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from sklearn.naive_bayes import MultinomialNB
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
text_pipeline = Pipeline([
|
||||
('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
|
||||
('classifier', MultinomialNB())
|
||||
])
|
||||
|
||||
text_pipeline.fit(X_train_text, y_train)
|
||||
```
|
||||
|
||||
### Mixed Data Types
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
# Define feature types
|
||||
numeric_features = ['age', 'income', 'credit_score']
|
||||
categorical_features = ['country', 'occupation']
|
||||
|
||||
# Separate preprocessing pipelines
|
||||
numeric_transformer = Pipeline([
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
|
||||
categorical_transformer = Pipeline([
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True))
|
||||
])
|
||||
|
||||
# Combine with ColumnTransformer
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
|
||||
# Complete pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
pipeline = Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', RandomForestClassifier())
|
||||
])
|
||||
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Model Evaluation
|
||||
|
||||
### Classification Metrics
|
||||
|
||||
**Balanced datasets**: Use accuracy or F1-score
|
||||
|
||||
**Imbalanced datasets**: Use balanced_accuracy, F1-weighted, or ROC-AUC
|
||||
```python
|
||||
from sklearn.metrics import balanced_accuracy_score, f1_score, roc_auc_score
|
||||
|
||||
balanced_acc = balanced_accuracy_score(y_true, y_pred)
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
|
||||
# ROC-AUC requires probabilities
|
||||
y_proba = model.predict_proba(X_test)
|
||||
auc = roc_auc_score(y_true, y_proba, multi_class='ovr')
|
||||
```
|
||||
|
||||
**Cost-sensitive**: Define custom scorer or adjust decision threshold
|
||||
|
||||
**Comprehensive report**:
|
||||
```python
|
||||
from sklearn.metrics import classification_report, confusion_matrix
|
||||
|
||||
print(classification_report(y_true, y_pred))
|
||||
print(confusion_matrix(y_true, y_pred))
|
||||
```
|
||||
|
||||
### Regression Metrics
|
||||
|
||||
**Standard use**: RMSE and R²
|
||||
```python
|
||||
from sklearn.metrics import mean_squared_error, r2_score
|
||||
|
||||
rmse = mean_squared_error(y_true, y_pred, squared=False)
|
||||
r2 = r2_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
**Outliers present**: Use MAE (robust to outliers)
|
||||
```python
|
||||
from sklearn.metrics import mean_absolute_error
|
||||
mae = mean_absolute_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
**Percentage errors matter**: Use MAPE
|
||||
```python
|
||||
from sklearn.metrics import mean_absolute_percentage_error
|
||||
mape = mean_absolute_percentage_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
### Cross-Validation
|
||||
|
||||
**Standard approach** (5-10 folds):
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_score
|
||||
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
|
||||
print(f"CV Score: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
**Imbalanced classes** (use stratification):
|
||||
```python
|
||||
from sklearn.model_selection import StratifiedKFold
|
||||
|
||||
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
|
||||
scores = cross_val_score(model, X, y, cv=cv)
|
||||
```
|
||||
|
||||
**Time series** (respect temporal order):
|
||||
```python
|
||||
from sklearn.model_selection import TimeSeriesSplit
|
||||
|
||||
cv = TimeSeriesSplit(n_splits=5)
|
||||
scores = cross_val_score(model, X, y, cv=cv)
|
||||
```
|
||||
|
||||
**Multiple metrics**:
|
||||
```python
|
||||
from sklearn.model_selection import cross_validate
|
||||
|
||||
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
|
||||
results = cross_validate(model, X, y, cv=5, scoring=scoring)
|
||||
|
||||
for metric in scoring:
|
||||
scores = results[f'test_{metric}']
|
||||
print(f"{metric}: {scores.mean():.3f}")
|
||||
```
|
||||
|
||||
## Hyperparameter Tuning
|
||||
|
||||
### Grid Search (Exhaustive)
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
param_grid = {
|
||||
'n_estimators': [100, 200, 500],
|
||||
'max_depth': [10, 20, 30, None],
|
||||
'min_samples_split': [2, 5, 10]
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_grid,
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1, # Use all CPU cores
|
||||
verbose=1
|
||||
)
|
||||
|
||||
grid_search.fit(X_train, y_train)
|
||||
|
||||
print(f"Best parameters: {grid_search.best_params_}")
|
||||
print(f"Best CV score: {grid_search.best_score_:.3f}")
|
||||
|
||||
# Use best model
|
||||
best_model = grid_search.best_estimator_
|
||||
test_score = best_model.score(X_test, y_test)
|
||||
```
|
||||
|
||||
### Random Search (Faster)
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import RandomizedSearchCV
|
||||
from scipy.stats import randint, uniform
|
||||
|
||||
param_distributions = {
|
||||
'n_estimators': randint(100, 1000),
|
||||
'max_depth': randint(5, 50),
|
||||
'min_samples_split': randint(2, 20),
|
||||
'max_features': uniform(0.1, 0.9)
|
||||
}
|
||||
|
||||
random_search = RandomizedSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_distributions,
|
||||
n_iter=100, # Number of combinations to try
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
random_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Pipeline Hyperparameter Tuning
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.svm import SVC
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('svm', SVC())
|
||||
])
|
||||
|
||||
# Use double underscore for nested parameters
|
||||
param_grid = {
|
||||
'svm__C': [0.1, 1, 10, 100],
|
||||
'svm__kernel': ['rbf', 'linear'],
|
||||
'svm__gamma': ['scale', 'auto', 0.001, 0.01]
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
|
||||
grid_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Feature Engineering and Selection
|
||||
|
||||
### Feature Importance
|
||||
|
||||
```python
|
||||
# Tree-based models have built-in feature importance
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
model = RandomForestClassifier(n_estimators=100)
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
importances = model.feature_importances_
|
||||
feature_importance_df = pd.DataFrame({
|
||||
'feature': feature_names,
|
||||
'importance': importances
|
||||
}).sort_values('importance', ascending=False)
|
||||
|
||||
# Permutation importance (works for any model)
|
||||
from sklearn.inspection import permutation_importance
|
||||
|
||||
result = permutation_importance(
|
||||
model, X_test, y_test,
|
||||
n_repeats=10,
|
||||
random_state=42,
|
||||
n_jobs=-1
|
||||
)
|
||||
|
||||
importance_df = pd.DataFrame({
|
||||
'feature': feature_names,
|
||||
'importance': result.importances_mean,
|
||||
'std': result.importances_std
|
||||
}).sort_values('importance', ascending=False)
|
||||
```
|
||||
|
||||
### Feature Selection Methods
|
||||
|
||||
**Univariate selection**:
|
||||
```python
|
||||
from sklearn.feature_selection import SelectKBest, f_classif
|
||||
|
||||
selector = SelectKBest(f_classif, k=10)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
selected_features = selector.get_support(indices=True)
|
||||
```
|
||||
|
||||
**Recursive Feature Elimination**:
|
||||
```python
|
||||
from sklearn.feature_selection import RFECV
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
selector = RFECV(
|
||||
RandomForestClassifier(n_estimators=100),
|
||||
step=1,
|
||||
cv=5,
|
||||
n_jobs=-1
|
||||
)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
print(f"Optimal features: {selector.n_features_}")
|
||||
```
|
||||
|
||||
**Model-based selection**:
|
||||
```python
|
||||
from sklearn.feature_selection import SelectFromModel
|
||||
|
||||
selector = SelectFromModel(
|
||||
RandomForestClassifier(n_estimators=100),
|
||||
threshold='median' # or '0.5*mean', or specific value
|
||||
)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
```
|
||||
|
||||
### Polynomial Features
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
from sklearn.linear_model import Ridge
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
pipeline = Pipeline([
|
||||
('poly', PolynomialFeatures(degree=2, include_bias=False)),
|
||||
('scaler', StandardScaler()),
|
||||
('ridge', Ridge())
|
||||
])
|
||||
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Common Patterns and Best Practices
|
||||
|
||||
### Always Use Pipelines
|
||||
|
||||
Pipelines prevent data leakage and ensure proper workflow:
|
||||
|
||||
✅ **Correct**:
|
||||
```python
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
pipeline.fit(X_train, y_train)
|
||||
y_pred = pipeline.predict(X_test)
|
||||
```
|
||||
|
||||
❌ **Wrong** (data leakage):
|
||||
```python
|
||||
scaler = StandardScaler().fit(X) # Fit on all data!
|
||||
X_train, X_test = train_test_split(scaler.transform(X))
|
||||
```
|
||||
|
||||
### Stratify for Imbalanced Classes
|
||||
|
||||
```python
|
||||
# Always use stratify for classification with imbalanced classes
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, stratify=y, random_state=42
|
||||
)
|
||||
```
|
||||
|
||||
### Scale When Necessary
|
||||
|
||||
**Scale for**: SVM, Neural Networks, KNN, Linear Models with regularization, PCA, Gradient Descent
|
||||
|
||||
**Don't scale for**: Tree-based models (Random Forest, Gradient Boosting), Naive Bayes
|
||||
|
||||
### Handle Missing Values
|
||||
|
||||
```python
|
||||
from sklearn.impute import SimpleImputer
|
||||
|
||||
# Numeric: use median (robust to outliers)
|
||||
imputer = SimpleImputer(strategy='median')
|
||||
|
||||
# Categorical: use constant value or most_frequent
|
||||
imputer = SimpleImputer(strategy='constant', fill_value='missing')
|
||||
```
|
||||
|
||||
### Use Appropriate Metrics
|
||||
|
||||
- **Balanced classification**: accuracy, F1
|
||||
- **Imbalanced classification**: balanced_accuracy, F1-weighted, ROC-AUC
|
||||
- **Regression with outliers**: MAE instead of RMSE
|
||||
- **Cost-sensitive**: custom scorer
|
||||
|
||||
### Set Random States
|
||||
|
||||
```python
|
||||
# For reproducibility
|
||||
model = RandomForestClassifier(random_state=42)
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, random_state=42
|
||||
)
|
||||
```
|
||||
|
||||
### Use Parallel Processing
|
||||
|
||||
```python
|
||||
# Use all CPU cores
|
||||
model = RandomForestClassifier(n_jobs=-1)
|
||||
grid_search = GridSearchCV(model, param_grid, n_jobs=-1)
|
||||
```
|
||||
|
||||
## Unsupervised Learning
|
||||
|
||||
### Clustering Workflow
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.cluster import KMeans
|
||||
from sklearn.metrics import silhouette_score
|
||||
|
||||
# Always scale for clustering
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Elbow method to find optimal k
|
||||
inertias = []
|
||||
silhouette_scores = []
|
||||
K_range = range(2, 11)
|
||||
|
||||
for k in K_range:
|
||||
kmeans = KMeans(n_clusters=k, random_state=42)
|
||||
labels = kmeans.fit_predict(X_scaled)
|
||||
inertias.append(kmeans.inertia_)
|
||||
silhouette_scores.append(silhouette_score(X_scaled, labels))
|
||||
|
||||
# Plot and choose k
|
||||
import matplotlib.pyplot as plt
|
||||
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
|
||||
ax1.plot(K_range, inertias, 'bo-')
|
||||
ax1.set_xlabel('k')
|
||||
ax1.set_ylabel('Inertia')
|
||||
ax2.plot(K_range, silhouette_scores, 'ro-')
|
||||
ax2.set_xlabel('k')
|
||||
ax2.set_ylabel('Silhouette Score')
|
||||
plt.show()
|
||||
|
||||
# Fit final model
|
||||
optimal_k = 5 # Based on elbow/silhouette
|
||||
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
|
||||
labels = kmeans.fit_predict(X_scaled)
|
||||
```
|
||||
|
||||
### Dimensionality Reduction
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
# ALWAYS scale before PCA
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Specify variance to retain
|
||||
pca = PCA(n_components=0.95) # Keep 95% of variance
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
print(f"Original features: {X.shape[1]}")
|
||||
print(f"Reduced features: {pca.n_components_}")
|
||||
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.3f}")
|
||||
|
||||
# Visualize explained variance
|
||||
import matplotlib.pyplot as plt
|
||||
plt.plot(np.cumsum(pca.explained_variance_ratio_))
|
||||
plt.xlabel('Number of components')
|
||||
plt.ylabel('Cumulative explained variance')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Visualization with t-SNE
|
||||
|
||||
```python
|
||||
from sklearn.manifold import TSNE
|
||||
from sklearn.decomposition import PCA
|
||||
|
||||
# Reduce to 50 dimensions with PCA first (faster)
|
||||
pca = PCA(n_components=min(50, X.shape[1]))
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
# Apply t-SNE (only for visualization!)
|
||||
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
|
||||
X_tsne = tsne.fit_transform(X_pca)
|
||||
|
||||
# Visualize
|
||||
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.6)
|
||||
plt.colorbar()
|
||||
plt.title('t-SNE Visualization')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Saving and Loading Models
|
||||
|
||||
```python
|
||||
import joblib
|
||||
|
||||
# Save model or pipeline
|
||||
joblib.dump(model, 'model.pkl')
|
||||
joblib.dump(pipeline, 'pipeline.pkl')
|
||||
|
||||
# Load
|
||||
loaded_model = joblib.load('model.pkl')
|
||||
loaded_pipeline = joblib.load('pipeline.pkl')
|
||||
|
||||
# Use loaded model
|
||||
predictions = loaded_model.predict(X_new)
|
||||
```
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
This skill includes comprehensive reference files:
|
||||
|
||||
- **`references/supervised_learning.md`**: Detailed coverage of all classification and regression algorithms, parameters, use cases, and selection guidelines
|
||||
- **`references/preprocessing.md`**: Complete guide to data preprocessing including scaling, encoding, imputation, transformations, and best practices
|
||||
- **`references/model_evaluation.md`**: In-depth coverage of cross-validation strategies, metrics, hyperparameter tuning, and validation techniques
|
||||
- **`references/unsupervised_learning.md`**: Comprehensive guide to clustering, dimensionality reduction, anomaly detection, and evaluation methods
|
||||
- **`references/pipelines_and_composition.md`**: Complete guide to Pipeline, ColumnTransformer, FeatureUnion, custom transformers, and composition patterns
|
||||
- **`references/quick_reference.md`**: Quick lookup guide with code snippets, common patterns, and decision trees for algorithm selection
|
||||
|
||||
Read these files when:
|
||||
- Need detailed parameter explanations for specific algorithms
|
||||
- Comparing multiple algorithms for a task
|
||||
- Understanding evaluation metrics in depth
|
||||
- Building complex preprocessing workflows
|
||||
- Troubleshooting common issues
|
||||
|
||||
Example search patterns:
|
||||
```python
|
||||
# To find information about specific algorithms
|
||||
grep -r "GradientBoosting" references/
|
||||
|
||||
# To find preprocessing techniques
|
||||
grep -r "OneHotEncoder" references/preprocessing.md
|
||||
|
||||
# To find evaluation metrics
|
||||
grep -r "f1_score" references/model_evaluation.md
|
||||
```
|
||||
|
||||
## Common Pitfalls to Avoid
|
||||
|
||||
1. **Data leakage**: Always use pipelines, fit only on training data
|
||||
2. **Not scaling**: Scale for distance-based algorithms (SVM, KNN, Neural Networks)
|
||||
3. **Wrong metrics**: Use appropriate metrics for imbalanced data
|
||||
4. **Not using cross-validation**: Single train-test split can be misleading
|
||||
5. **Forgetting stratification**: Stratify for imbalanced classification
|
||||
6. **Using t-SNE for preprocessing**: t-SNE is for visualization only!
|
||||
7. **Not setting random_state**: Results won't be reproducible
|
||||
8. **Ignoring class imbalance**: Use stratification, appropriate metrics, or resampling
|
||||
9. **PCA without scaling**: Components will be dominated by high-variance features
|
||||
10. **Testing on training data**: Always evaluate on held-out test set
|
||||
601
scientific-packages/scikit-learn/references/model_evaluation.md
Normal file
601
scientific-packages/scikit-learn/references/model_evaluation.md
Normal file
@@ -0,0 +1,601 @@
|
||||
# Model Evaluation and Selection in scikit-learn
|
||||
|
||||
## Overview
|
||||
Model evaluation assesses how well models generalize to unseen data. Scikit-learn provides three main APIs for evaluation:
|
||||
1. **Estimator score methods**: Built-in evaluation (accuracy for classifiers, R² for regressors)
|
||||
2. **Scoring parameter**: Used in cross-validation and hyperparameter tuning
|
||||
3. **Metric functions**: Specialized evaluation in `sklearn.metrics`
|
||||
|
||||
## Cross-Validation
|
||||
|
||||
Cross-validation evaluates model performance by splitting data into multiple train/test sets. This addresses overfitting: "a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data."
|
||||
|
||||
### Basic Cross-Validation
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
model = LogisticRegression()
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
|
||||
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Cross-Validation Strategies
|
||||
|
||||
#### For i.i.d. Data
|
||||
|
||||
**KFold**: Standard k-fold cross-validation
|
||||
- Splits data into k equal folds
|
||||
- Each fold used once as test set
|
||||
- `n_splits`: Number of folds (typically 5 or 10)
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import KFold
|
||||
cv = KFold(n_splits=5, shuffle=True, random_state=42)
|
||||
```
|
||||
|
||||
**RepeatedKFold**: Repeats KFold with different randomization
|
||||
- More robust estimation
|
||||
- Computationally expensive
|
||||
|
||||
**LeaveOneOut (LOO)**: Each sample is a test set
|
||||
- Maximum training data usage
|
||||
- Very computationally expensive
|
||||
- High variance in estimates
|
||||
- Use only for small datasets (<1000 samples)
|
||||
|
||||
**ShuffleSplit**: Random train/test splits
|
||||
- Flexible train/test sizes
|
||||
- Can control number of iterations
|
||||
- Good for quick evaluation
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import ShuffleSplit
|
||||
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
|
||||
```
|
||||
|
||||
#### For Imbalanced Classes
|
||||
|
||||
**StratifiedKFold**: Preserves class proportions in each fold
|
||||
- Essential for imbalanced datasets
|
||||
- Default for classification in cross_val_score()
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import StratifiedKFold
|
||||
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
|
||||
```
|
||||
|
||||
**StratifiedShuffleSplit**: Stratified random splits
|
||||
|
||||
#### For Grouped Data
|
||||
|
||||
Use when samples are not independent (e.g., multiple measurements from same subject).
|
||||
|
||||
**GroupKFold**: Groups don't appear in both train and test
|
||||
```python
|
||||
from sklearn.model_selection import GroupKFold
|
||||
cv = GroupKFold(n_splits=5)
|
||||
scores = cross_val_score(model, X, y, groups=groups, cv=cv)
|
||||
```
|
||||
|
||||
**StratifiedGroupKFold**: Combines stratification with group separation
|
||||
|
||||
**LeaveOneGroupOut**: Each group becomes a test set
|
||||
|
||||
#### For Time Series
|
||||
|
||||
**TimeSeriesSplit**: Expanding window approach
|
||||
- Successive training sets are supersets
|
||||
- Respects temporal ordering
|
||||
- No data leakage from future to past
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import TimeSeriesSplit
|
||||
cv = TimeSeriesSplit(n_splits=5)
|
||||
for train_idx, test_idx in cv.split(X):
|
||||
# Train on indices 0 to t, test on t+1 to t+k
|
||||
pass
|
||||
```
|
||||
|
||||
### Cross-Validation Functions
|
||||
|
||||
**cross_val_score**: Returns array of scores
|
||||
```python
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
|
||||
```
|
||||
|
||||
**cross_validate**: Returns multiple metrics and timing
|
||||
```python
|
||||
results = cross_validate(
|
||||
model, X, y, cv=5,
|
||||
scoring=['accuracy', 'f1_weighted', 'roc_auc'],
|
||||
return_train_score=True,
|
||||
return_estimator=True # Returns fitted estimators
|
||||
)
|
||||
print(results['test_accuracy'])
|
||||
print(results['fit_time'])
|
||||
```
|
||||
|
||||
**cross_val_predict**: Returns predictions for model blending/visualization
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_predict
|
||||
y_pred = cross_val_predict(model, X, y, cv=5)
|
||||
# Use for confusion matrix, error analysis, etc.
|
||||
```
|
||||
|
||||
## Hyperparameter Tuning
|
||||
|
||||
### GridSearchCV
|
||||
Exhaustively searches all parameter combinations.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
param_grid = {
|
||||
'n_estimators': [100, 200, 500],
|
||||
'max_depth': [10, 20, 30, None],
|
||||
'min_samples_split': [2, 5, 10],
|
||||
'min_samples_leaf': [1, 2, 4]
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_grid,
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1, # Use all CPU cores
|
||||
verbose=2
|
||||
)
|
||||
|
||||
grid_search.fit(X_train, y_train)
|
||||
print("Best parameters:", grid_search.best_params_)
|
||||
print("Best score:", grid_search.best_score_)
|
||||
|
||||
# Use best model
|
||||
best_model = grid_search.best_estimator_
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Small parameter spaces
|
||||
- When computational resources allow
|
||||
- When exhaustive search is desired
|
||||
|
||||
### RandomizedSearchCV
|
||||
Samples parameter combinations from distributions.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import RandomizedSearchCV
|
||||
from scipy.stats import randint, uniform
|
||||
|
||||
param_distributions = {
|
||||
'n_estimators': randint(100, 1000),
|
||||
'max_depth': randint(5, 50),
|
||||
'min_samples_split': randint(2, 20),
|
||||
'min_samples_leaf': randint(1, 10),
|
||||
'max_features': uniform(0.1, 0.9)
|
||||
}
|
||||
|
||||
random_search = RandomizedSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_distributions,
|
||||
n_iter=100, # Number of parameter settings sampled
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
random_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Large parameter spaces
|
||||
- When budget is limited
|
||||
- Often finds good parameters faster than GridSearchCV
|
||||
|
||||
**Advantage**: "Budget can be chosen independent of the number of parameters and possible values"
|
||||
|
||||
### Successive Halving
|
||||
|
||||
**HalvingGridSearchCV** and **HalvingRandomSearchCV**: Tournament-style selection
|
||||
|
||||
**How it works**:
|
||||
1. Start with many candidates, minimal resources
|
||||
2. Eliminate poor performers
|
||||
3. Increase resources for remaining candidates
|
||||
4. Repeat until best candidates found
|
||||
|
||||
**When to use**:
|
||||
- Large parameter spaces
|
||||
- Expensive model training
|
||||
- When many parameter combinations are clearly inferior
|
||||
|
||||
```python
|
||||
from sklearn.experimental import enable_halving_search_cv
|
||||
from sklearn.model_selection import HalvingGridSearchCV
|
||||
|
||||
halving_search = HalvingGridSearchCV(
|
||||
estimator,
|
||||
param_grid,
|
||||
factor=3, # Proportion of candidates eliminated each round
|
||||
cv=5
|
||||
)
|
||||
```
|
||||
|
||||
## Classification Metrics
|
||||
|
||||
### Accuracy-Based Metrics
|
||||
|
||||
**Accuracy**: Proportion of correct predictions
|
||||
```python
|
||||
from sklearn.metrics import accuracy_score
|
||||
accuracy = accuracy_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: Balanced datasets only
|
||||
**When NOT to use**: Imbalanced datasets (misleading)
|
||||
|
||||
**Balanced Accuracy**: Average recall per class
|
||||
```python
|
||||
from sklearn.metrics import balanced_accuracy_score
|
||||
bal_acc = balanced_accuracy_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: Imbalanced datasets, ensures all classes matter equally
|
||||
|
||||
### Precision, Recall, F-Score
|
||||
|
||||
**Precision**: Of predicted positives, how many are actually positive
|
||||
- Formula: TP / (TP + FP)
|
||||
- Answers: "How reliable are positive predictions?"
|
||||
|
||||
**Recall** (Sensitivity): Of actual positives, how many are predicted positive
|
||||
- Formula: TP / (TP + FN)
|
||||
- Answers: "How complete is positive detection?"
|
||||
|
||||
**F1-Score**: Harmonic mean of precision and recall
|
||||
- Formula: 2 * (precision * recall) / (precision + recall)
|
||||
- Balanced measure when both precision and recall are important
|
||||
|
||||
```python
|
||||
from sklearn.metrics import precision_recall_fscore_support, f1_score
|
||||
|
||||
precision, recall, f1, support = precision_recall_fscore_support(
|
||||
y_true, y_pred, average='weighted'
|
||||
)
|
||||
|
||||
# Or individually
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
```
|
||||
|
||||
**Averaging strategies for multiclass**:
|
||||
- `binary`: Binary classification only
|
||||
- `micro`: Calculate globally (total TP, FP, FN)
|
||||
- `macro`: Calculate per class, unweighted mean (all classes equal)
|
||||
- `weighted`: Calculate per class, weighted by support (class frequency)
|
||||
- `samples`: For multilabel classification
|
||||
|
||||
**When to use**:
|
||||
- `macro`: When all classes equally important (even rare ones)
|
||||
- `weighted`: When class frequency matters
|
||||
- `micro`: When overall performance across all samples matters
|
||||
|
||||
### Confusion Matrix
|
||||
|
||||
Shows true positives, false positives, true negatives, false negatives.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
cm = confusion_matrix(y_true, y_pred)
|
||||
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
|
||||
disp.plot()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### ROC Curve and AUC
|
||||
|
||||
**ROC (Receiver Operating Characteristic)**: Plot of true positive rate vs false positive rate at different thresholds
|
||||
|
||||
**AUC (Area Under Curve)**: Measures overall ability to discriminate between classes
|
||||
- 1.0 = perfect classifier
|
||||
- 0.5 = random classifier
|
||||
- <0.5 = worse than random
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_auc_score, roc_curve
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Requires probability predictions
|
||||
y_proba = model.predict_proba(X_test)[:, 1] # Probabilities for positive class
|
||||
|
||||
auc = roc_auc_score(y_true, y_proba)
|
||||
fpr, tpr, thresholds = roc_curve(y_true, y_proba)
|
||||
|
||||
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
|
||||
plt.xlabel('False Positive Rate')
|
||||
plt.ylabel('True Positive Rate')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Multiclass ROC**: Use `multi_class='ovr'` (one-vs-rest) or `'ovo'` (one-vs-one)
|
||||
|
||||
```python
|
||||
auc = roc_auc_score(y_true, y_proba, multi_class='ovr')
|
||||
```
|
||||
|
||||
### Log Loss
|
||||
|
||||
Measures probability calibration quality.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import log_loss
|
||||
loss = log_loss(y_true, y_proba)
|
||||
```
|
||||
|
||||
**When to use**: When probability quality matters, not just class predictions
|
||||
**Lower is better**: Perfect predictions have log loss of 0
|
||||
|
||||
### Classification Report
|
||||
|
||||
Comprehensive summary of precision, recall, f1-score per class.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import classification_report
|
||||
|
||||
print(classification_report(y_true, y_pred, target_names=['Class 0', 'Class 1']))
|
||||
```
|
||||
|
||||
## Regression Metrics
|
||||
|
||||
### Mean Squared Error (MSE)
|
||||
Average squared difference between predictions and true values.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import mean_squared_error
|
||||
mse = mean_squared_error(y_true, y_pred)
|
||||
rmse = mean_squared_error(y_true, y_pred, squared=False) # Root MSE
|
||||
```
|
||||
|
||||
**Characteristics**:
|
||||
- Penalizes large errors heavily (squared term)
|
||||
- Same units as target² (use RMSE for same units as target)
|
||||
- Lower is better
|
||||
|
||||
### Mean Absolute Error (MAE)
|
||||
Average absolute difference between predictions and true values.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import mean_absolute_error
|
||||
mae = mean_absolute_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
**Characteristics**:
|
||||
- More robust to outliers than MSE
|
||||
- Same units as target
|
||||
- More interpretable
|
||||
- Lower is better
|
||||
|
||||
**MSE vs MAE**: Use MAE when outliers shouldn't dominate the metric
|
||||
|
||||
### R² Score (Coefficient of Determination)
|
||||
Proportion of variance explained by the model.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import r2_score
|
||||
r2 = r2_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
**Interpretation**:
|
||||
- 1.0 = perfect predictions
|
||||
- 0.0 = model as good as mean
|
||||
- <0.0 = model worse than mean (possible!)
|
||||
- Higher is better
|
||||
|
||||
**Note**: Can be negative for models that perform worse than predicting the mean.
|
||||
|
||||
### Mean Absolute Percentage Error (MAPE)
|
||||
Percentage-based error metric.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import mean_absolute_percentage_error
|
||||
mape = mean_absolute_percentage_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: When relative errors matter more than absolute errors
|
||||
**Warning**: Undefined when true values are zero
|
||||
|
||||
### Median Absolute Error
|
||||
Median of absolute errors (robust to outliers).
|
||||
|
||||
```python
|
||||
from sklearn.metrics import median_absolute_error
|
||||
med_ae = median_absolute_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
### Max Error
|
||||
Maximum residual error.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import max_error
|
||||
max_err = max_error(y_true, y_pred)
|
||||
```
|
||||
|
||||
**When to use**: When worst-case performance matters
|
||||
|
||||
## Custom Scoring Functions
|
||||
|
||||
Create custom scorers for GridSearchCV and cross_val_score:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import make_scorer, fbeta_score
|
||||
|
||||
# F2 score (weights recall higher than precision)
|
||||
f2_scorer = make_scorer(fbeta_score, beta=2)
|
||||
|
||||
# Custom function
|
||||
def custom_metric(y_true, y_pred):
|
||||
# Your custom logic
|
||||
return score
|
||||
|
||||
custom_scorer = make_scorer(custom_metric, greater_is_better=True)
|
||||
|
||||
# Use in cross-validation or grid search
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring=custom_scorer)
|
||||
```
|
||||
|
||||
## Scoring Parameter Options
|
||||
|
||||
Common scoring strings for `scoring` parameter:
|
||||
|
||||
**Classification**:
|
||||
- `'accuracy'`, `'balanced_accuracy'`
|
||||
- `'precision'`, `'recall'`, `'f1'` (add `_macro`, `_micro`, `_weighted` for multiclass)
|
||||
- `'roc_auc'`, `'roc_auc_ovr'`, `'roc_auc_ovo'`
|
||||
- `'log_loss'` (lower is better, negate for maximization)
|
||||
- `'jaccard'` (Jaccard similarity)
|
||||
|
||||
**Regression**:
|
||||
- `'r2'`
|
||||
- `'neg_mean_squared_error'`, `'neg_root_mean_squared_error'`
|
||||
- `'neg_mean_absolute_error'`
|
||||
- `'neg_mean_absolute_percentage_error'`
|
||||
- `'neg_median_absolute_error'`
|
||||
|
||||
**Note**: Many metrics are negated (neg_*) so GridSearchCV can maximize them.
|
||||
|
||||
## Validation Strategies
|
||||
|
||||
### Train-Test Split
|
||||
Simple single split.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y,
|
||||
test_size=0.2,
|
||||
random_state=42,
|
||||
stratify=y # For classification with imbalanced classes
|
||||
)
|
||||
```
|
||||
|
||||
**When to use**: Large datasets, quick evaluation
|
||||
**Parameters**:
|
||||
- `test_size`: Proportion for test (typically 0.2-0.3)
|
||||
- `stratify`: Preserves class proportions
|
||||
- `random_state`: Reproducibility
|
||||
|
||||
### Train-Validation-Test Split
|
||||
Three-way split for hyperparameter tuning.
|
||||
|
||||
```python
|
||||
# First split: train+val and test
|
||||
X_trainval, X_test, y_trainval, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# Second split: train and validation
|
||||
X_train, X_val, y_train, y_val = train_test_split(
|
||||
X_trainval, y_trainval, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# Or use GridSearchCV with train+val, then evaluate on test
|
||||
```
|
||||
|
||||
**When to use**: Model selection and final evaluation
|
||||
**Strategy**:
|
||||
1. Train: Model training
|
||||
2. Validation: Hyperparameter tuning
|
||||
3. Test: Final, unbiased evaluation (touch only once!)
|
||||
|
||||
### Learning Curves
|
||||
|
||||
Diagnose bias vs variance issues.
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import learning_curve
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
train_sizes, train_scores, val_scores = learning_curve(
|
||||
model, X, y,
|
||||
cv=5,
|
||||
train_sizes=np.linspace(0.1, 1.0, 10),
|
||||
scoring='accuracy',
|
||||
n_jobs=-1
|
||||
)
|
||||
|
||||
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
|
||||
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
|
||||
plt.xlabel('Training set size')
|
||||
plt.ylabel('Score')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Interpretation**:
|
||||
- Large gap between train and validation: **Overfitting** (high variance)
|
||||
- Both scores low: **Underfitting** (high bias)
|
||||
- Scores converging but low: Need better features or more complex model
|
||||
- Validation score still improving: More data would help
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Metric Selection Guidelines
|
||||
|
||||
**Classification - Balanced classes**:
|
||||
- Accuracy or F1-score
|
||||
|
||||
**Classification - Imbalanced classes**:
|
||||
- Balanced accuracy
|
||||
- F1-score (weighted or macro)
|
||||
- ROC-AUC
|
||||
- Precision-Recall curve
|
||||
|
||||
**Classification - Cost-sensitive**:
|
||||
- Custom scorer with cost matrix
|
||||
- Adjust threshold on probabilities
|
||||
|
||||
**Regression - Typical use**:
|
||||
- RMSE (sensitive to outliers)
|
||||
- R² (proportion of variance explained)
|
||||
|
||||
**Regression - Outliers present**:
|
||||
- MAE (robust to outliers)
|
||||
- Median absolute error
|
||||
|
||||
**Regression - Percentage errors matter**:
|
||||
- MAPE
|
||||
|
||||
### Cross-Validation Guidelines
|
||||
|
||||
**Number of folds**:
|
||||
- 5-10 folds typical
|
||||
- More folds = more computation, less variance in estimate
|
||||
- LeaveOneOut only for small datasets
|
||||
|
||||
**Stratification**:
|
||||
- Always use for classification with imbalanced classes
|
||||
- Use StratifiedKFold by default for classification
|
||||
|
||||
**Grouping**:
|
||||
- Always use when samples are not independent
|
||||
- Time series: Always use TimeSeriesSplit
|
||||
|
||||
**Nested cross-validation**:
|
||||
- For unbiased performance estimate when doing hyperparameter tuning
|
||||
- Outer loop: Performance estimation
|
||||
- Inner loop: Hyperparameter selection
|
||||
|
||||
### Avoiding Common Pitfalls
|
||||
|
||||
1. **Data leakage**: Fit preprocessors only on training data within each CV fold (use Pipeline!)
|
||||
2. **Test set leakage**: Never use test set for model selection
|
||||
3. **Improper metric**: Use metrics appropriate for problem (balanced_accuracy for imbalanced data)
|
||||
4. **Multiple testing**: More models evaluated = higher chance of random good results
|
||||
5. **Temporal leakage**: For time series, use TimeSeriesSplit, not random splits
|
||||
6. **Target leakage**: Features shouldn't contain information not available at prediction time
|
||||
@@ -0,0 +1,679 @@
|
||||
# Pipelines and Composite Estimators in scikit-learn
|
||||
|
||||
## Overview
|
||||
Pipelines chain multiple estimators into a single unit, ensuring proper workflow sequencing and preventing data leakage. As the documentation states: "Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification."
|
||||
|
||||
## Pipeline Basics
|
||||
|
||||
### Creating Pipelines
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
# Method 1: List of (name, estimator) tuples
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
|
||||
# Method 2: Using make_pipeline (auto-generates names)
|
||||
from sklearn.pipeline import make_pipeline
|
||||
pipeline = make_pipeline(
|
||||
StandardScaler(),
|
||||
PCA(n_components=10),
|
||||
LogisticRegression()
|
||||
)
|
||||
```
|
||||
|
||||
### Using Pipelines
|
||||
|
||||
```python
|
||||
# Fit and predict like any estimator
|
||||
pipeline.fit(X_train, y_train)
|
||||
y_pred = pipeline.predict(X_test)
|
||||
score = pipeline.score(X_test, y_test)
|
||||
|
||||
# Access steps
|
||||
pipeline.named_steps['scaler']
|
||||
pipeline.steps[0] # Returns ('scaler', StandardScaler(...))
|
||||
pipeline[0] # Returns StandardScaler(...) object
|
||||
pipeline['scaler'] # Returns StandardScaler(...) object
|
||||
|
||||
# Get final estimator
|
||||
pipeline[-1] # Returns LogisticRegression(...) object
|
||||
```
|
||||
|
||||
### Pipeline Rules
|
||||
|
||||
**All steps except the last must be transformers** (have `fit()` and `transform()` methods).
|
||||
|
||||
**The final step** can be:
|
||||
- Predictor (classifier/regressor) with `fit()` and `predict()`
|
||||
- Transformer with `fit()` and `transform()`
|
||||
- Any estimator with at least `fit()`
|
||||
|
||||
### Pipeline Benefits
|
||||
|
||||
1. **Convenience**: Single `fit()` and `predict()` call
|
||||
2. **Prevents data leakage**: Ensures proper fit/transform on train/test
|
||||
3. **Joint parameter selection**: Tune all steps together with GridSearchCV
|
||||
4. **Reproducibility**: Encapsulates entire workflow
|
||||
|
||||
## Accessing and Setting Parameters
|
||||
|
||||
### Nested Parameters
|
||||
|
||||
Access step parameters using `stepname__parameter` syntax:
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('clf', LogisticRegression())
|
||||
])
|
||||
|
||||
# Grid search over pipeline parameters
|
||||
param_grid = {
|
||||
'scaler__with_mean': [True, False],
|
||||
'clf__C': [0.1, 1.0, 10.0],
|
||||
'clf__penalty': ['l1', 'l2']
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
|
||||
grid_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Setting Parameters
|
||||
|
||||
```python
|
||||
# Set parameters
|
||||
pipeline.set_params(clf__C=10.0, scaler__with_std=False)
|
||||
|
||||
# Get parameters
|
||||
params = pipeline.get_params()
|
||||
```
|
||||
|
||||
## Caching Intermediate Results
|
||||
|
||||
Cache fitted transformers to avoid recomputation:
|
||||
|
||||
```python
|
||||
from tempfile import mkdtemp
|
||||
from shutil import rmtree
|
||||
|
||||
# Create cache directory
|
||||
cachedir = mkdtemp()
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('clf', LogisticRegression())
|
||||
], memory=cachedir)
|
||||
|
||||
# When doing grid search, scaler and PCA only fit once per fold
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
|
||||
grid_search.fit(X_train, y_train)
|
||||
|
||||
# Clean up cache
|
||||
rmtree(cachedir)
|
||||
|
||||
# Or use joblib for persistent caching
|
||||
from joblib import Memory
|
||||
memory = Memory(location='./cache', verbose=0)
|
||||
pipeline = Pipeline([...], memory=memory)
|
||||
```
|
||||
|
||||
**When to use caching**:
|
||||
- Expensive transformations (PCA, feature selection)
|
||||
- Grid search over final estimator parameters only
|
||||
- Multiple experiments with same preprocessing
|
||||
|
||||
## ColumnTransformer
|
||||
|
||||
Apply different transformations to different columns (essential for heterogeneous data).
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
|
||||
# Define which transformations for which columns
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', StandardScaler(), ['age', 'income', 'credit_score']),
|
||||
('cat', OneHotEncoder(), ['country', 'occupation'])
|
||||
],
|
||||
remainder='drop' # What to do with remaining columns
|
||||
)
|
||||
|
||||
X_transformed = preprocessor.fit_transform(X)
|
||||
```
|
||||
|
||||
### Column Selection Methods
|
||||
|
||||
```python
|
||||
# Method 1: Column names (list of strings)
|
||||
('num', StandardScaler(), ['age', 'income'])
|
||||
|
||||
# Method 2: Column indices (list of integers)
|
||||
('num', StandardScaler(), [0, 1, 2])
|
||||
|
||||
# Method 3: Boolean mask
|
||||
('num', StandardScaler(), [True, True, False, True, False])
|
||||
|
||||
# Method 4: Slice
|
||||
('num', StandardScaler(), slice(0, 3))
|
||||
|
||||
# Method 5: make_column_selector (by dtype or pattern)
|
||||
from sklearn.compose import make_column_selector as selector
|
||||
|
||||
preprocessor = ColumnTransformer([
|
||||
('num', StandardScaler(), selector(dtype_include='number')),
|
||||
('cat', OneHotEncoder(), selector(dtype_include='object'))
|
||||
])
|
||||
|
||||
# Select by pattern
|
||||
selector(pattern='.*_score$') # All columns ending with '_score'
|
||||
```
|
||||
|
||||
### Remainder Parameter
|
||||
|
||||
Controls what happens to columns not specified:
|
||||
|
||||
```python
|
||||
# Drop remaining columns (default)
|
||||
remainder='drop'
|
||||
|
||||
# Pass through remaining columns unchanged
|
||||
remainder='passthrough'
|
||||
|
||||
# Apply transformer to remaining columns
|
||||
remainder=StandardScaler()
|
||||
```
|
||||
|
||||
### Full Pipeline with ColumnTransformer
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
# Separate preprocessing for numeric and categorical
|
||||
numeric_features = ['age', 'income', 'credit_score']
|
||||
categorical_features = ['country', 'occupation', 'education']
|
||||
|
||||
numeric_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
|
||||
categorical_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
])
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
|
||||
# Complete pipeline
|
||||
clf = Pipeline(steps=[
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', RandomForestClassifier())
|
||||
])
|
||||
|
||||
clf.fit(X_train, y_train)
|
||||
y_pred = clf.predict(X_test)
|
||||
|
||||
# Grid search over preprocessing and model parameters
|
||||
param_grid = {
|
||||
'preprocessor__num__imputer__strategy': ['mean', 'median'],
|
||||
'preprocessor__cat__onehot__max_categories': [10, 20, None],
|
||||
'classifier__n_estimators': [100, 200],
|
||||
'classifier__max_depth': [10, 20, None]
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(clf, param_grid, cv=5)
|
||||
grid_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## FeatureUnion
|
||||
|
||||
Combine multiple transformer outputs by concatenating features side-by-side.
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import FeatureUnion
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.feature_selection import SelectKBest
|
||||
|
||||
# Combine PCA and feature selection
|
||||
combined_features = FeatureUnion([
|
||||
('pca', PCA(n_components=10)),
|
||||
('univ_select', SelectKBest(k=5))
|
||||
])
|
||||
|
||||
X_features = combined_features.fit_transform(X, y)
|
||||
# Result: 15 features (10 from PCA + 5 from SelectKBest)
|
||||
|
||||
# In a pipeline
|
||||
pipeline = Pipeline([
|
||||
('features', combined_features),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
```
|
||||
|
||||
### FeatureUnion with Transformers on Different Data
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import FeatureUnion
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
import numpy as np
|
||||
|
||||
def get_numeric_data(X):
|
||||
return X[:, :3] # First 3 columns
|
||||
|
||||
def get_text_data(X):
|
||||
return X[:, 3] # 4th column (text)
|
||||
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
|
||||
combined = FeatureUnion([
|
||||
('numeric_features', Pipeline([
|
||||
('selector', FunctionTransformer(get_numeric_data)),
|
||||
('scaler', StandardScaler())
|
||||
])),
|
||||
('text_features', Pipeline([
|
||||
('selector', FunctionTransformer(get_text_data)),
|
||||
('tfidf', TfidfVectorizer())
|
||||
]))
|
||||
])
|
||||
```
|
||||
|
||||
**Note**: ColumnTransformer is usually more convenient than FeatureUnion for heterogeneous data.
|
||||
|
||||
## Common Pipeline Patterns
|
||||
|
||||
### Classification Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.feature_selection import SelectKBest, f_classif
|
||||
from sklearn.svm import SVC
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('feature_selection', SelectKBest(f_classif, k=10)),
|
||||
('classifier', SVC(kernel='rbf'))
|
||||
])
|
||||
```
|
||||
|
||||
### Regression Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
|
||||
from sklearn.linear_model import Ridge
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('poly', PolynomialFeatures(degree=2)),
|
||||
('ridge', Ridge(alpha=1.0))
|
||||
])
|
||||
```
|
||||
|
||||
### Text Classification Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from sklearn.naive_bayes import MultinomialNB
|
||||
|
||||
pipeline = Pipeline([
|
||||
('tfidf', TfidfVectorizer(max_features=1000)),
|
||||
('classifier', MultinomialNB())
|
||||
])
|
||||
|
||||
# Works directly with text
|
||||
pipeline.fit(X_train_text, y_train)
|
||||
y_pred = pipeline.predict(X_test_text)
|
||||
```
|
||||
|
||||
### Image Processing Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.neural_network import MLPClassifier
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=100)),
|
||||
('mlp', MLPClassifier(hidden_layer_sizes=(100, 50)))
|
||||
])
|
||||
```
|
||||
|
||||
### Dimensionality Reduction + Clustering
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.cluster import KMeans
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('kmeans', KMeans(n_clusters=5))
|
||||
])
|
||||
|
||||
labels = pipeline.fit_predict(X)
|
||||
```
|
||||
|
||||
## Custom Transformers
|
||||
|
||||
### Using FunctionTransformer
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
import numpy as np
|
||||
|
||||
# Log transformation
|
||||
log_transformer = FunctionTransformer(np.log1p)
|
||||
|
||||
# Custom function
|
||||
def custom_transform(X):
|
||||
# Your transformation logic
|
||||
return X_transformed
|
||||
|
||||
custom_transformer = FunctionTransformer(custom_transform)
|
||||
|
||||
# In pipeline
|
||||
pipeline = Pipeline([
|
||||
('log', log_transformer),
|
||||
('scaler', StandardScaler()),
|
||||
('model', LinearRegression())
|
||||
])
|
||||
```
|
||||
|
||||
### Creating Custom Transformer Class
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
|
||||
class CustomTransformer(BaseEstimator, TransformerMixin):
|
||||
def __init__(self, parameter=1.0):
|
||||
self.parameter = parameter
|
||||
|
||||
def fit(self, X, y=None):
|
||||
# Learn parameters from X
|
||||
self.learned_param_ = X.mean() # Example
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
# Transform X using learned parameters
|
||||
return X * self.parameter - self.learned_param_
|
||||
|
||||
# Optional: for pipelines that need inverse transform
|
||||
def inverse_transform(self, X):
|
||||
return (X + self.learned_param_) / self.parameter
|
||||
|
||||
# Use in pipeline
|
||||
pipeline = Pipeline([
|
||||
('custom', CustomTransformer(parameter=2.0)),
|
||||
('model', LinearRegression())
|
||||
])
|
||||
```
|
||||
|
||||
**Key requirements**:
|
||||
- Inherit from `BaseEstimator` and `TransformerMixin`
|
||||
- Implement `fit()` and `transform()` methods
|
||||
- `fit()` must return `self`
|
||||
- Use trailing underscore for learned attributes (`learned_param_`)
|
||||
- Constructor parameters should be stored as attributes
|
||||
|
||||
### Transformer for Pandas DataFrames
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
import pandas as pd
|
||||
|
||||
class DataFrameTransformer(BaseEstimator, TransformerMixin):
|
||||
def __init__(self, columns=None):
|
||||
self.columns = columns
|
||||
|
||||
def fit(self, X, y=None):
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
if isinstance(X, pd.DataFrame):
|
||||
if self.columns:
|
||||
return X[self.columns].values
|
||||
return X.values
|
||||
return X
|
||||
```
|
||||
|
||||
## Visualization
|
||||
|
||||
### Display Pipeline in Jupyter
|
||||
|
||||
```python
|
||||
from sklearn import set_config
|
||||
|
||||
# Enable HTML display
|
||||
set_config(display='diagram')
|
||||
|
||||
# Now displaying the pipeline shows interactive diagram
|
||||
pipeline
|
||||
```
|
||||
|
||||
### Print Pipeline Structure
|
||||
|
||||
```python
|
||||
from sklearn.utils import estimator_html_repr
|
||||
|
||||
# Get HTML representation
|
||||
html = estimator_html_repr(pipeline)
|
||||
|
||||
# Or just print
|
||||
print(pipeline)
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Conditional Transformations
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler, FunctionTransformer
|
||||
|
||||
def conditional_scale(X, scale=True):
|
||||
if scale:
|
||||
return StandardScaler().fit_transform(X)
|
||||
return X
|
||||
|
||||
pipeline = Pipeline([
|
||||
('conditional_scaler', FunctionTransformer(
|
||||
conditional_scale,
|
||||
kw_args={'scale': True}
|
||||
)),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
```
|
||||
|
||||
### Multiple Preprocessing Paths
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
# Different preprocessing for different feature types
|
||||
preprocessor = ColumnTransformer([
|
||||
# Numeric: impute + scale
|
||||
('num_standard', Pipeline([
|
||||
('imputer', SimpleImputer(strategy='mean')),
|
||||
('scaler', StandardScaler())
|
||||
]), ['age', 'income']),
|
||||
|
||||
# Numeric: impute + log + scale
|
||||
('num_skewed', Pipeline([
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('log', FunctionTransformer(np.log1p)),
|
||||
('scaler', StandardScaler())
|
||||
]), ['price', 'revenue']),
|
||||
|
||||
# Categorical: impute + one-hot
|
||||
('cat', Pipeline([
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
]), ['category', 'region']),
|
||||
|
||||
# Text: TF-IDF
|
||||
('text', TfidfVectorizer(), 'description')
|
||||
])
|
||||
```
|
||||
|
||||
### Feature Engineering Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
|
||||
class FeatureEngineer(BaseEstimator, TransformerMixin):
|
||||
def fit(self, X, y=None):
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
X = X.copy()
|
||||
# Add engineered features
|
||||
X['age_income_ratio'] = X['age'] / (X['income'] + 1)
|
||||
X['total_score'] = X['score1'] + X['score2'] + X['score3']
|
||||
return X
|
||||
|
||||
pipeline = Pipeline([
|
||||
('engineer', FeatureEngineer()),
|
||||
('preprocessor', preprocessor),
|
||||
('model', RandomForestClassifier())
|
||||
])
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Always Use Pipelines When
|
||||
|
||||
1. **Preprocessing is needed**: Scaling, encoding, imputation
|
||||
2. **Cross-validation**: Ensures proper fit/transform split
|
||||
3. **Hyperparameter tuning**: Joint optimization of preprocessing and model
|
||||
4. **Production deployment**: Single object to serialize
|
||||
5. **Multiple steps**: Any workflow with >1 step
|
||||
|
||||
### Pipeline Do's
|
||||
|
||||
- ✅ Fit pipeline only on training data
|
||||
- ✅ Use ColumnTransformer for heterogeneous data
|
||||
- ✅ Cache expensive transformations during grid search
|
||||
- ✅ Use make_pipeline for simple cases
|
||||
- ✅ Set verbose=True to debug issues
|
||||
- ✅ Use remainder='passthrough' when appropriate
|
||||
|
||||
### Pipeline Don'ts
|
||||
|
||||
- ❌ Fit preprocessing on full dataset before split (data leakage!)
|
||||
- ❌ Manually transform test data (use pipeline.predict())
|
||||
- ❌ Forget to handle missing values before scaling
|
||||
- ❌ Mix pandas DataFrames and arrays inconsistently
|
||||
- ❌ Skip using pipelines for "just one preprocessing step"
|
||||
|
||||
### Data Leakage Prevention
|
||||
|
||||
```python
|
||||
# ❌ WRONG - Data leakage
|
||||
scaler = StandardScaler().fit(X) # Fit on all data
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y)
|
||||
X_train_scaled = scaler.transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# ✅ CORRECT - No leakage with pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y)
|
||||
pipeline.fit(X_train, y_train) # Scaler fits only on train
|
||||
y_pred = pipeline.predict(X_test) # Scaler transforms only on test
|
||||
|
||||
# ✅ CORRECT - No leakage in cross-validation
|
||||
scores = cross_val_score(pipeline, X, y, cv=5)
|
||||
# Each fold: scaler fits on train folds, transforms on test fold
|
||||
```
|
||||
|
||||
### Debugging Pipelines
|
||||
|
||||
```python
|
||||
# Examine intermediate outputs
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
|
||||
# Fit pipeline
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Get output after scaling
|
||||
X_scaled = pipeline.named_steps['scaler'].transform(X_train)
|
||||
|
||||
# Get output after PCA
|
||||
X_pca = pipeline[:-1].transform(X_train) # All steps except last
|
||||
|
||||
# Or build partial pipeline
|
||||
partial_pipeline = Pipeline(pipeline.steps[:-1])
|
||||
X_transformed = partial_pipeline.transform(X_train)
|
||||
```
|
||||
|
||||
### Saving and Loading Pipelines
|
||||
|
||||
```python
|
||||
import joblib
|
||||
|
||||
# Save pipeline
|
||||
joblib.dump(pipeline, 'model_pipeline.pkl')
|
||||
|
||||
# Load pipeline
|
||||
pipeline = joblib.load('model_pipeline.pkl')
|
||||
|
||||
# Use loaded pipeline
|
||||
y_pred = pipeline.predict(X_new)
|
||||
```
|
||||
|
||||
## Common Errors and Solutions
|
||||
|
||||
**Error**: `ValueError: could not convert string to float`
|
||||
- **Cause**: Categorical features not encoded
|
||||
- **Solution**: Add OneHotEncoder or OrdinalEncoder to pipeline
|
||||
|
||||
**Error**: `All intermediate steps should be transformers`
|
||||
- **Cause**: Non-transformer in non-final position
|
||||
- **Solution**: Ensure only last step is predictor
|
||||
|
||||
**Error**: `X has different number of features than during fitting`
|
||||
- **Cause**: Different columns in train and test
|
||||
- **Solution**: Ensure consistent column handling, use `handle_unknown='ignore'` in OneHotEncoder
|
||||
|
||||
**Error**: Different results in cross-validation vs train-test split
|
||||
- **Cause**: Data leakage (fitting preprocessing on all data)
|
||||
- **Solution**: Always use Pipeline for preprocessing
|
||||
|
||||
**Error**: Pipeline too slow during grid search
|
||||
- **Solution**: Use caching with `memory` parameter
|
||||
413
scientific-packages/scikit-learn/references/preprocessing.md
Normal file
413
scientific-packages/scikit-learn/references/preprocessing.md
Normal file
@@ -0,0 +1,413 @@
|
||||
# Data Preprocessing in scikit-learn
|
||||
|
||||
## Overview
|
||||
Preprocessing transforms raw data into a format suitable for machine learning algorithms. Many algorithms require standardized or normalized data to perform well.
|
||||
|
||||
## Standardization and Scaling
|
||||
|
||||
### StandardScaler
|
||||
Removes mean and scales to unit variance (z-score normalization).
|
||||
|
||||
**Formula**: `z = (x - μ) / σ`
|
||||
|
||||
**Use cases**:
|
||||
- Most ML algorithms (especially SVM, neural networks, PCA)
|
||||
- When features have different units or scales
|
||||
- When assuming Gaussian-like distribution
|
||||
|
||||
**Important**: Fit only on training data, then transform both train and test sets.
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test) # Use same parameters
|
||||
```
|
||||
|
||||
### MinMaxScaler
|
||||
Scales features to a specified range, typically [0, 1].
|
||||
|
||||
**Formula**: `X_scaled = (X - X_min) / (X_max - X_min)`
|
||||
|
||||
**Use cases**:
|
||||
- When bounded range is needed
|
||||
- Neural networks (often prefer [0, 1] range)
|
||||
- When distribution is not Gaussian
|
||||
- Image pixel values
|
||||
|
||||
**Parameters**:
|
||||
- `feature_range`: Tuple (min, max), default (0, 1)
|
||||
|
||||
**Warning**: Sensitive to outliers since it uses min/max.
|
||||
|
||||
### MaxAbsScaler
|
||||
Scales to [-1, 1] by dividing by maximum absolute value.
|
||||
|
||||
**Use cases**:
|
||||
- Sparse data (preserves sparsity)
|
||||
- Data already centered at zero
|
||||
- When sign of values is meaningful
|
||||
|
||||
**Advantage**: Doesn't shift/center the data, preserves zero entries.
|
||||
|
||||
### RobustScaler
|
||||
Uses median and interquartile range (IQR) instead of mean and standard deviation.
|
||||
|
||||
**Formula**: `X_scaled = (X - median) / IQR`
|
||||
|
||||
**Use cases**:
|
||||
- When outliers are present
|
||||
- When StandardScaler produces skewed results
|
||||
- Robust statistics preferred
|
||||
|
||||
**Parameters**:
|
||||
- `quantile_range`: Tuple (q_min, q_max), default (25.0, 75.0)
|
||||
|
||||
## Normalization
|
||||
|
||||
### normalize() function and Normalizer
|
||||
Scales individual samples (rows) to unit norm, not features (columns).
|
||||
|
||||
**Use cases**:
|
||||
- Text classification (TF-IDF vectors)
|
||||
- When similarity metrics (dot product, cosine) are used
|
||||
- When each sample should have equal weight
|
||||
|
||||
**Norms**:
|
||||
- `l1`: Manhattan norm (sum of absolutes = 1)
|
||||
- `l2`: Euclidean norm (sum of squares = 1) - **most common**
|
||||
- `max`: Maximum absolute value = 1
|
||||
|
||||
**Key difference from scalers**: Operates on rows (samples), not columns (features).
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import Normalizer
|
||||
normalizer = Normalizer(norm='l2')
|
||||
X_normalized = normalizer.transform(X)
|
||||
```
|
||||
|
||||
## Encoding Categorical Features
|
||||
|
||||
### OrdinalEncoder
|
||||
Converts categories to integers (0 to n_categories - 1).
|
||||
|
||||
**Use cases**:
|
||||
- Ordinal relationships exist (small < medium < large)
|
||||
- Preprocessing before other transformations
|
||||
- Tree-based algorithms (which can handle integers)
|
||||
|
||||
**Parameters**:
|
||||
- `handle_unknown`: 'error' or 'use_encoded_value'
|
||||
- `unknown_value`: Value for unknown categories
|
||||
- `encoded_missing_value`: Value for missing data
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import OrdinalEncoder
|
||||
encoder = OrdinalEncoder()
|
||||
X_encoded = encoder.fit_transform(X_categorical)
|
||||
```
|
||||
|
||||
### OneHotEncoder
|
||||
Creates binary columns for each category.
|
||||
|
||||
**Use cases**:
|
||||
- Nominal categories (no order)
|
||||
- Linear models, neural networks
|
||||
- When category relationships shouldn't be assumed
|
||||
|
||||
**Parameters**:
|
||||
- `drop`: 'first', 'if_binary', array-like (prevents multicollinearity)
|
||||
- `sparse_output`: True (default, memory efficient) or False
|
||||
- `handle_unknown`: 'error', 'ignore', 'infrequent_if_exist'
|
||||
- `min_frequency`: Group infrequent categories
|
||||
- `max_categories`: Limit number of categories
|
||||
|
||||
**High cardinality handling**:
|
||||
```python
|
||||
encoder = OneHotEncoder(min_frequency=100, handle_unknown='infrequent_if_exist')
|
||||
# Groups categories appearing < 100 times into 'infrequent' category
|
||||
```
|
||||
|
||||
**Memory tip**: Use `sparse_output=True` (default) for high-cardinality features.
|
||||
|
||||
### TargetEncoder
|
||||
Uses target statistics to encode categories.
|
||||
|
||||
**Use cases**:
|
||||
- High-cardinality categorical features (zip codes, user IDs)
|
||||
- When linear relationships with target are expected
|
||||
- Often improves performance over one-hot encoding
|
||||
|
||||
**How it works**:
|
||||
- Replaces category with mean of target for that category
|
||||
- Uses cross-fitting during fit_transform() to prevent target leakage
|
||||
- Applies smoothing to handle rare categories
|
||||
|
||||
**Parameters**:
|
||||
- `smooth`: Smoothing parameter for rare categories
|
||||
- `cv`: Cross-validation strategy
|
||||
|
||||
**Warning**: Only for supervised learning. Requires target variable.
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import TargetEncoder
|
||||
encoder = TargetEncoder()
|
||||
X_encoded = encoder.fit_transform(X_categorical, y)
|
||||
```
|
||||
|
||||
### LabelEncoder
|
||||
Encodes target labels into integers 0 to n_classes - 1.
|
||||
|
||||
**Use cases**: Encoding target variable for classification (not features!)
|
||||
|
||||
**Important**: Use `LabelEncoder` for targets, not features. For features, use OrdinalEncoder or OneHotEncoder.
|
||||
|
||||
### Binarizer
|
||||
Converts numeric values to binary (0 or 1) based on threshold.
|
||||
|
||||
**Use cases**: Creating binary features from continuous values
|
||||
|
||||
## Non-linear Transformations
|
||||
|
||||
### QuantileTransformer
|
||||
Maps features to uniform or normal distribution using rank transformation.
|
||||
|
||||
**Use cases**:
|
||||
- Unusual distributions (bimodal, heavy tails)
|
||||
- Reducing outlier impact
|
||||
- When normal distribution is desired
|
||||
|
||||
**Parameters**:
|
||||
- `output_distribution`: 'uniform' (default) or 'normal'
|
||||
- `n_quantiles`: Number of quantiles (default: min(1000, n_samples))
|
||||
|
||||
**Effect**: Strong transformation that reduces outlier influence and makes data more Gaussian-like.
|
||||
|
||||
### PowerTransformer
|
||||
Applies parametric monotonic transformation to make data more Gaussian.
|
||||
|
||||
**Methods**:
|
||||
- `yeo-johnson`: Works with positive and negative values (default)
|
||||
- `box-cox`: Only positive values
|
||||
|
||||
**Use cases**:
|
||||
- Skewed distributions
|
||||
- When Gaussian assumption is important
|
||||
- Variance stabilization
|
||||
|
||||
**Advantage**: Less radical than QuantileTransformer, preserves more of original relationships.
|
||||
|
||||
## Discretization
|
||||
|
||||
### KBinsDiscretizer
|
||||
Bins continuous features into discrete intervals.
|
||||
|
||||
**Strategies**:
|
||||
- `uniform`: Equal-width bins
|
||||
- `quantile`: Equal-frequency bins
|
||||
- `kmeans`: K-means clustering to determine bins
|
||||
|
||||
**Encoding**:
|
||||
- `ordinal`: Integer encoding (0 to n_bins - 1)
|
||||
- `onehot`: One-hot encoding
|
||||
- `onehot-dense`: Dense one-hot encoding
|
||||
|
||||
**Use cases**:
|
||||
- Making linear models handle non-linear relationships
|
||||
- Reducing noise in features
|
||||
- Making features more interpretable
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import KBinsDiscretizer
|
||||
disc = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
|
||||
X_binned = disc.fit_transform(X)
|
||||
```
|
||||
|
||||
## Feature Generation
|
||||
|
||||
### PolynomialFeatures
|
||||
Generates polynomial and interaction features.
|
||||
|
||||
**Parameters**:
|
||||
- `degree`: Polynomial degree
|
||||
- `interaction_only`: Only multiplicative interactions (no x²)
|
||||
- `include_bias`: Include constant feature
|
||||
|
||||
**Use cases**:
|
||||
- Adding non-linearity to linear models
|
||||
- Feature engineering
|
||||
- Polynomial regression
|
||||
|
||||
**Warning**: Number of features grows rapidly: (n+d)!/d!n! for degree d.
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
poly = PolynomialFeatures(degree=2, include_bias=False)
|
||||
X_poly = poly.fit_transform(X)
|
||||
# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
|
||||
```
|
||||
|
||||
### SplineTransformer
|
||||
Generates B-spline basis functions.
|
||||
|
||||
**Use cases**:
|
||||
- Smooth non-linear transformations
|
||||
- Alternative to PolynomialFeatures (less oscillation at boundaries)
|
||||
- Generalized additive models (GAMs)
|
||||
|
||||
**Parameters**:
|
||||
- `n_knots`: Number of knots
|
||||
- `degree`: Spline degree
|
||||
- `knots`: Knot positions ('uniform', 'quantile', or array)
|
||||
|
||||
## Missing Value Handling
|
||||
|
||||
### SimpleImputer
|
||||
Imputes missing values with various strategies.
|
||||
|
||||
**Strategies**:
|
||||
- `mean`: Mean of column (numeric only)
|
||||
- `median`: Median of column (numeric only)
|
||||
- `most_frequent`: Mode (numeric or categorical)
|
||||
- `constant`: Fill with constant value
|
||||
|
||||
**Parameters**:
|
||||
- `strategy`: Imputation strategy
|
||||
- `fill_value`: Value when strategy='constant'
|
||||
- `missing_values`: What represents missing (np.nan, None, specific value)
|
||||
|
||||
```python
|
||||
from sklearn.impute import SimpleImputer
|
||||
imputer = SimpleImputer(strategy='median')
|
||||
X_imputed = imputer.fit_transform(X)
|
||||
```
|
||||
|
||||
### KNNImputer
|
||||
Imputes using k-nearest neighbors.
|
||||
|
||||
**Use cases**: When relationships between features should inform imputation
|
||||
|
||||
**Parameters**:
|
||||
- `n_neighbors`: Number of neighbors
|
||||
- `weights`: 'uniform' or 'distance'
|
||||
|
||||
### IterativeImputer
|
||||
Models each feature with missing values as function of other features.
|
||||
|
||||
**Use cases**:
|
||||
- Complex relationships between features
|
||||
- When multiple features have missing values
|
||||
- Higher quality imputation (but slower)
|
||||
|
||||
**Parameters**:
|
||||
- `estimator`: Estimator for regression (default: BayesianRidge)
|
||||
- `max_iter`: Maximum iterations
|
||||
|
||||
## Function Transformers
|
||||
|
||||
### FunctionTransformer
|
||||
Applies custom function to data.
|
||||
|
||||
**Use cases**:
|
||||
- Custom transformations in pipelines
|
||||
- Log transformation, square root, etc.
|
||||
- Domain-specific preprocessing
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
import numpy as np
|
||||
|
||||
log_transformer = FunctionTransformer(np.log1p, validate=True)
|
||||
X_log = log_transformer.transform(X)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Feature Scaling Guidelines
|
||||
|
||||
**Always scale**:
|
||||
- SVM, neural networks
|
||||
- K-nearest neighbors
|
||||
- Linear/Logistic regression with regularization
|
||||
- PCA, LDA
|
||||
- Gradient descent-based algorithms
|
||||
|
||||
**Don't need to scale**:
|
||||
- Tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting)
|
||||
- Naive Bayes
|
||||
|
||||
### Pipeline Integration
|
||||
|
||||
Always use preprocessing within pipelines to prevent data leakage:
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
|
||||
pipeline.fit(X_train, y_train) # Scaler fit only on train data
|
||||
y_pred = pipeline.predict(X_test) # Scaler transform only on test data
|
||||
```
|
||||
|
||||
### Common Transformations by Data Type
|
||||
|
||||
**Numeric - Continuous**:
|
||||
- StandardScaler (most common)
|
||||
- MinMaxScaler (neural networks)
|
||||
- RobustScaler (outliers present)
|
||||
- PowerTransformer (skewed data)
|
||||
|
||||
**Numeric - Count Data**:
|
||||
- sqrt or log transformation
|
||||
- QuantileTransformer
|
||||
- StandardScaler after transformation
|
||||
|
||||
**Categorical - Low Cardinality (<10 categories)**:
|
||||
- OneHotEncoder
|
||||
|
||||
**Categorical - High Cardinality (>10 categories)**:
|
||||
- TargetEncoder (supervised)
|
||||
- Frequency encoding
|
||||
- OneHotEncoder with min_frequency parameter
|
||||
|
||||
**Categorical - Ordinal**:
|
||||
- OrdinalEncoder
|
||||
|
||||
**Text**:
|
||||
- CountVectorizer or TfidfVectorizer
|
||||
- Normalizer after vectorization
|
||||
|
||||
### Data Leakage Prevention
|
||||
|
||||
1. **Fit only on training data**: Never include test data when fitting preprocessors
|
||||
2. **Use pipelines**: Ensures proper fit/transform separation
|
||||
3. **Cross-validation**: Use Pipeline with cross_val_score() for proper evaluation
|
||||
4. **Target encoding**: Use cv parameter in TargetEncoder for cross-fitting
|
||||
|
||||
```python
|
||||
# WRONG - data leakage
|
||||
scaler = StandardScaler().fit(X_full)
|
||||
X_train_scaled = scaler.transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# CORRECT - no leakage
|
||||
scaler = StandardScaler().fit(X_train)
|
||||
X_train_scaled = scaler.transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
```
|
||||
|
||||
## Preprocessing Checklist
|
||||
|
||||
Before modeling:
|
||||
1. Handle missing values (imputation or removal)
|
||||
2. Encode categorical variables appropriately
|
||||
3. Scale/normalize numeric features (if needed for algorithm)
|
||||
4. Handle outliers (RobustScaler, clipping, removal)
|
||||
5. Create additional features if beneficial (PolynomialFeatures, domain knowledge)
|
||||
6. Check for data leakage in preprocessing steps
|
||||
7. Wrap everything in a Pipeline
|
||||
625
scientific-packages/scikit-learn/references/quick_reference.md
Normal file
625
scientific-packages/scikit-learn/references/quick_reference.md
Normal file
@@ -0,0 +1,625 @@
|
||||
# Scikit-learn Quick Reference
|
||||
|
||||
## Essential Imports
|
||||
|
||||
```python
|
||||
# Core
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
|
||||
from sklearn.pipeline import Pipeline, make_pipeline
|
||||
from sklearn.compose import ColumnTransformer
|
||||
|
||||
# Preprocessing
|
||||
from sklearn.preprocessing import (
|
||||
StandardScaler, MinMaxScaler, RobustScaler,
|
||||
OneHotEncoder, OrdinalEncoder, LabelEncoder,
|
||||
PolynomialFeatures
|
||||
)
|
||||
from sklearn.impute import SimpleImputer
|
||||
|
||||
# Models - Classification
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
from sklearn.ensemble import (
|
||||
RandomForestClassifier,
|
||||
GradientBoostingClassifier,
|
||||
HistGradientBoostingClassifier
|
||||
)
|
||||
from sklearn.svm import SVC
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
|
||||
# Models - Regression
|
||||
from sklearn.linear_model import LinearRegression, Ridge, Lasso
|
||||
from sklearn.ensemble import (
|
||||
RandomForestRegressor,
|
||||
GradientBoostingRegressor,
|
||||
HistGradientBoostingRegressor
|
||||
)
|
||||
|
||||
# Clustering
|
||||
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
|
||||
from sklearn.mixture import GaussianMixture
|
||||
|
||||
# Dimensionality Reduction
|
||||
from sklearn.decomposition import PCA, NMF, TruncatedSVD
|
||||
from sklearn.manifold import TSNE
|
||||
|
||||
# Metrics
|
||||
from sklearn.metrics import (
|
||||
accuracy_score, precision_score, recall_score, f1_score,
|
||||
confusion_matrix, classification_report,
|
||||
mean_squared_error, r2_score, mean_absolute_error
|
||||
)
|
||||
```
|
||||
|
||||
## Basic Workflow Template
|
||||
|
||||
### Classification
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.metrics import classification_report
|
||||
|
||||
# Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42, stratify=y
|
||||
)
|
||||
|
||||
# Scale features
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# Train model
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
model.fit(X_train_scaled, y_train)
|
||||
|
||||
# Predict and evaluate
|
||||
y_pred = model.predict(X_test_scaled)
|
||||
print(classification_report(y_test, y_pred))
|
||||
```
|
||||
|
||||
### Regression
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
from sklearn.metrics import mean_squared_error, r2_score
|
||||
|
||||
# Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# Scale features
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# Train model
|
||||
model = RandomForestRegressor(n_estimators=100, random_state=42)
|
||||
model.fit(X_train_scaled, y_train)
|
||||
|
||||
# Predict and evaluate
|
||||
y_pred = model.predict(X_test_scaled)
|
||||
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.3f}")
|
||||
print(f"R²: {r2_score(y_test, y_pred):.3f}")
|
||||
```
|
||||
|
||||
### With Pipeline (Recommended)
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import train_test_split, cross_val_score
|
||||
|
||||
# Create pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
|
||||
])
|
||||
|
||||
# Split and train
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Evaluate
|
||||
score = pipeline.score(X_test, y_test)
|
||||
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
|
||||
print(f"Test accuracy: {score:.3f}")
|
||||
print(f"CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
|
||||
```
|
||||
|
||||
## Common Preprocessing Patterns
|
||||
|
||||
### Numeric Data
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
numeric_transformer = Pipeline([
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
```
|
||||
|
||||
### Categorical Data
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
categorical_transformer = Pipeline([
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
])
|
||||
```
|
||||
|
||||
### Mixed Data with ColumnTransformer
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
|
||||
numeric_features = ['age', 'income', 'credit_score']
|
||||
categorical_features = ['country', 'occupation']
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
|
||||
# Complete pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
pipeline = Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', RandomForestClassifier())
|
||||
])
|
||||
```
|
||||
|
||||
## Model Selection Cheat Sheet
|
||||
|
||||
### Quick Decision Tree
|
||||
|
||||
```
|
||||
Is it supervised?
|
||||
├─ Yes
|
||||
│ ├─ Predicting categories? → Classification
|
||||
│ │ ├─ Start with: LogisticRegression (baseline)
|
||||
│ │ ├─ Then try: RandomForestClassifier
|
||||
│ │ └─ Best performance: HistGradientBoostingClassifier
|
||||
│ └─ Predicting numbers? → Regression
|
||||
│ ├─ Start with: LinearRegression/Ridge (baseline)
|
||||
│ ├─ Then try: RandomForestRegressor
|
||||
│ └─ Best performance: HistGradientBoostingRegressor
|
||||
└─ No
|
||||
├─ Grouping similar items? → Clustering
|
||||
│ ├─ Know # clusters: KMeans
|
||||
│ └─ Unknown # clusters: DBSCAN or HDBSCAN
|
||||
├─ Reducing dimensions?
|
||||
│ ├─ For preprocessing: PCA
|
||||
│ └─ For visualization: t-SNE or UMAP
|
||||
└─ Finding outliers? → IsolationForest or LocalOutlierFactor
|
||||
```
|
||||
|
||||
### Algorithm Selection by Data Size
|
||||
|
||||
- **Small (<1K samples)**: Any algorithm
|
||||
- **Medium (1K-100K)**: Random Forests, Gradient Boosting, Neural Networks
|
||||
- **Large (>100K)**: SGDClassifier/Regressor, HistGradientBoosting, LinearSVC
|
||||
|
||||
### When to Scale Features
|
||||
|
||||
**Always scale**:
|
||||
- SVM, Neural Networks
|
||||
- K-Nearest Neighbors
|
||||
- Linear/Logistic Regression (with regularization)
|
||||
- PCA, LDA
|
||||
- Any gradient descent algorithm
|
||||
|
||||
**Don't need to scale**:
|
||||
- Tree-based (Decision Trees, Random Forests, Gradient Boosting)
|
||||
- Naive Bayes
|
||||
|
||||
## Hyperparameter Tuning
|
||||
|
||||
### GridSearchCV
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
param_grid = {
|
||||
'n_estimators': [100, 200, 500],
|
||||
'max_depth': [10, 20, None],
|
||||
'min_samples_split': [2, 5, 10]
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_grid,
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1
|
||||
)
|
||||
|
||||
grid_search.fit(X_train, y_train)
|
||||
best_model = grid_search.best_estimator_
|
||||
print(f"Best params: {grid_search.best_params_}")
|
||||
```
|
||||
|
||||
### RandomizedSearchCV (Faster)
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import RandomizedSearchCV
|
||||
from scipy.stats import randint, uniform
|
||||
|
||||
param_distributions = {
|
||||
'n_estimators': randint(100, 1000),
|
||||
'max_depth': randint(5, 50),
|
||||
'min_samples_split': randint(2, 20)
|
||||
}
|
||||
|
||||
random_search = RandomizedSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_distributions,
|
||||
n_iter=50, # Number of combinations to try
|
||||
cv=5,
|
||||
n_jobs=-1,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
random_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Pipeline with GridSearchCV
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.svm import SVC
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('svm', SVC())
|
||||
])
|
||||
|
||||
param_grid = {
|
||||
'svm__C': [0.1, 1, 10],
|
||||
'svm__kernel': ['rbf', 'linear'],
|
||||
'svm__gamma': ['scale', 'auto']
|
||||
}
|
||||
|
||||
grid = GridSearchCV(pipeline, param_grid, cv=5)
|
||||
grid.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Cross-Validation
|
||||
|
||||
### Basic Cross-Validation
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_score
|
||||
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
|
||||
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Multiple Metrics
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_validate
|
||||
|
||||
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
|
||||
results = cross_validate(model, X, y, cv=5, scoring=scoring)
|
||||
|
||||
for metric in scoring:
|
||||
scores = results[f'test_{metric}']
|
||||
print(f"{metric}: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Custom CV Strategies
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
|
||||
|
||||
# For imbalanced classification
|
||||
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
|
||||
|
||||
# For time series
|
||||
cv = TimeSeriesSplit(n_splits=5)
|
||||
|
||||
scores = cross_val_score(model, X, y, cv=cv)
|
||||
```
|
||||
|
||||
## Common Metrics
|
||||
|
||||
### Classification
|
||||
|
||||
```python
|
||||
from sklearn.metrics import (
|
||||
accuracy_score, balanced_accuracy_score,
|
||||
precision_score, recall_score, f1_score,
|
||||
confusion_matrix, classification_report,
|
||||
roc_auc_score
|
||||
)
|
||||
|
||||
# Basic metrics
|
||||
accuracy = accuracy_score(y_true, y_pred)
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
|
||||
# Comprehensive report
|
||||
print(classification_report(y_true, y_pred))
|
||||
|
||||
# ROC AUC (requires probabilities)
|
||||
y_proba = model.predict_proba(X_test)[:, 1]
|
||||
auc = roc_auc_score(y_true, y_proba)
|
||||
```
|
||||
|
||||
### Regression
|
||||
|
||||
```python
|
||||
from sklearn.metrics import (
|
||||
mean_squared_error,
|
||||
mean_absolute_error,
|
||||
r2_score
|
||||
)
|
||||
|
||||
mse = mean_squared_error(y_true, y_pred)
|
||||
rmse = mean_squared_error(y_true, y_pred, squared=False)
|
||||
mae = mean_absolute_error(y_true, y_pred)
|
||||
r2 = r2_score(y_true, y_pred)
|
||||
|
||||
print(f"RMSE: {rmse:.3f}")
|
||||
print(f"MAE: {mae:.3f}")
|
||||
print(f"R²: {r2:.3f}")
|
||||
```
|
||||
|
||||
## Feature Engineering
|
||||
|
||||
### Polynomial Features
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
|
||||
poly = PolynomialFeatures(degree=2, include_bias=False)
|
||||
X_poly = poly.fit_transform(X)
|
||||
# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
|
||||
```
|
||||
|
||||
### Feature Selection
|
||||
|
||||
```python
|
||||
from sklearn.feature_selection import (
|
||||
SelectKBest, f_classif,
|
||||
RFE,
|
||||
SelectFromModel
|
||||
)
|
||||
|
||||
# Univariate selection
|
||||
selector = SelectKBest(f_classif, k=10)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
|
||||
# Recursive feature elimination
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
rfe = RFE(RandomForestClassifier(), n_features_to_select=10)
|
||||
X_selected = rfe.fit_transform(X, y)
|
||||
|
||||
# Model-based selection
|
||||
selector = SelectFromModel(
|
||||
RandomForestClassifier(n_estimators=100),
|
||||
threshold='median'
|
||||
)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
```
|
||||
|
||||
### Feature Importance
|
||||
|
||||
```python
|
||||
# Tree-based models
|
||||
model = RandomForestClassifier()
|
||||
model.fit(X_train, y_train)
|
||||
importances = model.feature_importances_
|
||||
|
||||
# Visualize
|
||||
import matplotlib.pyplot as plt
|
||||
indices = np.argsort(importances)[::-1]
|
||||
plt.bar(range(X.shape[1]), importances[indices])
|
||||
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
|
||||
plt.show()
|
||||
|
||||
# Permutation importance (works for any model)
|
||||
from sklearn.inspection import permutation_importance
|
||||
result = permutation_importance(model, X_test, y_test, n_repeats=10)
|
||||
importances = result.importances_mean
|
||||
```
|
||||
|
||||
## Clustering
|
||||
|
||||
### K-Means
|
||||
|
||||
```python
|
||||
from sklearn.cluster import KMeans
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
# Always scale for k-means
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Fit k-means
|
||||
kmeans = KMeans(n_clusters=3, random_state=42)
|
||||
labels = kmeans.fit_predict(X_scaled)
|
||||
|
||||
# Evaluate
|
||||
from sklearn.metrics import silhouette_score
|
||||
score = silhouette_score(X_scaled, labels)
|
||||
print(f"Silhouette score: {score:.3f}")
|
||||
```
|
||||
|
||||
### Elbow Method
|
||||
|
||||
```python
|
||||
inertias = []
|
||||
K_range = range(2, 11)
|
||||
|
||||
for k in K_range:
|
||||
kmeans = KMeans(n_clusters=k, random_state=42)
|
||||
kmeans.fit(X_scaled)
|
||||
inertias.append(kmeans.inertia_)
|
||||
|
||||
plt.plot(K_range, inertias, 'bo-')
|
||||
plt.xlabel('k')
|
||||
plt.ylabel('Inertia')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### DBSCAN
|
||||
|
||||
```python
|
||||
from sklearn.cluster import DBSCAN
|
||||
|
||||
dbscan = DBSCAN(eps=0.5, min_samples=5)
|
||||
labels = dbscan.fit_predict(X_scaled)
|
||||
|
||||
# -1 indicates noise/outliers
|
||||
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
|
||||
n_noise = list(labels).count(-1)
|
||||
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
|
||||
```
|
||||
|
||||
## Dimensionality Reduction
|
||||
|
||||
### PCA
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
# Always scale before PCA
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Specify n_components
|
||||
pca = PCA(n_components=2)
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
# Or specify variance to retain
|
||||
pca = PCA(n_components=0.95) # Keep 95% variance
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
print(f"Explained variance: {pca.explained_variance_ratio_}")
|
||||
print(f"Components needed: {pca.n_components_}")
|
||||
```
|
||||
|
||||
### t-SNE (Visualization Only)
|
||||
|
||||
```python
|
||||
from sklearn.manifold import TSNE
|
||||
|
||||
# Reduce to 50 dimensions with PCA first (recommended)
|
||||
pca = PCA(n_components=50)
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
# Apply t-SNE
|
||||
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
|
||||
X_tsne = tsne.fit_transform(X_pca)
|
||||
|
||||
# Visualize
|
||||
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
|
||||
plt.colorbar()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Saving and Loading Models
|
||||
|
||||
```python
|
||||
import joblib
|
||||
|
||||
# Save model
|
||||
joblib.dump(model, 'model.pkl')
|
||||
|
||||
# Save pipeline
|
||||
joblib.dump(pipeline, 'pipeline.pkl')
|
||||
|
||||
# Load
|
||||
model = joblib.load('model.pkl')
|
||||
pipeline = joblib.load('pipeline.pkl')
|
||||
|
||||
# Use loaded model
|
||||
y_pred = model.predict(X_new)
|
||||
```
|
||||
|
||||
## Common Pitfalls and Solutions
|
||||
|
||||
### Data Leakage
|
||||
❌ **Wrong**: Fit on all data before split
|
||||
```python
|
||||
scaler = StandardScaler().fit(X)
|
||||
X_train, X_test = train_test_split(scaler.transform(X))
|
||||
```
|
||||
|
||||
✅ **Correct**: Use pipeline or fit only on train
|
||||
```python
|
||||
X_train, X_test = train_test_split(X)
|
||||
pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Not Scaling
|
||||
❌ **Wrong**: Using SVM without scaling
|
||||
```python
|
||||
svm = SVC()
|
||||
svm.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
✅ **Correct**: Scale for SVM
|
||||
```python
|
||||
pipeline = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Wrong Metric for Imbalanced Data
|
||||
❌ **Wrong**: Using accuracy for 99:1 imbalance
|
||||
```python
|
||||
accuracy = accuracy_score(y_true, y_pred) # Can be misleading
|
||||
```
|
||||
|
||||
✅ **Correct**: Use appropriate metrics
|
||||
```python
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
balanced_acc = balanced_accuracy_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
### Not Using Stratification
|
||||
❌ **Wrong**: Random split for imbalanced data
|
||||
```python
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
|
||||
```
|
||||
|
||||
✅ **Correct**: Stratify for imbalanced classes
|
||||
```python
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, stratify=y
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use n_jobs=-1** for parallel processing (RandomForest, GridSearchCV)
|
||||
2. **Use HistGradientBoosting** for large datasets (>10K samples)
|
||||
3. **Use MiniBatchKMeans** for large clustering tasks
|
||||
4. **Use IncrementalPCA** for data that doesn't fit in memory
|
||||
5. **Use sparse matrices** for high-dimensional sparse data (text)
|
||||
6. **Cache transformers** in pipelines during grid search
|
||||
7. **Use RandomizedSearchCV** instead of GridSearchCV for large parameter spaces
|
||||
8. **Reduce dimensionality** with PCA before applying expensive algorithms
|
||||
@@ -0,0 +1,261 @@
|
||||
# Supervised Learning in scikit-learn
|
||||
|
||||
## Overview
|
||||
Supervised learning algorithms learn patterns from labeled training data to make predictions on new data. Scikit-learn organizes supervised learning into 17 major categories.
|
||||
|
||||
## Linear Models
|
||||
|
||||
### Regression
|
||||
- **LinearRegression**: Ordinary least squares regression
|
||||
- **Ridge**: L2-regularized regression, good for multicollinearity
|
||||
- **Lasso**: L1-regularized regression, performs feature selection
|
||||
- **ElasticNet**: Combined L1/L2 regularization
|
||||
- **LassoLars**: Lasso using Least Angle Regression algorithm
|
||||
- **BayesianRidge**: Bayesian approach with automatic relevance determination
|
||||
|
||||
### Classification
|
||||
- **LogisticRegression**: Binary and multiclass classification
|
||||
- **RidgeClassifier**: Ridge regression for classification
|
||||
- **SGDClassifier**: Linear classifiers with SGD training
|
||||
|
||||
**Use cases**: Baseline models, interpretable predictions, high-dimensional data, when linear relationships are expected
|
||||
|
||||
**Key parameters**:
|
||||
- `alpha`: Regularization strength (higher = more regularization)
|
||||
- `fit_intercept`: Whether to calculate intercept
|
||||
- `solver`: Optimization algorithm ('lbfgs', 'saga', 'liblinear')
|
||||
|
||||
## Support Vector Machines (SVM)
|
||||
|
||||
- **SVC**: Support Vector Classification
|
||||
- **SVR**: Support Vector Regression
|
||||
- **LinearSVC**: Linear SVM using liblinear (faster for large datasets)
|
||||
- **OneClassSVM**: Unsupervised outlier detection
|
||||
|
||||
**Use cases**: Complex non-linear decision boundaries, high-dimensional spaces, when clear margin of separation exists
|
||||
|
||||
**Key parameters**:
|
||||
- `kernel`: 'linear', 'poly', 'rbf', 'sigmoid'
|
||||
- `C`: Regularization parameter (lower = more regularization)
|
||||
- `gamma`: Kernel coefficient ('scale', 'auto', or float)
|
||||
- `degree`: Polynomial degree (for poly kernel)
|
||||
|
||||
**Performance tip**: SVMs don't scale well beyond tens of thousands of samples. Use LinearSVC for large datasets with linear kernel.
|
||||
|
||||
## Decision Trees
|
||||
|
||||
- **DecisionTreeClassifier**: Classification tree
|
||||
- **DecisionTreeRegressor**: Regression tree
|
||||
- **ExtraTreeClassifier/Regressor**: Extremely randomized tree
|
||||
|
||||
**Use cases**: Non-linear relationships, feature importance analysis, interpretable rules, handling mixed data types
|
||||
|
||||
**Key parameters**:
|
||||
- `max_depth`: Maximum tree depth (controls overfitting)
|
||||
- `min_samples_split`: Minimum samples to split a node
|
||||
- `min_samples_leaf`: Minimum samples in leaf node
|
||||
- `max_features`: Number of features to consider for splits
|
||||
- `criterion`: 'gini', 'entropy' (classification); 'squared_error', 'absolute_error' (regression)
|
||||
|
||||
**Overfitting prevention**: Limit `max_depth`, increase `min_samples_split/leaf`, use pruning with `ccp_alpha`
|
||||
|
||||
## Ensemble Methods
|
||||
|
||||
### Random Forests
|
||||
- **RandomForestClassifier**: Ensemble of decision trees
|
||||
- **RandomForestRegressor**: Regression variant
|
||||
|
||||
**Use cases**: Robust general-purpose algorithm, reduces overfitting vs single trees, handles non-linear relationships
|
||||
|
||||
**Key parameters**:
|
||||
- `n_estimators`: Number of trees (higher = better but slower)
|
||||
- `max_depth`: Maximum tree depth
|
||||
- `max_features`: Features per split ('sqrt', 'log2', int, float)
|
||||
- `bootstrap`: Whether to use bootstrap samples
|
||||
- `n_jobs`: Parallel processing (-1 uses all cores)
|
||||
|
||||
### Gradient Boosting
|
||||
- **HistGradientBoostingClassifier/Regressor**: Histogram-based, fast for large datasets (>10k samples)
|
||||
- **GradientBoostingClassifier/Regressor**: Traditional implementation, better for small datasets
|
||||
|
||||
**Use cases**: High-performance predictions, winning Kaggle competitions, structured/tabular data
|
||||
|
||||
**Key parameters**:
|
||||
- `n_estimators`: Number of boosting stages
|
||||
- `learning_rate`: Shrinks contribution of each tree
|
||||
- `max_depth`: Maximum tree depth (typically 3-8)
|
||||
- `subsample`: Fraction of samples per tree (enables stochastic gradient boosting)
|
||||
- `early_stopping`: Stop when validation score stops improving
|
||||
|
||||
**Performance tip**: HistGradientBoosting is orders of magnitude faster for large datasets
|
||||
|
||||
### AdaBoost
|
||||
- **AdaBoostClassifier/Regressor**: Adaptive boosting
|
||||
|
||||
**Use cases**: Boosting weak learners, less prone to overfitting than other methods
|
||||
|
||||
**Key parameters**:
|
||||
- `estimator`: Base estimator (default: DecisionTreeClassifier with max_depth=1)
|
||||
- `n_estimators`: Number of boosting iterations
|
||||
- `learning_rate`: Weight applied to each classifier
|
||||
|
||||
### Bagging
|
||||
- **BaggingClassifier/Regressor**: Bootstrap aggregating with any base estimator
|
||||
|
||||
**Use cases**: Reducing variance of unstable models, parallel ensemble creation
|
||||
|
||||
**Key parameters**:
|
||||
- `estimator`: Base estimator to fit
|
||||
- `n_estimators`: Number of estimators
|
||||
- `max_samples`: Samples to draw per estimator
|
||||
- `bootstrap`: Whether to use replacement
|
||||
|
||||
### Voting & Stacking
|
||||
- **VotingClassifier/Regressor**: Combines different model types
|
||||
- **StackingClassifier/Regressor**: Meta-learner trained on base predictions
|
||||
|
||||
**Use cases**: Combining diverse models, leveraging different model strengths
|
||||
|
||||
## Neural Networks
|
||||
|
||||
- **MLPClassifier**: Multi-layer perceptron classifier
|
||||
- **MLPRegressor**: Multi-layer perceptron regressor
|
||||
|
||||
**Use cases**: Complex non-linear patterns, when gradient boosting is too slow, deep feature learning
|
||||
|
||||
**Key parameters**:
|
||||
- `hidden_layer_sizes`: Tuple of hidden layer sizes (e.g., (100, 50))
|
||||
- `activation`: 'relu', 'tanh', 'logistic'
|
||||
- `solver`: 'adam', 'lbfgs', 'sgd'
|
||||
- `alpha`: L2 regularization term
|
||||
- `learning_rate`: Learning rate schedule
|
||||
- `early_stopping`: Stop when validation score stops improving
|
||||
|
||||
**Important**: Feature scaling is critical for neural networks. Always use StandardScaler or similar.
|
||||
|
||||
## Nearest Neighbors
|
||||
|
||||
- **KNeighborsClassifier/Regressor**: K-nearest neighbors
|
||||
- **RadiusNeighborsClassifier/Regressor**: Radius-based neighbors
|
||||
- **NearestCentroid**: Classification using class centroids
|
||||
|
||||
**Use cases**: Simple baseline, irregular decision boundaries, when interpretability isn't critical
|
||||
|
||||
**Key parameters**:
|
||||
- `n_neighbors`: Number of neighbors (typically 3-11)
|
||||
- `weights`: 'uniform' or 'distance' (distance-weighted voting)
|
||||
- `metric`: Distance metric ('euclidean', 'manhattan', 'minkowski')
|
||||
- `algorithm`: 'auto', 'ball_tree', 'kd_tree', 'brute'
|
||||
|
||||
## Naive Bayes
|
||||
|
||||
- **GaussianNB**: Assumes Gaussian distribution of features
|
||||
- **MultinomialNB**: For discrete counts (text classification)
|
||||
- **BernoulliNB**: For binary/boolean features
|
||||
- **CategoricalNB**: For categorical features
|
||||
- **ComplementNB**: Adapted for imbalanced datasets
|
||||
|
||||
**Use cases**: Text classification, fast baseline, when features are independent, small training sets
|
||||
|
||||
**Key parameters**:
|
||||
- `alpha`: Smoothing parameter (Laplace/Lidstone smoothing)
|
||||
- `fit_prior`: Whether to learn class prior probabilities
|
||||
|
||||
## Linear/Quadratic Discriminant Analysis
|
||||
|
||||
- **LinearDiscriminantAnalysis**: Linear decision boundary with dimensionality reduction
|
||||
- **QuadraticDiscriminantAnalysis**: Quadratic decision boundary
|
||||
|
||||
**Use cases**: When classes have Gaussian distributions, dimensionality reduction, when covariance assumptions hold
|
||||
|
||||
## Gaussian Processes
|
||||
|
||||
- **GaussianProcessClassifier**: Probabilistic classification
|
||||
- **GaussianProcessRegressor**: Probabilistic regression with uncertainty estimates
|
||||
|
||||
**Use cases**: When uncertainty quantification is important, small datasets, smooth function approximation
|
||||
|
||||
**Key parameters**:
|
||||
- `kernel`: Covariance function (RBF, Matern, RationalQuadratic, etc.)
|
||||
- `alpha`: Noise level
|
||||
|
||||
**Limitation**: Doesn't scale well to large datasets (O(n³) complexity)
|
||||
|
||||
## Stochastic Gradient Descent
|
||||
|
||||
- **SGDClassifier**: Linear classifiers with SGD
|
||||
- **SGDRegressor**: Linear regressors with SGD
|
||||
|
||||
**Use cases**: Very large datasets (>100k samples), online learning, when data doesn't fit in memory
|
||||
|
||||
**Key parameters**:
|
||||
- `loss`: Loss function ('hinge', 'log_loss', 'squared_error', etc.)
|
||||
- `penalty`: Regularization ('l2', 'l1', 'elasticnet')
|
||||
- `alpha`: Regularization strength
|
||||
- `learning_rate`: Learning rate schedule
|
||||
|
||||
## Semi-Supervised Learning
|
||||
|
||||
- **SelfTrainingClassifier**: Self-training with any base classifier
|
||||
- **LabelPropagation**: Label propagation through graph
|
||||
- **LabelSpreading**: Label spreading (modified label propagation)
|
||||
|
||||
**Use cases**: When labeled data is scarce but unlabeled data is abundant
|
||||
|
||||
## Feature Selection
|
||||
|
||||
- **VarianceThreshold**: Remove low-variance features
|
||||
- **SelectKBest**: Select K highest scoring features
|
||||
- **SelectPercentile**: Select top percentile of features
|
||||
- **RFE**: Recursive feature elimination
|
||||
- **RFECV**: RFE with cross-validation
|
||||
- **SelectFromModel**: Select features based on importance
|
||||
- **SequentialFeatureSelector**: Forward/backward feature selection
|
||||
|
||||
**Use cases**: Reducing dimensionality, removing irrelevant features, improving interpretability, reducing overfitting
|
||||
|
||||
## Probability Calibration
|
||||
|
||||
- **CalibratedClassifierCV**: Calibrate classifier probabilities
|
||||
|
||||
**Use cases**: When probability estimates are important (not just class predictions), especially with SVM and Naive Bayes
|
||||
|
||||
**Methods**:
|
||||
- `sigmoid`: Platt scaling
|
||||
- `isotonic`: Isotonic regression (more flexible, needs more data)
|
||||
|
||||
## Multi-Output Methods
|
||||
|
||||
- **MultiOutputClassifier**: Fit one classifier per target
|
||||
- **MultiOutputRegressor**: Fit one regressor per target
|
||||
- **ClassifierChain**: Models dependencies between targets
|
||||
- **RegressorChain**: Regression variant
|
||||
|
||||
**Use cases**: Predicting multiple related targets simultaneously
|
||||
|
||||
## Specialized Regression
|
||||
|
||||
- **IsotonicRegression**: Monotonic regression
|
||||
- **QuantileRegressor**: Quantile regression for prediction intervals
|
||||
|
||||
## Algorithm Selection Guidelines
|
||||
|
||||
**Start with**:
|
||||
1. **Logistic Regression** (classification) or **LinearRegression/Ridge** (regression) as baseline
|
||||
2. **RandomForestClassifier/Regressor** for general non-linear problems
|
||||
3. **HistGradientBoostingClassifier/Regressor** when best performance is needed
|
||||
|
||||
**Consider dataset size**:
|
||||
- Small (<1k samples): SVM, Gaussian Processes, any algorithm
|
||||
- Medium (1k-100k): Random Forests, Gradient Boosting, Neural Networks
|
||||
- Large (>100k): SGD, HistGradientBoosting, LinearSVC
|
||||
|
||||
**Consider interpretability needs**:
|
||||
- High interpretability: Linear models, Decision Trees, Naive Bayes
|
||||
- Medium: Random Forests (feature importance), Rule extraction
|
||||
- Low (black box acceptable): Gradient Boosting, Neural Networks, SVM with RBF kernel
|
||||
|
||||
**Consider training time**:
|
||||
- Fast: Linear models, Naive Bayes, Decision Trees
|
||||
- Medium: Random Forests (parallelizable), SVM (small data)
|
||||
- Slow: Gradient Boosting, Neural Networks, SVM (large data), Gaussian Processes
|
||||
@@ -0,0 +1,728 @@
|
||||
# Unsupervised Learning in scikit-learn
|
||||
|
||||
## Overview
|
||||
Unsupervised learning discovers patterns in data without labeled targets. Main tasks include clustering (grouping similar samples), dimensionality reduction (reducing feature count), and anomaly detection (finding outliers).
|
||||
|
||||
## Clustering Algorithms
|
||||
|
||||
### K-Means
|
||||
|
||||
Groups data into k clusters by minimizing within-cluster variance.
|
||||
|
||||
**Algorithm**:
|
||||
1. Initialize k centroids (k-means++ initialization recommended)
|
||||
2. Assign each point to nearest centroid
|
||||
3. Update centroids to mean of assigned points
|
||||
4. Repeat until convergence
|
||||
|
||||
```python
|
||||
from sklearn.cluster import KMeans
|
||||
|
||||
kmeans = KMeans(
|
||||
n_clusters=3,
|
||||
init='k-means++', # Smart initialization
|
||||
n_init=10, # Number of times to run with different seeds
|
||||
max_iter=300,
|
||||
random_state=42
|
||||
)
|
||||
labels = kmeans.fit_predict(X)
|
||||
centroids = kmeans.cluster_centers_
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Customer segmentation
|
||||
- Image compression
|
||||
- Data preprocessing (clustering as features)
|
||||
|
||||
**Strengths**:
|
||||
- Fast and scalable
|
||||
- Simple to understand
|
||||
- Works well with spherical clusters
|
||||
|
||||
**Limitations**:
|
||||
- Assumes spherical clusters of similar size
|
||||
- Sensitive to initialization (mitigated by k-means++)
|
||||
- Must specify k beforehand
|
||||
- Sensitive to outliers
|
||||
|
||||
**Choosing k**: Use elbow method, silhouette score, or domain knowledge
|
||||
|
||||
**Variants**:
|
||||
- **MiniBatchKMeans**: Faster for large datasets, uses mini-batches
|
||||
- **KMeans with n_init='auto'**: Adaptive number of initializations
|
||||
|
||||
### DBSCAN
|
||||
|
||||
Density-Based Spatial Clustering of Applications with Noise. Identifies clusters as dense regions separated by sparse areas.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import DBSCAN
|
||||
|
||||
dbscan = DBSCAN(
|
||||
eps=0.5, # Maximum distance between neighbors
|
||||
min_samples=5, # Minimum points to form dense region
|
||||
metric='euclidean'
|
||||
)
|
||||
labels = dbscan.fit_predict(X)
|
||||
# -1 indicates noise/outliers
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Arbitrary cluster shapes
|
||||
- Outlier detection
|
||||
- When cluster count is unknown
|
||||
- Geographic/spatial data
|
||||
|
||||
**Strengths**:
|
||||
- Discovers arbitrary-shaped clusters
|
||||
- Automatically detects outliers
|
||||
- Doesn't require specifying number of clusters
|
||||
- Robust to outliers
|
||||
|
||||
**Limitations**:
|
||||
- Struggles with varying densities
|
||||
- Sensitive to eps and min_samples parameters
|
||||
- Not deterministic (border points may vary)
|
||||
|
||||
**Parameter tuning**:
|
||||
- `eps`: Plot k-distance graph, look for elbow
|
||||
- `min_samples`: Rule of thumb: 2 * dimensions
|
||||
|
||||
### HDBSCAN
|
||||
|
||||
Hierarchical DBSCAN that handles variable cluster densities.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import HDBSCAN
|
||||
|
||||
hdbscan = HDBSCAN(
|
||||
min_cluster_size=5,
|
||||
min_samples=None, # Defaults to min_cluster_size
|
||||
metric='euclidean'
|
||||
)
|
||||
labels = hdbscan.fit_predict(X)
|
||||
```
|
||||
|
||||
**Advantages over DBSCAN**:
|
||||
- Handles variable density clusters
|
||||
- More robust parameter selection
|
||||
- Provides cluster membership probabilities
|
||||
- Hierarchical structure
|
||||
|
||||
**Use cases**: When DBSCAN struggles with varying densities
|
||||
|
||||
### Hierarchical Clustering
|
||||
|
||||
Builds nested cluster hierarchies using agglomerative (bottom-up) approach.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import AgglomerativeClustering
|
||||
|
||||
agg_clust = AgglomerativeClustering(
|
||||
n_clusters=3,
|
||||
linkage='ward', # 'ward', 'complete', 'average', 'single'
|
||||
metric='euclidean'
|
||||
)
|
||||
labels = agg_clust.fit_predict(X)
|
||||
|
||||
# Visualize with dendrogram
|
||||
from scipy.cluster.hierarchy import dendrogram, linkage as scipy_linkage
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
linkage_matrix = scipy_linkage(X, method='ward')
|
||||
dendrogram(linkage_matrix)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Linkage methods**:
|
||||
- `ward`: Minimizes variance (only with Euclidean) - **most common**
|
||||
- `complete`: Maximum distance between clusters
|
||||
- `average`: Average distance between clusters
|
||||
- `single`: Minimum distance between clusters
|
||||
|
||||
**Use cases**:
|
||||
- When hierarchical structure is meaningful
|
||||
- Taxonomy/phylogenetic trees
|
||||
- When visualization is important (dendrograms)
|
||||
|
||||
**Strengths**:
|
||||
- No need to specify k initially (cut dendrogram at desired level)
|
||||
- Produces hierarchy of clusters
|
||||
- Deterministic
|
||||
|
||||
**Limitations**:
|
||||
- Computationally expensive (O(n²) to O(n³))
|
||||
- Not suitable for large datasets
|
||||
- Cannot undo previous merges
|
||||
|
||||
### Spectral Clustering
|
||||
|
||||
Performs dimensionality reduction using affinity matrix before clustering.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import SpectralClustering
|
||||
|
||||
spectral = SpectralClustering(
|
||||
n_clusters=3,
|
||||
affinity='rbf', # 'rbf', 'nearest_neighbors', 'precomputed'
|
||||
gamma=1.0,
|
||||
n_neighbors=10,
|
||||
random_state=42
|
||||
)
|
||||
labels = spectral.fit_predict(X)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Non-convex clusters
|
||||
- Image segmentation
|
||||
- Graph clustering
|
||||
- When similarity matrix is available
|
||||
|
||||
**Strengths**:
|
||||
- Handles non-convex clusters
|
||||
- Works with similarity matrices
|
||||
- Often better than k-means for complex shapes
|
||||
|
||||
**Limitations**:
|
||||
- Computationally expensive
|
||||
- Requires specifying number of clusters
|
||||
- Memory intensive
|
||||
|
||||
### Mean Shift
|
||||
|
||||
Discovers clusters through iterative centroid updates based on density.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import MeanShift, estimate_bandwidth
|
||||
|
||||
# Estimate bandwidth
|
||||
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)
|
||||
|
||||
mean_shift = MeanShift(bandwidth=bandwidth)
|
||||
labels = mean_shift.fit_predict(X)
|
||||
cluster_centers = mean_shift.cluster_centers_
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- When cluster count is unknown
|
||||
- Computer vision applications
|
||||
- Object tracking
|
||||
|
||||
**Strengths**:
|
||||
- Automatically determines number of clusters
|
||||
- Handles arbitrary shapes
|
||||
- No assumptions about cluster shape
|
||||
|
||||
**Limitations**:
|
||||
- Computationally expensive
|
||||
- Very sensitive to bandwidth parameter
|
||||
- Doesn't scale well
|
||||
|
||||
### Affinity Propagation
|
||||
|
||||
Uses message-passing between samples to identify exemplars.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import AffinityPropagation
|
||||
|
||||
affinity_prop = AffinityPropagation(
|
||||
damping=0.5, # Damping factor (0.5-1.0)
|
||||
preference=None, # Self-preference (controls number of clusters)
|
||||
random_state=42
|
||||
)
|
||||
labels = affinity_prop.fit_predict(X)
|
||||
exemplars = affinity_prop.cluster_centers_indices_
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- When number of clusters is unknown
|
||||
- When exemplars (representative samples) are needed
|
||||
|
||||
**Strengths**:
|
||||
- Automatically determines number of clusters
|
||||
- Identifies exemplar samples
|
||||
- No initialization required
|
||||
|
||||
**Limitations**:
|
||||
- Very slow: O(n²t) where t is iterations
|
||||
- Not suitable for large datasets
|
||||
- Memory intensive
|
||||
|
||||
### Gaussian Mixture Models (GMM)
|
||||
|
||||
Probabilistic model assuming data comes from mixture of Gaussian distributions.
|
||||
|
||||
```python
|
||||
from sklearn.mixture import GaussianMixture
|
||||
|
||||
gmm = GaussianMixture(
|
||||
n_components=3,
|
||||
covariance_type='full', # 'full', 'tied', 'diag', 'spherical'
|
||||
random_state=42
|
||||
)
|
||||
labels = gmm.fit_predict(X)
|
||||
probabilities = gmm.predict_proba(X) # Soft clustering
|
||||
```
|
||||
|
||||
**Covariance types**:
|
||||
- `full`: Each component has its own covariance matrix
|
||||
- `tied`: All components share same covariance
|
||||
- `diag`: Diagonal covariance (independent features)
|
||||
- `spherical`: Spherical covariance (isotropic)
|
||||
|
||||
**Use cases**:
|
||||
- When soft clustering is needed (probabilities)
|
||||
- When clusters have different shapes/sizes
|
||||
- Generative modeling
|
||||
- Density estimation
|
||||
|
||||
**Strengths**:
|
||||
- Provides probabilities (soft clustering)
|
||||
- Can handle elliptical clusters
|
||||
- Generative model (can sample new data)
|
||||
- Model selection with BIC/AIC
|
||||
|
||||
**Limitations**:
|
||||
- Assumes Gaussian distributions
|
||||
- Sensitive to initialization
|
||||
- Can converge to local optima
|
||||
|
||||
**Model selection**:
|
||||
```python
|
||||
from sklearn.mixture import GaussianMixture
|
||||
import numpy as np
|
||||
|
||||
n_components_range = range(2, 10)
|
||||
bic_scores = []
|
||||
|
||||
for n in n_components_range:
|
||||
gmm = GaussianMixture(n_components=n, random_state=42)
|
||||
gmm.fit(X)
|
||||
bic_scores.append(gmm.bic(X))
|
||||
|
||||
optimal_n = n_components_range[np.argmin(bic_scores)]
|
||||
```
|
||||
|
||||
### BIRCH
|
||||
|
||||
Builds Clustering Feature Tree for memory-efficient processing of large datasets.
|
||||
|
||||
```python
|
||||
from sklearn.cluster import Birch
|
||||
|
||||
birch = Birch(
|
||||
n_clusters=3,
|
||||
threshold=0.5,
|
||||
branching_factor=50
|
||||
)
|
||||
labels = birch.fit_predict(X)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Very large datasets
|
||||
- Streaming data
|
||||
- Memory constraints
|
||||
|
||||
**Strengths**:
|
||||
- Memory efficient
|
||||
- Single pass over data
|
||||
- Incremental learning
|
||||
|
||||
## Dimensionality Reduction
|
||||
|
||||
### Principal Component Analysis (PCA)
|
||||
|
||||
Finds orthogonal components that explain maximum variance.
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import PCA
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Specify number of components
|
||||
pca = PCA(n_components=2, random_state=42)
|
||||
X_transformed = pca.fit_transform(X)
|
||||
|
||||
print("Explained variance ratio:", pca.explained_variance_ratio_)
|
||||
print("Total variance explained:", pca.explained_variance_ratio_.sum())
|
||||
|
||||
# Or specify variance to retain
|
||||
pca = PCA(n_components=0.95) # Keep 95% of variance
|
||||
X_transformed = pca.fit_transform(X)
|
||||
print(f"Components needed: {pca.n_components_}")
|
||||
|
||||
# Visualize explained variance
|
||||
plt.plot(np.cumsum(pca.explained_variance_ratio_))
|
||||
plt.xlabel('Number of components')
|
||||
plt.ylabel('Cumulative explained variance')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Visualization (reduce to 2-3 dimensions)
|
||||
- Remove multicollinearity
|
||||
- Noise reduction
|
||||
- Speed up training
|
||||
- Feature extraction
|
||||
|
||||
**Strengths**:
|
||||
- Fast and efficient
|
||||
- Reduces multicollinearity
|
||||
- Works well for linear relationships
|
||||
- Interpretable components
|
||||
|
||||
**Limitations**:
|
||||
- Only linear transformations
|
||||
- Sensitive to scaling (always standardize first!)
|
||||
- Components may be hard to interpret
|
||||
|
||||
**Variants**:
|
||||
- **IncrementalPCA**: For datasets that don't fit in memory
|
||||
- **KernelPCA**: Non-linear dimensionality reduction
|
||||
- **SparsePCA**: Sparse loadings for interpretability
|
||||
|
||||
### t-SNE
|
||||
|
||||
t-Distributed Stochastic Neighbor Embedding for visualization.
|
||||
|
||||
```python
|
||||
from sklearn.manifold import TSNE
|
||||
|
||||
tsne = TSNE(
|
||||
n_components=2,
|
||||
perplexity=30, # Balance local vs global structure (5-50)
|
||||
learning_rate='auto',
|
||||
n_iter=1000,
|
||||
random_state=42
|
||||
)
|
||||
X_embedded = tsne.fit_transform(X)
|
||||
|
||||
# Visualize
|
||||
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Visualization only (do not use for preprocessing!)
|
||||
- Exploring high-dimensional data
|
||||
- Finding clusters visually
|
||||
|
||||
**Important notes**:
|
||||
- **Only for visualization**, not for preprocessing
|
||||
- Each run produces different results (use random_state for reproducibility)
|
||||
- Slow for large datasets
|
||||
- Cannot transform new data (no transform() method)
|
||||
|
||||
**Parameter tuning**:
|
||||
- `perplexity`: 5-50, larger for larger datasets
|
||||
- Lower perplexity = focus on local structure
|
||||
- Higher perplexity = focus on global structure
|
||||
|
||||
### UMAP
|
||||
|
||||
Uniform Manifold Approximation and Projection (requires umap-learn package).
|
||||
|
||||
**Advantages over t-SNE**:
|
||||
- Preserves global structure better
|
||||
- Faster
|
||||
- Can transform new data
|
||||
- Can be used for preprocessing (not just visualization)
|
||||
|
||||
### Truncated SVD (LSA)
|
||||
|
||||
Similar to PCA but works with sparse matrices (e.g., TF-IDF).
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import TruncatedSVD
|
||||
|
||||
svd = TruncatedSVD(n_components=100, random_state=42)
|
||||
X_reduced = svd.fit_transform(X_sparse)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Text data (after TF-IDF)
|
||||
- Sparse matrices
|
||||
- Latent Semantic Analysis (LSA)
|
||||
|
||||
### Non-negative Matrix Factorization (NMF)
|
||||
|
||||
Factorizes data into non-negative components.
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import NMF
|
||||
|
||||
nmf = NMF(n_components=10, init='nndsvd', random_state=42)
|
||||
W = nmf.fit_transform(X) # Document-topic matrix
|
||||
H = nmf.components_ # Topic-word matrix
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Topic modeling
|
||||
- Audio source separation
|
||||
- Image processing
|
||||
- When non-negativity is important (e.g., counts)
|
||||
|
||||
**Strengths**:
|
||||
- Interpretable components (additive, non-negative)
|
||||
- Sparse representations
|
||||
|
||||
### Independent Component Analysis (ICA)
|
||||
|
||||
Separates multivariate signal into independent components.
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import FastICA
|
||||
|
||||
ica = FastICA(n_components=10, random_state=42)
|
||||
X_independent = ica.fit_transform(X)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Blind source separation
|
||||
- Signal processing
|
||||
- Feature extraction when independence is expected
|
||||
|
||||
### Factor Analysis
|
||||
|
||||
Models observed variables as linear combinations of latent factors plus noise.
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import FactorAnalysis
|
||||
|
||||
fa = FactorAnalysis(n_components=5, random_state=42)
|
||||
X_factors = fa.fit_transform(X)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- When noise is heteroscedastic
|
||||
- Latent variable modeling
|
||||
- Psychology/social science research
|
||||
|
||||
**Difference from PCA**: Models noise explicitly, assumes features have independent noise
|
||||
|
||||
## Anomaly Detection
|
||||
|
||||
### One-Class SVM
|
||||
|
||||
Learns boundary around normal data.
|
||||
|
||||
```python
|
||||
from sklearn.svm import OneClassSVM
|
||||
|
||||
oc_svm = OneClassSVM(
|
||||
nu=0.1, # Proportion of outliers expected
|
||||
kernel='rbf',
|
||||
gamma='auto'
|
||||
)
|
||||
oc_svm.fit(X_train)
|
||||
predictions = oc_svm.predict(X_test) # 1 for inliers, -1 for outliers
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Novelty detection
|
||||
- When only normal data is available for training
|
||||
|
||||
### Isolation Forest
|
||||
|
||||
Isolates outliers using random forests.
|
||||
|
||||
```python
|
||||
from sklearn.ensemble import IsolationForest
|
||||
|
||||
iso_forest = IsolationForest(
|
||||
contamination=0.1, # Expected proportion of outliers
|
||||
random_state=42
|
||||
)
|
||||
predictions = iso_forest.fit_predict(X) # 1 for inliers, -1 for outliers
|
||||
scores = iso_forest.score_samples(X) # Anomaly scores
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- General anomaly detection
|
||||
- Works well with high-dimensional data
|
||||
- Fast and scalable
|
||||
|
||||
**Strengths**:
|
||||
- Fast
|
||||
- Effective in high dimensions
|
||||
- Low memory requirements
|
||||
|
||||
### Local Outlier Factor (LOF)
|
||||
|
||||
Detects outliers based on local density deviation.
|
||||
|
||||
```python
|
||||
from sklearn.neighbors import LocalOutlierFactor
|
||||
|
||||
lof = LocalOutlierFactor(
|
||||
n_neighbors=20,
|
||||
contamination=0.1
|
||||
)
|
||||
predictions = lof.fit_predict(X) # 1 for inliers, -1 for outliers
|
||||
scores = lof.negative_outlier_factor_ # Anomaly scores (negative)
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Finding local outliers
|
||||
- When global methods fail
|
||||
|
||||
## Clustering Evaluation
|
||||
|
||||
### With Ground Truth Labels
|
||||
|
||||
When true labels are available (for validation):
|
||||
|
||||
**Adjusted Rand Index (ARI)**:
|
||||
```python
|
||||
from sklearn.metrics import adjusted_rand_score
|
||||
ari = adjusted_rand_score(y_true, y_pred)
|
||||
# Range: [-1, 1], 1 = perfect, 0 = random
|
||||
```
|
||||
|
||||
**Normalized Mutual Information (NMI)**:
|
||||
```python
|
||||
from sklearn.metrics import normalized_mutual_info_score
|
||||
nmi = normalized_mutual_info_score(y_true, y_pred)
|
||||
# Range: [0, 1], 1 = perfect
|
||||
```
|
||||
|
||||
**V-Measure**:
|
||||
```python
|
||||
from sklearn.metrics import v_measure_score
|
||||
v = v_measure_score(y_true, y_pred)
|
||||
# Range: [0, 1], harmonic mean of homogeneity and completeness
|
||||
```
|
||||
|
||||
### Without Ground Truth Labels
|
||||
|
||||
When true labels are unavailable (unsupervised evaluation):
|
||||
|
||||
**Silhouette Score**:
|
||||
Measures how similar objects are to their own cluster vs other clusters.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import silhouette_score, silhouette_samples
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
score = silhouette_score(X, labels)
|
||||
# Range: [-1, 1], higher is better
|
||||
# >0.7: Strong structure
|
||||
# 0.5-0.7: Reasonable structure
|
||||
# 0.25-0.5: Weak structure
|
||||
# <0.25: No substantial structure
|
||||
|
||||
# Per-sample scores for detailed analysis
|
||||
sample_scores = silhouette_samples(X, labels)
|
||||
|
||||
# Visualize silhouette plot
|
||||
for i in range(n_clusters):
|
||||
cluster_scores = sample_scores[labels == i]
|
||||
cluster_scores.sort()
|
||||
plt.barh(range(len(cluster_scores)), cluster_scores)
|
||||
plt.axvline(x=score, color='red', linestyle='--')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Davies-Bouldin Index**:
|
||||
```python
|
||||
from sklearn.metrics import davies_bouldin_score
|
||||
db = davies_bouldin_score(X, labels)
|
||||
# Lower is better, 0 = perfect
|
||||
```
|
||||
|
||||
**Calinski-Harabasz Index** (Variance Ratio Criterion):
|
||||
```python
|
||||
from sklearn.metrics import calinski_harabasz_score
|
||||
ch = calinski_harabasz_score(X, labels)
|
||||
# Higher is better
|
||||
```
|
||||
|
||||
**Inertia** (K-Means specific):
|
||||
```python
|
||||
inertia = kmeans.inertia_
|
||||
# Sum of squared distances to nearest cluster center
|
||||
# Use for elbow method
|
||||
```
|
||||
|
||||
### Elbow Method (K-Means)
|
||||
|
||||
```python
|
||||
from sklearn.cluster import KMeans
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
inertias = []
|
||||
K_range = range(2, 11)
|
||||
|
||||
for k in K_range:
|
||||
kmeans = KMeans(n_clusters=k, random_state=42)
|
||||
kmeans.fit(X)
|
||||
inertias.append(kmeans.inertia_)
|
||||
|
||||
plt.plot(K_range, inertias, 'bo-')
|
||||
plt.xlabel('Number of clusters (k)')
|
||||
plt.ylabel('Inertia')
|
||||
plt.title('Elbow Method')
|
||||
plt.show()
|
||||
# Look for "elbow" where inertia starts decreasing more slowly
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Clustering Algorithm Selection
|
||||
|
||||
**Use K-Means when**:
|
||||
- Clusters are spherical and similar size
|
||||
- Speed is important
|
||||
- Data is not too high-dimensional
|
||||
|
||||
**Use DBSCAN when**:
|
||||
- Arbitrary cluster shapes
|
||||
- Number of clusters unknown
|
||||
- Outlier detection needed
|
||||
|
||||
**Use Hierarchical when**:
|
||||
- Hierarchy is meaningful
|
||||
- Small to medium datasets
|
||||
- Visualization is important
|
||||
|
||||
**Use GMM when**:
|
||||
- Soft clustering needed
|
||||
- Clusters have different shapes/sizes
|
||||
- Probabilistic interpretation needed
|
||||
|
||||
**Use Spectral Clustering when**:
|
||||
- Non-convex clusters
|
||||
- Have similarity matrix
|
||||
- Moderate dataset size
|
||||
|
||||
### Preprocessing for Clustering
|
||||
|
||||
1. **Always scale features**: Use StandardScaler or MinMaxScaler
|
||||
2. **Handle outliers**: Remove or use robust algorithms (DBSCAN, HDBSCAN)
|
||||
3. **Reduce dimensionality if needed**: PCA for speed, careful with interpretation
|
||||
4. **Check for categorical variables**: Encode appropriately or use specialized algorithms
|
||||
|
||||
### Dimensionality Reduction Guidelines
|
||||
|
||||
**For preprocessing/feature extraction**:
|
||||
- PCA (linear relationships)
|
||||
- TruncatedSVD (sparse data)
|
||||
- NMF (non-negative data)
|
||||
|
||||
**For visualization only**:
|
||||
- t-SNE (preserves local structure)
|
||||
- UMAP (preserves both local and global structure)
|
||||
|
||||
**Always**:
|
||||
- Standardize features before PCA
|
||||
- Use appropriate n_components (elbow plot, explained variance)
|
||||
- Don't use t-SNE for anything except visualization
|
||||
|
||||
### Common Pitfalls
|
||||
|
||||
1. **Not scaling data**: Most algorithms sensitive to scale
|
||||
2. **Using t-SNE for preprocessing**: Only for visualization!
|
||||
3. **Overfitting cluster count**: Too many clusters = overfitting noise
|
||||
4. **Ignoring outliers**: Can severely affect centroid-based methods
|
||||
5. **Wrong metric**: Euclidean assumes all features equally important
|
||||
6. **Not validating results**: Always check with multiple metrics and domain knowledge
|
||||
7. **PCA without standardization**: Components dominated by high-variance features
|
||||
@@ -0,0 +1,219 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Complete classification pipeline with preprocessing, training, evaluation, and hyperparameter tuning.
|
||||
Demonstrates best practices for scikit-learn workflows.
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
|
||||
import joblib
|
||||
|
||||
|
||||
def create_preprocessing_pipeline(numeric_features, categorical_features):
|
||||
"""
|
||||
Create preprocessing pipeline for mixed data types.
|
||||
|
||||
Args:
|
||||
numeric_features: List of numeric column names
|
||||
categorical_features: List of categorical column names
|
||||
|
||||
Returns:
|
||||
ColumnTransformer with appropriate preprocessing for each data type
|
||||
"""
|
||||
numeric_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
|
||||
categorical_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True))
|
||||
])
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
|
||||
return preprocessor
|
||||
|
||||
|
||||
def create_full_pipeline(preprocessor, classifier=None):
|
||||
"""
|
||||
Create complete ML pipeline with preprocessing and classification.
|
||||
|
||||
Args:
|
||||
preprocessor: Preprocessing ColumnTransformer
|
||||
classifier: Classifier instance (default: RandomForestClassifier)
|
||||
|
||||
Returns:
|
||||
Complete Pipeline
|
||||
"""
|
||||
if classifier is None:
|
||||
classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
|
||||
|
||||
pipeline = Pipeline(steps=[
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', classifier)
|
||||
])
|
||||
|
||||
return pipeline
|
||||
|
||||
|
||||
def evaluate_model(pipeline, X_train, y_train, X_test, y_test, cv=5):
|
||||
"""
|
||||
Evaluate model using cross-validation and test set.
|
||||
|
||||
Args:
|
||||
pipeline: Trained pipeline
|
||||
X_train, y_train: Training data
|
||||
X_test, y_test: Test data
|
||||
cv: Number of cross-validation folds
|
||||
|
||||
Returns:
|
||||
Dictionary with evaluation results
|
||||
"""
|
||||
# Cross-validation on training set
|
||||
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='accuracy')
|
||||
|
||||
# Test set evaluation
|
||||
y_pred = pipeline.predict(X_test)
|
||||
test_score = pipeline.score(X_test, y_test)
|
||||
|
||||
# Get probabilities if available
|
||||
try:
|
||||
y_proba = pipeline.predict_proba(X_test)
|
||||
if len(np.unique(y_test)) == 2:
|
||||
# Binary classification
|
||||
auc = roc_auc_score(y_test, y_proba[:, 1])
|
||||
else:
|
||||
# Multiclass
|
||||
auc = roc_auc_score(y_test, y_proba, multi_class='ovr')
|
||||
except:
|
||||
auc = None
|
||||
|
||||
results = {
|
||||
'cv_mean': cv_scores.mean(),
|
||||
'cv_std': cv_scores.std(),
|
||||
'test_score': test_score,
|
||||
'auc': auc,
|
||||
'classification_report': classification_report(y_test, y_pred),
|
||||
'confusion_matrix': confusion_matrix(y_test, y_pred)
|
||||
}
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def tune_hyperparameters(pipeline, X_train, y_train, param_grid, cv=5):
|
||||
"""
|
||||
Perform hyperparameter tuning using GridSearchCV.
|
||||
|
||||
Args:
|
||||
pipeline: Pipeline to tune
|
||||
X_train, y_train: Training data
|
||||
param_grid: Dictionary of parameters to search
|
||||
cv: Number of cross-validation folds
|
||||
|
||||
Returns:
|
||||
GridSearchCV object with best model
|
||||
"""
|
||||
grid_search = GridSearchCV(
|
||||
pipeline,
|
||||
param_grid,
|
||||
cv=cv,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1,
|
||||
verbose=1
|
||||
)
|
||||
|
||||
grid_search.fit(X_train, y_train)
|
||||
|
||||
print(f"Best parameters: {grid_search.best_params_}")
|
||||
print(f"Best CV score: {grid_search.best_score_:.3f}")
|
||||
|
||||
return grid_search
|
||||
|
||||
|
||||
def main():
|
||||
"""
|
||||
Example usage of the classification pipeline.
|
||||
"""
|
||||
# Load your data here
|
||||
# X, y = load_data()
|
||||
|
||||
# Example with synthetic data
|
||||
from sklearn.datasets import make_classification
|
||||
X, y = make_classification(
|
||||
n_samples=1000,
|
||||
n_features=20,
|
||||
n_informative=15,
|
||||
n_redundant=5,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
# Convert to DataFrame for demonstration
|
||||
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
|
||||
X = pd.DataFrame(X, columns=feature_names)
|
||||
|
||||
# Split features into numeric and categorical (all numeric in this example)
|
||||
numeric_features = feature_names
|
||||
categorical_features = []
|
||||
|
||||
# Split data (use stratify for imbalanced classes)
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42, stratify=y
|
||||
)
|
||||
|
||||
# Create preprocessing pipeline
|
||||
preprocessor = create_preprocessing_pipeline(numeric_features, categorical_features)
|
||||
|
||||
# Create full pipeline
|
||||
pipeline = create_full_pipeline(preprocessor)
|
||||
|
||||
# Train model
|
||||
print("Training model...")
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Evaluate model
|
||||
print("\nEvaluating model...")
|
||||
results = evaluate_model(pipeline, X_train, y_train, X_test, y_test)
|
||||
|
||||
print(f"CV Accuracy: {results['cv_mean']:.3f} (+/- {results['cv_std']:.3f})")
|
||||
print(f"Test Accuracy: {results['test_score']:.3f}")
|
||||
if results['auc']:
|
||||
print(f"ROC-AUC: {results['auc']:.3f}")
|
||||
print("\nClassification Report:")
|
||||
print(results['classification_report'])
|
||||
|
||||
# Hyperparameter tuning (optional)
|
||||
print("\nTuning hyperparameters...")
|
||||
param_grid = {
|
||||
'classifier__n_estimators': [100, 200],
|
||||
'classifier__max_depth': [10, 20, None],
|
||||
'classifier__min_samples_split': [2, 5]
|
||||
}
|
||||
|
||||
grid_search = tune_hyperparameters(pipeline, X_train, y_train, param_grid)
|
||||
|
||||
# Evaluate best model
|
||||
print("\nEvaluating tuned model...")
|
||||
best_pipeline = grid_search.best_estimator_
|
||||
y_pred = best_pipeline.predict(X_test)
|
||||
print(classification_report(y_test, y_pred))
|
||||
|
||||
# Save model
|
||||
print("\nSaving model...")
|
||||
joblib.dump(best_pipeline, 'best_model.pkl')
|
||||
print("Model saved as 'best_model.pkl'")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
291
scientific-packages/scikit-learn/scripts/clustering_analysis.py
Normal file
291
scientific-packages/scikit-learn/scripts/clustering_analysis.py
Normal file
@@ -0,0 +1,291 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Clustering analysis script with multiple algorithms and evaluation.
|
||||
Demonstrates k-means, DBSCAN, and hierarchical clustering with visualization.
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
|
||||
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
|
||||
from sklearn.decomposition import PCA
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
|
||||
|
||||
def scale_data(X):
|
||||
"""
|
||||
Scale features using StandardScaler.
|
||||
ALWAYS scale data before clustering!
|
||||
|
||||
Args:
|
||||
X: Feature matrix
|
||||
|
||||
Returns:
|
||||
Scaled feature matrix and fitted scaler
|
||||
"""
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
return X_scaled, scaler
|
||||
|
||||
|
||||
def find_optimal_k(X_scaled, k_range=range(2, 11)):
|
||||
"""
|
||||
Find optimal number of clusters using elbow method and silhouette score.
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
k_range: Range of k values to try
|
||||
|
||||
Returns:
|
||||
Dictionary with inertias and silhouette scores
|
||||
"""
|
||||
inertias = []
|
||||
silhouette_scores = []
|
||||
|
||||
for k in k_range:
|
||||
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
|
||||
labels = kmeans.fit_predict(X_scaled)
|
||||
inertias.append(kmeans.inertia_)
|
||||
silhouette_scores.append(silhouette_score(X_scaled, labels))
|
||||
|
||||
return {
|
||||
'k_values': list(k_range),
|
||||
'inertias': inertias,
|
||||
'silhouette_scores': silhouette_scores
|
||||
}
|
||||
|
||||
|
||||
def plot_elbow_silhouette(results):
|
||||
"""
|
||||
Plot elbow method and silhouette scores.
|
||||
|
||||
Args:
|
||||
results: Dictionary from find_optimal_k
|
||||
"""
|
||||
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
|
||||
|
||||
# Elbow plot
|
||||
ax1.plot(results['k_values'], results['inertias'], 'bo-')
|
||||
ax1.set_xlabel('Number of clusters (k)')
|
||||
ax1.set_ylabel('Inertia')
|
||||
ax1.set_title('Elbow Method')
|
||||
ax1.grid(True, alpha=0.3)
|
||||
|
||||
# Silhouette plot
|
||||
ax2.plot(results['k_values'], results['silhouette_scores'], 'ro-')
|
||||
ax2.set_xlabel('Number of clusters (k)')
|
||||
ax2.set_ylabel('Silhouette Score')
|
||||
ax2.set_title('Silhouette Score vs k')
|
||||
ax2.grid(True, alpha=0.3)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.savefig('elbow_silhouette.png', dpi=300, bbox_inches='tight')
|
||||
print("Saved elbow and silhouette plots to 'elbow_silhouette.png'")
|
||||
plt.close()
|
||||
|
||||
|
||||
def evaluate_clustering(X_scaled, labels, algorithm_name):
|
||||
"""
|
||||
Evaluate clustering using multiple metrics.
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
labels: Cluster labels
|
||||
algorithm_name: Name of clustering algorithm
|
||||
|
||||
Returns:
|
||||
Dictionary with evaluation metrics
|
||||
"""
|
||||
# Filter out noise points for DBSCAN (-1 labels)
|
||||
mask = labels != -1
|
||||
X_filtered = X_scaled[mask]
|
||||
labels_filtered = labels[mask]
|
||||
|
||||
n_clusters = len(set(labels_filtered))
|
||||
n_noise = list(labels).count(-1)
|
||||
|
||||
results = {
|
||||
'algorithm': algorithm_name,
|
||||
'n_clusters': n_clusters,
|
||||
'n_noise': n_noise
|
||||
}
|
||||
|
||||
# Calculate metrics if we have valid clusters
|
||||
if n_clusters > 1:
|
||||
results['silhouette'] = silhouette_score(X_filtered, labels_filtered)
|
||||
results['davies_bouldin'] = davies_bouldin_score(X_filtered, labels_filtered)
|
||||
results['calinski_harabasz'] = calinski_harabasz_score(X_filtered, labels_filtered)
|
||||
else:
|
||||
results['silhouette'] = None
|
||||
results['davies_bouldin'] = None
|
||||
results['calinski_harabasz'] = None
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def perform_kmeans(X_scaled, n_clusters=3):
|
||||
"""
|
||||
Perform k-means clustering.
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
n_clusters: Number of clusters
|
||||
|
||||
Returns:
|
||||
Fitted KMeans model and labels
|
||||
"""
|
||||
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
|
||||
labels = kmeans.fit_predict(X_scaled)
|
||||
return kmeans, labels
|
||||
|
||||
|
||||
def perform_dbscan(X_scaled, eps=0.5, min_samples=5):
|
||||
"""
|
||||
Perform DBSCAN clustering.
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
eps: Maximum distance between neighbors
|
||||
min_samples: Minimum points to form dense region
|
||||
|
||||
Returns:
|
||||
Fitted DBSCAN model and labels
|
||||
"""
|
||||
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
|
||||
labels = dbscan.fit_predict(X_scaled)
|
||||
return dbscan, labels
|
||||
|
||||
|
||||
def perform_hierarchical(X_scaled, n_clusters=3, linkage='ward'):
|
||||
"""
|
||||
Perform hierarchical clustering.
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
n_clusters: Number of clusters
|
||||
linkage: Linkage criterion ('ward', 'complete', 'average', 'single')
|
||||
|
||||
Returns:
|
||||
Fitted AgglomerativeClustering model and labels
|
||||
"""
|
||||
hierarchical = AgglomerativeClustering(n_clusters=n_clusters, linkage=linkage)
|
||||
labels = hierarchical.fit_predict(X_scaled)
|
||||
return hierarchical, labels
|
||||
|
||||
|
||||
def visualize_clusters_2d(X_scaled, labels, algorithm_name, method='pca'):
|
||||
"""
|
||||
Visualize clusters in 2D using PCA or t-SNE.
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
labels: Cluster labels
|
||||
algorithm_name: Name of algorithm for title
|
||||
method: 'pca' or 'tsne'
|
||||
"""
|
||||
# Reduce to 2D
|
||||
if method == 'pca':
|
||||
pca = PCA(n_components=2, random_state=42)
|
||||
X_2d = pca.fit_transform(X_scaled)
|
||||
variance = pca.explained_variance_ratio_
|
||||
xlabel = f'PC1 ({variance[0]:.1%} variance)'
|
||||
ylabel = f'PC2 ({variance[1]:.1%} variance)'
|
||||
else:
|
||||
from sklearn.manifold import TSNE
|
||||
# Use PCA first to speed up t-SNE
|
||||
pca = PCA(n_components=min(50, X_scaled.shape[1]), random_state=42)
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
|
||||
X_2d = tsne.fit_transform(X_pca)
|
||||
xlabel = 't-SNE 1'
|
||||
ylabel = 't-SNE 2'
|
||||
|
||||
# Plot
|
||||
plt.figure(figsize=(10, 8))
|
||||
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis', alpha=0.6, s=50)
|
||||
plt.colorbar(scatter, label='Cluster')
|
||||
plt.xlabel(xlabel)
|
||||
plt.ylabel(ylabel)
|
||||
plt.title(f'{algorithm_name} Clustering ({method.upper()})')
|
||||
plt.grid(True, alpha=0.3)
|
||||
|
||||
filename = f'{algorithm_name.lower().replace(" ", "_")}_{method}.png'
|
||||
plt.savefig(filename, dpi=300, bbox_inches='tight')
|
||||
print(f"Saved visualization to '{filename}'")
|
||||
plt.close()
|
||||
|
||||
|
||||
def main():
|
||||
"""
|
||||
Example clustering analysis workflow.
|
||||
"""
|
||||
# Load your data here
|
||||
# X = load_data()
|
||||
|
||||
# Example with synthetic data
|
||||
from sklearn.datasets import make_blobs
|
||||
X, y_true = make_blobs(
|
||||
n_samples=500,
|
||||
n_features=10,
|
||||
centers=4,
|
||||
cluster_std=1.0,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
print(f"Dataset shape: {X.shape}")
|
||||
|
||||
# Scale data (ALWAYS scale for clustering!)
|
||||
print("\nScaling data...")
|
||||
X_scaled, scaler = scale_data(X)
|
||||
|
||||
# Find optimal k
|
||||
print("\nFinding optimal number of clusters...")
|
||||
results = find_optimal_k(X_scaled)
|
||||
plot_elbow_silhouette(results)
|
||||
|
||||
# Based on elbow/silhouette, choose optimal k
|
||||
optimal_k = 4 # Adjust based on plots
|
||||
|
||||
# Perform k-means
|
||||
print(f"\nPerforming k-means with k={optimal_k}...")
|
||||
kmeans, kmeans_labels = perform_kmeans(X_scaled, n_clusters=optimal_k)
|
||||
kmeans_results = evaluate_clustering(X_scaled, kmeans_labels, 'K-Means')
|
||||
|
||||
# Perform DBSCAN
|
||||
print("\nPerforming DBSCAN...")
|
||||
dbscan, dbscan_labels = perform_dbscan(X_scaled, eps=0.5, min_samples=5)
|
||||
dbscan_results = evaluate_clustering(X_scaled, dbscan_labels, 'DBSCAN')
|
||||
|
||||
# Perform hierarchical clustering
|
||||
print("\nPerforming hierarchical clustering...")
|
||||
hierarchical, hier_labels = perform_hierarchical(X_scaled, n_clusters=optimal_k)
|
||||
hier_results = evaluate_clustering(X_scaled, hier_labels, 'Hierarchical')
|
||||
|
||||
# Print results
|
||||
print("\n" + "="*60)
|
||||
print("CLUSTERING RESULTS")
|
||||
print("="*60)
|
||||
|
||||
for results in [kmeans_results, dbscan_results, hier_results]:
|
||||
print(f"\n{results['algorithm']}:")
|
||||
print(f" Clusters: {results['n_clusters']}")
|
||||
if results['n_noise'] > 0:
|
||||
print(f" Noise points: {results['n_noise']}")
|
||||
if results['silhouette']:
|
||||
print(f" Silhouette Score: {results['silhouette']:.3f}")
|
||||
print(f" Davies-Bouldin Index: {results['davies_bouldin']:.3f} (lower is better)")
|
||||
print(f" Calinski-Harabasz Index: {results['calinski_harabasz']:.1f} (higher is better)")
|
||||
|
||||
# Visualize clusters
|
||||
print("\nCreating visualizations...")
|
||||
visualize_clusters_2d(X_scaled, kmeans_labels, 'K-Means', method='pca')
|
||||
visualize_clusters_2d(X_scaled, dbscan_labels, 'DBSCAN', method='pca')
|
||||
visualize_clusters_2d(X_scaled, hier_labels, 'Hierarchical', method='pca')
|
||||
|
||||
print("\nClustering analysis complete!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user