Improve the scikit-learn skill

This commit is contained in:
Timothy Kassis
2025-11-04 10:11:46 -08:00
parent 63a4293f1a
commit 4ad4f9970f
10 changed files with 3293 additions and 3606 deletions

View File

@@ -1,345 +1,563 @@
# Data Preprocessing in scikit-learn
# Data Preprocessing and Feature Engineering Reference
## Overview
Preprocessing transforms raw data into a format suitable for machine learning algorithms. Many algorithms require standardized or normalized data to perform well.
## Standardization and Scaling
Data preprocessing transforms raw data into a format suitable for machine learning models. This includes scaling, encoding, handling missing values, and feature engineering.
## Feature Scaling and Normalization
### StandardScaler
Removes mean and scales to unit variance (z-score normalization).
**Formula**: `z = (x - μ) / σ`
**Use cases**:
- Most ML algorithms (especially SVM, neural networks, PCA)
- When features have different units or scales
- When assuming Gaussian-like distribution
**Important**: Fit only on training data, then transform both train and test sets.
**StandardScaler (`sklearn.preprocessing.StandardScaler`)**
- Standardizes features to zero mean and unit variance
- Formula: z = (x - mean) / std
- Use when: Features have different scales, algorithm assumes normally distributed data
- Required for: SVM, KNN, Neural Networks, PCA, Linear Regression with regularization
- Example:
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use same parameters
X_test_scaled = scaler.transform(X_test) # Use same parameters as training
# Access learned parameters
print(f"Mean: {scaler.mean_}")
print(f"Std: {scaler.scale_}")
```
### MinMaxScaler
Scales features to a specified range, typically [0, 1].
**Formula**: `X_scaled = (X - X_min) / (X_max - X_min)`
**Use cases**:
- When bounded range is needed
- Neural networks (often prefer [0, 1] range)
- When distribution is not Gaussian
- Image pixel values
**Parameters**:
- `feature_range`: Tuple (min, max), default (0, 1)
**Warning**: Sensitive to outliers since it uses min/max.
### MaxAbsScaler
Scales to [-1, 1] by dividing by maximum absolute value.
**Use cases**:
- Sparse data (preserves sparsity)
- Data already centered at zero
- When sign of values is meaningful
**Advantage**: Doesn't shift/center the data, preserves zero entries.
### RobustScaler
Uses median and interquartile range (IQR) instead of mean and standard deviation.
**Formula**: `X_scaled = (X - median) / IQR`
**Use cases**:
- When outliers are present
- When StandardScaler produces skewed results
- Robust statistics preferred
**Parameters**:
- `quantile_range`: Tuple (q_min, q_max), default (25.0, 75.0)
## Normalization
### normalize() function and Normalizer
Scales individual samples (rows) to unit norm, not features (columns).
**Use cases**:
- Text classification (TF-IDF vectors)
- When similarity metrics (dot product, cosine) are used
- When each sample should have equal weight
**Norms**:
- `l1`: Manhattan norm (sum of absolutes = 1)
- `l2`: Euclidean norm (sum of squares = 1) - **most common**
- `max`: Maximum absolute value = 1
**Key difference from scalers**: Operates on rows (samples), not columns (features).
**MinMaxScaler (`sklearn.preprocessing.MinMaxScaler`)**
- Scales features to a given range (default [0, 1])
- Formula: X_scaled = (X - X.min) / (X.max - X.min)
- Use when: Need bounded values, data not normally distributed
- Sensitive to outliers
- Example:
```python
from sklearn.preprocessing import Normalizer
normalizer = Normalizer(norm='l2')
X_normalized = normalizer.transform(X)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled = scaler.fit_transform(X_train)
# Custom range
scaler = MinMaxScaler(feature_range=(-1, 1))
X_scaled = scaler.fit_transform(X_train)
```
## Encoding Categorical Features
### RobustScaler
**RobustScaler (`sklearn.preprocessing.RobustScaler`)**
- Scales using median and interquartile range (IQR)
- Formula: X_scaled = (X - median) / IQR
- Use when: Data contains outliers
- Robust to outliers
- Example:
```python
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)
```
### Normalizer
**Normalizer (`sklearn.preprocessing.Normalizer`)**
- Normalizes samples individually to unit norm
- Common norms: 'l1', 'l2', 'max'
- Use when: Need to normalize each sample independently (e.g., text features)
- Example:
```python
from sklearn.preprocessing import Normalizer
normalizer = Normalizer(norm='l2') # Euclidean norm
X_normalized = normalizer.fit_transform(X)
```
### MaxAbsScaler
**MaxAbsScaler (`sklearn.preprocessing.MaxAbsScaler`)**
- Scales by maximum absolute value
- Range: [-1, 1]
- Doesn't shift/center data (preserves sparsity)
- Use when: Data is already centered or sparse
- Example:
```python
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X_sparse)
```
## Encoding Categorical Variables
### OneHotEncoder
**OneHotEncoder (`sklearn.preprocessing.OneHotEncoder`)**
- Creates binary columns for each category
- Use when: Nominal categories (no order), tree-based models or linear models
- Example:
```python
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(X_categorical)
# Get feature names
feature_names = encoder.get_feature_names_out(['color', 'size'])
# Handle unknown categories during transform
X_test_encoded = encoder.transform(X_test_categorical)
```
### OrdinalEncoder
Converts categories to integers (0 to n_categories - 1).
**Use cases**:
- Ordinal relationships exist (small < medium < large)
- Preprocessing before other transformations
- Tree-based algorithms (which can handle integers)
**Parameters**:
- `handle_unknown`: 'error' or 'use_encoded_value'
- `unknown_value`: Value for unknown categories
- `encoded_missing_value`: Value for missing data
**OrdinalEncoder (`sklearn.preprocessing.OrdinalEncoder`)**
- Encodes categories as integers
- Use when: Ordinal categories (ordered), or tree-based models
- Example:
```python
from sklearn.preprocessing import OrdinalEncoder
# Natural ordering
encoder = OrdinalEncoder()
X_encoded = encoder.fit_transform(X_categorical)
# Custom ordering
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
X_encoded = encoder.fit_transform(X_categorical)
```
### OneHotEncoder
Creates binary columns for each category.
**Use cases**:
- Nominal categories (no order)
- Linear models, neural networks
- When category relationships shouldn't be assumed
**Parameters**:
- `drop`: 'first', 'if_binary', array-like (prevents multicollinearity)
- `sparse_output`: True (default, memory efficient) or False
- `handle_unknown`: 'error', 'ignore', 'infrequent_if_exist'
- `min_frequency`: Group infrequent categories
- `max_categories`: Limit number of categories
**High cardinality handling**:
```python
encoder = OneHotEncoder(min_frequency=100, handle_unknown='infrequent_if_exist')
# Groups categories appearing < 100 times into 'infrequent' category
```
**Memory tip**: Use `sparse_output=True` (default) for high-cardinality features.
### TargetEncoder
Uses target statistics to encode categories.
**Use cases**:
- High-cardinality categorical features (zip codes, user IDs)
- When linear relationships with target are expected
- Often improves performance over one-hot encoding
**How it works**:
- Replaces category with mean of target for that category
- Uses cross-fitting during fit_transform() to prevent target leakage
- Applies smoothing to handle rare categories
**Parameters**:
- `smooth`: Smoothing parameter for rare categories
- `cv`: Cross-validation strategy
**Warning**: Only for supervised learning. Requires target variable.
```python
from sklearn.preprocessing import TargetEncoder
encoder = TargetEncoder()
X_encoded = encoder.fit_transform(X_categorical, y)
```
### LabelEncoder
Encodes target labels into integers 0 to n_classes - 1.
**Use cases**: Encoding target variable for classification (not features!)
**LabelEncoder (`sklearn.preprocessing.LabelEncoder`)**
- Encodes target labels (y) as integers
- Use for: Target variable encoding
- Example:
```python
from sklearn.preprocessing import LabelEncoder
**Important**: Use `LabelEncoder` for targets, not features. For features, use OrdinalEncoder or OneHotEncoder.
le = LabelEncoder()
y_encoded = le.fit_transform(y)
### Binarizer
Converts numeric values to binary (0 or 1) based on threshold.
# Decode back
y_decoded = le.inverse_transform(y_encoded)
print(f"Classes: {le.classes_}")
```
**Use cases**: Creating binary features from continuous values
### Target Encoding (using category_encoders)
```python
# Install: uv pip install category-encoders
from category_encoders import TargetEncoder
encoder = TargetEncoder()
X_train_encoded = encoder.fit_transform(X_train_categorical, y_train)
X_test_encoded = encoder.transform(X_test_categorical)
```
## Non-linear Transformations
### QuantileTransformer
Maps features to uniform or normal distribution using rank transformation.
**Use cases**:
- Unusual distributions (bimodal, heavy tails)
- Reducing outlier impact
- When normal distribution is desired
**Parameters**:
- `output_distribution`: 'uniform' (default) or 'normal'
- `n_quantiles`: Number of quantiles (default: min(1000, n_samples))
**Effect**: Strong transformation that reduces outlier influence and makes data more Gaussian-like.
### PowerTransformer
Applies parametric monotonic transformation to make data more Gaussian.
**Methods**:
- `yeo-johnson`: Works with positive and negative values (default)
- `box-cox`: Only positive values
**Use cases**:
- Skewed distributions
- When Gaussian assumption is important
- Variance stabilization
**Advantage**: Less radical than QuantileTransformer, preserves more of original relationships.
## Discretization
### KBinsDiscretizer
Bins continuous features into discrete intervals.
**Strategies**:
- `uniform`: Equal-width bins
- `quantile`: Equal-frequency bins
- `kmeans`: K-means clustering to determine bins
**Encoding**:
- `ordinal`: Integer encoding (0 to n_bins - 1)
- `onehot`: One-hot encoding
- `onehot-dense`: Dense one-hot encoding
**Use cases**:
- Making linear models handle non-linear relationships
- Reducing noise in features
- Making features more interpretable
### Power Transforms
**PowerTransformer**
- Makes data more Gaussian-like
- Methods: 'yeo-johnson' (works with negative values), 'box-cox' (positive only)
- Use when: Data is skewed, algorithm assumes normality
- Example:
```python
from sklearn.preprocessing import KBinsDiscretizer
disc = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
X_binned = disc.fit_transform(X)
from sklearn.preprocessing import PowerTransformer
# Yeo-Johnson (handles negative values)
pt = PowerTransformer(method='yeo-johnson', standardize=True)
X_transformed = pt.fit_transform(X)
# Box-Cox (positive values only)
pt = PowerTransformer(method='box-cox', standardize=True)
X_transformed = pt.fit_transform(X)
```
## Feature Generation
### PolynomialFeatures
Generates polynomial and interaction features.
**Parameters**:
- `degree`: Polynomial degree
- `interaction_only`: Only multiplicative interactions (no x²)
- `include_bias`: Include constant feature
**Use cases**:
- Adding non-linearity to linear models
- Feature engineering
- Polynomial regression
**Warning**: Number of features grows rapidly: (n+d)!/d!n! for degree d.
### Quantile Transformation
**QuantileTransformer**
- Transforms features to follow uniform or normal distribution
- Robust to outliers
- Use when: Want to reduce outlier impact
- Example:
```python
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
from sklearn.preprocessing import QuantileTransformer
# Transform to uniform distribution
qt = QuantileTransformer(output_distribution='uniform', random_state=42)
X_transformed = qt.fit_transform(X)
# Transform to normal distribution
qt = QuantileTransformer(output_distribution='normal', random_state=42)
X_transformed = qt.fit_transform(X)
```
### SplineTransformer
Generates B-spline basis functions.
### Log Transform
**Use cases**:
- Smooth non-linear transformations
- Alternative to PolynomialFeatures (less oscillation at boundaries)
- Generalized additive models (GAMs)
```python
import numpy as np
**Parameters**:
- `n_knots`: Number of knots
- `degree`: Spline degree
- `knots`: Knot positions ('uniform', 'quantile', or array)
# Log1p (log(1 + x)) - handles zeros
X_log = np.log1p(X)
## Missing Value Handling
# Or use FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
log_transformer = FunctionTransformer(np.log1p, inverse_func=np.expm1)
X_log = log_transformer.fit_transform(X)
```
## Missing Value Imputation
### SimpleImputer
Imputes missing values with various strategies.
**Strategies**:
- `mean`: Mean of column (numeric only)
- `median`: Median of column (numeric only)
- `most_frequent`: Mode (numeric or categorical)
- `constant`: Fill with constant value
**Parameters**:
- `strategy`: Imputation strategy
- `fill_value`: Value when strategy='constant'
- `missing_values`: What represents missing (np.nan, None, specific value)
**SimpleImputer (`sklearn.impute.SimpleImputer`)**
- Basic imputation strategies
- Strategies: 'mean', 'median', 'most_frequent', 'constant'
- Example:
```python
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
# For numerical features
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
# For categorical features
imputer = SimpleImputer(strategy='most_frequent')
X_imputed = imputer.fit_transform(X_categorical)
# Fill with constant
imputer = SimpleImputer(strategy='constant', fill_value=0)
X_imputed = imputer.fit_transform(X)
```
### KNNImputer
Imputes using k-nearest neighbors.
### Iterative Imputer
**Use cases**: When relationships between features should inform imputation
**IterativeImputer**
- Models each feature with missing values as function of other features
- More sophisticated than SimpleImputer
- Example:
```python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
**Parameters**:
- `n_neighbors`: Number of neighbors
- `weights`: 'uniform' or 'distance'
imputer = IterativeImputer(max_iter=10, random_state=42)
X_imputed = imputer.fit_transform(X)
```
### IterativeImputer
Models each feature with missing values as function of other features.
### KNN Imputer
**Use cases**:
- Complex relationships between features
- When multiple features have missing values
- Higher quality imputation (but slower)
**KNNImputer**
- Imputes using k-nearest neighbors
- Use when: Features are correlated
- Example:
```python
from sklearn.impute import KNNImputer
**Parameters**:
- `estimator`: Estimator for regression (default: BayesianRidge)
- `max_iter`: Maximum iterations
imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X)
```
## Function Transformers
## Feature Engineering
### FunctionTransformer
Applies custom function to data.
### Polynomial Features
**Use cases**:
- Custom transformations in pipelines
- Log transformation, square root, etc.
- Domain-specific preprocessing
**PolynomialFeatures**
- Creates polynomial and interaction features
- Use when: Need non-linear features for linear models
- Example:
```python
from sklearn.preprocessing import PolynomialFeatures
# Degree 2: includes x1, x2, x1^2, x2^2, x1*x2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# Get feature names
feature_names = poly.get_feature_names_out(['x1', 'x2'])
# Only interactions (no powers)
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interactions = poly.fit_transform(X)
```
### Binning/Discretization
**KBinsDiscretizer**
- Bins continuous features into discrete intervals
- Strategies: 'uniform', 'quantile', 'kmeans'
- Encoding: 'onehot', 'ordinal', 'onehot-dense'
- Example:
```python
from sklearn.preprocessing import KBinsDiscretizer
# Equal-width bins
binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
X_binned = binner.fit_transform(X)
# Equal-frequency bins (quantile-based)
binner = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
X_binned = binner.fit_transform(X)
```
### Binarization
**Binarizer**
- Converts features to binary (0 or 1) based on threshold
- Example:
```python
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.5)
X_binary = binarizer.fit_transform(X)
```
### Spline Features
**SplineTransformer**
- Creates spline basis functions
- Useful for capturing non-linear relationships
- Example:
```python
from sklearn.preprocessing import SplineTransformer
spline = SplineTransformer(n_knots=5, degree=3)
X_splines = spline.fit_transform(X)
```
## Text Feature Extraction
### CountVectorizer
**CountVectorizer (`sklearn.feature_extraction.text.CountVectorizer`)**
- Converts text to token count matrix
- Use for: Bag-of-words representation
- Example:
```python
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(
max_features=5000, # Keep top 5000 features
min_df=2, # Ignore terms appearing in < 2 documents
max_df=0.8, # Ignore terms appearing in > 80% documents
ngram_range=(1, 2) # Unigrams and bigrams
)
X_counts = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
```
### TfidfVectorizer
**TfidfVectorizer**
- TF-IDF (Term Frequency-Inverse Document Frequency) transformation
- Better than CountVectorizer for most tasks
- Example:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
max_features=5000,
min_df=2,
max_df=0.8,
ngram_range=(1, 2),
stop_words='english' # Remove English stop words
)
X_tfidf = vectorizer.fit_transform(documents)
```
### HashingVectorizer
**HashingVectorizer**
- Uses hashing trick for memory efficiency
- No fit needed, can't reverse transform
- Use when: Very large vocabulary, streaming data
- Example:
```python
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(n_features=2**18)
X_hashed = vectorizer.transform(documents) # No fit needed
```
## Feature Selection
### Filter Methods
**Variance Threshold**
- Removes low-variance features
- Example:
```python
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
```
**SelectKBest / SelectPercentile**
- Select features based on statistical tests
- Tests: f_classif, chi2, mutual_info_classif
- Example:
```python
from sklearn.feature_selection import SelectKBest, f_classif
# Select top 10 features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)
# Get selected feature indices
selected_indices = selector.get_support(indices=True)
```
### Wrapper Methods
**Recursive Feature Elimination (RFE)**
- Recursively removes features
- Uses model feature importances
- Example:
```python
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(estimator=model, n_features_to_select=10, step=1)
X_selected = rfe.fit_transform(X_train, y_train)
# Get selected features
selected_features = rfe.support_
feature_ranking = rfe.ranking_
```
**RFECV (with Cross-Validation)**
- RFE with cross-validation to find optimal number of features
- Example:
```python
from sklearn.feature_selection import RFECV
model = RandomForestClassifier(n_estimators=100, random_state=42)
rfecv = RFECV(estimator=model, cv=5, scoring='accuracy')
X_selected = rfecv.fit_transform(X_train, y_train)
print(f"Optimal number of features: {rfecv.n_features_}")
```
### Embedded Methods
**SelectFromModel**
- Select features based on model coefficients/importances
- Works with: Linear models (L1), Tree-based models
- Example:
```python
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
selector = SelectFromModel(model, threshold='median')
selector.fit(X_train, y_train)
X_selected = selector.transform(X_train)
# Get selected features
selected_features = selector.get_support()
```
**L1-based Feature Selection**
```python
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
model = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)
selector = SelectFromModel(model)
selector.fit(X_train, y_train)
X_selected = selector.transform(X_train)
```
## Handling Outliers
### IQR Method
```python
import numpy as np
Q1 = np.percentile(X, 25, axis=0)
Q3 = np.percentile(X, 75, axis=0)
IQR = Q3 - Q1
# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
mask = np.all((X >= lower_bound) & (X <= upper_bound), axis=1)
X_no_outliers = X[mask]
```
### Winsorization
```python
from scipy.stats import mstats
# Clip outliers at 5th and 95th percentiles
X_winsorized = mstats.winsorize(X, limits=[0.05, 0.05], axis=0)
```
## Custom Transformers
### Using FunctionTransformer
```python
from sklearn.preprocessing import FunctionTransformer
import numpy as np
log_transformer = FunctionTransformer(np.log1p, validate=True)
X_log = log_transformer.transform(X)
def log_transform(X):
return np.log1p(X)
transformer = FunctionTransformer(log_transform, inverse_func=np.expm1)
X_transformed = transformer.fit_transform(X)
```
### Creating Custom Transformer
```python
from sklearn.base import BaseEstimator, TransformerMixin
class CustomTransformer(BaseEstimator, TransformerMixin):
def __init__(self, parameter=1):
self.parameter = parameter
def fit(self, X, y=None):
# Learn parameters from X if needed
return self
def transform(self, X):
# Transform X
return X * self.parameter
transformer = CustomTransformer(parameter=2)
X_transformed = transformer.fit_transform(X)
```
## Best Practices
### Feature Scaling Guidelines
### Fit on Training Data Only
Always fit transformers on training data only:
```python
# Correct
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
**Always scale**:
- SVM, neural networks
- K-nearest neighbors
- Linear/Logistic regression with regularization
- PCA, LDA
- Gradient descent-based algorithms
**Don't need to scale**:
- Tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting)
- Naive Bayes
### Pipeline Integration
Always use preprocessing within pipelines to prevent data leakage:
# Wrong - causes data leakage
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))
```
### Use Pipelines
Combine preprocessing with models:
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
@@ -350,64 +568,39 @@ pipeline = Pipeline([
('classifier', LogisticRegression())
])
pipeline.fit(X_train, y_train) # Scaler fit only on train data
y_pred = pipeline.predict(X_test) # Scaler transform only on test data
pipeline.fit(X_train, y_train)
```
### Common Transformations by Data Type
**Numeric - Continuous**:
- StandardScaler (most common)
- MinMaxScaler (neural networks)
- RobustScaler (outliers present)
- PowerTransformer (skewed data)
**Numeric - Count Data**:
- sqrt or log transformation
- QuantileTransformer
- StandardScaler after transformation
**Categorical - Low Cardinality (<10 categories)**:
- OneHotEncoder
**Categorical - High Cardinality (>10 categories)**:
- TargetEncoder (supervised)
- Frequency encoding
- OneHotEncoder with min_frequency parameter
**Categorical - Ordinal**:
- OrdinalEncoder
**Text**:
- CountVectorizer or TfidfVectorizer
- Normalizer after vectorization
### Data Leakage Prevention
1. **Fit only on training data**: Never include test data when fitting preprocessors
2. **Use pipelines**: Ensures proper fit/transform separation
3. **Cross-validation**: Use Pipeline with cross_val_score() for proper evaluation
4. **Target encoding**: Use cv parameter in TargetEncoder for cross-fitting
### Handle Categorical and Numerical Separately
Use ColumnTransformer:
```python
# WRONG - data leakage
scaler = StandardScaler().fit(X_full)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# CORRECT - no leakage
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
numeric_features = ['age', 'income']
categorical_features = ['gender', 'occupation']
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
]
)
X_transformed = preprocessor.fit_transform(X)
```
## Preprocessing Checklist
### Algorithm-Specific Requirements
Before modeling:
1. Handle missing values (imputation or removal)
2. Encode categorical variables appropriately
3. Scale/normalize numeric features (if needed for algorithm)
4. Handle outliers (RobustScaler, clipping, removal)
5. Create additional features if beneficial (PolynomialFeatures, domain knowledge)
6. Check for data leakage in preprocessing steps
7. Wrap everything in a Pipeline
**Require Scaling:**
- SVM, KNN, Neural Networks
- PCA, Linear/Logistic Regression with regularization
- K-Means clustering
**Don't Require Scaling:**
- Tree-based models (Decision Trees, Random Forest, Gradient Boosting)
- Naive Bayes
**Encoding Requirements:**
- Linear models, SVM, KNN: One-hot encoding for nominal features
- Tree-based models: Can handle ordinal encoding directly

View File

@@ -1,546 +1,287 @@
# Scikit-learn Quick Reference
## Essential Imports
## Common Import Patterns
```python
# Core
import numpy as np
import pandas as pd
# Core scikit-learn
import sklearn
# Data splitting and cross-validation
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
# Preprocessing
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler, RobustScaler,
OneHotEncoder, OrdinalEncoder, LabelEncoder,
PolynomialFeatures
)
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Models - Classification
from sklearn.linear_model import LogisticRegression
# Feature selection
from sklearn.feature_selection import SelectKBest, RFE
# Supervised learning
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
RandomForestClassifier,
GradientBoostingClassifier,
HistGradientBoostingClassifier
)
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
# Models - Regression
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import (
RandomForestRegressor,
GradientBoostingRegressor,
HistGradientBoostingRegressor
)
# Clustering
# Unsupervised learning
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
# Dimensionality Reduction
from sklearn.decomposition import PCA, NMF, TruncatedSVD
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, NMF
# Metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report,
mean_squared_error, r2_score, mean_absolute_error
mean_squared_error, r2_score, confusion_matrix, classification_report
)
# Pipeline
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
# Utilities
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```
## Basic Workflow Template
## Installation
### Classification
```bash
# Using uv (recommended)
uv pip install scikit-learn
# Optional dependencies
uv pip install scikit-learn[plots] # For plotting utilities
uv pip install pandas numpy matplotlib seaborn # Common companions
```
## Quick Workflow Templates
### Classification Pipeline
```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import classification_report, confusion_matrix
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
X, y, test_size=0.2, stratify=y, random_state=42
)
# Scale features
# Preprocess
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
# Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Predict and evaluate
# Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
```
### Regression
### Regression Pipeline
```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Split data
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
# Preprocess and train
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Predict and evaluate
# Evaluate
y_pred = model.predict(X_test_scaled)
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.3f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")
print(f" Score: {r2_score(y_test, y_pred):.3f}")
```
### With Pipeline (Recommended)
### Cross-Validation
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Split and train
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
pipeline.fit(X_train, y_train)
# Evaluate
score = pipeline.score(X_test, y_test)
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"Test accuracy: {score:.3f}")
print(f"CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
model = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
```
## Common Preprocessing Patterns
### Numeric Data
### Complete Pipeline with Mixed Data Types
```python
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
# Define feature types
numeric_features = ['age', 'income']
categorical_features = ['gender', 'occupation']
# Create preprocessing pipelines
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
```
### Categorical Data
```python
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
```
### Mixed Data with ColumnTransformer
```python
from sklearn.compose import ColumnTransformer
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['country', 'occupation']
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Complete pipeline
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
# Combine transformers
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Full pipeline
model = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Fit and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
```
## Model Selection Cheat Sheet
### Quick Decision Tree
```
Is it supervised?
├─ Yes
│ ├─ Predicting categories? → Classification
│ │ ├─ Start with: LogisticRegression (baseline)
│ │ ├─ Then try: RandomForestClassifier
│ │ └─ Best performance: HistGradientBoostingClassifier
│ └─ Predicting numbers? → Regression
│ ├─ Start with: LinearRegression/Ridge (baseline)
│ ├─ Then try: RandomForestRegressor
│ └─ Best performance: HistGradientBoostingRegressor
└─ No
├─ Grouping similar items? → Clustering
│ ├─ Know # clusters: KMeans
│ └─ Unknown # clusters: DBSCAN or HDBSCAN
├─ Reducing dimensions?
│ ├─ For preprocessing: PCA
│ └─ For visualization: t-SNE or UMAP
└─ Finding outliers? → IsolationForest or LocalOutlierFactor
```
### Algorithm Selection by Data Size
- **Small (<1K samples)**: Any algorithm
- **Medium (1K-100K)**: Random Forests, Gradient Boosting, Neural Networks
- **Large (>100K)**: SGDClassifier/Regressor, HistGradientBoosting, LinearSVC
### When to Scale Features
**Always scale**:
- SVM, Neural Networks
- K-Nearest Neighbors
- Linear/Logistic Regression (with regularization)
- PCA, LDA
- Any gradient descent algorithm
**Don't need to scale**:
- Tree-based (Decision Trees, Random Forests, Gradient Boosting)
- Naive Bayes
## Hyperparameter Tuning
### GridSearchCV
### Hyperparameter Tuning
```python
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
'n_estimators': [100, 200, 500],
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10]
}
model = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='f1_weighted',
n_jobs=-1
model, param_grid, cv=5, scoring='accuracy', n_jobs=-1
)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
# Use best model
best_model = grid_search.best_estimator_
```
### RandomizedSearchCV (Faster)
## Common Patterns
### Loading Data
```python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# From scikit-learn datasets
from sklearn.datasets import load_iris, load_digits, make_classification
param_distributions = {
'n_estimators': randint(100, 1000),
'max_depth': randint(5, 50),
'min_samples_split': randint(2, 20)
}
# Built-in datasets
iris = load_iris()
X, y = iris.data, iris.target
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=50, # Number of combinations to try
cv=5,
n_jobs=-1,
random_state=42
# Synthetic data
X, y = make_classification(
n_samples=1000, n_features=20, n_classes=2, random_state=42
)
random_search.fit(X_train, y_train)
# From pandas
import pandas as pd
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
```
### Pipeline with GridSearchCV
### Handling Imbalanced Data
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC())
])
param_grid = {
'svm__C': [0.1, 1, 10],
'svm__kernel': ['rbf', 'linear'],
'svm__gamma': ['scale', 'auto']
}
grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(X_train, y_train)
```
## Cross-Validation
### Basic Cross-Validation
```python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
```
### Multiple Metrics
```python
from sklearn.model_selection import cross_validate
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
results = cross_validate(model, X, y, cv=5, scoring=scoring)
for metric in scoring:
scores = results[f'test_{metric}']
print(f"{metric}: {scores.mean():.3f} (+/- {scores.std():.3f})")
```
### Custom CV Strategies
```python
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
# For imbalanced classification
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# For time series
cv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=cv)
```
## Common Metrics
### Classification
```python
from sklearn.metrics import (
accuracy_score, balanced_accuracy_score,
precision_score, recall_score, f1_score,
confusion_matrix, classification_report,
roc_auc_score
)
# Basic metrics
accuracy = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average='weighted')
# Comprehensive report
print(classification_report(y_true, y_pred))
# ROC AUC (requires probabilities)
y_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_true, y_proba)
```
### Regression
```python
from sklearn.metrics import (
mean_squared_error,
mean_absolute_error,
r2_score
)
mse = mean_squared_error(y_true, y_pred)
rmse = mean_squared_error(y_true, y_pred, squared=False)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"RMSE: {rmse:.3f}")
print(f"MAE: {mae:.3f}")
print(f"R²: {r2:.3f}")
```
## Feature Engineering
### Polynomial Features
```python
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
```
### Feature Selection
```python
from sklearn.feature_selection import (
SelectKBest, f_classif,
RFE,
SelectFromModel
)
# Univariate selection
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
# Recursive feature elimination
from sklearn.ensemble import RandomForestClassifier
rfe = RFE(RandomForestClassifier(), n_features_to_select=10)
X_selected = rfe.fit_transform(X, y)
# Model-based selection
selector = SelectFromModel(
RandomForestClassifier(n_estimators=100),
threshold='median'
)
X_selected = selector.fit_transform(X, y)
# Use class_weight parameter
model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
# Or use appropriate metrics
from sklearn.metrics import balanced_accuracy_score, f1_score
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
```
### Feature Importance
```python
# Tree-based models
model = RandomForestClassifier()
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
importances = model.feature_importances_
# Visualize
import matplotlib.pyplot as plt
indices = np.argsort(importances)[::-1]
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
plt.show()
# Get feature importances
importances = pd.DataFrame({
'feature': feature_names,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
# Permutation importance (works for any model)
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10)
importances = result.importances_mean
print(importances.head(10))
```
## Clustering
### K-Means
### Clustering
```python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Always scale for k-means
# Scale data first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit k-means
# Fit K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)
# Evaluate
from sklearn.metrics import silhouette_score
score = silhouette_score(X_scaled, labels)
print(f"Silhouette score: {score:.3f}")
print(f"Silhouette Score: {score:.3f}")
```
### Elbow Method
```python
inertias = []
K_range = range(2, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.show()
```
### DBSCAN
```python
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
# -1 indicates noise/outliers
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
```
## Dimensionality Reduction
### PCA
### Dimensionality Reduction
```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Always scale before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Specify n_components
# Fit PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
X_reduced = pca.fit_transform(X)
# Or specify variance to retain
pca = PCA(n_components=0.95) # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Components needed: {pca.n_components_}")
# Plot
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title(f'PCA (explained variance: {pca.explained_variance_ratio_.sum():.2%})')
```
### t-SNE (Visualization Only)
```python
from sklearn.manifold import TSNE
# Reduce to 50 dimensions with PCA first (recommended)
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X_scaled)
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_pca)
# Visualize
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
plt.colorbar()
plt.show()
```
## Saving and Loading Models
### Model Persistence
```python
import joblib
@@ -548,78 +289,145 @@ import joblib
# Save model
joblib.dump(model, 'model.pkl')
# Save pipeline
joblib.dump(pipeline, 'pipeline.pkl')
# Load
model = joblib.load('model.pkl')
pipeline = joblib.load('pipeline.pkl')
# Use loaded model
y_pred = model.predict(X_new)
# Load model
loaded_model = joblib.load('model.pkl')
predictions = loaded_model.predict(X_new)
```
## Common Pitfalls and Solutions
## Common Gotchas and Solutions
### Data Leakage
**Wrong**: Fit on all data before split
```python
scaler = StandardScaler().fit(X)
X_train, X_test = train_test_split(scaler.transform(X))
```
# WRONG: Fitting scaler on all data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test = train_test_split(X_scaled)
**Correct**: Use pipeline or fit only on train
```python
# RIGHT: Fit on training data only
X_train, X_test = train_test_split(X)
pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
pipeline.fit(X_train, y_train)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# BEST: Use Pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
pipeline.fit(X_train, y_train) # No leakage!
```
### Not Scaling
**Wrong**: Using SVM without scaling
```python
svm = SVC()
svm.fit(X_train, y_train)
```
**Correct**: Scale for SVM
```python
pipeline = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
pipeline.fit(X_train, y_train)
```
### Wrong Metric for Imbalanced Data
**Wrong**: Using accuracy for 99:1 imbalance
```python
accuracy = accuracy_score(y_true, y_pred) # Can be misleading
```
**Correct**: Use appropriate metrics
```python
f1 = f1_score(y_true, y_pred, average='weighted')
balanced_acc = balanced_accuracy_score(y_true, y_pred)
```
### Not Using Stratification
**Wrong**: Random split for imbalanced data
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```
**Correct**: Stratify for imbalanced classes
### Stratified Splitting for Classification
```python
# Always use stratify for classification
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y
X, y, test_size=0.2, stratify=y, random_state=42
)
```
### Random State for Reproducibility
```python
# Set random_state for reproducibility
model = RandomForestClassifier(n_estimators=100, random_state=42)
```
### Handling Unknown Categories
```python
# Use handle_unknown='ignore' for OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
```
### Feature Names with Pipelines
```python
# Get feature names after transformation
preprocessor.fit(X_train)
feature_names = preprocessor.get_feature_names_out()
```
## Cheat Sheet: Algorithm Selection
### Classification
| Problem | Algorithm | When to Use |
|---------|-----------|-------------|
| Binary/Multiclass | Logistic Regression | Fast baseline, interpretability |
| Binary/Multiclass | Random Forest | Good default, robust |
| Binary/Multiclass | Gradient Boosting | Best accuracy, willing to tune |
| Binary/Multiclass | SVM | Small data, complex boundaries |
| Binary/Multiclass | Naive Bayes | Text classification, fast |
| High dimensions | Linear SVM or Logistic | Text, many features |
### Regression
| Problem | Algorithm | When to Use |
|---------|-----------|-------------|
| Continuous target | Linear Regression | Fast baseline, interpretability |
| Continuous target | Ridge/Lasso | Regularization needed |
| Continuous target | Random Forest | Good default, non-linear |
| Continuous target | Gradient Boosting | Best accuracy |
| Continuous target | SVR | Small data, non-linear |
### Clustering
| Problem | Algorithm | When to Use |
|---------|-----------|-------------|
| Known K, spherical | K-Means | Fast, simple |
| Unknown K, arbitrary shapes | DBSCAN | Noise/outliers present |
| Hierarchical structure | Agglomerative | Need dendrogram |
| Soft clustering | Gaussian Mixture | Probability estimates |
### Dimensionality Reduction
| Problem | Algorithm | When to Use |
|---------|-----------|-------------|
| Linear reduction | PCA | Variance explanation |
| Visualization | t-SNE | 2D/3D plots |
| Non-negative data | NMF | Images, text |
| Sparse data | TruncatedSVD | Text, recommender systems |
## Performance Tips
1. **Use n_jobs=-1** for parallel processing (RandomForest, GridSearchCV)
2. **Use HistGradientBoosting** for large datasets (>10K samples)
3. **Use MiniBatchKMeans** for large clustering tasks
4. **Use IncrementalPCA** for data that doesn't fit in memory
5. **Use sparse matrices** for high-dimensional sparse data (text)
6. **Cache transformers** in pipelines during grid search
7. **Use RandomizedSearchCV** instead of GridSearchCV for large parameter spaces
8. **Reduce dimensionality** with PCA before applying expensive algorithms
### Speed Up Training
```python
# Use n_jobs=-1 for parallel processing
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
# Use warm_start for incremental learning
model = RandomForestClassifier(n_estimators=100, warm_start=True)
model.fit(X, y)
model.n_estimators += 50
model.fit(X, y) # Adds 50 more trees
# Use partial_fit for online learning
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
for X_batch, y_batch in batches:
model.partial_fit(X_batch, y_batch, classes=np.unique(y))
```
### Memory Efficiency
```python
# Use sparse matrices
from scipy.sparse import csr_matrix
X_sparse = csr_matrix(X)
# Use MiniBatchKMeans for large data
from sklearn.cluster import MiniBatchKMeans
model = MiniBatchKMeans(n_clusters=8, batch_size=100)
```
## Version Check
```python
import sklearn
print(f"scikit-learn version: {sklearn.__version__}")
```
## Useful Resources
- Official Documentation: https://scikit-learn.org/stable/
- User Guide: https://scikit-learn.org/stable/user_guide.html
- API Reference: https://scikit-learn.org/stable/api/index.html
- Examples: https://scikit-learn.org/stable/auto_examples/index.html
- Tutorials: https://scikit-learn.org/stable/tutorial/index.html

View File

@@ -1,261 +1,378 @@
# Supervised Learning in scikit-learn
# Supervised Learning Reference
## Overview
Supervised learning algorithms learn patterns from labeled training data to make predictions on new data. Scikit-learn organizes supervised learning into 17 major categories.
Supervised learning algorithms learn from labeled training data to make predictions on new data. Scikit-learn provides comprehensive implementations for both classification and regression tasks.
## Linear Models
### Regression
- **LinearRegression**: Ordinary least squares regression
- **Ridge**: L2-regularized regression, good for multicollinearity
- **Lasso**: L1-regularized regression, performs feature selection
- **ElasticNet**: Combined L1/L2 regularization
- **LassoLars**: Lasso using Least Angle Regression algorithm
- **BayesianRidge**: Bayesian approach with automatic relevance determination
**Linear Regression (`sklearn.linear_model.LinearRegression`)**
- Ordinary least squares regression
- Fast, interpretable, no hyperparameters
- Use when: Linear relationships, interpretability matters
- Example:
```python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
```
**Ridge Regression (`sklearn.linear_model.Ridge`)**
- L2 regularization to prevent overfitting
- Key parameter: `alpha` (regularization strength, default=1.0)
- Use when: Multicollinearity present, need regularization
- Example:
```python
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
```
**Lasso (`sklearn.linear_model.Lasso`)**
- L1 regularization with feature selection
- Key parameter: `alpha` (regularization strength)
- Use when: Want sparse models, feature selection
- Can reduce some coefficients to exactly zero
- Example:
```python
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)
model.fit(X_train, y_train)
# Check which features were selected
print(f"Non-zero coefficients: {sum(model.coef_ != 0)}")
```
**ElasticNet (`sklearn.linear_model.ElasticNet`)**
- Combines L1 and L2 regularization
- Key parameters: `alpha`, `l1_ratio` (0=Ridge, 1=Lasso)
- Use when: Need both feature selection and regularization
- Example:
```python
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
model.fit(X_train, y_train)
```
### Classification
- **LogisticRegression**: Binary and multiclass classification
- **RidgeClassifier**: Ridge regression for classification
- **SGDClassifier**: Linear classifiers with SGD training
**Use cases**: Baseline models, interpretable predictions, high-dimensional data, when linear relationships are expected
**Logistic Regression (`sklearn.linear_model.LogisticRegression`)**
- Binary and multiclass classification
- Key parameters: `C` (inverse regularization), `penalty` ('l1', 'l2', 'elasticnet')
- Returns probability estimates
- Use when: Need probabilistic predictions, interpretability
- Example:
```python
from sklearn.linear_model import LogisticRegression
**Key parameters**:
- `alpha`: Regularization strength (higher = more regularization)
- `fit_intercept`: Whether to calculate intercept
- `solver`: Optimization algorithm ('lbfgs', 'saga', 'liblinear')
model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train, y_train)
probas = model.predict_proba(X_test)
```
## Support Vector Machines (SVM)
**Stochastic Gradient Descent (SGD)**
- `SGDClassifier`, `SGDRegressor`
- Efficient for large-scale learning
- Key parameters: `loss`, `penalty`, `alpha`, `learning_rate`
- Use when: Very large datasets (>10^4 samples)
- Example:
```python
from sklearn.linear_model import SGDClassifier
- **SVC**: Support Vector Classification
- **SVR**: Support Vector Regression
- **LinearSVC**: Linear SVM using liblinear (faster for large datasets)
- **OneClassSVM**: Unsupervised outlier detection
model = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3)
model.fit(X_train, y_train)
```
**Use cases**: Complex non-linear decision boundaries, high-dimensional spaces, when clear margin of separation exists
## Support Vector Machines
**Key parameters**:
- `kernel`: 'linear', 'poly', 'rbf', 'sigmoid'
- `C`: Regularization parameter (lower = more regularization)
- `gamma`: Kernel coefficient ('scale', 'auto', or float)
- `degree`: Polynomial degree (for poly kernel)
**SVC (`sklearn.svm.SVC`)**
- Classification with kernel methods
- Key parameters: `C`, `kernel` ('linear', 'rbf', 'poly'), `gamma`
- Use when: Small to medium datasets, complex decision boundaries
- Note: Does not scale well to large datasets
- Example:
```python
from sklearn.svm import SVC
**Performance tip**: SVMs don't scale well beyond tens of thousands of samples. Use LinearSVC for large datasets with linear kernel.
# Linear kernel for linearly separable data
model_linear = SVC(kernel='linear', C=1.0)
# RBF kernel for non-linear data
model_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
model_rbf.fit(X_train, y_train)
```
**SVR (`sklearn.svm.SVR`)**
- Regression with kernel methods
- Similar parameters to SVC
- Additional parameter: `epsilon` (tube width)
- Example:
```python
from sklearn.svm import SVR
model = SVR(kernel='rbf', C=1.0, epsilon=0.1)
model.fit(X_train, y_train)
```
## Decision Trees
- **DecisionTreeClassifier**: Classification tree
- **DecisionTreeRegressor**: Regression tree
- **ExtraTreeClassifier/Regressor**: Extremely randomized tree
**DecisionTreeClassifier / DecisionTreeRegressor**
- Non-parametric model learning decision rules
- Key parameters:
- `max_depth`: Maximum tree depth (prevents overfitting)
- `min_samples_split`: Minimum samples to split a node
- `min_samples_leaf`: Minimum samples in leaf
- `criterion`: 'gini', 'entropy' for classification; 'squared_error', 'absolute_error' for regression
- Use when: Need interpretable model, non-linear relationships, mixed feature types
- Prone to overfitting - use ensembles or pruning
- Example:
```python
from sklearn.tree import DecisionTreeClassifier
**Use cases**: Non-linear relationships, feature importance analysis, interpretable rules, handling mixed data types
model = DecisionTreeClassifier(
max_depth=5,
min_samples_split=20,
min_samples_leaf=10,
criterion='gini'
)
model.fit(X_train, y_train)
**Key parameters**:
- `max_depth`: Maximum tree depth (controls overfitting)
- `min_samples_split`: Minimum samples to split a node
- `min_samples_leaf`: Minimum samples in leaf node
- `max_features`: Number of features to consider for splits
- `criterion`: 'gini', 'entropy' (classification); 'squared_error', 'absolute_error' (regression)
**Overfitting prevention**: Limit `max_depth`, increase `min_samples_split/leaf`, use pruning with `ccp_alpha`
# Visualize the tree
from sklearn.tree import plot_tree
plot_tree(model, feature_names=feature_names, class_names=class_names)
```
## Ensemble Methods
### Random Forests
- **RandomForestClassifier**: Ensemble of decision trees
- **RandomForestRegressor**: Regression variant
**Use cases**: Robust general-purpose algorithm, reduces overfitting vs single trees, handles non-linear relationships
**RandomForestClassifier / RandomForestRegressor**
- Ensemble of decision trees with bagging
- Key parameters:
- `n_estimators`: Number of trees (default=100)
- `max_depth`: Maximum tree depth
- `max_features`: Features to consider for splits ('sqrt', 'log2', or int)
- `min_samples_split`, `min_samples_leaf`: Control tree growth
- Use when: High accuracy needed, can afford computation
- Provides feature importance
- Example:
```python
from sklearn.ensemble import RandomForestClassifier
**Key parameters**:
- `n_estimators`: Number of trees (higher = better but slower)
- `max_depth`: Maximum tree depth
- `max_features`: Features per split ('sqrt', 'log2', int, float)
- `bootstrap`: Whether to use bootstrap samples
- `n_jobs`: Parallel processing (-1 uses all cores)
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
max_features='sqrt',
n_jobs=-1 # Use all CPU cores
)
model.fit(X_train, y_train)
# Feature importance
importances = model.feature_importances_
```
### Gradient Boosting
- **HistGradientBoostingClassifier/Regressor**: Histogram-based, fast for large datasets (>10k samples)
- **GradientBoostingClassifier/Regressor**: Traditional implementation, better for small datasets
**Use cases**: High-performance predictions, winning Kaggle competitions, structured/tabular data
**GradientBoostingClassifier / GradientBoostingRegressor**
- Sequential ensemble building trees on residuals
- Key parameters:
- `n_estimators`: Number of boosting stages
- `learning_rate`: Shrinks contribution of each tree
- `max_depth`: Depth of individual trees (typically 3-5)
- `subsample`: Fraction of samples for training each tree
- Use when: Need high accuracy, can afford training time
- Often achieves best performance
- Example:
```python
from sklearn.ensemble import GradientBoostingClassifier
**Key parameters**:
- `n_estimators`: Number of boosting stages
- `learning_rate`: Shrinks contribution of each tree
- `max_depth`: Maximum tree depth (typically 3-8)
- `subsample`: Fraction of samples per tree (enables stochastic gradient boosting)
- `early_stopping`: Stop when validation score stops improving
model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
subsample=0.8
)
model.fit(X_train, y_train)
```
**Performance tip**: HistGradientBoosting is orders of magnitude faster for large datasets
**HistGradientBoostingClassifier / HistGradientBoostingRegressor**
- Faster gradient boosting with histogram-based algorithm
- Native support for missing values and categorical features
- Key parameters: Similar to GradientBoosting
- Use when: Large datasets, need faster training
- Example:
```python
from sklearn.ensemble import HistGradientBoostingClassifier
### AdaBoost
- **AdaBoostClassifier/Regressor**: Adaptive boosting
model = HistGradientBoostingClassifier(
max_iter=100,
learning_rate=0.1,
max_depth=None, # No limit by default
categorical_features='from_dtype' # Auto-detect categorical
)
model.fit(X_train, y_train)
```
**Use cases**: Boosting weak learners, less prone to overfitting than other methods
### Other Ensemble Methods
**Key parameters**:
- `estimator`: Base estimator (default: DecisionTreeClassifier with max_depth=1)
- `n_estimators`: Number of boosting iterations
- `learning_rate`: Weight applied to each classifier
**AdaBoost**
- Adaptive boosting focusing on misclassified samples
- Key parameters: `n_estimators`, `learning_rate`, `estimator` (base estimator)
- Use when: Simple boosting approach needed
- Example:
```python
from sklearn.ensemble import AdaBoostClassifier
### Bagging
- **BaggingClassifier/Regressor**: Bootstrap aggregating with any base estimator
model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)
model.fit(X_train, y_train)
```
**Use cases**: Reducing variance of unstable models, parallel ensemble creation
**Voting Classifier / Regressor**
- Combines predictions from multiple models
- Types: 'hard' (majority vote) or 'soft' (average probabilities)
- Use when: Want to ensemble different model types
- Example:
```python
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
**Key parameters**:
- `estimator`: Base estimator to fit
- `n_estimators`: Number of estimators
- `max_samples`: Samples to draw per estimator
- `bootstrap`: Whether to use replacement
model = VotingClassifier(
estimators=[
('lr', LogisticRegression()),
('dt', DecisionTreeClassifier()),
('svc', SVC(probability=True))
],
voting='soft'
)
model.fit(X_train, y_train)
```
### Voting & Stacking
- **VotingClassifier/Regressor**: Combines different model types
- **StackingClassifier/Regressor**: Meta-learner trained on base predictions
**Stacking Classifier / Regressor**
- Trains a meta-model on predictions from base models
- More sophisticated than voting
- Key parameter: `final_estimator` (meta-learner)
- Example:
```python
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
**Use cases**: Combining diverse models, leveraging different model strengths
model = StackingClassifier(
estimators=[
('dt', DecisionTreeClassifier()),
('svc', SVC())
],
final_estimator=LogisticRegression()
)
model.fit(X_train, y_train)
```
## Neural Networks
## K-Nearest Neighbors
- **MLPClassifier**: Multi-layer perceptron classifier
- **MLPRegressor**: Multi-layer perceptron regressor
**KNeighborsClassifier / KNeighborsRegressor**
- Non-parametric method based on distance
- Key parameters:
- `n_neighbors`: Number of neighbors (default=5)
- `weights`: 'uniform' or 'distance'
- `metric`: Distance metric ('euclidean', 'manhattan', etc.)
- Use when: Small dataset, simple baseline needed
- Slow prediction on large datasets
- Example:
```python
from sklearn.neighbors import KNeighborsClassifier
**Use cases**: Complex non-linear patterns, when gradient boosting is too slow, deep feature learning
**Key parameters**:
- `hidden_layer_sizes`: Tuple of hidden layer sizes (e.g., (100, 50))
- `activation`: 'relu', 'tanh', 'logistic'
- `solver`: 'adam', 'lbfgs', 'sgd'
- `alpha`: L2 regularization term
- `learning_rate`: Learning rate schedule
- `early_stopping`: Stop when validation score stops improving
**Important**: Feature scaling is critical for neural networks. Always use StandardScaler or similar.
## Nearest Neighbors
- **KNeighborsClassifier/Regressor**: K-nearest neighbors
- **RadiusNeighborsClassifier/Regressor**: Radius-based neighbors
- **NearestCentroid**: Classification using class centroids
**Use cases**: Simple baseline, irregular decision boundaries, when interpretability isn't critical
**Key parameters**:
- `n_neighbors`: Number of neighbors (typically 3-11)
- `weights`: 'uniform' or 'distance' (distance-weighted voting)
- `metric`: Distance metric ('euclidean', 'manhattan', 'minkowski')
- `algorithm`: 'auto', 'ball_tree', 'kd_tree', 'brute'
model = KNeighborsClassifier(n_neighbors=5, weights='distance')
model.fit(X_train, y_train)
```
## Naive Bayes
- **GaussianNB**: Assumes Gaussian distribution of features
- **MultinomialNB**: For discrete counts (text classification)
- **BernoulliNB**: For binary/boolean features
- **CategoricalNB**: For categorical features
- **ComplementNB**: Adapted for imbalanced datasets
**GaussianNB, MultinomialNB, BernoulliNB**
- Probabilistic classifiers based on Bayes' theorem
- Fast training and prediction
- GaussianNB: Continuous features (assumes Gaussian distribution)
- MultinomialNB: Count features (text classification)
- BernoulliNB: Binary features
- Use when: Text classification, fast baseline, probabilistic predictions
- Example:
```python
from sklearn.naive_bayes import GaussianNB, MultinomialNB
**Use cases**: Text classification, fast baseline, when features are independent, small training sets
# For continuous features
model_gaussian = GaussianNB()
**Key parameters**:
- `alpha`: Smoothing parameter (Laplace/Lidstone smoothing)
- `fit_prior`: Whether to learn class prior probabilities
# For text/count data
model_multinomial = MultinomialNB(alpha=1.0) # alpha is smoothing parameter
model_multinomial.fit(X_train, y_train)
```
## Linear/Quadratic Discriminant Analysis
## Neural Networks
- **LinearDiscriminantAnalysis**: Linear decision boundary with dimensionality reduction
- **QuadraticDiscriminantAnalysis**: Quadratic decision boundary
**MLPClassifier / MLPRegressor**
- Multi-layer perceptron (feedforward neural network)
- Key parameters:
- `hidden_layer_sizes`: Tuple of hidden layer sizes, e.g., (100, 50)
- `activation`: 'relu', 'tanh', 'logistic'
- `solver`: 'adam', 'sgd', 'lbfgs'
- `alpha`: L2 regularization parameter
- `learning_rate`: 'constant', 'adaptive'
- Use when: Complex non-linear patterns, large datasets
- Requires feature scaling
- Example:
```python
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
**Use cases**: When classes have Gaussian distributions, dimensionality reduction, when covariance assumptions hold
# Scale features first
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
## Gaussian Processes
model = MLPClassifier(
hidden_layer_sizes=(100, 50),
activation='relu',
solver='adam',
alpha=0.0001,
max_iter=1000
)
model.fit(X_train_scaled, y_train)
```
- **GaussianProcessClassifier**: Probabilistic classification
- **GaussianProcessRegressor**: Probabilistic regression with uncertainty estimates
## Algorithm Selection Guide
**Use cases**: When uncertainty quantification is important, small datasets, smooth function approximation
### Choose based on:
**Key parameters**:
- `kernel`: Covariance function (RBF, Matern, RationalQuadratic, etc.)
- `alpha`: Noise level
**Dataset size:**
- Small (<1k samples): KNN, SVM, Decision Trees
- Medium (1k-100k): Random Forest, Gradient Boosting, Linear Models
- Large (>100k): SGD, Linear Models, HistGradientBoosting
**Limitation**: Doesn't scale well to large datasets (O(n³) complexity)
**Interpretability:**
- High: Linear Models, Decision Trees
- Medium: Random Forest (feature importance)
- Low: SVM with RBF kernel, Neural Networks
## Stochastic Gradient Descent
**Accuracy vs Speed:**
- Fast training: Naive Bayes, Linear Models, KNN
- High accuracy: Gradient Boosting, Random Forest, Stacking
- Fast prediction: Linear Models, Naive Bayes
- Slow prediction: KNN (on large datasets), SVM
- **SGDClassifier**: Linear classifiers with SGD
- **SGDRegressor**: Linear regressors with SGD
**Feature types:**
- Continuous: Most algorithms work well
- Categorical: Trees, HistGradientBoosting (native support)
- Mixed: Trees, Gradient Boosting
- Text: Naive Bayes, Linear Models with TF-IDF
**Use cases**: Very large datasets (>100k samples), online learning, when data doesn't fit in memory
**Key parameters**:
- `loss`: Loss function ('hinge', 'log_loss', 'squared_error', etc.)
- `penalty`: Regularization ('l2', 'l1', 'elasticnet')
- `alpha`: Regularization strength
- `learning_rate`: Learning rate schedule
## Semi-Supervised Learning
- **SelfTrainingClassifier**: Self-training with any base classifier
- **LabelPropagation**: Label propagation through graph
- **LabelSpreading**: Label spreading (modified label propagation)
**Use cases**: When labeled data is scarce but unlabeled data is abundant
## Feature Selection
- **VarianceThreshold**: Remove low-variance features
- **SelectKBest**: Select K highest scoring features
- **SelectPercentile**: Select top percentile of features
- **RFE**: Recursive feature elimination
- **RFECV**: RFE with cross-validation
- **SelectFromModel**: Select features based on importance
- **SequentialFeatureSelector**: Forward/backward feature selection
**Use cases**: Reducing dimensionality, removing irrelevant features, improving interpretability, reducing overfitting
## Probability Calibration
- **CalibratedClassifierCV**: Calibrate classifier probabilities
**Use cases**: When probability estimates are important (not just class predictions), especially with SVM and Naive Bayes
**Methods**:
- `sigmoid`: Platt scaling
- `isotonic`: Isotonic regression (more flexible, needs more data)
## Multi-Output Methods
- **MultiOutputClassifier**: Fit one classifier per target
- **MultiOutputRegressor**: Fit one regressor per target
- **ClassifierChain**: Models dependencies between targets
- **RegressorChain**: Regression variant
**Use cases**: Predicting multiple related targets simultaneously
## Specialized Regression
- **IsotonicRegression**: Monotonic regression
- **QuantileRegressor**: Quantile regression for prediction intervals
## Algorithm Selection Guidelines
**Start with**:
1. **Logistic Regression** (classification) or **LinearRegression/Ridge** (regression) as baseline
2. **RandomForestClassifier/Regressor** for general non-linear problems
3. **HistGradientBoostingClassifier/Regressor** when best performance is needed
**Consider dataset size**:
- Small (<1k samples): SVM, Gaussian Processes, any algorithm
- Medium (1k-100k): Random Forests, Gradient Boosting, Neural Networks
- Large (>100k): SGD, HistGradientBoosting, LinearSVC
**Consider interpretability needs**:
- High interpretability: Linear models, Decision Trees, Naive Bayes
- Medium: Random Forests (feature importance), Rule extraction
- Low (black box acceptable): Gradient Boosting, Neural Networks, SVM with RBF kernel
**Consider training time**:
- Fast: Linear models, Naive Bayes, Decision Trees
- Medium: Random Forests (parallelizable), SVM (small data)
- Slow: Gradient Boosting, Neural Networks, SVM (large data), Gaussian Processes
**Common starting points:**
1. Logistic Regression (classification) / Linear Regression (regression) - fast baseline
2. Random Forest - good default choice
3. Gradient Boosting - optimize for best accuracy