mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
Improve the scikit-learn skill
This commit is contained in:
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -1,345 +1,563 @@
|
||||
# Data Preprocessing in scikit-learn
|
||||
# Data Preprocessing and Feature Engineering Reference
|
||||
|
||||
## Overview
|
||||
Preprocessing transforms raw data into a format suitable for machine learning algorithms. Many algorithms require standardized or normalized data to perform well.
|
||||
|
||||
## Standardization and Scaling
|
||||
Data preprocessing transforms raw data into a format suitable for machine learning models. This includes scaling, encoding, handling missing values, and feature engineering.
|
||||
|
||||
## Feature Scaling and Normalization
|
||||
|
||||
### StandardScaler
|
||||
Removes mean and scales to unit variance (z-score normalization).
|
||||
|
||||
**Formula**: `z = (x - μ) / σ`
|
||||
|
||||
**Use cases**:
|
||||
- Most ML algorithms (especially SVM, neural networks, PCA)
|
||||
- When features have different units or scales
|
||||
- When assuming Gaussian-like distribution
|
||||
|
||||
**Important**: Fit only on training data, then transform both train and test sets.
|
||||
|
||||
**StandardScaler (`sklearn.preprocessing.StandardScaler`)**
|
||||
- Standardizes features to zero mean and unit variance
|
||||
- Formula: z = (x - mean) / std
|
||||
- Use when: Features have different scales, algorithm assumes normally distributed data
|
||||
- Required for: SVM, KNN, Neural Networks, PCA, Linear Regression with regularization
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test) # Use same parameters
|
||||
X_test_scaled = scaler.transform(X_test) # Use same parameters as training
|
||||
|
||||
# Access learned parameters
|
||||
print(f"Mean: {scaler.mean_}")
|
||||
print(f"Std: {scaler.scale_}")
|
||||
```
|
||||
|
||||
### MinMaxScaler
|
||||
Scales features to a specified range, typically [0, 1].
|
||||
|
||||
**Formula**: `X_scaled = (X - X_min) / (X_max - X_min)`
|
||||
|
||||
**Use cases**:
|
||||
- When bounded range is needed
|
||||
- Neural networks (often prefer [0, 1] range)
|
||||
- When distribution is not Gaussian
|
||||
- Image pixel values
|
||||
|
||||
**Parameters**:
|
||||
- `feature_range`: Tuple (min, max), default (0, 1)
|
||||
|
||||
**Warning**: Sensitive to outliers since it uses min/max.
|
||||
|
||||
### MaxAbsScaler
|
||||
Scales to [-1, 1] by dividing by maximum absolute value.
|
||||
|
||||
**Use cases**:
|
||||
- Sparse data (preserves sparsity)
|
||||
- Data already centered at zero
|
||||
- When sign of values is meaningful
|
||||
|
||||
**Advantage**: Doesn't shift/center the data, preserves zero entries.
|
||||
|
||||
### RobustScaler
|
||||
Uses median and interquartile range (IQR) instead of mean and standard deviation.
|
||||
|
||||
**Formula**: `X_scaled = (X - median) / IQR`
|
||||
|
||||
**Use cases**:
|
||||
- When outliers are present
|
||||
- When StandardScaler produces skewed results
|
||||
- Robust statistics preferred
|
||||
|
||||
**Parameters**:
|
||||
- `quantile_range`: Tuple (q_min, q_max), default (25.0, 75.0)
|
||||
|
||||
## Normalization
|
||||
|
||||
### normalize() function and Normalizer
|
||||
Scales individual samples (rows) to unit norm, not features (columns).
|
||||
|
||||
**Use cases**:
|
||||
- Text classification (TF-IDF vectors)
|
||||
- When similarity metrics (dot product, cosine) are used
|
||||
- When each sample should have equal weight
|
||||
|
||||
**Norms**:
|
||||
- `l1`: Manhattan norm (sum of absolutes = 1)
|
||||
- `l2`: Euclidean norm (sum of squares = 1) - **most common**
|
||||
- `max`: Maximum absolute value = 1
|
||||
|
||||
**Key difference from scalers**: Operates on rows (samples), not columns (features).
|
||||
|
||||
**MinMaxScaler (`sklearn.preprocessing.MinMaxScaler`)**
|
||||
- Scales features to a given range (default [0, 1])
|
||||
- Formula: X_scaled = (X - X.min) / (X.max - X.min)
|
||||
- Use when: Need bounded values, data not normally distributed
|
||||
- Sensitive to outliers
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import Normalizer
|
||||
normalizer = Normalizer(norm='l2')
|
||||
X_normalized = normalizer.transform(X)
|
||||
from sklearn.preprocessing import MinMaxScaler
|
||||
|
||||
scaler = MinMaxScaler(feature_range=(0, 1))
|
||||
X_scaled = scaler.fit_transform(X_train)
|
||||
|
||||
# Custom range
|
||||
scaler = MinMaxScaler(feature_range=(-1, 1))
|
||||
X_scaled = scaler.fit_transform(X_train)
|
||||
```
|
||||
|
||||
## Encoding Categorical Features
|
||||
### RobustScaler
|
||||
|
||||
**RobustScaler (`sklearn.preprocessing.RobustScaler`)**
|
||||
- Scales using median and interquartile range (IQR)
|
||||
- Formula: X_scaled = (X - median) / IQR
|
||||
- Use when: Data contains outliers
|
||||
- Robust to outliers
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import RobustScaler
|
||||
|
||||
scaler = RobustScaler()
|
||||
X_scaled = scaler.fit_transform(X_train)
|
||||
```
|
||||
|
||||
### Normalizer
|
||||
|
||||
**Normalizer (`sklearn.preprocessing.Normalizer`)**
|
||||
- Normalizes samples individually to unit norm
|
||||
- Common norms: 'l1', 'l2', 'max'
|
||||
- Use when: Need to normalize each sample independently (e.g., text features)
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import Normalizer
|
||||
|
||||
normalizer = Normalizer(norm='l2') # Euclidean norm
|
||||
X_normalized = normalizer.fit_transform(X)
|
||||
```
|
||||
|
||||
### MaxAbsScaler
|
||||
|
||||
**MaxAbsScaler (`sklearn.preprocessing.MaxAbsScaler`)**
|
||||
- Scales by maximum absolute value
|
||||
- Range: [-1, 1]
|
||||
- Doesn't shift/center data (preserves sparsity)
|
||||
- Use when: Data is already centered or sparse
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import MaxAbsScaler
|
||||
|
||||
scaler = MaxAbsScaler()
|
||||
X_scaled = scaler.fit_transform(X_sparse)
|
||||
```
|
||||
|
||||
## Encoding Categorical Variables
|
||||
|
||||
### OneHotEncoder
|
||||
|
||||
**OneHotEncoder (`sklearn.preprocessing.OneHotEncoder`)**
|
||||
- Creates binary columns for each category
|
||||
- Use when: Nominal categories (no order), tree-based models or linear models
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import OneHotEncoder
|
||||
|
||||
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
|
||||
X_encoded = encoder.fit_transform(X_categorical)
|
||||
|
||||
# Get feature names
|
||||
feature_names = encoder.get_feature_names_out(['color', 'size'])
|
||||
|
||||
# Handle unknown categories during transform
|
||||
X_test_encoded = encoder.transform(X_test_categorical)
|
||||
```
|
||||
|
||||
### OrdinalEncoder
|
||||
Converts categories to integers (0 to n_categories - 1).
|
||||
|
||||
**Use cases**:
|
||||
- Ordinal relationships exist (small < medium < large)
|
||||
- Preprocessing before other transformations
|
||||
- Tree-based algorithms (which can handle integers)
|
||||
|
||||
**Parameters**:
|
||||
- `handle_unknown`: 'error' or 'use_encoded_value'
|
||||
- `unknown_value`: Value for unknown categories
|
||||
- `encoded_missing_value`: Value for missing data
|
||||
|
||||
**OrdinalEncoder (`sklearn.preprocessing.OrdinalEncoder`)**
|
||||
- Encodes categories as integers
|
||||
- Use when: Ordinal categories (ordered), or tree-based models
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import OrdinalEncoder
|
||||
|
||||
# Natural ordering
|
||||
encoder = OrdinalEncoder()
|
||||
X_encoded = encoder.fit_transform(X_categorical)
|
||||
|
||||
# Custom ordering
|
||||
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
|
||||
X_encoded = encoder.fit_transform(X_categorical)
|
||||
```
|
||||
|
||||
### OneHotEncoder
|
||||
Creates binary columns for each category.
|
||||
|
||||
**Use cases**:
|
||||
- Nominal categories (no order)
|
||||
- Linear models, neural networks
|
||||
- When category relationships shouldn't be assumed
|
||||
|
||||
**Parameters**:
|
||||
- `drop`: 'first', 'if_binary', array-like (prevents multicollinearity)
|
||||
- `sparse_output`: True (default, memory efficient) or False
|
||||
- `handle_unknown`: 'error', 'ignore', 'infrequent_if_exist'
|
||||
- `min_frequency`: Group infrequent categories
|
||||
- `max_categories`: Limit number of categories
|
||||
|
||||
**High cardinality handling**:
|
||||
```python
|
||||
encoder = OneHotEncoder(min_frequency=100, handle_unknown='infrequent_if_exist')
|
||||
# Groups categories appearing < 100 times into 'infrequent' category
|
||||
```
|
||||
|
||||
**Memory tip**: Use `sparse_output=True` (default) for high-cardinality features.
|
||||
|
||||
### TargetEncoder
|
||||
Uses target statistics to encode categories.
|
||||
|
||||
**Use cases**:
|
||||
- High-cardinality categorical features (zip codes, user IDs)
|
||||
- When linear relationships with target are expected
|
||||
- Often improves performance over one-hot encoding
|
||||
|
||||
**How it works**:
|
||||
- Replaces category with mean of target for that category
|
||||
- Uses cross-fitting during fit_transform() to prevent target leakage
|
||||
- Applies smoothing to handle rare categories
|
||||
|
||||
**Parameters**:
|
||||
- `smooth`: Smoothing parameter for rare categories
|
||||
- `cv`: Cross-validation strategy
|
||||
|
||||
**Warning**: Only for supervised learning. Requires target variable.
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import TargetEncoder
|
||||
encoder = TargetEncoder()
|
||||
X_encoded = encoder.fit_transform(X_categorical, y)
|
||||
```
|
||||
|
||||
### LabelEncoder
|
||||
Encodes target labels into integers 0 to n_classes - 1.
|
||||
|
||||
**Use cases**: Encoding target variable for classification (not features!)
|
||||
**LabelEncoder (`sklearn.preprocessing.LabelEncoder`)**
|
||||
- Encodes target labels (y) as integers
|
||||
- Use for: Target variable encoding
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import LabelEncoder
|
||||
|
||||
**Important**: Use `LabelEncoder` for targets, not features. For features, use OrdinalEncoder or OneHotEncoder.
|
||||
le = LabelEncoder()
|
||||
y_encoded = le.fit_transform(y)
|
||||
|
||||
### Binarizer
|
||||
Converts numeric values to binary (0 or 1) based on threshold.
|
||||
# Decode back
|
||||
y_decoded = le.inverse_transform(y_encoded)
|
||||
print(f"Classes: {le.classes_}")
|
||||
```
|
||||
|
||||
**Use cases**: Creating binary features from continuous values
|
||||
### Target Encoding (using category_encoders)
|
||||
|
||||
```python
|
||||
# Install: uv pip install category-encoders
|
||||
from category_encoders import TargetEncoder
|
||||
|
||||
encoder = TargetEncoder()
|
||||
X_train_encoded = encoder.fit_transform(X_train_categorical, y_train)
|
||||
X_test_encoded = encoder.transform(X_test_categorical)
|
||||
```
|
||||
|
||||
## Non-linear Transformations
|
||||
|
||||
### QuantileTransformer
|
||||
Maps features to uniform or normal distribution using rank transformation.
|
||||
|
||||
**Use cases**:
|
||||
- Unusual distributions (bimodal, heavy tails)
|
||||
- Reducing outlier impact
|
||||
- When normal distribution is desired
|
||||
|
||||
**Parameters**:
|
||||
- `output_distribution`: 'uniform' (default) or 'normal'
|
||||
- `n_quantiles`: Number of quantiles (default: min(1000, n_samples))
|
||||
|
||||
**Effect**: Strong transformation that reduces outlier influence and makes data more Gaussian-like.
|
||||
|
||||
### PowerTransformer
|
||||
Applies parametric monotonic transformation to make data more Gaussian.
|
||||
|
||||
**Methods**:
|
||||
- `yeo-johnson`: Works with positive and negative values (default)
|
||||
- `box-cox`: Only positive values
|
||||
|
||||
**Use cases**:
|
||||
- Skewed distributions
|
||||
- When Gaussian assumption is important
|
||||
- Variance stabilization
|
||||
|
||||
**Advantage**: Less radical than QuantileTransformer, preserves more of original relationships.
|
||||
|
||||
## Discretization
|
||||
|
||||
### KBinsDiscretizer
|
||||
Bins continuous features into discrete intervals.
|
||||
|
||||
**Strategies**:
|
||||
- `uniform`: Equal-width bins
|
||||
- `quantile`: Equal-frequency bins
|
||||
- `kmeans`: K-means clustering to determine bins
|
||||
|
||||
**Encoding**:
|
||||
- `ordinal`: Integer encoding (0 to n_bins - 1)
|
||||
- `onehot`: One-hot encoding
|
||||
- `onehot-dense`: Dense one-hot encoding
|
||||
|
||||
**Use cases**:
|
||||
- Making linear models handle non-linear relationships
|
||||
- Reducing noise in features
|
||||
- Making features more interpretable
|
||||
### Power Transforms
|
||||
|
||||
**PowerTransformer**
|
||||
- Makes data more Gaussian-like
|
||||
- Methods: 'yeo-johnson' (works with negative values), 'box-cox' (positive only)
|
||||
- Use when: Data is skewed, algorithm assumes normality
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import KBinsDiscretizer
|
||||
disc = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
|
||||
X_binned = disc.fit_transform(X)
|
||||
from sklearn.preprocessing import PowerTransformer
|
||||
|
||||
# Yeo-Johnson (handles negative values)
|
||||
pt = PowerTransformer(method='yeo-johnson', standardize=True)
|
||||
X_transformed = pt.fit_transform(X)
|
||||
|
||||
# Box-Cox (positive values only)
|
||||
pt = PowerTransformer(method='box-cox', standardize=True)
|
||||
X_transformed = pt.fit_transform(X)
|
||||
```
|
||||
|
||||
## Feature Generation
|
||||
|
||||
### PolynomialFeatures
|
||||
Generates polynomial and interaction features.
|
||||
|
||||
**Parameters**:
|
||||
- `degree`: Polynomial degree
|
||||
- `interaction_only`: Only multiplicative interactions (no x²)
|
||||
- `include_bias`: Include constant feature
|
||||
|
||||
**Use cases**:
|
||||
- Adding non-linearity to linear models
|
||||
- Feature engineering
|
||||
- Polynomial regression
|
||||
|
||||
**Warning**: Number of features grows rapidly: (n+d)!/d!n! for degree d.
|
||||
### Quantile Transformation
|
||||
|
||||
**QuantileTransformer**
|
||||
- Transforms features to follow uniform or normal distribution
|
||||
- Robust to outliers
|
||||
- Use when: Want to reduce outlier impact
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
poly = PolynomialFeatures(degree=2, include_bias=False)
|
||||
X_poly = poly.fit_transform(X)
|
||||
# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
|
||||
from sklearn.preprocessing import QuantileTransformer
|
||||
|
||||
# Transform to uniform distribution
|
||||
qt = QuantileTransformer(output_distribution='uniform', random_state=42)
|
||||
X_transformed = qt.fit_transform(X)
|
||||
|
||||
# Transform to normal distribution
|
||||
qt = QuantileTransformer(output_distribution='normal', random_state=42)
|
||||
X_transformed = qt.fit_transform(X)
|
||||
```
|
||||
|
||||
### SplineTransformer
|
||||
Generates B-spline basis functions.
|
||||
### Log Transform
|
||||
|
||||
**Use cases**:
|
||||
- Smooth non-linear transformations
|
||||
- Alternative to PolynomialFeatures (less oscillation at boundaries)
|
||||
- Generalized additive models (GAMs)
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
**Parameters**:
|
||||
- `n_knots`: Number of knots
|
||||
- `degree`: Spline degree
|
||||
- `knots`: Knot positions ('uniform', 'quantile', or array)
|
||||
# Log1p (log(1 + x)) - handles zeros
|
||||
X_log = np.log1p(X)
|
||||
|
||||
## Missing Value Handling
|
||||
# Or use FunctionTransformer
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
|
||||
log_transformer = FunctionTransformer(np.log1p, inverse_func=np.expm1)
|
||||
X_log = log_transformer.fit_transform(X)
|
||||
```
|
||||
|
||||
## Missing Value Imputation
|
||||
|
||||
### SimpleImputer
|
||||
Imputes missing values with various strategies.
|
||||
|
||||
**Strategies**:
|
||||
- `mean`: Mean of column (numeric only)
|
||||
- `median`: Median of column (numeric only)
|
||||
- `most_frequent`: Mode (numeric or categorical)
|
||||
- `constant`: Fill with constant value
|
||||
|
||||
**Parameters**:
|
||||
- `strategy`: Imputation strategy
|
||||
- `fill_value`: Value when strategy='constant'
|
||||
- `missing_values`: What represents missing (np.nan, None, specific value)
|
||||
|
||||
**SimpleImputer (`sklearn.impute.SimpleImputer`)**
|
||||
- Basic imputation strategies
|
||||
- Strategies: 'mean', 'median', 'most_frequent', 'constant'
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.impute import SimpleImputer
|
||||
imputer = SimpleImputer(strategy='median')
|
||||
|
||||
# For numerical features
|
||||
imputer = SimpleImputer(strategy='mean')
|
||||
X_imputed = imputer.fit_transform(X)
|
||||
|
||||
# For categorical features
|
||||
imputer = SimpleImputer(strategy='most_frequent')
|
||||
X_imputed = imputer.fit_transform(X_categorical)
|
||||
|
||||
# Fill with constant
|
||||
imputer = SimpleImputer(strategy='constant', fill_value=0)
|
||||
X_imputed = imputer.fit_transform(X)
|
||||
```
|
||||
|
||||
### KNNImputer
|
||||
Imputes using k-nearest neighbors.
|
||||
### Iterative Imputer
|
||||
|
||||
**Use cases**: When relationships between features should inform imputation
|
||||
**IterativeImputer**
|
||||
- Models each feature with missing values as function of other features
|
||||
- More sophisticated than SimpleImputer
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.experimental import enable_iterative_imputer
|
||||
from sklearn.impute import IterativeImputer
|
||||
|
||||
**Parameters**:
|
||||
- `n_neighbors`: Number of neighbors
|
||||
- `weights`: 'uniform' or 'distance'
|
||||
imputer = IterativeImputer(max_iter=10, random_state=42)
|
||||
X_imputed = imputer.fit_transform(X)
|
||||
```
|
||||
|
||||
### IterativeImputer
|
||||
Models each feature with missing values as function of other features.
|
||||
### KNN Imputer
|
||||
|
||||
**Use cases**:
|
||||
- Complex relationships between features
|
||||
- When multiple features have missing values
|
||||
- Higher quality imputation (but slower)
|
||||
**KNNImputer**
|
||||
- Imputes using k-nearest neighbors
|
||||
- Use when: Features are correlated
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.impute import KNNImputer
|
||||
|
||||
**Parameters**:
|
||||
- `estimator`: Estimator for regression (default: BayesianRidge)
|
||||
- `max_iter`: Maximum iterations
|
||||
imputer = KNNImputer(n_neighbors=5)
|
||||
X_imputed = imputer.fit_transform(X)
|
||||
```
|
||||
|
||||
## Function Transformers
|
||||
## Feature Engineering
|
||||
|
||||
### FunctionTransformer
|
||||
Applies custom function to data.
|
||||
### Polynomial Features
|
||||
|
||||
**Use cases**:
|
||||
- Custom transformations in pipelines
|
||||
- Log transformation, square root, etc.
|
||||
- Domain-specific preprocessing
|
||||
**PolynomialFeatures**
|
||||
- Creates polynomial and interaction features
|
||||
- Use when: Need non-linear features for linear models
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
|
||||
# Degree 2: includes x1, x2, x1^2, x2^2, x1*x2
|
||||
poly = PolynomialFeatures(degree=2, include_bias=False)
|
||||
X_poly = poly.fit_transform(X)
|
||||
|
||||
# Get feature names
|
||||
feature_names = poly.get_feature_names_out(['x1', 'x2'])
|
||||
|
||||
# Only interactions (no powers)
|
||||
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
|
||||
X_interactions = poly.fit_transform(X)
|
||||
```
|
||||
|
||||
### Binning/Discretization
|
||||
|
||||
**KBinsDiscretizer**
|
||||
- Bins continuous features into discrete intervals
|
||||
- Strategies: 'uniform', 'quantile', 'kmeans'
|
||||
- Encoding: 'onehot', 'ordinal', 'onehot-dense'
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import KBinsDiscretizer
|
||||
|
||||
# Equal-width bins
|
||||
binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
|
||||
X_binned = binner.fit_transform(X)
|
||||
|
||||
# Equal-frequency bins (quantile-based)
|
||||
binner = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
|
||||
X_binned = binner.fit_transform(X)
|
||||
```
|
||||
|
||||
### Binarization
|
||||
|
||||
**Binarizer**
|
||||
- Converts features to binary (0 or 1) based on threshold
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import Binarizer
|
||||
|
||||
binarizer = Binarizer(threshold=0.5)
|
||||
X_binary = binarizer.fit_transform(X)
|
||||
```
|
||||
|
||||
### Spline Features
|
||||
|
||||
**SplineTransformer**
|
||||
- Creates spline basis functions
|
||||
- Useful for capturing non-linear relationships
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import SplineTransformer
|
||||
|
||||
spline = SplineTransformer(n_knots=5, degree=3)
|
||||
X_splines = spline.fit_transform(X)
|
||||
```
|
||||
|
||||
## Text Feature Extraction
|
||||
|
||||
### CountVectorizer
|
||||
|
||||
**CountVectorizer (`sklearn.feature_extraction.text.CountVectorizer`)**
|
||||
- Converts text to token count matrix
|
||||
- Use for: Bag-of-words representation
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_extraction.text import CountVectorizer
|
||||
|
||||
vectorizer = CountVectorizer(
|
||||
max_features=5000, # Keep top 5000 features
|
||||
min_df=2, # Ignore terms appearing in < 2 documents
|
||||
max_df=0.8, # Ignore terms appearing in > 80% documents
|
||||
ngram_range=(1, 2) # Unigrams and bigrams
|
||||
)
|
||||
|
||||
X_counts = vectorizer.fit_transform(documents)
|
||||
feature_names = vectorizer.get_feature_names_out()
|
||||
```
|
||||
|
||||
### TfidfVectorizer
|
||||
|
||||
**TfidfVectorizer**
|
||||
- TF-IDF (Term Frequency-Inverse Document Frequency) transformation
|
||||
- Better than CountVectorizer for most tasks
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
|
||||
vectorizer = TfidfVectorizer(
|
||||
max_features=5000,
|
||||
min_df=2,
|
||||
max_df=0.8,
|
||||
ngram_range=(1, 2),
|
||||
stop_words='english' # Remove English stop words
|
||||
)
|
||||
|
||||
X_tfidf = vectorizer.fit_transform(documents)
|
||||
```
|
||||
|
||||
### HashingVectorizer
|
||||
|
||||
**HashingVectorizer**
|
||||
- Uses hashing trick for memory efficiency
|
||||
- No fit needed, can't reverse transform
|
||||
- Use when: Very large vocabulary, streaming data
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_extraction.text import HashingVectorizer
|
||||
|
||||
vectorizer = HashingVectorizer(n_features=2**18)
|
||||
X_hashed = vectorizer.transform(documents) # No fit needed
|
||||
```
|
||||
|
||||
## Feature Selection
|
||||
|
||||
### Filter Methods
|
||||
|
||||
**Variance Threshold**
|
||||
- Removes low-variance features
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_selection import VarianceThreshold
|
||||
|
||||
selector = VarianceThreshold(threshold=0.01)
|
||||
X_selected = selector.fit_transform(X)
|
||||
```
|
||||
|
||||
**SelectKBest / SelectPercentile**
|
||||
- Select features based on statistical tests
|
||||
- Tests: f_classif, chi2, mutual_info_classif
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_selection import SelectKBest, f_classif
|
||||
|
||||
# Select top 10 features
|
||||
selector = SelectKBest(score_func=f_classif, k=10)
|
||||
X_selected = selector.fit_transform(X_train, y_train)
|
||||
|
||||
# Get selected feature indices
|
||||
selected_indices = selector.get_support(indices=True)
|
||||
```
|
||||
|
||||
### Wrapper Methods
|
||||
|
||||
**Recursive Feature Elimination (RFE)**
|
||||
- Recursively removes features
|
||||
- Uses model feature importances
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_selection import RFE
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
rfe = RFE(estimator=model, n_features_to_select=10, step=1)
|
||||
X_selected = rfe.fit_transform(X_train, y_train)
|
||||
|
||||
# Get selected features
|
||||
selected_features = rfe.support_
|
||||
feature_ranking = rfe.ranking_
|
||||
```
|
||||
|
||||
**RFECV (with Cross-Validation)**
|
||||
- RFE with cross-validation to find optimal number of features
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_selection import RFECV
|
||||
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
rfecv = RFECV(estimator=model, cv=5, scoring='accuracy')
|
||||
X_selected = rfecv.fit_transform(X_train, y_train)
|
||||
|
||||
print(f"Optimal number of features: {rfecv.n_features_}")
|
||||
```
|
||||
|
||||
### Embedded Methods
|
||||
|
||||
**SelectFromModel**
|
||||
- Select features based on model coefficients/importances
|
||||
- Works with: Linear models (L1), Tree-based models
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_selection import SelectFromModel
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
selector = SelectFromModel(model, threshold='median')
|
||||
selector.fit(X_train, y_train)
|
||||
X_selected = selector.transform(X_train)
|
||||
|
||||
# Get selected features
|
||||
selected_features = selector.get_support()
|
||||
```
|
||||
|
||||
**L1-based Feature Selection**
|
||||
```python
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
from sklearn.feature_selection import SelectFromModel
|
||||
|
||||
model = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)
|
||||
selector = SelectFromModel(model)
|
||||
selector.fit(X_train, y_train)
|
||||
X_selected = selector.transform(X_train)
|
||||
```
|
||||
|
||||
## Handling Outliers
|
||||
|
||||
### IQR Method
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
Q1 = np.percentile(X, 25, axis=0)
|
||||
Q3 = np.percentile(X, 75, axis=0)
|
||||
IQR = Q3 - Q1
|
||||
|
||||
# Define outlier boundaries
|
||||
lower_bound = Q1 - 1.5 * IQR
|
||||
upper_bound = Q3 + 1.5 * IQR
|
||||
|
||||
# Remove outliers
|
||||
mask = np.all((X >= lower_bound) & (X <= upper_bound), axis=1)
|
||||
X_no_outliers = X[mask]
|
||||
```
|
||||
|
||||
### Winsorization
|
||||
|
||||
```python
|
||||
from scipy.stats import mstats
|
||||
|
||||
# Clip outliers at 5th and 95th percentiles
|
||||
X_winsorized = mstats.winsorize(X, limits=[0.05, 0.05], axis=0)
|
||||
```
|
||||
|
||||
## Custom Transformers
|
||||
|
||||
### Using FunctionTransformer
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
import numpy as np
|
||||
|
||||
log_transformer = FunctionTransformer(np.log1p, validate=True)
|
||||
X_log = log_transformer.transform(X)
|
||||
def log_transform(X):
|
||||
return np.log1p(X)
|
||||
|
||||
transformer = FunctionTransformer(log_transform, inverse_func=np.expm1)
|
||||
X_transformed = transformer.fit_transform(X)
|
||||
```
|
||||
|
||||
### Creating Custom Transformer
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
|
||||
class CustomTransformer(BaseEstimator, TransformerMixin):
|
||||
def __init__(self, parameter=1):
|
||||
self.parameter = parameter
|
||||
|
||||
def fit(self, X, y=None):
|
||||
# Learn parameters from X if needed
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
# Transform X
|
||||
return X * self.parameter
|
||||
|
||||
transformer = CustomTransformer(parameter=2)
|
||||
X_transformed = transformer.fit_transform(X)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Feature Scaling Guidelines
|
||||
### Fit on Training Data Only
|
||||
Always fit transformers on training data only:
|
||||
```python
|
||||
# Correct
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
**Always scale**:
|
||||
- SVM, neural networks
|
||||
- K-nearest neighbors
|
||||
- Linear/Logistic regression with regularization
|
||||
- PCA, LDA
|
||||
- Gradient descent-based algorithms
|
||||
|
||||
**Don't need to scale**:
|
||||
- Tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting)
|
||||
- Naive Bayes
|
||||
|
||||
### Pipeline Integration
|
||||
|
||||
Always use preprocessing within pipelines to prevent data leakage:
|
||||
# Wrong - causes data leakage
|
||||
scaler = StandardScaler()
|
||||
X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))
|
||||
```
|
||||
|
||||
### Use Pipelines
|
||||
Combine preprocessing with models:
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
@@ -350,64 +568,39 @@ pipeline = Pipeline([
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
|
||||
pipeline.fit(X_train, y_train) # Scaler fit only on train data
|
||||
y_pred = pipeline.predict(X_test) # Scaler transform only on test data
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Common Transformations by Data Type
|
||||
|
||||
**Numeric - Continuous**:
|
||||
- StandardScaler (most common)
|
||||
- MinMaxScaler (neural networks)
|
||||
- RobustScaler (outliers present)
|
||||
- PowerTransformer (skewed data)
|
||||
|
||||
**Numeric - Count Data**:
|
||||
- sqrt or log transformation
|
||||
- QuantileTransformer
|
||||
- StandardScaler after transformation
|
||||
|
||||
**Categorical - Low Cardinality (<10 categories)**:
|
||||
- OneHotEncoder
|
||||
|
||||
**Categorical - High Cardinality (>10 categories)**:
|
||||
- TargetEncoder (supervised)
|
||||
- Frequency encoding
|
||||
- OneHotEncoder with min_frequency parameter
|
||||
|
||||
**Categorical - Ordinal**:
|
||||
- OrdinalEncoder
|
||||
|
||||
**Text**:
|
||||
- CountVectorizer or TfidfVectorizer
|
||||
- Normalizer after vectorization
|
||||
|
||||
### Data Leakage Prevention
|
||||
|
||||
1. **Fit only on training data**: Never include test data when fitting preprocessors
|
||||
2. **Use pipelines**: Ensures proper fit/transform separation
|
||||
3. **Cross-validation**: Use Pipeline with cross_val_score() for proper evaluation
|
||||
4. **Target encoding**: Use cv parameter in TargetEncoder for cross-fitting
|
||||
|
||||
### Handle Categorical and Numerical Separately
|
||||
Use ColumnTransformer:
|
||||
```python
|
||||
# WRONG - data leakage
|
||||
scaler = StandardScaler().fit(X_full)
|
||||
X_train_scaled = scaler.transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
|
||||
# CORRECT - no leakage
|
||||
scaler = StandardScaler().fit(X_train)
|
||||
X_train_scaled = scaler.transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
numeric_features = ['age', 'income']
|
||||
categorical_features = ['gender', 'occupation']
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', StandardScaler(), numeric_features),
|
||||
('cat', OneHotEncoder(), categorical_features)
|
||||
]
|
||||
)
|
||||
|
||||
X_transformed = preprocessor.fit_transform(X)
|
||||
```
|
||||
|
||||
## Preprocessing Checklist
|
||||
### Algorithm-Specific Requirements
|
||||
|
||||
Before modeling:
|
||||
1. Handle missing values (imputation or removal)
|
||||
2. Encode categorical variables appropriately
|
||||
3. Scale/normalize numeric features (if needed for algorithm)
|
||||
4. Handle outliers (RobustScaler, clipping, removal)
|
||||
5. Create additional features if beneficial (PolynomialFeatures, domain knowledge)
|
||||
6. Check for data leakage in preprocessing steps
|
||||
7. Wrap everything in a Pipeline
|
||||
**Require Scaling:**
|
||||
- SVM, KNN, Neural Networks
|
||||
- PCA, Linear/Logistic Regression with regularization
|
||||
- K-Means clustering
|
||||
|
||||
**Don't Require Scaling:**
|
||||
- Tree-based models (Decision Trees, Random Forest, Gradient Boosting)
|
||||
- Naive Bayes
|
||||
|
||||
**Encoding Requirements:**
|
||||
- Linear models, SVM, KNN: One-hot encoding for nominal features
|
||||
- Tree-based models: Can handle ordinal encoding directly
|
||||
|
||||
@@ -1,546 +1,287 @@
|
||||
# Scikit-learn Quick Reference
|
||||
|
||||
## Essential Imports
|
||||
## Common Import Patterns
|
||||
|
||||
```python
|
||||
# Core
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
# Core scikit-learn
|
||||
import sklearn
|
||||
|
||||
# Data splitting and cross-validation
|
||||
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
|
||||
from sklearn.pipeline import Pipeline, make_pipeline
|
||||
from sklearn.compose import ColumnTransformer
|
||||
|
||||
# Preprocessing
|
||||
from sklearn.preprocessing import (
|
||||
StandardScaler, MinMaxScaler, RobustScaler,
|
||||
OneHotEncoder, OrdinalEncoder, LabelEncoder,
|
||||
PolynomialFeatures
|
||||
)
|
||||
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
|
||||
# Models - Classification
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
# Feature selection
|
||||
from sklearn.feature_selection import SelectKBest, RFE
|
||||
|
||||
# Supervised learning
|
||||
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
|
||||
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
|
||||
from sklearn.svm import SVC, SVR
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
from sklearn.ensemble import (
|
||||
RandomForestClassifier,
|
||||
GradientBoostingClassifier,
|
||||
HistGradientBoostingClassifier
|
||||
)
|
||||
from sklearn.svm import SVC
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
|
||||
# Models - Regression
|
||||
from sklearn.linear_model import LinearRegression, Ridge, Lasso
|
||||
from sklearn.ensemble import (
|
||||
RandomForestRegressor,
|
||||
GradientBoostingRegressor,
|
||||
HistGradientBoostingRegressor
|
||||
)
|
||||
|
||||
# Clustering
|
||||
# Unsupervised learning
|
||||
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
|
||||
from sklearn.mixture import GaussianMixture
|
||||
|
||||
# Dimensionality Reduction
|
||||
from sklearn.decomposition import PCA, NMF, TruncatedSVD
|
||||
from sklearn.manifold import TSNE
|
||||
from sklearn.decomposition import PCA, NMF
|
||||
|
||||
# Metrics
|
||||
from sklearn.metrics import (
|
||||
accuracy_score, precision_score, recall_score, f1_score,
|
||||
confusion_matrix, classification_report,
|
||||
mean_squared_error, r2_score, mean_absolute_error
|
||||
mean_squared_error, r2_score, confusion_matrix, classification_report
|
||||
)
|
||||
|
||||
# Pipeline
|
||||
from sklearn.pipeline import Pipeline, make_pipeline
|
||||
from sklearn.compose import ColumnTransformer, make_column_transformer
|
||||
|
||||
# Utilities
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
```
|
||||
|
||||
## Basic Workflow Template
|
||||
## Installation
|
||||
|
||||
### Classification
|
||||
```bash
|
||||
# Using uv (recommended)
|
||||
uv pip install scikit-learn
|
||||
|
||||
# Optional dependencies
|
||||
uv pip install scikit-learn[plots] # For plotting utilities
|
||||
uv pip install pandas numpy matplotlib seaborn # Common companions
|
||||
```
|
||||
|
||||
## Quick Workflow Templates
|
||||
|
||||
### Classification Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.metrics import classification_report
|
||||
from sklearn.metrics import classification_report, confusion_matrix
|
||||
|
||||
# Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42, stratify=y
|
||||
X, y, test_size=0.2, stratify=y, random_state=42
|
||||
)
|
||||
|
||||
# Scale features
|
||||
# Preprocess
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# Train model
|
||||
# Train
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
model.fit(X_train_scaled, y_train)
|
||||
|
||||
# Predict and evaluate
|
||||
# Evaluate
|
||||
y_pred = model.predict(X_test_scaled)
|
||||
print(classification_report(y_test, y_pred))
|
||||
print(confusion_matrix(y_test, y_pred))
|
||||
```
|
||||
|
||||
### Regression
|
||||
### Regression Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
from sklearn.ensemble import GradientBoostingRegressor
|
||||
from sklearn.metrics import mean_squared_error, r2_score
|
||||
|
||||
# Split data
|
||||
# Split
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# Scale features
|
||||
# Preprocess and train
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# Train model
|
||||
model = RandomForestRegressor(n_estimators=100, random_state=42)
|
||||
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
|
||||
model.fit(X_train_scaled, y_train)
|
||||
|
||||
# Predict and evaluate
|
||||
# Evaluate
|
||||
y_pred = model.predict(X_test_scaled)
|
||||
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.3f}")
|
||||
print(f"R²: {r2_score(y_test, y_pred):.3f}")
|
||||
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
|
||||
```
|
||||
|
||||
### With Pipeline (Recommended)
|
||||
### Cross-Validation
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import train_test_split, cross_val_score
|
||||
|
||||
# Create pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
|
||||
])
|
||||
|
||||
# Split and train
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Evaluate
|
||||
score = pipeline.score(X_test, y_test)
|
||||
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
|
||||
print(f"Test accuracy: {score:.3f}")
|
||||
print(f"CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
|
||||
print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
|
||||
```
|
||||
|
||||
## Common Preprocessing Patterns
|
||||
|
||||
### Numeric Data
|
||||
### Complete Pipeline with Mixed Data Types
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
# Define feature types
|
||||
numeric_features = ['age', 'income']
|
||||
categorical_features = ['gender', 'occupation']
|
||||
|
||||
# Create preprocessing pipelines
|
||||
numeric_transformer = Pipeline([
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
```
|
||||
|
||||
### Categorical Data
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
categorical_transformer = Pipeline([
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('imputer', SimpleImputer(strategy='most_frequent')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
])
|
||||
```
|
||||
|
||||
### Mixed Data with ColumnTransformer
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
|
||||
numeric_features = ['age', 'income', 'credit_score']
|
||||
categorical_features = ['country', 'occupation']
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
|
||||
# Complete pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
pipeline = Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', RandomForestClassifier())
|
||||
# Combine transformers
|
||||
preprocessor = ColumnTransformer([
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
|
||||
# Full pipeline
|
||||
model = Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
|
||||
])
|
||||
|
||||
# Fit and predict
|
||||
model.fit(X_train, y_train)
|
||||
y_pred = model.predict(X_test)
|
||||
```
|
||||
|
||||
## Model Selection Cheat Sheet
|
||||
|
||||
### Quick Decision Tree
|
||||
|
||||
```
|
||||
Is it supervised?
|
||||
├─ Yes
|
||||
│ ├─ Predicting categories? → Classification
|
||||
│ │ ├─ Start with: LogisticRegression (baseline)
|
||||
│ │ ├─ Then try: RandomForestClassifier
|
||||
│ │ └─ Best performance: HistGradientBoostingClassifier
|
||||
│ └─ Predicting numbers? → Regression
|
||||
│ ├─ Start with: LinearRegression/Ridge (baseline)
|
||||
│ ├─ Then try: RandomForestRegressor
|
||||
│ └─ Best performance: HistGradientBoostingRegressor
|
||||
└─ No
|
||||
├─ Grouping similar items? → Clustering
|
||||
│ ├─ Know # clusters: KMeans
|
||||
│ └─ Unknown # clusters: DBSCAN or HDBSCAN
|
||||
├─ Reducing dimensions?
|
||||
│ ├─ For preprocessing: PCA
|
||||
│ └─ For visualization: t-SNE or UMAP
|
||||
└─ Finding outliers? → IsolationForest or LocalOutlierFactor
|
||||
```
|
||||
|
||||
### Algorithm Selection by Data Size
|
||||
|
||||
- **Small (<1K samples)**: Any algorithm
|
||||
- **Medium (1K-100K)**: Random Forests, Gradient Boosting, Neural Networks
|
||||
- **Large (>100K)**: SGDClassifier/Regressor, HistGradientBoosting, LinearSVC
|
||||
|
||||
### When to Scale Features
|
||||
|
||||
**Always scale**:
|
||||
- SVM, Neural Networks
|
||||
- K-Nearest Neighbors
|
||||
- Linear/Logistic Regression (with regularization)
|
||||
- PCA, LDA
|
||||
- Any gradient descent algorithm
|
||||
|
||||
**Don't need to scale**:
|
||||
- Tree-based (Decision Trees, Random Forests, Gradient Boosting)
|
||||
- Naive Bayes
|
||||
|
||||
## Hyperparameter Tuning
|
||||
|
||||
### GridSearchCV
|
||||
### Hyperparameter Tuning
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
param_grid = {
|
||||
'n_estimators': [100, 200, 500],
|
||||
'n_estimators': [100, 200, 300],
|
||||
'max_depth': [10, 20, None],
|
||||
'min_samples_split': [2, 5, 10]
|
||||
}
|
||||
|
||||
model = RandomForestClassifier(random_state=42)
|
||||
grid_search = GridSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_grid,
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1
|
||||
model, param_grid, cv=5, scoring='accuracy', n_jobs=-1
|
||||
)
|
||||
|
||||
grid_search.fit(X_train, y_train)
|
||||
best_model = grid_search.best_estimator_
|
||||
print(f"Best params: {grid_search.best_params_}")
|
||||
print(f"Best score: {grid_search.best_score_:.3f}")
|
||||
|
||||
# Use best model
|
||||
best_model = grid_search.best_estimator_
|
||||
```
|
||||
|
||||
### RandomizedSearchCV (Faster)
|
||||
## Common Patterns
|
||||
|
||||
### Loading Data
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import RandomizedSearchCV
|
||||
from scipy.stats import randint, uniform
|
||||
# From scikit-learn datasets
|
||||
from sklearn.datasets import load_iris, load_digits, make_classification
|
||||
|
||||
param_distributions = {
|
||||
'n_estimators': randint(100, 1000),
|
||||
'max_depth': randint(5, 50),
|
||||
'min_samples_split': randint(2, 20)
|
||||
}
|
||||
# Built-in datasets
|
||||
iris = load_iris()
|
||||
X, y = iris.data, iris.target
|
||||
|
||||
random_search = RandomizedSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_distributions,
|
||||
n_iter=50, # Number of combinations to try
|
||||
cv=5,
|
||||
n_jobs=-1,
|
||||
random_state=42
|
||||
# Synthetic data
|
||||
X, y = make_classification(
|
||||
n_samples=1000, n_features=20, n_classes=2, random_state=42
|
||||
)
|
||||
|
||||
random_search.fit(X_train, y_train)
|
||||
# From pandas
|
||||
import pandas as pd
|
||||
df = pd.read_csv('data.csv')
|
||||
X = df.drop('target', axis=1)
|
||||
y = df['target']
|
||||
```
|
||||
|
||||
### Pipeline with GridSearchCV
|
||||
### Handling Imbalanced Data
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.svm import SVC
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('svm', SVC())
|
||||
])
|
||||
|
||||
param_grid = {
|
||||
'svm__C': [0.1, 1, 10],
|
||||
'svm__kernel': ['rbf', 'linear'],
|
||||
'svm__gamma': ['scale', 'auto']
|
||||
}
|
||||
|
||||
grid = GridSearchCV(pipeline, param_grid, cv=5)
|
||||
grid.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Cross-Validation
|
||||
|
||||
### Basic Cross-Validation
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_score
|
||||
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
|
||||
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Multiple Metrics
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_validate
|
||||
|
||||
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
|
||||
results = cross_validate(model, X, y, cv=5, scoring=scoring)
|
||||
|
||||
for metric in scoring:
|
||||
scores = results[f'test_{metric}']
|
||||
print(f"{metric}: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Custom CV Strategies
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
|
||||
|
||||
# For imbalanced classification
|
||||
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
|
||||
|
||||
# For time series
|
||||
cv = TimeSeriesSplit(n_splits=5)
|
||||
|
||||
scores = cross_val_score(model, X, y, cv=cv)
|
||||
```
|
||||
|
||||
## Common Metrics
|
||||
|
||||
### Classification
|
||||
|
||||
```python
|
||||
from sklearn.metrics import (
|
||||
accuracy_score, balanced_accuracy_score,
|
||||
precision_score, recall_score, f1_score,
|
||||
confusion_matrix, classification_report,
|
||||
roc_auc_score
|
||||
)
|
||||
|
||||
# Basic metrics
|
||||
accuracy = accuracy_score(y_true, y_pred)
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
|
||||
# Comprehensive report
|
||||
print(classification_report(y_true, y_pred))
|
||||
|
||||
# ROC AUC (requires probabilities)
|
||||
y_proba = model.predict_proba(X_test)[:, 1]
|
||||
auc = roc_auc_score(y_true, y_proba)
|
||||
```
|
||||
|
||||
### Regression
|
||||
|
||||
```python
|
||||
from sklearn.metrics import (
|
||||
mean_squared_error,
|
||||
mean_absolute_error,
|
||||
r2_score
|
||||
)
|
||||
|
||||
mse = mean_squared_error(y_true, y_pred)
|
||||
rmse = mean_squared_error(y_true, y_pred, squared=False)
|
||||
mae = mean_absolute_error(y_true, y_pred)
|
||||
r2 = r2_score(y_true, y_pred)
|
||||
|
||||
print(f"RMSE: {rmse:.3f}")
|
||||
print(f"MAE: {mae:.3f}")
|
||||
print(f"R²: {r2:.3f}")
|
||||
```
|
||||
|
||||
## Feature Engineering
|
||||
|
||||
### Polynomial Features
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
|
||||
poly = PolynomialFeatures(degree=2, include_bias=False)
|
||||
X_poly = poly.fit_transform(X)
|
||||
# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
|
||||
```
|
||||
|
||||
### Feature Selection
|
||||
|
||||
```python
|
||||
from sklearn.feature_selection import (
|
||||
SelectKBest, f_classif,
|
||||
RFE,
|
||||
SelectFromModel
|
||||
)
|
||||
|
||||
# Univariate selection
|
||||
selector = SelectKBest(f_classif, k=10)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
|
||||
# Recursive feature elimination
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
rfe = RFE(RandomForestClassifier(), n_features_to_select=10)
|
||||
X_selected = rfe.fit_transform(X, y)
|
||||
|
||||
# Model-based selection
|
||||
selector = SelectFromModel(
|
||||
RandomForestClassifier(n_estimators=100),
|
||||
threshold='median'
|
||||
)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
# Use class_weight parameter
|
||||
model = RandomForestClassifier(class_weight='balanced', random_state=42)
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
# Or use appropriate metrics
|
||||
from sklearn.metrics import balanced_accuracy_score, f1_score
|
||||
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.3f}")
|
||||
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
|
||||
```
|
||||
|
||||
### Feature Importance
|
||||
|
||||
```python
|
||||
# Tree-based models
|
||||
model = RandomForestClassifier()
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
import pandas as pd
|
||||
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
model.fit(X_train, y_train)
|
||||
importances = model.feature_importances_
|
||||
|
||||
# Visualize
|
||||
import matplotlib.pyplot as plt
|
||||
indices = np.argsort(importances)[::-1]
|
||||
plt.bar(range(X.shape[1]), importances[indices])
|
||||
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
|
||||
plt.show()
|
||||
# Get feature importances
|
||||
importances = pd.DataFrame({
|
||||
'feature': feature_names,
|
||||
'importance': model.feature_importances_
|
||||
}).sort_values('importance', ascending=False)
|
||||
|
||||
# Permutation importance (works for any model)
|
||||
from sklearn.inspection import permutation_importance
|
||||
result = permutation_importance(model, X_test, y_test, n_repeats=10)
|
||||
importances = result.importances_mean
|
||||
print(importances.head(10))
|
||||
```
|
||||
|
||||
## Clustering
|
||||
|
||||
### K-Means
|
||||
### Clustering
|
||||
|
||||
```python
|
||||
from sklearn.cluster import KMeans
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
# Always scale for k-means
|
||||
# Scale data first
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Fit k-means
|
||||
# Fit K-Means
|
||||
kmeans = KMeans(n_clusters=3, random_state=42)
|
||||
labels = kmeans.fit_predict(X_scaled)
|
||||
|
||||
# Evaluate
|
||||
from sklearn.metrics import silhouette_score
|
||||
score = silhouette_score(X_scaled, labels)
|
||||
print(f"Silhouette score: {score:.3f}")
|
||||
print(f"Silhouette Score: {score:.3f}")
|
||||
```
|
||||
|
||||
### Elbow Method
|
||||
|
||||
```python
|
||||
inertias = []
|
||||
K_range = range(2, 11)
|
||||
|
||||
for k in K_range:
|
||||
kmeans = KMeans(n_clusters=k, random_state=42)
|
||||
kmeans.fit(X_scaled)
|
||||
inertias.append(kmeans.inertia_)
|
||||
|
||||
plt.plot(K_range, inertias, 'bo-')
|
||||
plt.xlabel('k')
|
||||
plt.ylabel('Inertia')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### DBSCAN
|
||||
|
||||
```python
|
||||
from sklearn.cluster import DBSCAN
|
||||
|
||||
dbscan = DBSCAN(eps=0.5, min_samples=5)
|
||||
labels = dbscan.fit_predict(X_scaled)
|
||||
|
||||
# -1 indicates noise/outliers
|
||||
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
|
||||
n_noise = list(labels).count(-1)
|
||||
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
|
||||
```
|
||||
|
||||
## Dimensionality Reduction
|
||||
|
||||
### PCA
|
||||
### Dimensionality Reduction
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Always scale before PCA
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Specify n_components
|
||||
# Fit PCA
|
||||
pca = PCA(n_components=2)
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
X_reduced = pca.fit_transform(X)
|
||||
|
||||
# Or specify variance to retain
|
||||
pca = PCA(n_components=0.95) # Keep 95% variance
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
print(f"Explained variance: {pca.explained_variance_ratio_}")
|
||||
print(f"Components needed: {pca.n_components_}")
|
||||
# Plot
|
||||
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis')
|
||||
plt.xlabel('PC1')
|
||||
plt.ylabel('PC2')
|
||||
plt.title(f'PCA (explained variance: {pca.explained_variance_ratio_.sum():.2%})')
|
||||
```
|
||||
|
||||
### t-SNE (Visualization Only)
|
||||
|
||||
```python
|
||||
from sklearn.manifold import TSNE
|
||||
|
||||
# Reduce to 50 dimensions with PCA first (recommended)
|
||||
pca = PCA(n_components=50)
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
# Apply t-SNE
|
||||
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
|
||||
X_tsne = tsne.fit_transform(X_pca)
|
||||
|
||||
# Visualize
|
||||
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
|
||||
plt.colorbar()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Saving and Loading Models
|
||||
### Model Persistence
|
||||
|
||||
```python
|
||||
import joblib
|
||||
@@ -548,78 +289,145 @@ import joblib
|
||||
# Save model
|
||||
joblib.dump(model, 'model.pkl')
|
||||
|
||||
# Save pipeline
|
||||
joblib.dump(pipeline, 'pipeline.pkl')
|
||||
|
||||
# Load
|
||||
model = joblib.load('model.pkl')
|
||||
pipeline = joblib.load('pipeline.pkl')
|
||||
|
||||
# Use loaded model
|
||||
y_pred = model.predict(X_new)
|
||||
# Load model
|
||||
loaded_model = joblib.load('model.pkl')
|
||||
predictions = loaded_model.predict(X_new)
|
||||
```
|
||||
|
||||
## Common Pitfalls and Solutions
|
||||
## Common Gotchas and Solutions
|
||||
|
||||
### Data Leakage
|
||||
❌ **Wrong**: Fit on all data before split
|
||||
```python
|
||||
scaler = StandardScaler().fit(X)
|
||||
X_train, X_test = train_test_split(scaler.transform(X))
|
||||
```
|
||||
# WRONG: Fitting scaler on all data
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
X_train, X_test = train_test_split(X_scaled)
|
||||
|
||||
✅ **Correct**: Use pipeline or fit only on train
|
||||
```python
|
||||
# RIGHT: Fit on training data only
|
||||
X_train, X_test = train_test_split(X)
|
||||
pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
|
||||
pipeline.fit(X_train, y_train)
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# BEST: Use Pipeline
|
||||
from sklearn.pipeline import Pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
pipeline.fit(X_train, y_train) # No leakage!
|
||||
```
|
||||
|
||||
### Not Scaling
|
||||
❌ **Wrong**: Using SVM without scaling
|
||||
```python
|
||||
svm = SVC()
|
||||
svm.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
✅ **Correct**: Scale for SVM
|
||||
```python
|
||||
pipeline = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Wrong Metric for Imbalanced Data
|
||||
❌ **Wrong**: Using accuracy for 99:1 imbalance
|
||||
```python
|
||||
accuracy = accuracy_score(y_true, y_pred) # Can be misleading
|
||||
```
|
||||
|
||||
✅ **Correct**: Use appropriate metrics
|
||||
```python
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
balanced_acc = balanced_accuracy_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
### Not Using Stratification
|
||||
❌ **Wrong**: Random split for imbalanced data
|
||||
```python
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
|
||||
```
|
||||
|
||||
✅ **Correct**: Stratify for imbalanced classes
|
||||
### Stratified Splitting for Classification
|
||||
```python
|
||||
# Always use stratify for classification
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, stratify=y
|
||||
X, y, test_size=0.2, stratify=y, random_state=42
|
||||
)
|
||||
```
|
||||
|
||||
### Random State for Reproducibility
|
||||
```python
|
||||
# Set random_state for reproducibility
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
```
|
||||
|
||||
### Handling Unknown Categories
|
||||
```python
|
||||
# Use handle_unknown='ignore' for OneHotEncoder
|
||||
encoder = OneHotEncoder(handle_unknown='ignore')
|
||||
```
|
||||
|
||||
### Feature Names with Pipelines
|
||||
```python
|
||||
# Get feature names after transformation
|
||||
preprocessor.fit(X_train)
|
||||
feature_names = preprocessor.get_feature_names_out()
|
||||
```
|
||||
|
||||
## Cheat Sheet: Algorithm Selection
|
||||
|
||||
### Classification
|
||||
|
||||
| Problem | Algorithm | When to Use |
|
||||
|---------|-----------|-------------|
|
||||
| Binary/Multiclass | Logistic Regression | Fast baseline, interpretability |
|
||||
| Binary/Multiclass | Random Forest | Good default, robust |
|
||||
| Binary/Multiclass | Gradient Boosting | Best accuracy, willing to tune |
|
||||
| Binary/Multiclass | SVM | Small data, complex boundaries |
|
||||
| Binary/Multiclass | Naive Bayes | Text classification, fast |
|
||||
| High dimensions | Linear SVM or Logistic | Text, many features |
|
||||
|
||||
### Regression
|
||||
|
||||
| Problem | Algorithm | When to Use |
|
||||
|---------|-----------|-------------|
|
||||
| Continuous target | Linear Regression | Fast baseline, interpretability |
|
||||
| Continuous target | Ridge/Lasso | Regularization needed |
|
||||
| Continuous target | Random Forest | Good default, non-linear |
|
||||
| Continuous target | Gradient Boosting | Best accuracy |
|
||||
| Continuous target | SVR | Small data, non-linear |
|
||||
|
||||
### Clustering
|
||||
|
||||
| Problem | Algorithm | When to Use |
|
||||
|---------|-----------|-------------|
|
||||
| Known K, spherical | K-Means | Fast, simple |
|
||||
| Unknown K, arbitrary shapes | DBSCAN | Noise/outliers present |
|
||||
| Hierarchical structure | Agglomerative | Need dendrogram |
|
||||
| Soft clustering | Gaussian Mixture | Probability estimates |
|
||||
|
||||
### Dimensionality Reduction
|
||||
|
||||
| Problem | Algorithm | When to Use |
|
||||
|---------|-----------|-------------|
|
||||
| Linear reduction | PCA | Variance explanation |
|
||||
| Visualization | t-SNE | 2D/3D plots |
|
||||
| Non-negative data | NMF | Images, text |
|
||||
| Sparse data | TruncatedSVD | Text, recommender systems |
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use n_jobs=-1** for parallel processing (RandomForest, GridSearchCV)
|
||||
2. **Use HistGradientBoosting** for large datasets (>10K samples)
|
||||
3. **Use MiniBatchKMeans** for large clustering tasks
|
||||
4. **Use IncrementalPCA** for data that doesn't fit in memory
|
||||
5. **Use sparse matrices** for high-dimensional sparse data (text)
|
||||
6. **Cache transformers** in pipelines during grid search
|
||||
7. **Use RandomizedSearchCV** instead of GridSearchCV for large parameter spaces
|
||||
8. **Reduce dimensionality** with PCA before applying expensive algorithms
|
||||
### Speed Up Training
|
||||
```python
|
||||
# Use n_jobs=-1 for parallel processing
|
||||
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
|
||||
|
||||
# Use warm_start for incremental learning
|
||||
model = RandomForestClassifier(n_estimators=100, warm_start=True)
|
||||
model.fit(X, y)
|
||||
model.n_estimators += 50
|
||||
model.fit(X, y) # Adds 50 more trees
|
||||
|
||||
# Use partial_fit for online learning
|
||||
from sklearn.linear_model import SGDClassifier
|
||||
model = SGDClassifier()
|
||||
for X_batch, y_batch in batches:
|
||||
model.partial_fit(X_batch, y_batch, classes=np.unique(y))
|
||||
```
|
||||
|
||||
### Memory Efficiency
|
||||
```python
|
||||
# Use sparse matrices
|
||||
from scipy.sparse import csr_matrix
|
||||
X_sparse = csr_matrix(X)
|
||||
|
||||
# Use MiniBatchKMeans for large data
|
||||
from sklearn.cluster import MiniBatchKMeans
|
||||
model = MiniBatchKMeans(n_clusters=8, batch_size=100)
|
||||
```
|
||||
|
||||
## Version Check
|
||||
|
||||
```python
|
||||
import sklearn
|
||||
print(f"scikit-learn version: {sklearn.__version__}")
|
||||
```
|
||||
|
||||
## Useful Resources
|
||||
|
||||
- Official Documentation: https://scikit-learn.org/stable/
|
||||
- User Guide: https://scikit-learn.org/stable/user_guide.html
|
||||
- API Reference: https://scikit-learn.org/stable/api/index.html
|
||||
- Examples: https://scikit-learn.org/stable/auto_examples/index.html
|
||||
- Tutorials: https://scikit-learn.org/stable/tutorial/index.html
|
||||
|
||||
@@ -1,261 +1,378 @@
|
||||
# Supervised Learning in scikit-learn
|
||||
# Supervised Learning Reference
|
||||
|
||||
## Overview
|
||||
Supervised learning algorithms learn patterns from labeled training data to make predictions on new data. Scikit-learn organizes supervised learning into 17 major categories.
|
||||
|
||||
Supervised learning algorithms learn from labeled training data to make predictions on new data. Scikit-learn provides comprehensive implementations for both classification and regression tasks.
|
||||
|
||||
## Linear Models
|
||||
|
||||
### Regression
|
||||
- **LinearRegression**: Ordinary least squares regression
|
||||
- **Ridge**: L2-regularized regression, good for multicollinearity
|
||||
- **Lasso**: L1-regularized regression, performs feature selection
|
||||
- **ElasticNet**: Combined L1/L2 regularization
|
||||
- **LassoLars**: Lasso using Least Angle Regression algorithm
|
||||
- **BayesianRidge**: Bayesian approach with automatic relevance determination
|
||||
|
||||
**Linear Regression (`sklearn.linear_model.LinearRegression`)**
|
||||
- Ordinary least squares regression
|
||||
- Fast, interpretable, no hyperparameters
|
||||
- Use when: Linear relationships, interpretability matters
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import LinearRegression
|
||||
|
||||
model = LinearRegression()
|
||||
model.fit(X_train, y_train)
|
||||
predictions = model.predict(X_test)
|
||||
```
|
||||
|
||||
**Ridge Regression (`sklearn.linear_model.Ridge`)**
|
||||
- L2 regularization to prevent overfitting
|
||||
- Key parameter: `alpha` (regularization strength, default=1.0)
|
||||
- Use when: Multicollinearity present, need regularization
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import Ridge
|
||||
|
||||
model = Ridge(alpha=1.0)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**Lasso (`sklearn.linear_model.Lasso`)**
|
||||
- L1 regularization with feature selection
|
||||
- Key parameter: `alpha` (regularization strength)
|
||||
- Use when: Want sparse models, feature selection
|
||||
- Can reduce some coefficients to exactly zero
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import Lasso
|
||||
|
||||
model = Lasso(alpha=0.1)
|
||||
model.fit(X_train, y_train)
|
||||
# Check which features were selected
|
||||
print(f"Non-zero coefficients: {sum(model.coef_ != 0)}")
|
||||
```
|
||||
|
||||
**ElasticNet (`sklearn.linear_model.ElasticNet`)**
|
||||
- Combines L1 and L2 regularization
|
||||
- Key parameters: `alpha`, `l1_ratio` (0=Ridge, 1=Lasso)
|
||||
- Use when: Need both feature selection and regularization
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import ElasticNet
|
||||
|
||||
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Classification
|
||||
- **LogisticRegression**: Binary and multiclass classification
|
||||
- **RidgeClassifier**: Ridge regression for classification
|
||||
- **SGDClassifier**: Linear classifiers with SGD training
|
||||
|
||||
**Use cases**: Baseline models, interpretable predictions, high-dimensional data, when linear relationships are expected
|
||||
**Logistic Regression (`sklearn.linear_model.LogisticRegression`)**
|
||||
- Binary and multiclass classification
|
||||
- Key parameters: `C` (inverse regularization), `penalty` ('l1', 'l2', 'elasticnet')
|
||||
- Returns probability estimates
|
||||
- Use when: Need probabilistic predictions, interpretability
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
**Key parameters**:
|
||||
- `alpha`: Regularization strength (higher = more regularization)
|
||||
- `fit_intercept`: Whether to calculate intercept
|
||||
- `solver`: Optimization algorithm ('lbfgs', 'saga', 'liblinear')
|
||||
model = LogisticRegression(C=1.0, max_iter=1000)
|
||||
model.fit(X_train, y_train)
|
||||
probas = model.predict_proba(X_test)
|
||||
```
|
||||
|
||||
## Support Vector Machines (SVM)
|
||||
**Stochastic Gradient Descent (SGD)**
|
||||
- `SGDClassifier`, `SGDRegressor`
|
||||
- Efficient for large-scale learning
|
||||
- Key parameters: `loss`, `penalty`, `alpha`, `learning_rate`
|
||||
- Use when: Very large datasets (>10^4 samples)
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import SGDClassifier
|
||||
|
||||
- **SVC**: Support Vector Classification
|
||||
- **SVR**: Support Vector Regression
|
||||
- **LinearSVC**: Linear SVM using liblinear (faster for large datasets)
|
||||
- **OneClassSVM**: Unsupervised outlier detection
|
||||
model = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**Use cases**: Complex non-linear decision boundaries, high-dimensional spaces, when clear margin of separation exists
|
||||
## Support Vector Machines
|
||||
|
||||
**Key parameters**:
|
||||
- `kernel`: 'linear', 'poly', 'rbf', 'sigmoid'
|
||||
- `C`: Regularization parameter (lower = more regularization)
|
||||
- `gamma`: Kernel coefficient ('scale', 'auto', or float)
|
||||
- `degree`: Polynomial degree (for poly kernel)
|
||||
**SVC (`sklearn.svm.SVC`)**
|
||||
- Classification with kernel methods
|
||||
- Key parameters: `C`, `kernel` ('linear', 'rbf', 'poly'), `gamma`
|
||||
- Use when: Small to medium datasets, complex decision boundaries
|
||||
- Note: Does not scale well to large datasets
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.svm import SVC
|
||||
|
||||
**Performance tip**: SVMs don't scale well beyond tens of thousands of samples. Use LinearSVC for large datasets with linear kernel.
|
||||
# Linear kernel for linearly separable data
|
||||
model_linear = SVC(kernel='linear', C=1.0)
|
||||
|
||||
# RBF kernel for non-linear data
|
||||
model_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
|
||||
model_rbf.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**SVR (`sklearn.svm.SVR`)**
|
||||
- Regression with kernel methods
|
||||
- Similar parameters to SVC
|
||||
- Additional parameter: `epsilon` (tube width)
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.svm import SVR
|
||||
|
||||
model = SVR(kernel='rbf', C=1.0, epsilon=0.1)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Decision Trees
|
||||
|
||||
- **DecisionTreeClassifier**: Classification tree
|
||||
- **DecisionTreeRegressor**: Regression tree
|
||||
- **ExtraTreeClassifier/Regressor**: Extremely randomized tree
|
||||
**DecisionTreeClassifier / DecisionTreeRegressor**
|
||||
- Non-parametric model learning decision rules
|
||||
- Key parameters:
|
||||
- `max_depth`: Maximum tree depth (prevents overfitting)
|
||||
- `min_samples_split`: Minimum samples to split a node
|
||||
- `min_samples_leaf`: Minimum samples in leaf
|
||||
- `criterion`: 'gini', 'entropy' for classification; 'squared_error', 'absolute_error' for regression
|
||||
- Use when: Need interpretable model, non-linear relationships, mixed feature types
|
||||
- Prone to overfitting - use ensembles or pruning
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
|
||||
**Use cases**: Non-linear relationships, feature importance analysis, interpretable rules, handling mixed data types
|
||||
model = DecisionTreeClassifier(
|
||||
max_depth=5,
|
||||
min_samples_split=20,
|
||||
min_samples_leaf=10,
|
||||
criterion='gini'
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
**Key parameters**:
|
||||
- `max_depth`: Maximum tree depth (controls overfitting)
|
||||
- `min_samples_split`: Minimum samples to split a node
|
||||
- `min_samples_leaf`: Minimum samples in leaf node
|
||||
- `max_features`: Number of features to consider for splits
|
||||
- `criterion`: 'gini', 'entropy' (classification); 'squared_error', 'absolute_error' (regression)
|
||||
|
||||
**Overfitting prevention**: Limit `max_depth`, increase `min_samples_split/leaf`, use pruning with `ccp_alpha`
|
||||
# Visualize the tree
|
||||
from sklearn.tree import plot_tree
|
||||
plot_tree(model, feature_names=feature_names, class_names=class_names)
|
||||
```
|
||||
|
||||
## Ensemble Methods
|
||||
|
||||
### Random Forests
|
||||
- **RandomForestClassifier**: Ensemble of decision trees
|
||||
- **RandomForestRegressor**: Regression variant
|
||||
|
||||
**Use cases**: Robust general-purpose algorithm, reduces overfitting vs single trees, handles non-linear relationships
|
||||
**RandomForestClassifier / RandomForestRegressor**
|
||||
- Ensemble of decision trees with bagging
|
||||
- Key parameters:
|
||||
- `n_estimators`: Number of trees (default=100)
|
||||
- `max_depth`: Maximum tree depth
|
||||
- `max_features`: Features to consider for splits ('sqrt', 'log2', or int)
|
||||
- `min_samples_split`, `min_samples_leaf`: Control tree growth
|
||||
- Use when: High accuracy needed, can afford computation
|
||||
- Provides feature importance
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
**Key parameters**:
|
||||
- `n_estimators`: Number of trees (higher = better but slower)
|
||||
- `max_depth`: Maximum tree depth
|
||||
- `max_features`: Features per split ('sqrt', 'log2', int, float)
|
||||
- `bootstrap`: Whether to use bootstrap samples
|
||||
- `n_jobs`: Parallel processing (-1 uses all cores)
|
||||
model = RandomForestClassifier(
|
||||
n_estimators=100,
|
||||
max_depth=10,
|
||||
max_features='sqrt',
|
||||
n_jobs=-1 # Use all CPU cores
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
# Feature importance
|
||||
importances = model.feature_importances_
|
||||
```
|
||||
|
||||
### Gradient Boosting
|
||||
- **HistGradientBoostingClassifier/Regressor**: Histogram-based, fast for large datasets (>10k samples)
|
||||
- **GradientBoostingClassifier/Regressor**: Traditional implementation, better for small datasets
|
||||
|
||||
**Use cases**: High-performance predictions, winning Kaggle competitions, structured/tabular data
|
||||
**GradientBoostingClassifier / GradientBoostingRegressor**
|
||||
- Sequential ensemble building trees on residuals
|
||||
- Key parameters:
|
||||
- `n_estimators`: Number of boosting stages
|
||||
- `learning_rate`: Shrinks contribution of each tree
|
||||
- `max_depth`: Depth of individual trees (typically 3-5)
|
||||
- `subsample`: Fraction of samples for training each tree
|
||||
- Use when: Need high accuracy, can afford training time
|
||||
- Often achieves best performance
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import GradientBoostingClassifier
|
||||
|
||||
**Key parameters**:
|
||||
- `n_estimators`: Number of boosting stages
|
||||
- `learning_rate`: Shrinks contribution of each tree
|
||||
- `max_depth`: Maximum tree depth (typically 3-8)
|
||||
- `subsample`: Fraction of samples per tree (enables stochastic gradient boosting)
|
||||
- `early_stopping`: Stop when validation score stops improving
|
||||
model = GradientBoostingClassifier(
|
||||
n_estimators=100,
|
||||
learning_rate=0.1,
|
||||
max_depth=3,
|
||||
subsample=0.8
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**Performance tip**: HistGradientBoosting is orders of magnitude faster for large datasets
|
||||
**HistGradientBoostingClassifier / HistGradientBoostingRegressor**
|
||||
- Faster gradient boosting with histogram-based algorithm
|
||||
- Native support for missing values and categorical features
|
||||
- Key parameters: Similar to GradientBoosting
|
||||
- Use when: Large datasets, need faster training
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import HistGradientBoostingClassifier
|
||||
|
||||
### AdaBoost
|
||||
- **AdaBoostClassifier/Regressor**: Adaptive boosting
|
||||
model = HistGradientBoostingClassifier(
|
||||
max_iter=100,
|
||||
learning_rate=0.1,
|
||||
max_depth=None, # No limit by default
|
||||
categorical_features='from_dtype' # Auto-detect categorical
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**Use cases**: Boosting weak learners, less prone to overfitting than other methods
|
||||
### Other Ensemble Methods
|
||||
|
||||
**Key parameters**:
|
||||
- `estimator`: Base estimator (default: DecisionTreeClassifier with max_depth=1)
|
||||
- `n_estimators`: Number of boosting iterations
|
||||
- `learning_rate`: Weight applied to each classifier
|
||||
**AdaBoost**
|
||||
- Adaptive boosting focusing on misclassified samples
|
||||
- Key parameters: `n_estimators`, `learning_rate`, `estimator` (base estimator)
|
||||
- Use when: Simple boosting approach needed
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import AdaBoostClassifier
|
||||
|
||||
### Bagging
|
||||
- **BaggingClassifier/Regressor**: Bootstrap aggregating with any base estimator
|
||||
model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**Use cases**: Reducing variance of unstable models, parallel ensemble creation
|
||||
**Voting Classifier / Regressor**
|
||||
- Combines predictions from multiple models
|
||||
- Types: 'hard' (majority vote) or 'soft' (average probabilities)
|
||||
- Use when: Want to ensemble different model types
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import VotingClassifier
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
from sklearn.svm import SVC
|
||||
|
||||
**Key parameters**:
|
||||
- `estimator`: Base estimator to fit
|
||||
- `n_estimators`: Number of estimators
|
||||
- `max_samples`: Samples to draw per estimator
|
||||
- `bootstrap`: Whether to use replacement
|
||||
model = VotingClassifier(
|
||||
estimators=[
|
||||
('lr', LogisticRegression()),
|
||||
('dt', DecisionTreeClassifier()),
|
||||
('svc', SVC(probability=True))
|
||||
],
|
||||
voting='soft'
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Voting & Stacking
|
||||
- **VotingClassifier/Regressor**: Combines different model types
|
||||
- **StackingClassifier/Regressor**: Meta-learner trained on base predictions
|
||||
**Stacking Classifier / Regressor**
|
||||
- Trains a meta-model on predictions from base models
|
||||
- More sophisticated than voting
|
||||
- Key parameter: `final_estimator` (meta-learner)
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import StackingClassifier
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
from sklearn.svm import SVC
|
||||
|
||||
**Use cases**: Combining diverse models, leveraging different model strengths
|
||||
model = StackingClassifier(
|
||||
estimators=[
|
||||
('dt', DecisionTreeClassifier()),
|
||||
('svc', SVC())
|
||||
],
|
||||
final_estimator=LogisticRegression()
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Neural Networks
|
||||
## K-Nearest Neighbors
|
||||
|
||||
- **MLPClassifier**: Multi-layer perceptron classifier
|
||||
- **MLPRegressor**: Multi-layer perceptron regressor
|
||||
**KNeighborsClassifier / KNeighborsRegressor**
|
||||
- Non-parametric method based on distance
|
||||
- Key parameters:
|
||||
- `n_neighbors`: Number of neighbors (default=5)
|
||||
- `weights`: 'uniform' or 'distance'
|
||||
- `metric`: Distance metric ('euclidean', 'manhattan', etc.)
|
||||
- Use when: Small dataset, simple baseline needed
|
||||
- Slow prediction on large datasets
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
|
||||
**Use cases**: Complex non-linear patterns, when gradient boosting is too slow, deep feature learning
|
||||
|
||||
**Key parameters**:
|
||||
- `hidden_layer_sizes`: Tuple of hidden layer sizes (e.g., (100, 50))
|
||||
- `activation`: 'relu', 'tanh', 'logistic'
|
||||
- `solver`: 'adam', 'lbfgs', 'sgd'
|
||||
- `alpha`: L2 regularization term
|
||||
- `learning_rate`: Learning rate schedule
|
||||
- `early_stopping`: Stop when validation score stops improving
|
||||
|
||||
**Important**: Feature scaling is critical for neural networks. Always use StandardScaler or similar.
|
||||
|
||||
## Nearest Neighbors
|
||||
|
||||
- **KNeighborsClassifier/Regressor**: K-nearest neighbors
|
||||
- **RadiusNeighborsClassifier/Regressor**: Radius-based neighbors
|
||||
- **NearestCentroid**: Classification using class centroids
|
||||
|
||||
**Use cases**: Simple baseline, irregular decision boundaries, when interpretability isn't critical
|
||||
|
||||
**Key parameters**:
|
||||
- `n_neighbors`: Number of neighbors (typically 3-11)
|
||||
- `weights`: 'uniform' or 'distance' (distance-weighted voting)
|
||||
- `metric`: Distance metric ('euclidean', 'manhattan', 'minkowski')
|
||||
- `algorithm`: 'auto', 'ball_tree', 'kd_tree', 'brute'
|
||||
model = KNeighborsClassifier(n_neighbors=5, weights='distance')
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Naive Bayes
|
||||
|
||||
- **GaussianNB**: Assumes Gaussian distribution of features
|
||||
- **MultinomialNB**: For discrete counts (text classification)
|
||||
- **BernoulliNB**: For binary/boolean features
|
||||
- **CategoricalNB**: For categorical features
|
||||
- **ComplementNB**: Adapted for imbalanced datasets
|
||||
**GaussianNB, MultinomialNB, BernoulliNB**
|
||||
- Probabilistic classifiers based on Bayes' theorem
|
||||
- Fast training and prediction
|
||||
- GaussianNB: Continuous features (assumes Gaussian distribution)
|
||||
- MultinomialNB: Count features (text classification)
|
||||
- BernoulliNB: Binary features
|
||||
- Use when: Text classification, fast baseline, probabilistic predictions
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.naive_bayes import GaussianNB, MultinomialNB
|
||||
|
||||
**Use cases**: Text classification, fast baseline, when features are independent, small training sets
|
||||
# For continuous features
|
||||
model_gaussian = GaussianNB()
|
||||
|
||||
**Key parameters**:
|
||||
- `alpha`: Smoothing parameter (Laplace/Lidstone smoothing)
|
||||
- `fit_prior`: Whether to learn class prior probabilities
|
||||
# For text/count data
|
||||
model_multinomial = MultinomialNB(alpha=1.0) # alpha is smoothing parameter
|
||||
model_multinomial.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Linear/Quadratic Discriminant Analysis
|
||||
## Neural Networks
|
||||
|
||||
- **LinearDiscriminantAnalysis**: Linear decision boundary with dimensionality reduction
|
||||
- **QuadraticDiscriminantAnalysis**: Quadratic decision boundary
|
||||
**MLPClassifier / MLPRegressor**
|
||||
- Multi-layer perceptron (feedforward neural network)
|
||||
- Key parameters:
|
||||
- `hidden_layer_sizes`: Tuple of hidden layer sizes, e.g., (100, 50)
|
||||
- `activation`: 'relu', 'tanh', 'logistic'
|
||||
- `solver`: 'adam', 'sgd', 'lbfgs'
|
||||
- `alpha`: L2 regularization parameter
|
||||
- `learning_rate`: 'constant', 'adaptive'
|
||||
- Use when: Complex non-linear patterns, large datasets
|
||||
- Requires feature scaling
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.neural_network import MLPClassifier
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
**Use cases**: When classes have Gaussian distributions, dimensionality reduction, when covariance assumptions hold
|
||||
# Scale features first
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
|
||||
## Gaussian Processes
|
||||
model = MLPClassifier(
|
||||
hidden_layer_sizes=(100, 50),
|
||||
activation='relu',
|
||||
solver='adam',
|
||||
alpha=0.0001,
|
||||
max_iter=1000
|
||||
)
|
||||
model.fit(X_train_scaled, y_train)
|
||||
```
|
||||
|
||||
- **GaussianProcessClassifier**: Probabilistic classification
|
||||
- **GaussianProcessRegressor**: Probabilistic regression with uncertainty estimates
|
||||
## Algorithm Selection Guide
|
||||
|
||||
**Use cases**: When uncertainty quantification is important, small datasets, smooth function approximation
|
||||
### Choose based on:
|
||||
|
||||
**Key parameters**:
|
||||
- `kernel`: Covariance function (RBF, Matern, RationalQuadratic, etc.)
|
||||
- `alpha`: Noise level
|
||||
**Dataset size:**
|
||||
- Small (<1k samples): KNN, SVM, Decision Trees
|
||||
- Medium (1k-100k): Random Forest, Gradient Boosting, Linear Models
|
||||
- Large (>100k): SGD, Linear Models, HistGradientBoosting
|
||||
|
||||
**Limitation**: Doesn't scale well to large datasets (O(n³) complexity)
|
||||
**Interpretability:**
|
||||
- High: Linear Models, Decision Trees
|
||||
- Medium: Random Forest (feature importance)
|
||||
- Low: SVM with RBF kernel, Neural Networks
|
||||
|
||||
## Stochastic Gradient Descent
|
||||
**Accuracy vs Speed:**
|
||||
- Fast training: Naive Bayes, Linear Models, KNN
|
||||
- High accuracy: Gradient Boosting, Random Forest, Stacking
|
||||
- Fast prediction: Linear Models, Naive Bayes
|
||||
- Slow prediction: KNN (on large datasets), SVM
|
||||
|
||||
- **SGDClassifier**: Linear classifiers with SGD
|
||||
- **SGDRegressor**: Linear regressors with SGD
|
||||
**Feature types:**
|
||||
- Continuous: Most algorithms work well
|
||||
- Categorical: Trees, HistGradientBoosting (native support)
|
||||
- Mixed: Trees, Gradient Boosting
|
||||
- Text: Naive Bayes, Linear Models with TF-IDF
|
||||
|
||||
**Use cases**: Very large datasets (>100k samples), online learning, when data doesn't fit in memory
|
||||
|
||||
**Key parameters**:
|
||||
- `loss`: Loss function ('hinge', 'log_loss', 'squared_error', etc.)
|
||||
- `penalty`: Regularization ('l2', 'l1', 'elasticnet')
|
||||
- `alpha`: Regularization strength
|
||||
- `learning_rate`: Learning rate schedule
|
||||
|
||||
## Semi-Supervised Learning
|
||||
|
||||
- **SelfTrainingClassifier**: Self-training with any base classifier
|
||||
- **LabelPropagation**: Label propagation through graph
|
||||
- **LabelSpreading**: Label spreading (modified label propagation)
|
||||
|
||||
**Use cases**: When labeled data is scarce but unlabeled data is abundant
|
||||
|
||||
## Feature Selection
|
||||
|
||||
- **VarianceThreshold**: Remove low-variance features
|
||||
- **SelectKBest**: Select K highest scoring features
|
||||
- **SelectPercentile**: Select top percentile of features
|
||||
- **RFE**: Recursive feature elimination
|
||||
- **RFECV**: RFE with cross-validation
|
||||
- **SelectFromModel**: Select features based on importance
|
||||
- **SequentialFeatureSelector**: Forward/backward feature selection
|
||||
|
||||
**Use cases**: Reducing dimensionality, removing irrelevant features, improving interpretability, reducing overfitting
|
||||
|
||||
## Probability Calibration
|
||||
|
||||
- **CalibratedClassifierCV**: Calibrate classifier probabilities
|
||||
|
||||
**Use cases**: When probability estimates are important (not just class predictions), especially with SVM and Naive Bayes
|
||||
|
||||
**Methods**:
|
||||
- `sigmoid`: Platt scaling
|
||||
- `isotonic`: Isotonic regression (more flexible, needs more data)
|
||||
|
||||
## Multi-Output Methods
|
||||
|
||||
- **MultiOutputClassifier**: Fit one classifier per target
|
||||
- **MultiOutputRegressor**: Fit one regressor per target
|
||||
- **ClassifierChain**: Models dependencies between targets
|
||||
- **RegressorChain**: Regression variant
|
||||
|
||||
**Use cases**: Predicting multiple related targets simultaneously
|
||||
|
||||
## Specialized Regression
|
||||
|
||||
- **IsotonicRegression**: Monotonic regression
|
||||
- **QuantileRegressor**: Quantile regression for prediction intervals
|
||||
|
||||
## Algorithm Selection Guidelines
|
||||
|
||||
**Start with**:
|
||||
1. **Logistic Regression** (classification) or **LinearRegression/Ridge** (regression) as baseline
|
||||
2. **RandomForestClassifier/Regressor** for general non-linear problems
|
||||
3. **HistGradientBoostingClassifier/Regressor** when best performance is needed
|
||||
|
||||
**Consider dataset size**:
|
||||
- Small (<1k samples): SVM, Gaussian Processes, any algorithm
|
||||
- Medium (1k-100k): Random Forests, Gradient Boosting, Neural Networks
|
||||
- Large (>100k): SGD, HistGradientBoosting, LinearSVC
|
||||
|
||||
**Consider interpretability needs**:
|
||||
- High interpretability: Linear models, Decision Trees, Naive Bayes
|
||||
- Medium: Random Forests (feature importance), Rule extraction
|
||||
- Low (black box acceptable): Gradient Boosting, Neural Networks, SVM with RBF kernel
|
||||
|
||||
**Consider training time**:
|
||||
- Fast: Linear models, Naive Bayes, Decision Trees
|
||||
- Medium: Random Forests (parallelizable), SVM (small data)
|
||||
- Slow: Gradient Boosting, Neural Networks, SVM (large data), Gaussian Processes
|
||||
**Common starting points:**
|
||||
1. Logistic Regression (classification) / Linear Regression (regression) - fast baseline
|
||||
2. Random Forest - good default choice
|
||||
3. Gradient Boosting - optimize for best accuracy
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user