mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Improve the scikit-learn skill
This commit is contained in:
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -1,345 +1,563 @@
|
||||
# Data Preprocessing in scikit-learn
|
||||
# Data Preprocessing and Feature Engineering Reference
|
||||
|
||||
## Overview
|
||||
Preprocessing transforms raw data into a format suitable for machine learning algorithms. Many algorithms require standardized or normalized data to perform well.
|
||||
|
||||
## Standardization and Scaling
|
||||
Data preprocessing transforms raw data into a format suitable for machine learning models. This includes scaling, encoding, handling missing values, and feature engineering.
|
||||
|
||||
## Feature Scaling and Normalization
|
||||
|
||||
### StandardScaler
|
||||
Removes mean and scales to unit variance (z-score normalization).
|
||||
|
||||
**Formula**: `z = (x - μ) / σ`
|
||||
|
||||
**Use cases**:
|
||||
- Most ML algorithms (especially SVM, neural networks, PCA)
|
||||
- When features have different units or scales
|
||||
- When assuming Gaussian-like distribution
|
||||
|
||||
**Important**: Fit only on training data, then transform both train and test sets.
|
||||
|
||||
**StandardScaler (`sklearn.preprocessing.StandardScaler`)**
|
||||
- Standardizes features to zero mean and unit variance
|
||||
- Formula: z = (x - mean) / std
|
||||
- Use when: Features have different scales, algorithm assumes normally distributed data
|
||||
- Required for: SVM, KNN, Neural Networks, PCA, Linear Regression with regularization
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test) # Use same parameters
|
||||
X_test_scaled = scaler.transform(X_test) # Use same parameters as training
|
||||
|
||||
# Access learned parameters
|
||||
print(f"Mean: {scaler.mean_}")
|
||||
print(f"Std: {scaler.scale_}")
|
||||
```
|
||||
|
||||
### MinMaxScaler
|
||||
Scales features to a specified range, typically [0, 1].
|
||||
|
||||
**Formula**: `X_scaled = (X - X_min) / (X_max - X_min)`
|
||||
|
||||
**Use cases**:
|
||||
- When bounded range is needed
|
||||
- Neural networks (often prefer [0, 1] range)
|
||||
- When distribution is not Gaussian
|
||||
- Image pixel values
|
||||
|
||||
**Parameters**:
|
||||
- `feature_range`: Tuple (min, max), default (0, 1)
|
||||
|
||||
**Warning**: Sensitive to outliers since it uses min/max.
|
||||
|
||||
### MaxAbsScaler
|
||||
Scales to [-1, 1] by dividing by maximum absolute value.
|
||||
|
||||
**Use cases**:
|
||||
- Sparse data (preserves sparsity)
|
||||
- Data already centered at zero
|
||||
- When sign of values is meaningful
|
||||
|
||||
**Advantage**: Doesn't shift/center the data, preserves zero entries.
|
||||
|
||||
### RobustScaler
|
||||
Uses median and interquartile range (IQR) instead of mean and standard deviation.
|
||||
|
||||
**Formula**: `X_scaled = (X - median) / IQR`
|
||||
|
||||
**Use cases**:
|
||||
- When outliers are present
|
||||
- When StandardScaler produces skewed results
|
||||
- Robust statistics preferred
|
||||
|
||||
**Parameters**:
|
||||
- `quantile_range`: Tuple (q_min, q_max), default (25.0, 75.0)
|
||||
|
||||
## Normalization
|
||||
|
||||
### normalize() function and Normalizer
|
||||
Scales individual samples (rows) to unit norm, not features (columns).
|
||||
|
||||
**Use cases**:
|
||||
- Text classification (TF-IDF vectors)
|
||||
- When similarity metrics (dot product, cosine) are used
|
||||
- When each sample should have equal weight
|
||||
|
||||
**Norms**:
|
||||
- `l1`: Manhattan norm (sum of absolutes = 1)
|
||||
- `l2`: Euclidean norm (sum of squares = 1) - **most common**
|
||||
- `max`: Maximum absolute value = 1
|
||||
|
||||
**Key difference from scalers**: Operates on rows (samples), not columns (features).
|
||||
|
||||
**MinMaxScaler (`sklearn.preprocessing.MinMaxScaler`)**
|
||||
- Scales features to a given range (default [0, 1])
|
||||
- Formula: X_scaled = (X - X.min) / (X.max - X.min)
|
||||
- Use when: Need bounded values, data not normally distributed
|
||||
- Sensitive to outliers
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import Normalizer
|
||||
normalizer = Normalizer(norm='l2')
|
||||
X_normalized = normalizer.transform(X)
|
||||
from sklearn.preprocessing import MinMaxScaler
|
||||
|
||||
scaler = MinMaxScaler(feature_range=(0, 1))
|
||||
X_scaled = scaler.fit_transform(X_train)
|
||||
|
||||
# Custom range
|
||||
scaler = MinMaxScaler(feature_range=(-1, 1))
|
||||
X_scaled = scaler.fit_transform(X_train)
|
||||
```
|
||||
|
||||
## Encoding Categorical Features
|
||||
### RobustScaler
|
||||
|
||||
**RobustScaler (`sklearn.preprocessing.RobustScaler`)**
|
||||
- Scales using median and interquartile range (IQR)
|
||||
- Formula: X_scaled = (X - median) / IQR
|
||||
- Use when: Data contains outliers
|
||||
- Robust to outliers
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import RobustScaler
|
||||
|
||||
scaler = RobustScaler()
|
||||
X_scaled = scaler.fit_transform(X_train)
|
||||
```
|
||||
|
||||
### Normalizer
|
||||
|
||||
**Normalizer (`sklearn.preprocessing.Normalizer`)**
|
||||
- Normalizes samples individually to unit norm
|
||||
- Common norms: 'l1', 'l2', 'max'
|
||||
- Use when: Need to normalize each sample independently (e.g., text features)
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import Normalizer
|
||||
|
||||
normalizer = Normalizer(norm='l2') # Euclidean norm
|
||||
X_normalized = normalizer.fit_transform(X)
|
||||
```
|
||||
|
||||
### MaxAbsScaler
|
||||
|
||||
**MaxAbsScaler (`sklearn.preprocessing.MaxAbsScaler`)**
|
||||
- Scales by maximum absolute value
|
||||
- Range: [-1, 1]
|
||||
- Doesn't shift/center data (preserves sparsity)
|
||||
- Use when: Data is already centered or sparse
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import MaxAbsScaler
|
||||
|
||||
scaler = MaxAbsScaler()
|
||||
X_scaled = scaler.fit_transform(X_sparse)
|
||||
```
|
||||
|
||||
## Encoding Categorical Variables
|
||||
|
||||
### OneHotEncoder
|
||||
|
||||
**OneHotEncoder (`sklearn.preprocessing.OneHotEncoder`)**
|
||||
- Creates binary columns for each category
|
||||
- Use when: Nominal categories (no order), tree-based models or linear models
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import OneHotEncoder
|
||||
|
||||
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
|
||||
X_encoded = encoder.fit_transform(X_categorical)
|
||||
|
||||
# Get feature names
|
||||
feature_names = encoder.get_feature_names_out(['color', 'size'])
|
||||
|
||||
# Handle unknown categories during transform
|
||||
X_test_encoded = encoder.transform(X_test_categorical)
|
||||
```
|
||||
|
||||
### OrdinalEncoder
|
||||
Converts categories to integers (0 to n_categories - 1).
|
||||
|
||||
**Use cases**:
|
||||
- Ordinal relationships exist (small < medium < large)
|
||||
- Preprocessing before other transformations
|
||||
- Tree-based algorithms (which can handle integers)
|
||||
|
||||
**Parameters**:
|
||||
- `handle_unknown`: 'error' or 'use_encoded_value'
|
||||
- `unknown_value`: Value for unknown categories
|
||||
- `encoded_missing_value`: Value for missing data
|
||||
|
||||
**OrdinalEncoder (`sklearn.preprocessing.OrdinalEncoder`)**
|
||||
- Encodes categories as integers
|
||||
- Use when: Ordinal categories (ordered), or tree-based models
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import OrdinalEncoder
|
||||
|
||||
# Natural ordering
|
||||
encoder = OrdinalEncoder()
|
||||
X_encoded = encoder.fit_transform(X_categorical)
|
||||
|
||||
# Custom ordering
|
||||
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
|
||||
X_encoded = encoder.fit_transform(X_categorical)
|
||||
```
|
||||
|
||||
### OneHotEncoder
|
||||
Creates binary columns for each category.
|
||||
|
||||
**Use cases**:
|
||||
- Nominal categories (no order)
|
||||
- Linear models, neural networks
|
||||
- When category relationships shouldn't be assumed
|
||||
|
||||
**Parameters**:
|
||||
- `drop`: 'first', 'if_binary', array-like (prevents multicollinearity)
|
||||
- `sparse_output`: True (default, memory efficient) or False
|
||||
- `handle_unknown`: 'error', 'ignore', 'infrequent_if_exist'
|
||||
- `min_frequency`: Group infrequent categories
|
||||
- `max_categories`: Limit number of categories
|
||||
|
||||
**High cardinality handling**:
|
||||
```python
|
||||
encoder = OneHotEncoder(min_frequency=100, handle_unknown='infrequent_if_exist')
|
||||
# Groups categories appearing < 100 times into 'infrequent' category
|
||||
```
|
||||
|
||||
**Memory tip**: Use `sparse_output=True` (default) for high-cardinality features.
|
||||
|
||||
### TargetEncoder
|
||||
Uses target statistics to encode categories.
|
||||
|
||||
**Use cases**:
|
||||
- High-cardinality categorical features (zip codes, user IDs)
|
||||
- When linear relationships with target are expected
|
||||
- Often improves performance over one-hot encoding
|
||||
|
||||
**How it works**:
|
||||
- Replaces category with mean of target for that category
|
||||
- Uses cross-fitting during fit_transform() to prevent target leakage
|
||||
- Applies smoothing to handle rare categories
|
||||
|
||||
**Parameters**:
|
||||
- `smooth`: Smoothing parameter for rare categories
|
||||
- `cv`: Cross-validation strategy
|
||||
|
||||
**Warning**: Only for supervised learning. Requires target variable.
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import TargetEncoder
|
||||
encoder = TargetEncoder()
|
||||
X_encoded = encoder.fit_transform(X_categorical, y)
|
||||
```
|
||||
|
||||
### LabelEncoder
|
||||
Encodes target labels into integers 0 to n_classes - 1.
|
||||
|
||||
**Use cases**: Encoding target variable for classification (not features!)
|
||||
**LabelEncoder (`sklearn.preprocessing.LabelEncoder`)**
|
||||
- Encodes target labels (y) as integers
|
||||
- Use for: Target variable encoding
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import LabelEncoder
|
||||
|
||||
**Important**: Use `LabelEncoder` for targets, not features. For features, use OrdinalEncoder or OneHotEncoder.
|
||||
le = LabelEncoder()
|
||||
y_encoded = le.fit_transform(y)
|
||||
|
||||
### Binarizer
|
||||
Converts numeric values to binary (0 or 1) based on threshold.
|
||||
# Decode back
|
||||
y_decoded = le.inverse_transform(y_encoded)
|
||||
print(f"Classes: {le.classes_}")
|
||||
```
|
||||
|
||||
**Use cases**: Creating binary features from continuous values
|
||||
### Target Encoding (using category_encoders)
|
||||
|
||||
```python
|
||||
# Install: uv pip install category-encoders
|
||||
from category_encoders import TargetEncoder
|
||||
|
||||
encoder = TargetEncoder()
|
||||
X_train_encoded = encoder.fit_transform(X_train_categorical, y_train)
|
||||
X_test_encoded = encoder.transform(X_test_categorical)
|
||||
```
|
||||
|
||||
## Non-linear Transformations
|
||||
|
||||
### QuantileTransformer
|
||||
Maps features to uniform or normal distribution using rank transformation.
|
||||
|
||||
**Use cases**:
|
||||
- Unusual distributions (bimodal, heavy tails)
|
||||
- Reducing outlier impact
|
||||
- When normal distribution is desired
|
||||
|
||||
**Parameters**:
|
||||
- `output_distribution`: 'uniform' (default) or 'normal'
|
||||
- `n_quantiles`: Number of quantiles (default: min(1000, n_samples))
|
||||
|
||||
**Effect**: Strong transformation that reduces outlier influence and makes data more Gaussian-like.
|
||||
|
||||
### PowerTransformer
|
||||
Applies parametric monotonic transformation to make data more Gaussian.
|
||||
|
||||
**Methods**:
|
||||
- `yeo-johnson`: Works with positive and negative values (default)
|
||||
- `box-cox`: Only positive values
|
||||
|
||||
**Use cases**:
|
||||
- Skewed distributions
|
||||
- When Gaussian assumption is important
|
||||
- Variance stabilization
|
||||
|
||||
**Advantage**: Less radical than QuantileTransformer, preserves more of original relationships.
|
||||
|
||||
## Discretization
|
||||
|
||||
### KBinsDiscretizer
|
||||
Bins continuous features into discrete intervals.
|
||||
|
||||
**Strategies**:
|
||||
- `uniform`: Equal-width bins
|
||||
- `quantile`: Equal-frequency bins
|
||||
- `kmeans`: K-means clustering to determine bins
|
||||
|
||||
**Encoding**:
|
||||
- `ordinal`: Integer encoding (0 to n_bins - 1)
|
||||
- `onehot`: One-hot encoding
|
||||
- `onehot-dense`: Dense one-hot encoding
|
||||
|
||||
**Use cases**:
|
||||
- Making linear models handle non-linear relationships
|
||||
- Reducing noise in features
|
||||
- Making features more interpretable
|
||||
### Power Transforms
|
||||
|
||||
**PowerTransformer**
|
||||
- Makes data more Gaussian-like
|
||||
- Methods: 'yeo-johnson' (works with negative values), 'box-cox' (positive only)
|
||||
- Use when: Data is skewed, algorithm assumes normality
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import KBinsDiscretizer
|
||||
disc = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
|
||||
X_binned = disc.fit_transform(X)
|
||||
from sklearn.preprocessing import PowerTransformer
|
||||
|
||||
# Yeo-Johnson (handles negative values)
|
||||
pt = PowerTransformer(method='yeo-johnson', standardize=True)
|
||||
X_transformed = pt.fit_transform(X)
|
||||
|
||||
# Box-Cox (positive values only)
|
||||
pt = PowerTransformer(method='box-cox', standardize=True)
|
||||
X_transformed = pt.fit_transform(X)
|
||||
```
|
||||
|
||||
## Feature Generation
|
||||
|
||||
### PolynomialFeatures
|
||||
Generates polynomial and interaction features.
|
||||
|
||||
**Parameters**:
|
||||
- `degree`: Polynomial degree
|
||||
- `interaction_only`: Only multiplicative interactions (no x²)
|
||||
- `include_bias`: Include constant feature
|
||||
|
||||
**Use cases**:
|
||||
- Adding non-linearity to linear models
|
||||
- Feature engineering
|
||||
- Polynomial regression
|
||||
|
||||
**Warning**: Number of features grows rapidly: (n+d)!/d!n! for degree d.
|
||||
### Quantile Transformation
|
||||
|
||||
**QuantileTransformer**
|
||||
- Transforms features to follow uniform or normal distribution
|
||||
- Robust to outliers
|
||||
- Use when: Want to reduce outlier impact
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
poly = PolynomialFeatures(degree=2, include_bias=False)
|
||||
X_poly = poly.fit_transform(X)
|
||||
# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
|
||||
from sklearn.preprocessing import QuantileTransformer
|
||||
|
||||
# Transform to uniform distribution
|
||||
qt = QuantileTransformer(output_distribution='uniform', random_state=42)
|
||||
X_transformed = qt.fit_transform(X)
|
||||
|
||||
# Transform to normal distribution
|
||||
qt = QuantileTransformer(output_distribution='normal', random_state=42)
|
||||
X_transformed = qt.fit_transform(X)
|
||||
```
|
||||
|
||||
### SplineTransformer
|
||||
Generates B-spline basis functions.
|
||||
### Log Transform
|
||||
|
||||
**Use cases**:
|
||||
- Smooth non-linear transformations
|
||||
- Alternative to PolynomialFeatures (less oscillation at boundaries)
|
||||
- Generalized additive models (GAMs)
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
**Parameters**:
|
||||
- `n_knots`: Number of knots
|
||||
- `degree`: Spline degree
|
||||
- `knots`: Knot positions ('uniform', 'quantile', or array)
|
||||
# Log1p (log(1 + x)) - handles zeros
|
||||
X_log = np.log1p(X)
|
||||
|
||||
## Missing Value Handling
|
||||
# Or use FunctionTransformer
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
|
||||
log_transformer = FunctionTransformer(np.log1p, inverse_func=np.expm1)
|
||||
X_log = log_transformer.fit_transform(X)
|
||||
```
|
||||
|
||||
## Missing Value Imputation
|
||||
|
||||
### SimpleImputer
|
||||
Imputes missing values with various strategies.
|
||||
|
||||
**Strategies**:
|
||||
- `mean`: Mean of column (numeric only)
|
||||
- `median`: Median of column (numeric only)
|
||||
- `most_frequent`: Mode (numeric or categorical)
|
||||
- `constant`: Fill with constant value
|
||||
|
||||
**Parameters**:
|
||||
- `strategy`: Imputation strategy
|
||||
- `fill_value`: Value when strategy='constant'
|
||||
- `missing_values`: What represents missing (np.nan, None, specific value)
|
||||
|
||||
**SimpleImputer (`sklearn.impute.SimpleImputer`)**
|
||||
- Basic imputation strategies
|
||||
- Strategies: 'mean', 'median', 'most_frequent', 'constant'
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.impute import SimpleImputer
|
||||
imputer = SimpleImputer(strategy='median')
|
||||
|
||||
# For numerical features
|
||||
imputer = SimpleImputer(strategy='mean')
|
||||
X_imputed = imputer.fit_transform(X)
|
||||
|
||||
# For categorical features
|
||||
imputer = SimpleImputer(strategy='most_frequent')
|
||||
X_imputed = imputer.fit_transform(X_categorical)
|
||||
|
||||
# Fill with constant
|
||||
imputer = SimpleImputer(strategy='constant', fill_value=0)
|
||||
X_imputed = imputer.fit_transform(X)
|
||||
```
|
||||
|
||||
### KNNImputer
|
||||
Imputes using k-nearest neighbors.
|
||||
### Iterative Imputer
|
||||
|
||||
**Use cases**: When relationships between features should inform imputation
|
||||
**IterativeImputer**
|
||||
- Models each feature with missing values as function of other features
|
||||
- More sophisticated than SimpleImputer
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.experimental import enable_iterative_imputer
|
||||
from sklearn.impute import IterativeImputer
|
||||
|
||||
**Parameters**:
|
||||
- `n_neighbors`: Number of neighbors
|
||||
- `weights`: 'uniform' or 'distance'
|
||||
imputer = IterativeImputer(max_iter=10, random_state=42)
|
||||
X_imputed = imputer.fit_transform(X)
|
||||
```
|
||||
|
||||
### IterativeImputer
|
||||
Models each feature with missing values as function of other features.
|
||||
### KNN Imputer
|
||||
|
||||
**Use cases**:
|
||||
- Complex relationships between features
|
||||
- When multiple features have missing values
|
||||
- Higher quality imputation (but slower)
|
||||
**KNNImputer**
|
||||
- Imputes using k-nearest neighbors
|
||||
- Use when: Features are correlated
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.impute import KNNImputer
|
||||
|
||||
**Parameters**:
|
||||
- `estimator`: Estimator for regression (default: BayesianRidge)
|
||||
- `max_iter`: Maximum iterations
|
||||
imputer = KNNImputer(n_neighbors=5)
|
||||
X_imputed = imputer.fit_transform(X)
|
||||
```
|
||||
|
||||
## Function Transformers
|
||||
## Feature Engineering
|
||||
|
||||
### FunctionTransformer
|
||||
Applies custom function to data.
|
||||
### Polynomial Features
|
||||
|
||||
**Use cases**:
|
||||
- Custom transformations in pipelines
|
||||
- Log transformation, square root, etc.
|
||||
- Domain-specific preprocessing
|
||||
**PolynomialFeatures**
|
||||
- Creates polynomial and interaction features
|
||||
- Use when: Need non-linear features for linear models
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
|
||||
# Degree 2: includes x1, x2, x1^2, x2^2, x1*x2
|
||||
poly = PolynomialFeatures(degree=2, include_bias=False)
|
||||
X_poly = poly.fit_transform(X)
|
||||
|
||||
# Get feature names
|
||||
feature_names = poly.get_feature_names_out(['x1', 'x2'])
|
||||
|
||||
# Only interactions (no powers)
|
||||
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
|
||||
X_interactions = poly.fit_transform(X)
|
||||
```
|
||||
|
||||
### Binning/Discretization
|
||||
|
||||
**KBinsDiscretizer**
|
||||
- Bins continuous features into discrete intervals
|
||||
- Strategies: 'uniform', 'quantile', 'kmeans'
|
||||
- Encoding: 'onehot', 'ordinal', 'onehot-dense'
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import KBinsDiscretizer
|
||||
|
||||
# Equal-width bins
|
||||
binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
|
||||
X_binned = binner.fit_transform(X)
|
||||
|
||||
# Equal-frequency bins (quantile-based)
|
||||
binner = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
|
||||
X_binned = binner.fit_transform(X)
|
||||
```
|
||||
|
||||
### Binarization
|
||||
|
||||
**Binarizer**
|
||||
- Converts features to binary (0 or 1) based on threshold
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import Binarizer
|
||||
|
||||
binarizer = Binarizer(threshold=0.5)
|
||||
X_binary = binarizer.fit_transform(X)
|
||||
```
|
||||
|
||||
### Spline Features
|
||||
|
||||
**SplineTransformer**
|
||||
- Creates spline basis functions
|
||||
- Useful for capturing non-linear relationships
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.preprocessing import SplineTransformer
|
||||
|
||||
spline = SplineTransformer(n_knots=5, degree=3)
|
||||
X_splines = spline.fit_transform(X)
|
||||
```
|
||||
|
||||
## Text Feature Extraction
|
||||
|
||||
### CountVectorizer
|
||||
|
||||
**CountVectorizer (`sklearn.feature_extraction.text.CountVectorizer`)**
|
||||
- Converts text to token count matrix
|
||||
- Use for: Bag-of-words representation
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_extraction.text import CountVectorizer
|
||||
|
||||
vectorizer = CountVectorizer(
|
||||
max_features=5000, # Keep top 5000 features
|
||||
min_df=2, # Ignore terms appearing in < 2 documents
|
||||
max_df=0.8, # Ignore terms appearing in > 80% documents
|
||||
ngram_range=(1, 2) # Unigrams and bigrams
|
||||
)
|
||||
|
||||
X_counts = vectorizer.fit_transform(documents)
|
||||
feature_names = vectorizer.get_feature_names_out()
|
||||
```
|
||||
|
||||
### TfidfVectorizer
|
||||
|
||||
**TfidfVectorizer**
|
||||
- TF-IDF (Term Frequency-Inverse Document Frequency) transformation
|
||||
- Better than CountVectorizer for most tasks
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
|
||||
vectorizer = TfidfVectorizer(
|
||||
max_features=5000,
|
||||
min_df=2,
|
||||
max_df=0.8,
|
||||
ngram_range=(1, 2),
|
||||
stop_words='english' # Remove English stop words
|
||||
)
|
||||
|
||||
X_tfidf = vectorizer.fit_transform(documents)
|
||||
```
|
||||
|
||||
### HashingVectorizer
|
||||
|
||||
**HashingVectorizer**
|
||||
- Uses hashing trick for memory efficiency
|
||||
- No fit needed, can't reverse transform
|
||||
- Use when: Very large vocabulary, streaming data
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_extraction.text import HashingVectorizer
|
||||
|
||||
vectorizer = HashingVectorizer(n_features=2**18)
|
||||
X_hashed = vectorizer.transform(documents) # No fit needed
|
||||
```
|
||||
|
||||
## Feature Selection
|
||||
|
||||
### Filter Methods
|
||||
|
||||
**Variance Threshold**
|
||||
- Removes low-variance features
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_selection import VarianceThreshold
|
||||
|
||||
selector = VarianceThreshold(threshold=0.01)
|
||||
X_selected = selector.fit_transform(X)
|
||||
```
|
||||
|
||||
**SelectKBest / SelectPercentile**
|
||||
- Select features based on statistical tests
|
||||
- Tests: f_classif, chi2, mutual_info_classif
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_selection import SelectKBest, f_classif
|
||||
|
||||
# Select top 10 features
|
||||
selector = SelectKBest(score_func=f_classif, k=10)
|
||||
X_selected = selector.fit_transform(X_train, y_train)
|
||||
|
||||
# Get selected feature indices
|
||||
selected_indices = selector.get_support(indices=True)
|
||||
```
|
||||
|
||||
### Wrapper Methods
|
||||
|
||||
**Recursive Feature Elimination (RFE)**
|
||||
- Recursively removes features
|
||||
- Uses model feature importances
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_selection import RFE
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
rfe = RFE(estimator=model, n_features_to_select=10, step=1)
|
||||
X_selected = rfe.fit_transform(X_train, y_train)
|
||||
|
||||
# Get selected features
|
||||
selected_features = rfe.support_
|
||||
feature_ranking = rfe.ranking_
|
||||
```
|
||||
|
||||
**RFECV (with Cross-Validation)**
|
||||
- RFE with cross-validation to find optimal number of features
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_selection import RFECV
|
||||
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
rfecv = RFECV(estimator=model, cv=5, scoring='accuracy')
|
||||
X_selected = rfecv.fit_transform(X_train, y_train)
|
||||
|
||||
print(f"Optimal number of features: {rfecv.n_features_}")
|
||||
```
|
||||
|
||||
### Embedded Methods
|
||||
|
||||
**SelectFromModel**
|
||||
- Select features based on model coefficients/importances
|
||||
- Works with: Linear models (L1), Tree-based models
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.feature_selection import SelectFromModel
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
selector = SelectFromModel(model, threshold='median')
|
||||
selector.fit(X_train, y_train)
|
||||
X_selected = selector.transform(X_train)
|
||||
|
||||
# Get selected features
|
||||
selected_features = selector.get_support()
|
||||
```
|
||||
|
||||
**L1-based Feature Selection**
|
||||
```python
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
from sklearn.feature_selection import SelectFromModel
|
||||
|
||||
model = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)
|
||||
selector = SelectFromModel(model)
|
||||
selector.fit(X_train, y_train)
|
||||
X_selected = selector.transform(X_train)
|
||||
```
|
||||
|
||||
## Handling Outliers
|
||||
|
||||
### IQR Method
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
Q1 = np.percentile(X, 25, axis=0)
|
||||
Q3 = np.percentile(X, 75, axis=0)
|
||||
IQR = Q3 - Q1
|
||||
|
||||
# Define outlier boundaries
|
||||
lower_bound = Q1 - 1.5 * IQR
|
||||
upper_bound = Q3 + 1.5 * IQR
|
||||
|
||||
# Remove outliers
|
||||
mask = np.all((X >= lower_bound) & (X <= upper_bound), axis=1)
|
||||
X_no_outliers = X[mask]
|
||||
```
|
||||
|
||||
### Winsorization
|
||||
|
||||
```python
|
||||
from scipy.stats import mstats
|
||||
|
||||
# Clip outliers at 5th and 95th percentiles
|
||||
X_winsorized = mstats.winsorize(X, limits=[0.05, 0.05], axis=0)
|
||||
```
|
||||
|
||||
## Custom Transformers
|
||||
|
||||
### Using FunctionTransformer
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
import numpy as np
|
||||
|
||||
log_transformer = FunctionTransformer(np.log1p, validate=True)
|
||||
X_log = log_transformer.transform(X)
|
||||
def log_transform(X):
|
||||
return np.log1p(X)
|
||||
|
||||
transformer = FunctionTransformer(log_transform, inverse_func=np.expm1)
|
||||
X_transformed = transformer.fit_transform(X)
|
||||
```
|
||||
|
||||
### Creating Custom Transformer
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
|
||||
class CustomTransformer(BaseEstimator, TransformerMixin):
|
||||
def __init__(self, parameter=1):
|
||||
self.parameter = parameter
|
||||
|
||||
def fit(self, X, y=None):
|
||||
# Learn parameters from X if needed
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
# Transform X
|
||||
return X * self.parameter
|
||||
|
||||
transformer = CustomTransformer(parameter=2)
|
||||
X_transformed = transformer.fit_transform(X)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Feature Scaling Guidelines
|
||||
### Fit on Training Data Only
|
||||
Always fit transformers on training data only:
|
||||
```python
|
||||
# Correct
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
**Always scale**:
|
||||
- SVM, neural networks
|
||||
- K-nearest neighbors
|
||||
- Linear/Logistic regression with regularization
|
||||
- PCA, LDA
|
||||
- Gradient descent-based algorithms
|
||||
|
||||
**Don't need to scale**:
|
||||
- Tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting)
|
||||
- Naive Bayes
|
||||
|
||||
### Pipeline Integration
|
||||
|
||||
Always use preprocessing within pipelines to prevent data leakage:
|
||||
# Wrong - causes data leakage
|
||||
scaler = StandardScaler()
|
||||
X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))
|
||||
```
|
||||
|
||||
### Use Pipelines
|
||||
Combine preprocessing with models:
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
@@ -350,64 +568,39 @@ pipeline = Pipeline([
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
|
||||
pipeline.fit(X_train, y_train) # Scaler fit only on train data
|
||||
y_pred = pipeline.predict(X_test) # Scaler transform only on test data
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Common Transformations by Data Type
|
||||
|
||||
**Numeric - Continuous**:
|
||||
- StandardScaler (most common)
|
||||
- MinMaxScaler (neural networks)
|
||||
- RobustScaler (outliers present)
|
||||
- PowerTransformer (skewed data)
|
||||
|
||||
**Numeric - Count Data**:
|
||||
- sqrt or log transformation
|
||||
- QuantileTransformer
|
||||
- StandardScaler after transformation
|
||||
|
||||
**Categorical - Low Cardinality (<10 categories)**:
|
||||
- OneHotEncoder
|
||||
|
||||
**Categorical - High Cardinality (>10 categories)**:
|
||||
- TargetEncoder (supervised)
|
||||
- Frequency encoding
|
||||
- OneHotEncoder with min_frequency parameter
|
||||
|
||||
**Categorical - Ordinal**:
|
||||
- OrdinalEncoder
|
||||
|
||||
**Text**:
|
||||
- CountVectorizer or TfidfVectorizer
|
||||
- Normalizer after vectorization
|
||||
|
||||
### Data Leakage Prevention
|
||||
|
||||
1. **Fit only on training data**: Never include test data when fitting preprocessors
|
||||
2. **Use pipelines**: Ensures proper fit/transform separation
|
||||
3. **Cross-validation**: Use Pipeline with cross_val_score() for proper evaluation
|
||||
4. **Target encoding**: Use cv parameter in TargetEncoder for cross-fitting
|
||||
|
||||
### Handle Categorical and Numerical Separately
|
||||
Use ColumnTransformer:
|
||||
```python
|
||||
# WRONG - data leakage
|
||||
scaler = StandardScaler().fit(X_full)
|
||||
X_train_scaled = scaler.transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
|
||||
# CORRECT - no leakage
|
||||
scaler = StandardScaler().fit(X_train)
|
||||
X_train_scaled = scaler.transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
numeric_features = ['age', 'income']
|
||||
categorical_features = ['gender', 'occupation']
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', StandardScaler(), numeric_features),
|
||||
('cat', OneHotEncoder(), categorical_features)
|
||||
]
|
||||
)
|
||||
|
||||
X_transformed = preprocessor.fit_transform(X)
|
||||
```
|
||||
|
||||
## Preprocessing Checklist
|
||||
### Algorithm-Specific Requirements
|
||||
|
||||
Before modeling:
|
||||
1. Handle missing values (imputation or removal)
|
||||
2. Encode categorical variables appropriately
|
||||
3. Scale/normalize numeric features (if needed for algorithm)
|
||||
4. Handle outliers (RobustScaler, clipping, removal)
|
||||
5. Create additional features if beneficial (PolynomialFeatures, domain knowledge)
|
||||
6. Check for data leakage in preprocessing steps
|
||||
7. Wrap everything in a Pipeline
|
||||
**Require Scaling:**
|
||||
- SVM, KNN, Neural Networks
|
||||
- PCA, Linear/Logistic Regression with regularization
|
||||
- K-Means clustering
|
||||
|
||||
**Don't Require Scaling:**
|
||||
- Tree-based models (Decision Trees, Random Forest, Gradient Boosting)
|
||||
- Naive Bayes
|
||||
|
||||
**Encoding Requirements:**
|
||||
- Linear models, SVM, KNN: One-hot encoding for nominal features
|
||||
- Tree-based models: Can handle ordinal encoding directly
|
||||
|
||||
@@ -1,546 +1,287 @@
|
||||
# Scikit-learn Quick Reference
|
||||
|
||||
## Essential Imports
|
||||
## Common Import Patterns
|
||||
|
||||
```python
|
||||
# Core
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
# Core scikit-learn
|
||||
import sklearn
|
||||
|
||||
# Data splitting and cross-validation
|
||||
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
|
||||
from sklearn.pipeline import Pipeline, make_pipeline
|
||||
from sklearn.compose import ColumnTransformer
|
||||
|
||||
# Preprocessing
|
||||
from sklearn.preprocessing import (
|
||||
StandardScaler, MinMaxScaler, RobustScaler,
|
||||
OneHotEncoder, OrdinalEncoder, LabelEncoder,
|
||||
PolynomialFeatures
|
||||
)
|
||||
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
|
||||
# Models - Classification
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
# Feature selection
|
||||
from sklearn.feature_selection import SelectKBest, RFE
|
||||
|
||||
# Supervised learning
|
||||
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
|
||||
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
|
||||
from sklearn.svm import SVC, SVR
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
from sklearn.ensemble import (
|
||||
RandomForestClassifier,
|
||||
GradientBoostingClassifier,
|
||||
HistGradientBoostingClassifier
|
||||
)
|
||||
from sklearn.svm import SVC
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
|
||||
# Models - Regression
|
||||
from sklearn.linear_model import LinearRegression, Ridge, Lasso
|
||||
from sklearn.ensemble import (
|
||||
RandomForestRegressor,
|
||||
GradientBoostingRegressor,
|
||||
HistGradientBoostingRegressor
|
||||
)
|
||||
|
||||
# Clustering
|
||||
# Unsupervised learning
|
||||
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
|
||||
from sklearn.mixture import GaussianMixture
|
||||
|
||||
# Dimensionality Reduction
|
||||
from sklearn.decomposition import PCA, NMF, TruncatedSVD
|
||||
from sklearn.manifold import TSNE
|
||||
from sklearn.decomposition import PCA, NMF
|
||||
|
||||
# Metrics
|
||||
from sklearn.metrics import (
|
||||
accuracy_score, precision_score, recall_score, f1_score,
|
||||
confusion_matrix, classification_report,
|
||||
mean_squared_error, r2_score, mean_absolute_error
|
||||
mean_squared_error, r2_score, confusion_matrix, classification_report
|
||||
)
|
||||
|
||||
# Pipeline
|
||||
from sklearn.pipeline import Pipeline, make_pipeline
|
||||
from sklearn.compose import ColumnTransformer, make_column_transformer
|
||||
|
||||
# Utilities
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
```
|
||||
|
||||
## Basic Workflow Template
|
||||
## Installation
|
||||
|
||||
### Classification
|
||||
```bash
|
||||
# Using uv (recommended)
|
||||
uv pip install scikit-learn
|
||||
|
||||
# Optional dependencies
|
||||
uv pip install scikit-learn[plots] # For plotting utilities
|
||||
uv pip install pandas numpy matplotlib seaborn # Common companions
|
||||
```
|
||||
|
||||
## Quick Workflow Templates
|
||||
|
||||
### Classification Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.metrics import classification_report
|
||||
from sklearn.metrics import classification_report, confusion_matrix
|
||||
|
||||
# Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42, stratify=y
|
||||
X, y, test_size=0.2, stratify=y, random_state=42
|
||||
)
|
||||
|
||||
# Scale features
|
||||
# Preprocess
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# Train model
|
||||
# Train
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
model.fit(X_train_scaled, y_train)
|
||||
|
||||
# Predict and evaluate
|
||||
# Evaluate
|
||||
y_pred = model.predict(X_test_scaled)
|
||||
print(classification_report(y_test, y_pred))
|
||||
print(confusion_matrix(y_test, y_pred))
|
||||
```
|
||||
|
||||
### Regression
|
||||
### Regression Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
from sklearn.ensemble import GradientBoostingRegressor
|
||||
from sklearn.metrics import mean_squared_error, r2_score
|
||||
|
||||
# Split data
|
||||
# Split
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# Scale features
|
||||
# Preprocess and train
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# Train model
|
||||
model = RandomForestRegressor(n_estimators=100, random_state=42)
|
||||
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
|
||||
model.fit(X_train_scaled, y_train)
|
||||
|
||||
# Predict and evaluate
|
||||
# Evaluate
|
||||
y_pred = model.predict(X_test_scaled)
|
||||
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.3f}")
|
||||
print(f"R²: {r2_score(y_test, y_pred):.3f}")
|
||||
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
|
||||
```
|
||||
|
||||
### With Pipeline (Recommended)
|
||||
### Cross-Validation
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import train_test_split, cross_val_score
|
||||
|
||||
# Create pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
|
||||
])
|
||||
|
||||
# Split and train
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Evaluate
|
||||
score = pipeline.score(X_test, y_test)
|
||||
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
|
||||
print(f"Test accuracy: {score:.3f}")
|
||||
print(f"CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
|
||||
print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
|
||||
```
|
||||
|
||||
## Common Preprocessing Patterns
|
||||
|
||||
### Numeric Data
|
||||
### Complete Pipeline with Mixed Data Types
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
# Define feature types
|
||||
numeric_features = ['age', 'income']
|
||||
categorical_features = ['gender', 'occupation']
|
||||
|
||||
# Create preprocessing pipelines
|
||||
numeric_transformer = Pipeline([
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
```
|
||||
|
||||
### Categorical Data
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
categorical_transformer = Pipeline([
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('imputer', SimpleImputer(strategy='most_frequent')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
])
|
||||
```
|
||||
|
||||
### Mixed Data with ColumnTransformer
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
|
||||
numeric_features = ['age', 'income', 'credit_score']
|
||||
categorical_features = ['country', 'occupation']
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
|
||||
# Complete pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
pipeline = Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', RandomForestClassifier())
|
||||
# Combine transformers
|
||||
preprocessor = ColumnTransformer([
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
|
||||
# Full pipeline
|
||||
model = Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
|
||||
])
|
||||
|
||||
# Fit and predict
|
||||
model.fit(X_train, y_train)
|
||||
y_pred = model.predict(X_test)
|
||||
```
|
||||
|
||||
## Model Selection Cheat Sheet
|
||||
|
||||
### Quick Decision Tree
|
||||
|
||||
```
|
||||
Is it supervised?
|
||||
├─ Yes
|
||||
│ ├─ Predicting categories? → Classification
|
||||
│ │ ├─ Start with: LogisticRegression (baseline)
|
||||
│ │ ├─ Then try: RandomForestClassifier
|
||||
│ │ └─ Best performance: HistGradientBoostingClassifier
|
||||
│ └─ Predicting numbers? → Regression
|
||||
│ ├─ Start with: LinearRegression/Ridge (baseline)
|
||||
│ ├─ Then try: RandomForestRegressor
|
||||
│ └─ Best performance: HistGradientBoostingRegressor
|
||||
└─ No
|
||||
├─ Grouping similar items? → Clustering
|
||||
│ ├─ Know # clusters: KMeans
|
||||
│ └─ Unknown # clusters: DBSCAN or HDBSCAN
|
||||
├─ Reducing dimensions?
|
||||
│ ├─ For preprocessing: PCA
|
||||
│ └─ For visualization: t-SNE or UMAP
|
||||
└─ Finding outliers? → IsolationForest or LocalOutlierFactor
|
||||
```
|
||||
|
||||
### Algorithm Selection by Data Size
|
||||
|
||||
- **Small (<1K samples)**: Any algorithm
|
||||
- **Medium (1K-100K)**: Random Forests, Gradient Boosting, Neural Networks
|
||||
- **Large (>100K)**: SGDClassifier/Regressor, HistGradientBoosting, LinearSVC
|
||||
|
||||
### When to Scale Features
|
||||
|
||||
**Always scale**:
|
||||
- SVM, Neural Networks
|
||||
- K-Nearest Neighbors
|
||||
- Linear/Logistic Regression (with regularization)
|
||||
- PCA, LDA
|
||||
- Any gradient descent algorithm
|
||||
|
||||
**Don't need to scale**:
|
||||
- Tree-based (Decision Trees, Random Forests, Gradient Boosting)
|
||||
- Naive Bayes
|
||||
|
||||
## Hyperparameter Tuning
|
||||
|
||||
### GridSearchCV
|
||||
### Hyperparameter Tuning
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
param_grid = {
|
||||
'n_estimators': [100, 200, 500],
|
||||
'n_estimators': [100, 200, 300],
|
||||
'max_depth': [10, 20, None],
|
||||
'min_samples_split': [2, 5, 10]
|
||||
}
|
||||
|
||||
model = RandomForestClassifier(random_state=42)
|
||||
grid_search = GridSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_grid,
|
||||
cv=5,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1
|
||||
model, param_grid, cv=5, scoring='accuracy', n_jobs=-1
|
||||
)
|
||||
|
||||
grid_search.fit(X_train, y_train)
|
||||
best_model = grid_search.best_estimator_
|
||||
print(f"Best params: {grid_search.best_params_}")
|
||||
print(f"Best score: {grid_search.best_score_:.3f}")
|
||||
|
||||
# Use best model
|
||||
best_model = grid_search.best_estimator_
|
||||
```
|
||||
|
||||
### RandomizedSearchCV (Faster)
|
||||
## Common Patterns
|
||||
|
||||
### Loading Data
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import RandomizedSearchCV
|
||||
from scipy.stats import randint, uniform
|
||||
# From scikit-learn datasets
|
||||
from sklearn.datasets import load_iris, load_digits, make_classification
|
||||
|
||||
param_distributions = {
|
||||
'n_estimators': randint(100, 1000),
|
||||
'max_depth': randint(5, 50),
|
||||
'min_samples_split': randint(2, 20)
|
||||
}
|
||||
# Built-in datasets
|
||||
iris = load_iris()
|
||||
X, y = iris.data, iris.target
|
||||
|
||||
random_search = RandomizedSearchCV(
|
||||
RandomForestClassifier(random_state=42),
|
||||
param_distributions,
|
||||
n_iter=50, # Number of combinations to try
|
||||
cv=5,
|
||||
n_jobs=-1,
|
||||
random_state=42
|
||||
# Synthetic data
|
||||
X, y = make_classification(
|
||||
n_samples=1000, n_features=20, n_classes=2, random_state=42
|
||||
)
|
||||
|
||||
random_search.fit(X_train, y_train)
|
||||
# From pandas
|
||||
import pandas as pd
|
||||
df = pd.read_csv('data.csv')
|
||||
X = df.drop('target', axis=1)
|
||||
y = df['target']
|
||||
```
|
||||
|
||||
### Pipeline with GridSearchCV
|
||||
### Handling Imbalanced Data
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.svm import SVC
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('svm', SVC())
|
||||
])
|
||||
|
||||
param_grid = {
|
||||
'svm__C': [0.1, 1, 10],
|
||||
'svm__kernel': ['rbf', 'linear'],
|
||||
'svm__gamma': ['scale', 'auto']
|
||||
}
|
||||
|
||||
grid = GridSearchCV(pipeline, param_grid, cv=5)
|
||||
grid.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Cross-Validation
|
||||
|
||||
### Basic Cross-Validation
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_score
|
||||
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
|
||||
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Multiple Metrics
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_validate
|
||||
|
||||
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
|
||||
results = cross_validate(model, X, y, cv=5, scoring=scoring)
|
||||
|
||||
for metric in scoring:
|
||||
scores = results[f'test_{metric}']
|
||||
print(f"{metric}: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Custom CV Strategies
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
|
||||
|
||||
# For imbalanced classification
|
||||
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
|
||||
|
||||
# For time series
|
||||
cv = TimeSeriesSplit(n_splits=5)
|
||||
|
||||
scores = cross_val_score(model, X, y, cv=cv)
|
||||
```
|
||||
|
||||
## Common Metrics
|
||||
|
||||
### Classification
|
||||
|
||||
```python
|
||||
from sklearn.metrics import (
|
||||
accuracy_score, balanced_accuracy_score,
|
||||
precision_score, recall_score, f1_score,
|
||||
confusion_matrix, classification_report,
|
||||
roc_auc_score
|
||||
)
|
||||
|
||||
# Basic metrics
|
||||
accuracy = accuracy_score(y_true, y_pred)
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
|
||||
# Comprehensive report
|
||||
print(classification_report(y_true, y_pred))
|
||||
|
||||
# ROC AUC (requires probabilities)
|
||||
y_proba = model.predict_proba(X_test)[:, 1]
|
||||
auc = roc_auc_score(y_true, y_proba)
|
||||
```
|
||||
|
||||
### Regression
|
||||
|
||||
```python
|
||||
from sklearn.metrics import (
|
||||
mean_squared_error,
|
||||
mean_absolute_error,
|
||||
r2_score
|
||||
)
|
||||
|
||||
mse = mean_squared_error(y_true, y_pred)
|
||||
rmse = mean_squared_error(y_true, y_pred, squared=False)
|
||||
mae = mean_absolute_error(y_true, y_pred)
|
||||
r2 = r2_score(y_true, y_pred)
|
||||
|
||||
print(f"RMSE: {rmse:.3f}")
|
||||
print(f"MAE: {mae:.3f}")
|
||||
print(f"R²: {r2:.3f}")
|
||||
```
|
||||
|
||||
## Feature Engineering
|
||||
|
||||
### Polynomial Features
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
|
||||
poly = PolynomialFeatures(degree=2, include_bias=False)
|
||||
X_poly = poly.fit_transform(X)
|
||||
# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
|
||||
```
|
||||
|
||||
### Feature Selection
|
||||
|
||||
```python
|
||||
from sklearn.feature_selection import (
|
||||
SelectKBest, f_classif,
|
||||
RFE,
|
||||
SelectFromModel
|
||||
)
|
||||
|
||||
# Univariate selection
|
||||
selector = SelectKBest(f_classif, k=10)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
|
||||
# Recursive feature elimination
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
rfe = RFE(RandomForestClassifier(), n_features_to_select=10)
|
||||
X_selected = rfe.fit_transform(X, y)
|
||||
|
||||
# Model-based selection
|
||||
selector = SelectFromModel(
|
||||
RandomForestClassifier(n_estimators=100),
|
||||
threshold='median'
|
||||
)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
# Use class_weight parameter
|
||||
model = RandomForestClassifier(class_weight='balanced', random_state=42)
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
# Or use appropriate metrics
|
||||
from sklearn.metrics import balanced_accuracy_score, f1_score
|
||||
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.3f}")
|
||||
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
|
||||
```
|
||||
|
||||
### Feature Importance
|
||||
|
||||
```python
|
||||
# Tree-based models
|
||||
model = RandomForestClassifier()
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
import pandas as pd
|
||||
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
model.fit(X_train, y_train)
|
||||
importances = model.feature_importances_
|
||||
|
||||
# Visualize
|
||||
import matplotlib.pyplot as plt
|
||||
indices = np.argsort(importances)[::-1]
|
||||
plt.bar(range(X.shape[1]), importances[indices])
|
||||
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
|
||||
plt.show()
|
||||
# Get feature importances
|
||||
importances = pd.DataFrame({
|
||||
'feature': feature_names,
|
||||
'importance': model.feature_importances_
|
||||
}).sort_values('importance', ascending=False)
|
||||
|
||||
# Permutation importance (works for any model)
|
||||
from sklearn.inspection import permutation_importance
|
||||
result = permutation_importance(model, X_test, y_test, n_repeats=10)
|
||||
importances = result.importances_mean
|
||||
print(importances.head(10))
|
||||
```
|
||||
|
||||
## Clustering
|
||||
|
||||
### K-Means
|
||||
### Clustering
|
||||
|
||||
```python
|
||||
from sklearn.cluster import KMeans
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
# Always scale for k-means
|
||||
# Scale data first
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Fit k-means
|
||||
# Fit K-Means
|
||||
kmeans = KMeans(n_clusters=3, random_state=42)
|
||||
labels = kmeans.fit_predict(X_scaled)
|
||||
|
||||
# Evaluate
|
||||
from sklearn.metrics import silhouette_score
|
||||
score = silhouette_score(X_scaled, labels)
|
||||
print(f"Silhouette score: {score:.3f}")
|
||||
print(f"Silhouette Score: {score:.3f}")
|
||||
```
|
||||
|
||||
### Elbow Method
|
||||
|
||||
```python
|
||||
inertias = []
|
||||
K_range = range(2, 11)
|
||||
|
||||
for k in K_range:
|
||||
kmeans = KMeans(n_clusters=k, random_state=42)
|
||||
kmeans.fit(X_scaled)
|
||||
inertias.append(kmeans.inertia_)
|
||||
|
||||
plt.plot(K_range, inertias, 'bo-')
|
||||
plt.xlabel('k')
|
||||
plt.ylabel('Inertia')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### DBSCAN
|
||||
|
||||
```python
|
||||
from sklearn.cluster import DBSCAN
|
||||
|
||||
dbscan = DBSCAN(eps=0.5, min_samples=5)
|
||||
labels = dbscan.fit_predict(X_scaled)
|
||||
|
||||
# -1 indicates noise/outliers
|
||||
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
|
||||
n_noise = list(labels).count(-1)
|
||||
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
|
||||
```
|
||||
|
||||
## Dimensionality Reduction
|
||||
|
||||
### PCA
|
||||
### Dimensionality Reduction
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Always scale before PCA
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Specify n_components
|
||||
# Fit PCA
|
||||
pca = PCA(n_components=2)
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
X_reduced = pca.fit_transform(X)
|
||||
|
||||
# Or specify variance to retain
|
||||
pca = PCA(n_components=0.95) # Keep 95% variance
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
print(f"Explained variance: {pca.explained_variance_ratio_}")
|
||||
print(f"Components needed: {pca.n_components_}")
|
||||
# Plot
|
||||
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis')
|
||||
plt.xlabel('PC1')
|
||||
plt.ylabel('PC2')
|
||||
plt.title(f'PCA (explained variance: {pca.explained_variance_ratio_.sum():.2%})')
|
||||
```
|
||||
|
||||
### t-SNE (Visualization Only)
|
||||
|
||||
```python
|
||||
from sklearn.manifold import TSNE
|
||||
|
||||
# Reduce to 50 dimensions with PCA first (recommended)
|
||||
pca = PCA(n_components=50)
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
|
||||
# Apply t-SNE
|
||||
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
|
||||
X_tsne = tsne.fit_transform(X_pca)
|
||||
|
||||
# Visualize
|
||||
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
|
||||
plt.colorbar()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Saving and Loading Models
|
||||
### Model Persistence
|
||||
|
||||
```python
|
||||
import joblib
|
||||
@@ -548,78 +289,145 @@ import joblib
|
||||
# Save model
|
||||
joblib.dump(model, 'model.pkl')
|
||||
|
||||
# Save pipeline
|
||||
joblib.dump(pipeline, 'pipeline.pkl')
|
||||
|
||||
# Load
|
||||
model = joblib.load('model.pkl')
|
||||
pipeline = joblib.load('pipeline.pkl')
|
||||
|
||||
# Use loaded model
|
||||
y_pred = model.predict(X_new)
|
||||
# Load model
|
||||
loaded_model = joblib.load('model.pkl')
|
||||
predictions = loaded_model.predict(X_new)
|
||||
```
|
||||
|
||||
## Common Pitfalls and Solutions
|
||||
## Common Gotchas and Solutions
|
||||
|
||||
### Data Leakage
|
||||
❌ **Wrong**: Fit on all data before split
|
||||
```python
|
||||
scaler = StandardScaler().fit(X)
|
||||
X_train, X_test = train_test_split(scaler.transform(X))
|
||||
```
|
||||
# WRONG: Fitting scaler on all data
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
X_train, X_test = train_test_split(X_scaled)
|
||||
|
||||
✅ **Correct**: Use pipeline or fit only on train
|
||||
```python
|
||||
# RIGHT: Fit on training data only
|
||||
X_train, X_test = train_test_split(X)
|
||||
pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
|
||||
pipeline.fit(X_train, y_train)
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# BEST: Use Pipeline
|
||||
from sklearn.pipeline import Pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
pipeline.fit(X_train, y_train) # No leakage!
|
||||
```
|
||||
|
||||
### Not Scaling
|
||||
❌ **Wrong**: Using SVM without scaling
|
||||
```python
|
||||
svm = SVC()
|
||||
svm.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
✅ **Correct**: Scale for SVM
|
||||
```python
|
||||
pipeline = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Wrong Metric for Imbalanced Data
|
||||
❌ **Wrong**: Using accuracy for 99:1 imbalance
|
||||
```python
|
||||
accuracy = accuracy_score(y_true, y_pred) # Can be misleading
|
||||
```
|
||||
|
||||
✅ **Correct**: Use appropriate metrics
|
||||
```python
|
||||
f1 = f1_score(y_true, y_pred, average='weighted')
|
||||
balanced_acc = balanced_accuracy_score(y_true, y_pred)
|
||||
```
|
||||
|
||||
### Not Using Stratification
|
||||
❌ **Wrong**: Random split for imbalanced data
|
||||
```python
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
|
||||
```
|
||||
|
||||
✅ **Correct**: Stratify for imbalanced classes
|
||||
### Stratified Splitting for Classification
|
||||
```python
|
||||
# Always use stratify for classification
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, stratify=y
|
||||
X, y, test_size=0.2, stratify=y, random_state=42
|
||||
)
|
||||
```
|
||||
|
||||
### Random State for Reproducibility
|
||||
```python
|
||||
# Set random_state for reproducibility
|
||||
model = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
```
|
||||
|
||||
### Handling Unknown Categories
|
||||
```python
|
||||
# Use handle_unknown='ignore' for OneHotEncoder
|
||||
encoder = OneHotEncoder(handle_unknown='ignore')
|
||||
```
|
||||
|
||||
### Feature Names with Pipelines
|
||||
```python
|
||||
# Get feature names after transformation
|
||||
preprocessor.fit(X_train)
|
||||
feature_names = preprocessor.get_feature_names_out()
|
||||
```
|
||||
|
||||
## Cheat Sheet: Algorithm Selection
|
||||
|
||||
### Classification
|
||||
|
||||
| Problem | Algorithm | When to Use |
|
||||
|---------|-----------|-------------|
|
||||
| Binary/Multiclass | Logistic Regression | Fast baseline, interpretability |
|
||||
| Binary/Multiclass | Random Forest | Good default, robust |
|
||||
| Binary/Multiclass | Gradient Boosting | Best accuracy, willing to tune |
|
||||
| Binary/Multiclass | SVM | Small data, complex boundaries |
|
||||
| Binary/Multiclass | Naive Bayes | Text classification, fast |
|
||||
| High dimensions | Linear SVM or Logistic | Text, many features |
|
||||
|
||||
### Regression
|
||||
|
||||
| Problem | Algorithm | When to Use |
|
||||
|---------|-----------|-------------|
|
||||
| Continuous target | Linear Regression | Fast baseline, interpretability |
|
||||
| Continuous target | Ridge/Lasso | Regularization needed |
|
||||
| Continuous target | Random Forest | Good default, non-linear |
|
||||
| Continuous target | Gradient Boosting | Best accuracy |
|
||||
| Continuous target | SVR | Small data, non-linear |
|
||||
|
||||
### Clustering
|
||||
|
||||
| Problem | Algorithm | When to Use |
|
||||
|---------|-----------|-------------|
|
||||
| Known K, spherical | K-Means | Fast, simple |
|
||||
| Unknown K, arbitrary shapes | DBSCAN | Noise/outliers present |
|
||||
| Hierarchical structure | Agglomerative | Need dendrogram |
|
||||
| Soft clustering | Gaussian Mixture | Probability estimates |
|
||||
|
||||
### Dimensionality Reduction
|
||||
|
||||
| Problem | Algorithm | When to Use |
|
||||
|---------|-----------|-------------|
|
||||
| Linear reduction | PCA | Variance explanation |
|
||||
| Visualization | t-SNE | 2D/3D plots |
|
||||
| Non-negative data | NMF | Images, text |
|
||||
| Sparse data | TruncatedSVD | Text, recommender systems |
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use n_jobs=-1** for parallel processing (RandomForest, GridSearchCV)
|
||||
2. **Use HistGradientBoosting** for large datasets (>10K samples)
|
||||
3. **Use MiniBatchKMeans** for large clustering tasks
|
||||
4. **Use IncrementalPCA** for data that doesn't fit in memory
|
||||
5. **Use sparse matrices** for high-dimensional sparse data (text)
|
||||
6. **Cache transformers** in pipelines during grid search
|
||||
7. **Use RandomizedSearchCV** instead of GridSearchCV for large parameter spaces
|
||||
8. **Reduce dimensionality** with PCA before applying expensive algorithms
|
||||
### Speed Up Training
|
||||
```python
|
||||
# Use n_jobs=-1 for parallel processing
|
||||
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
|
||||
|
||||
# Use warm_start for incremental learning
|
||||
model = RandomForestClassifier(n_estimators=100, warm_start=True)
|
||||
model.fit(X, y)
|
||||
model.n_estimators += 50
|
||||
model.fit(X, y) # Adds 50 more trees
|
||||
|
||||
# Use partial_fit for online learning
|
||||
from sklearn.linear_model import SGDClassifier
|
||||
model = SGDClassifier()
|
||||
for X_batch, y_batch in batches:
|
||||
model.partial_fit(X_batch, y_batch, classes=np.unique(y))
|
||||
```
|
||||
|
||||
### Memory Efficiency
|
||||
```python
|
||||
# Use sparse matrices
|
||||
from scipy.sparse import csr_matrix
|
||||
X_sparse = csr_matrix(X)
|
||||
|
||||
# Use MiniBatchKMeans for large data
|
||||
from sklearn.cluster import MiniBatchKMeans
|
||||
model = MiniBatchKMeans(n_clusters=8, batch_size=100)
|
||||
```
|
||||
|
||||
## Version Check
|
||||
|
||||
```python
|
||||
import sklearn
|
||||
print(f"scikit-learn version: {sklearn.__version__}")
|
||||
```
|
||||
|
||||
## Useful Resources
|
||||
|
||||
- Official Documentation: https://scikit-learn.org/stable/
|
||||
- User Guide: https://scikit-learn.org/stable/user_guide.html
|
||||
- API Reference: https://scikit-learn.org/stable/api/index.html
|
||||
- Examples: https://scikit-learn.org/stable/auto_examples/index.html
|
||||
- Tutorials: https://scikit-learn.org/stable/tutorial/index.html
|
||||
|
||||
@@ -1,261 +1,378 @@
|
||||
# Supervised Learning in scikit-learn
|
||||
# Supervised Learning Reference
|
||||
|
||||
## Overview
|
||||
Supervised learning algorithms learn patterns from labeled training data to make predictions on new data. Scikit-learn organizes supervised learning into 17 major categories.
|
||||
|
||||
Supervised learning algorithms learn from labeled training data to make predictions on new data. Scikit-learn provides comprehensive implementations for both classification and regression tasks.
|
||||
|
||||
## Linear Models
|
||||
|
||||
### Regression
|
||||
- **LinearRegression**: Ordinary least squares regression
|
||||
- **Ridge**: L2-regularized regression, good for multicollinearity
|
||||
- **Lasso**: L1-regularized regression, performs feature selection
|
||||
- **ElasticNet**: Combined L1/L2 regularization
|
||||
- **LassoLars**: Lasso using Least Angle Regression algorithm
|
||||
- **BayesianRidge**: Bayesian approach with automatic relevance determination
|
||||
|
||||
**Linear Regression (`sklearn.linear_model.LinearRegression`)**
|
||||
- Ordinary least squares regression
|
||||
- Fast, interpretable, no hyperparameters
|
||||
- Use when: Linear relationships, interpretability matters
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import LinearRegression
|
||||
|
||||
model = LinearRegression()
|
||||
model.fit(X_train, y_train)
|
||||
predictions = model.predict(X_test)
|
||||
```
|
||||
|
||||
**Ridge Regression (`sklearn.linear_model.Ridge`)**
|
||||
- L2 regularization to prevent overfitting
|
||||
- Key parameter: `alpha` (regularization strength, default=1.0)
|
||||
- Use when: Multicollinearity present, need regularization
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import Ridge
|
||||
|
||||
model = Ridge(alpha=1.0)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**Lasso (`sklearn.linear_model.Lasso`)**
|
||||
- L1 regularization with feature selection
|
||||
- Key parameter: `alpha` (regularization strength)
|
||||
- Use when: Want sparse models, feature selection
|
||||
- Can reduce some coefficients to exactly zero
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import Lasso
|
||||
|
||||
model = Lasso(alpha=0.1)
|
||||
model.fit(X_train, y_train)
|
||||
# Check which features were selected
|
||||
print(f"Non-zero coefficients: {sum(model.coef_ != 0)}")
|
||||
```
|
||||
|
||||
**ElasticNet (`sklearn.linear_model.ElasticNet`)**
|
||||
- Combines L1 and L2 regularization
|
||||
- Key parameters: `alpha`, `l1_ratio` (0=Ridge, 1=Lasso)
|
||||
- Use when: Need both feature selection and regularization
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import ElasticNet
|
||||
|
||||
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Classification
|
||||
- **LogisticRegression**: Binary and multiclass classification
|
||||
- **RidgeClassifier**: Ridge regression for classification
|
||||
- **SGDClassifier**: Linear classifiers with SGD training
|
||||
|
||||
**Use cases**: Baseline models, interpretable predictions, high-dimensional data, when linear relationships are expected
|
||||
**Logistic Regression (`sklearn.linear_model.LogisticRegression`)**
|
||||
- Binary and multiclass classification
|
||||
- Key parameters: `C` (inverse regularization), `penalty` ('l1', 'l2', 'elasticnet')
|
||||
- Returns probability estimates
|
||||
- Use when: Need probabilistic predictions, interpretability
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
**Key parameters**:
|
||||
- `alpha`: Regularization strength (higher = more regularization)
|
||||
- `fit_intercept`: Whether to calculate intercept
|
||||
- `solver`: Optimization algorithm ('lbfgs', 'saga', 'liblinear')
|
||||
model = LogisticRegression(C=1.0, max_iter=1000)
|
||||
model.fit(X_train, y_train)
|
||||
probas = model.predict_proba(X_test)
|
||||
```
|
||||
|
||||
## Support Vector Machines (SVM)
|
||||
**Stochastic Gradient Descent (SGD)**
|
||||
- `SGDClassifier`, `SGDRegressor`
|
||||
- Efficient for large-scale learning
|
||||
- Key parameters: `loss`, `penalty`, `alpha`, `learning_rate`
|
||||
- Use when: Very large datasets (>10^4 samples)
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.linear_model import SGDClassifier
|
||||
|
||||
- **SVC**: Support Vector Classification
|
||||
- **SVR**: Support Vector Regression
|
||||
- **LinearSVC**: Linear SVM using liblinear (faster for large datasets)
|
||||
- **OneClassSVM**: Unsupervised outlier detection
|
||||
model = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**Use cases**: Complex non-linear decision boundaries, high-dimensional spaces, when clear margin of separation exists
|
||||
## Support Vector Machines
|
||||
|
||||
**Key parameters**:
|
||||
- `kernel`: 'linear', 'poly', 'rbf', 'sigmoid'
|
||||
- `C`: Regularization parameter (lower = more regularization)
|
||||
- `gamma`: Kernel coefficient ('scale', 'auto', or float)
|
||||
- `degree`: Polynomial degree (for poly kernel)
|
||||
**SVC (`sklearn.svm.SVC`)**
|
||||
- Classification with kernel methods
|
||||
- Key parameters: `C`, `kernel` ('linear', 'rbf', 'poly'), `gamma`
|
||||
- Use when: Small to medium datasets, complex decision boundaries
|
||||
- Note: Does not scale well to large datasets
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.svm import SVC
|
||||
|
||||
**Performance tip**: SVMs don't scale well beyond tens of thousands of samples. Use LinearSVC for large datasets with linear kernel.
|
||||
# Linear kernel for linearly separable data
|
||||
model_linear = SVC(kernel='linear', C=1.0)
|
||||
|
||||
# RBF kernel for non-linear data
|
||||
model_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
|
||||
model_rbf.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**SVR (`sklearn.svm.SVR`)**
|
||||
- Regression with kernel methods
|
||||
- Similar parameters to SVC
|
||||
- Additional parameter: `epsilon` (tube width)
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.svm import SVR
|
||||
|
||||
model = SVR(kernel='rbf', C=1.0, epsilon=0.1)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Decision Trees
|
||||
|
||||
- **DecisionTreeClassifier**: Classification tree
|
||||
- **DecisionTreeRegressor**: Regression tree
|
||||
- **ExtraTreeClassifier/Regressor**: Extremely randomized tree
|
||||
**DecisionTreeClassifier / DecisionTreeRegressor**
|
||||
- Non-parametric model learning decision rules
|
||||
- Key parameters:
|
||||
- `max_depth`: Maximum tree depth (prevents overfitting)
|
||||
- `min_samples_split`: Minimum samples to split a node
|
||||
- `min_samples_leaf`: Minimum samples in leaf
|
||||
- `criterion`: 'gini', 'entropy' for classification; 'squared_error', 'absolute_error' for regression
|
||||
- Use when: Need interpretable model, non-linear relationships, mixed feature types
|
||||
- Prone to overfitting - use ensembles or pruning
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
|
||||
**Use cases**: Non-linear relationships, feature importance analysis, interpretable rules, handling mixed data types
|
||||
model = DecisionTreeClassifier(
|
||||
max_depth=5,
|
||||
min_samples_split=20,
|
||||
min_samples_leaf=10,
|
||||
criterion='gini'
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
**Key parameters**:
|
||||
- `max_depth`: Maximum tree depth (controls overfitting)
|
||||
- `min_samples_split`: Minimum samples to split a node
|
||||
- `min_samples_leaf`: Minimum samples in leaf node
|
||||
- `max_features`: Number of features to consider for splits
|
||||
- `criterion`: 'gini', 'entropy' (classification); 'squared_error', 'absolute_error' (regression)
|
||||
|
||||
**Overfitting prevention**: Limit `max_depth`, increase `min_samples_split/leaf`, use pruning with `ccp_alpha`
|
||||
# Visualize the tree
|
||||
from sklearn.tree import plot_tree
|
||||
plot_tree(model, feature_names=feature_names, class_names=class_names)
|
||||
```
|
||||
|
||||
## Ensemble Methods
|
||||
|
||||
### Random Forests
|
||||
- **RandomForestClassifier**: Ensemble of decision trees
|
||||
- **RandomForestRegressor**: Regression variant
|
||||
|
||||
**Use cases**: Robust general-purpose algorithm, reduces overfitting vs single trees, handles non-linear relationships
|
||||
**RandomForestClassifier / RandomForestRegressor**
|
||||
- Ensemble of decision trees with bagging
|
||||
- Key parameters:
|
||||
- `n_estimators`: Number of trees (default=100)
|
||||
- `max_depth`: Maximum tree depth
|
||||
- `max_features`: Features to consider for splits ('sqrt', 'log2', or int)
|
||||
- `min_samples_split`, `min_samples_leaf`: Control tree growth
|
||||
- Use when: High accuracy needed, can afford computation
|
||||
- Provides feature importance
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
**Key parameters**:
|
||||
- `n_estimators`: Number of trees (higher = better but slower)
|
||||
- `max_depth`: Maximum tree depth
|
||||
- `max_features`: Features per split ('sqrt', 'log2', int, float)
|
||||
- `bootstrap`: Whether to use bootstrap samples
|
||||
- `n_jobs`: Parallel processing (-1 uses all cores)
|
||||
model = RandomForestClassifier(
|
||||
n_estimators=100,
|
||||
max_depth=10,
|
||||
max_features='sqrt',
|
||||
n_jobs=-1 # Use all CPU cores
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
# Feature importance
|
||||
importances = model.feature_importances_
|
||||
```
|
||||
|
||||
### Gradient Boosting
|
||||
- **HistGradientBoostingClassifier/Regressor**: Histogram-based, fast for large datasets (>10k samples)
|
||||
- **GradientBoostingClassifier/Regressor**: Traditional implementation, better for small datasets
|
||||
|
||||
**Use cases**: High-performance predictions, winning Kaggle competitions, structured/tabular data
|
||||
**GradientBoostingClassifier / GradientBoostingRegressor**
|
||||
- Sequential ensemble building trees on residuals
|
||||
- Key parameters:
|
||||
- `n_estimators`: Number of boosting stages
|
||||
- `learning_rate`: Shrinks contribution of each tree
|
||||
- `max_depth`: Depth of individual trees (typically 3-5)
|
||||
- `subsample`: Fraction of samples for training each tree
|
||||
- Use when: Need high accuracy, can afford training time
|
||||
- Often achieves best performance
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import GradientBoostingClassifier
|
||||
|
||||
**Key parameters**:
|
||||
- `n_estimators`: Number of boosting stages
|
||||
- `learning_rate`: Shrinks contribution of each tree
|
||||
- `max_depth`: Maximum tree depth (typically 3-8)
|
||||
- `subsample`: Fraction of samples per tree (enables stochastic gradient boosting)
|
||||
- `early_stopping`: Stop when validation score stops improving
|
||||
model = GradientBoostingClassifier(
|
||||
n_estimators=100,
|
||||
learning_rate=0.1,
|
||||
max_depth=3,
|
||||
subsample=0.8
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**Performance tip**: HistGradientBoosting is orders of magnitude faster for large datasets
|
||||
**HistGradientBoostingClassifier / HistGradientBoostingRegressor**
|
||||
- Faster gradient boosting with histogram-based algorithm
|
||||
- Native support for missing values and categorical features
|
||||
- Key parameters: Similar to GradientBoosting
|
||||
- Use when: Large datasets, need faster training
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import HistGradientBoostingClassifier
|
||||
|
||||
### AdaBoost
|
||||
- **AdaBoostClassifier/Regressor**: Adaptive boosting
|
||||
model = HistGradientBoostingClassifier(
|
||||
max_iter=100,
|
||||
learning_rate=0.1,
|
||||
max_depth=None, # No limit by default
|
||||
categorical_features='from_dtype' # Auto-detect categorical
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**Use cases**: Boosting weak learners, less prone to overfitting than other methods
|
||||
### Other Ensemble Methods
|
||||
|
||||
**Key parameters**:
|
||||
- `estimator`: Base estimator (default: DecisionTreeClassifier with max_depth=1)
|
||||
- `n_estimators`: Number of boosting iterations
|
||||
- `learning_rate`: Weight applied to each classifier
|
||||
**AdaBoost**
|
||||
- Adaptive boosting focusing on misclassified samples
|
||||
- Key parameters: `n_estimators`, `learning_rate`, `estimator` (base estimator)
|
||||
- Use when: Simple boosting approach needed
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import AdaBoostClassifier
|
||||
|
||||
### Bagging
|
||||
- **BaggingClassifier/Regressor**: Bootstrap aggregating with any base estimator
|
||||
model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
**Use cases**: Reducing variance of unstable models, parallel ensemble creation
|
||||
**Voting Classifier / Regressor**
|
||||
- Combines predictions from multiple models
|
||||
- Types: 'hard' (majority vote) or 'soft' (average probabilities)
|
||||
- Use when: Want to ensemble different model types
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import VotingClassifier
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
from sklearn.svm import SVC
|
||||
|
||||
**Key parameters**:
|
||||
- `estimator`: Base estimator to fit
|
||||
- `n_estimators`: Number of estimators
|
||||
- `max_samples`: Samples to draw per estimator
|
||||
- `bootstrap`: Whether to use replacement
|
||||
model = VotingClassifier(
|
||||
estimators=[
|
||||
('lr', LogisticRegression()),
|
||||
('dt', DecisionTreeClassifier()),
|
||||
('svc', SVC(probability=True))
|
||||
],
|
||||
voting='soft'
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Voting & Stacking
|
||||
- **VotingClassifier/Regressor**: Combines different model types
|
||||
- **StackingClassifier/Regressor**: Meta-learner trained on base predictions
|
||||
**Stacking Classifier / Regressor**
|
||||
- Trains a meta-model on predictions from base models
|
||||
- More sophisticated than voting
|
||||
- Key parameter: `final_estimator` (meta-learner)
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.ensemble import StackingClassifier
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
from sklearn.svm import SVC
|
||||
|
||||
**Use cases**: Combining diverse models, leveraging different model strengths
|
||||
model = StackingClassifier(
|
||||
estimators=[
|
||||
('dt', DecisionTreeClassifier()),
|
||||
('svc', SVC())
|
||||
],
|
||||
final_estimator=LogisticRegression()
|
||||
)
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Neural Networks
|
||||
## K-Nearest Neighbors
|
||||
|
||||
- **MLPClassifier**: Multi-layer perceptron classifier
|
||||
- **MLPRegressor**: Multi-layer perceptron regressor
|
||||
**KNeighborsClassifier / KNeighborsRegressor**
|
||||
- Non-parametric method based on distance
|
||||
- Key parameters:
|
||||
- `n_neighbors`: Number of neighbors (default=5)
|
||||
- `weights`: 'uniform' or 'distance'
|
||||
- `metric`: Distance metric ('euclidean', 'manhattan', etc.)
|
||||
- Use when: Small dataset, simple baseline needed
|
||||
- Slow prediction on large datasets
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
|
||||
**Use cases**: Complex non-linear patterns, when gradient boosting is too slow, deep feature learning
|
||||
|
||||
**Key parameters**:
|
||||
- `hidden_layer_sizes`: Tuple of hidden layer sizes (e.g., (100, 50))
|
||||
- `activation`: 'relu', 'tanh', 'logistic'
|
||||
- `solver`: 'adam', 'lbfgs', 'sgd'
|
||||
- `alpha`: L2 regularization term
|
||||
- `learning_rate`: Learning rate schedule
|
||||
- `early_stopping`: Stop when validation score stops improving
|
||||
|
||||
**Important**: Feature scaling is critical for neural networks. Always use StandardScaler or similar.
|
||||
|
||||
## Nearest Neighbors
|
||||
|
||||
- **KNeighborsClassifier/Regressor**: K-nearest neighbors
|
||||
- **RadiusNeighborsClassifier/Regressor**: Radius-based neighbors
|
||||
- **NearestCentroid**: Classification using class centroids
|
||||
|
||||
**Use cases**: Simple baseline, irregular decision boundaries, when interpretability isn't critical
|
||||
|
||||
**Key parameters**:
|
||||
- `n_neighbors`: Number of neighbors (typically 3-11)
|
||||
- `weights`: 'uniform' or 'distance' (distance-weighted voting)
|
||||
- `metric`: Distance metric ('euclidean', 'manhattan', 'minkowski')
|
||||
- `algorithm`: 'auto', 'ball_tree', 'kd_tree', 'brute'
|
||||
model = KNeighborsClassifier(n_neighbors=5, weights='distance')
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Naive Bayes
|
||||
|
||||
- **GaussianNB**: Assumes Gaussian distribution of features
|
||||
- **MultinomialNB**: For discrete counts (text classification)
|
||||
- **BernoulliNB**: For binary/boolean features
|
||||
- **CategoricalNB**: For categorical features
|
||||
- **ComplementNB**: Adapted for imbalanced datasets
|
||||
**GaussianNB, MultinomialNB, BernoulliNB**
|
||||
- Probabilistic classifiers based on Bayes' theorem
|
||||
- Fast training and prediction
|
||||
- GaussianNB: Continuous features (assumes Gaussian distribution)
|
||||
- MultinomialNB: Count features (text classification)
|
||||
- BernoulliNB: Binary features
|
||||
- Use when: Text classification, fast baseline, probabilistic predictions
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.naive_bayes import GaussianNB, MultinomialNB
|
||||
|
||||
**Use cases**: Text classification, fast baseline, when features are independent, small training sets
|
||||
# For continuous features
|
||||
model_gaussian = GaussianNB()
|
||||
|
||||
**Key parameters**:
|
||||
- `alpha`: Smoothing parameter (Laplace/Lidstone smoothing)
|
||||
- `fit_prior`: Whether to learn class prior probabilities
|
||||
# For text/count data
|
||||
model_multinomial = MultinomialNB(alpha=1.0) # alpha is smoothing parameter
|
||||
model_multinomial.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Linear/Quadratic Discriminant Analysis
|
||||
## Neural Networks
|
||||
|
||||
- **LinearDiscriminantAnalysis**: Linear decision boundary with dimensionality reduction
|
||||
- **QuadraticDiscriminantAnalysis**: Quadratic decision boundary
|
||||
**MLPClassifier / MLPRegressor**
|
||||
- Multi-layer perceptron (feedforward neural network)
|
||||
- Key parameters:
|
||||
- `hidden_layer_sizes`: Tuple of hidden layer sizes, e.g., (100, 50)
|
||||
- `activation`: 'relu', 'tanh', 'logistic'
|
||||
- `solver`: 'adam', 'sgd', 'lbfgs'
|
||||
- `alpha`: L2 regularization parameter
|
||||
- `learning_rate`: 'constant', 'adaptive'
|
||||
- Use when: Complex non-linear patterns, large datasets
|
||||
- Requires feature scaling
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.neural_network import MLPClassifier
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
**Use cases**: When classes have Gaussian distributions, dimensionality reduction, when covariance assumptions hold
|
||||
# Scale features first
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
|
||||
## Gaussian Processes
|
||||
model = MLPClassifier(
|
||||
hidden_layer_sizes=(100, 50),
|
||||
activation='relu',
|
||||
solver='adam',
|
||||
alpha=0.0001,
|
||||
max_iter=1000
|
||||
)
|
||||
model.fit(X_train_scaled, y_train)
|
||||
```
|
||||
|
||||
- **GaussianProcessClassifier**: Probabilistic classification
|
||||
- **GaussianProcessRegressor**: Probabilistic regression with uncertainty estimates
|
||||
## Algorithm Selection Guide
|
||||
|
||||
**Use cases**: When uncertainty quantification is important, small datasets, smooth function approximation
|
||||
### Choose based on:
|
||||
|
||||
**Key parameters**:
|
||||
- `kernel`: Covariance function (RBF, Matern, RationalQuadratic, etc.)
|
||||
- `alpha`: Noise level
|
||||
**Dataset size:**
|
||||
- Small (<1k samples): KNN, SVM, Decision Trees
|
||||
- Medium (1k-100k): Random Forest, Gradient Boosting, Linear Models
|
||||
- Large (>100k): SGD, Linear Models, HistGradientBoosting
|
||||
|
||||
**Limitation**: Doesn't scale well to large datasets (O(n³) complexity)
|
||||
**Interpretability:**
|
||||
- High: Linear Models, Decision Trees
|
||||
- Medium: Random Forest (feature importance)
|
||||
- Low: SVM with RBF kernel, Neural Networks
|
||||
|
||||
## Stochastic Gradient Descent
|
||||
**Accuracy vs Speed:**
|
||||
- Fast training: Naive Bayes, Linear Models, KNN
|
||||
- High accuracy: Gradient Boosting, Random Forest, Stacking
|
||||
- Fast prediction: Linear Models, Naive Bayes
|
||||
- Slow prediction: KNN (on large datasets), SVM
|
||||
|
||||
- **SGDClassifier**: Linear classifiers with SGD
|
||||
- **SGDRegressor**: Linear regressors with SGD
|
||||
**Feature types:**
|
||||
- Continuous: Most algorithms work well
|
||||
- Categorical: Trees, HistGradientBoosting (native support)
|
||||
- Mixed: Trees, Gradient Boosting
|
||||
- Text: Naive Bayes, Linear Models with TF-IDF
|
||||
|
||||
**Use cases**: Very large datasets (>100k samples), online learning, when data doesn't fit in memory
|
||||
|
||||
**Key parameters**:
|
||||
- `loss`: Loss function ('hinge', 'log_loss', 'squared_error', etc.)
|
||||
- `penalty`: Regularization ('l2', 'l1', 'elasticnet')
|
||||
- `alpha`: Regularization strength
|
||||
- `learning_rate`: Learning rate schedule
|
||||
|
||||
## Semi-Supervised Learning
|
||||
|
||||
- **SelfTrainingClassifier**: Self-training with any base classifier
|
||||
- **LabelPropagation**: Label propagation through graph
|
||||
- **LabelSpreading**: Label spreading (modified label propagation)
|
||||
|
||||
**Use cases**: When labeled data is scarce but unlabeled data is abundant
|
||||
|
||||
## Feature Selection
|
||||
|
||||
- **VarianceThreshold**: Remove low-variance features
|
||||
- **SelectKBest**: Select K highest scoring features
|
||||
- **SelectPercentile**: Select top percentile of features
|
||||
- **RFE**: Recursive feature elimination
|
||||
- **RFECV**: RFE with cross-validation
|
||||
- **SelectFromModel**: Select features based on importance
|
||||
- **SequentialFeatureSelector**: Forward/backward feature selection
|
||||
|
||||
**Use cases**: Reducing dimensionality, removing irrelevant features, improving interpretability, reducing overfitting
|
||||
|
||||
## Probability Calibration
|
||||
|
||||
- **CalibratedClassifierCV**: Calibrate classifier probabilities
|
||||
|
||||
**Use cases**: When probability estimates are important (not just class predictions), especially with SVM and Naive Bayes
|
||||
|
||||
**Methods**:
|
||||
- `sigmoid`: Platt scaling
|
||||
- `isotonic`: Isotonic regression (more flexible, needs more data)
|
||||
|
||||
## Multi-Output Methods
|
||||
|
||||
- **MultiOutputClassifier**: Fit one classifier per target
|
||||
- **MultiOutputRegressor**: Fit one regressor per target
|
||||
- **ClassifierChain**: Models dependencies between targets
|
||||
- **RegressorChain**: Regression variant
|
||||
|
||||
**Use cases**: Predicting multiple related targets simultaneously
|
||||
|
||||
## Specialized Regression
|
||||
|
||||
- **IsotonicRegression**: Monotonic regression
|
||||
- **QuantileRegressor**: Quantile regression for prediction intervals
|
||||
|
||||
## Algorithm Selection Guidelines
|
||||
|
||||
**Start with**:
|
||||
1. **Logistic Regression** (classification) or **LinearRegression/Ridge** (regression) as baseline
|
||||
2. **RandomForestClassifier/Regressor** for general non-linear problems
|
||||
3. **HistGradientBoostingClassifier/Regressor** when best performance is needed
|
||||
|
||||
**Consider dataset size**:
|
||||
- Small (<1k samples): SVM, Gaussian Processes, any algorithm
|
||||
- Medium (1k-100k): Random Forests, Gradient Boosting, Neural Networks
|
||||
- Large (>100k): SGD, HistGradientBoosting, LinearSVC
|
||||
|
||||
**Consider interpretability needs**:
|
||||
- High interpretability: Linear models, Decision Trees, Naive Bayes
|
||||
- Medium: Random Forests (feature importance), Rule extraction
|
||||
- Low (black box acceptable): Gradient Boosting, Neural Networks, SVM with RBF kernel
|
||||
|
||||
**Consider training time**:
|
||||
- Fast: Linear models, Naive Bayes, Decision Trees
|
||||
- Medium: Random Forests (parallelizable), SVM (small data)
|
||||
- Slow: Gradient Boosting, Neural Networks, SVM (large data), Gaussian Processes
|
||||
**Common starting points:**
|
||||
1. Logistic Regression (classification) / Linear Regression (regression) - fast baseline
|
||||
2. Random Forest - good default choice
|
||||
3. Gradient Boosting - optimize for best accuracy
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,219 +1,257 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Complete classification pipeline with preprocessing, training, evaluation, and hyperparameter tuning.
|
||||
Demonstrates best practices for scikit-learn workflows.
|
||||
Complete classification pipeline example with preprocessing, model training,
|
||||
hyperparameter tuning, and evaluation.
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
|
||||
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
|
||||
import joblib
|
||||
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
from sklearn.metrics import (
|
||||
classification_report, confusion_matrix, roc_auc_score,
|
||||
accuracy_score, precision_score, recall_score, f1_score
|
||||
)
|
||||
import warnings
|
||||
warnings.filterwarnings('ignore')
|
||||
|
||||
|
||||
def create_preprocessing_pipeline(numeric_features, categorical_features):
|
||||
"""
|
||||
Create preprocessing pipeline for mixed data types.
|
||||
Create a preprocessing pipeline for mixed data types.
|
||||
|
||||
Args:
|
||||
numeric_features: List of numeric column names
|
||||
categorical_features: List of categorical column names
|
||||
Parameters:
|
||||
-----------
|
||||
numeric_features : list
|
||||
List of numeric feature column names
|
||||
categorical_features : list
|
||||
List of categorical feature column names
|
||||
|
||||
Returns:
|
||||
ColumnTransformer with appropriate preprocessing for each data type
|
||||
--------
|
||||
ColumnTransformer
|
||||
Preprocessing pipeline
|
||||
"""
|
||||
# Numeric preprocessing
|
||||
numeric_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
|
||||
# Categorical preprocessing
|
||||
categorical_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True))
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
|
||||
])
|
||||
|
||||
# Combine transformers
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
]
|
||||
)
|
||||
|
||||
return preprocessor
|
||||
|
||||
|
||||
def create_full_pipeline(preprocessor, classifier=None):
|
||||
def train_and_evaluate_model(X, y, numeric_features, categorical_features,
|
||||
test_size=0.2, random_state=42):
|
||||
"""
|
||||
Create complete ML pipeline with preprocessing and classification.
|
||||
Complete pipeline: preprocess, train, tune, and evaluate a classifier.
|
||||
|
||||
Args:
|
||||
preprocessor: Preprocessing ColumnTransformer
|
||||
classifier: Classifier instance (default: RandomForestClassifier)
|
||||
Parameters:
|
||||
-----------
|
||||
X : DataFrame or array
|
||||
Feature matrix
|
||||
y : Series or array
|
||||
Target variable
|
||||
numeric_features : list
|
||||
List of numeric feature names
|
||||
categorical_features : list
|
||||
List of categorical feature names
|
||||
test_size : float
|
||||
Proportion of data for testing
|
||||
random_state : int
|
||||
Random seed
|
||||
|
||||
Returns:
|
||||
Complete Pipeline
|
||||
--------
|
||||
dict
|
||||
Dictionary containing trained model, predictions, and metrics
|
||||
"""
|
||||
if classifier is None:
|
||||
classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
|
||||
# Split data with stratification
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=test_size, stratify=y, random_state=random_state
|
||||
)
|
||||
|
||||
pipeline = Pipeline(steps=[
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', classifier)
|
||||
])
|
||||
print(f"Training set size: {len(X_train)}")
|
||||
print(f"Test set size: {len(X_test)}")
|
||||
print(f"Class distribution in training: {pd.Series(y_train).value_counts().to_dict()}")
|
||||
|
||||
return pipeline
|
||||
# Create preprocessor
|
||||
preprocessor = create_preprocessing_pipeline(numeric_features, categorical_features)
|
||||
|
||||
|
||||
def evaluate_model(pipeline, X_train, y_train, X_test, y_test, cv=5):
|
||||
"""
|
||||
Evaluate model using cross-validation and test set.
|
||||
|
||||
Args:
|
||||
pipeline: Trained pipeline
|
||||
X_train, y_train: Training data
|
||||
X_test, y_test: Test data
|
||||
cv: Number of cross-validation folds
|
||||
|
||||
Returns:
|
||||
Dictionary with evaluation results
|
||||
"""
|
||||
# Cross-validation on training set
|
||||
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='accuracy')
|
||||
|
||||
# Test set evaluation
|
||||
y_pred = pipeline.predict(X_test)
|
||||
test_score = pipeline.score(X_test, y_test)
|
||||
|
||||
# Get probabilities if available
|
||||
try:
|
||||
y_proba = pipeline.predict_proba(X_test)
|
||||
if len(np.unique(y_test)) == 2:
|
||||
# Binary classification
|
||||
auc = roc_auc_score(y_test, y_proba[:, 1])
|
||||
else:
|
||||
# Multiclass
|
||||
auc = roc_auc_score(y_test, y_proba, multi_class='ovr')
|
||||
except:
|
||||
auc = None
|
||||
|
||||
results = {
|
||||
'cv_mean': cv_scores.mean(),
|
||||
'cv_std': cv_scores.std(),
|
||||
'test_score': test_score,
|
||||
'auc': auc,
|
||||
'classification_report': classification_report(y_test, y_pred),
|
||||
'confusion_matrix': confusion_matrix(y_test, y_pred)
|
||||
# Define models to compare
|
||||
models = {
|
||||
'Logistic Regression': Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', LogisticRegression(max_iter=1000, random_state=random_state))
|
||||
]),
|
||||
'Random Forest': Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', RandomForestClassifier(n_estimators=100, random_state=random_state))
|
||||
]),
|
||||
'Gradient Boosting': Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', GradientBoostingClassifier(n_estimators=100, random_state=random_state))
|
||||
])
|
||||
}
|
||||
|
||||
return results
|
||||
# Compare models using cross-validation
|
||||
print("\n" + "="*60)
|
||||
print("Model Comparison (5-Fold Cross-Validation)")
|
||||
print("="*60)
|
||||
|
||||
cv_results = {}
|
||||
for name, model in models.items():
|
||||
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
|
||||
cv_results[name] = scores.mean()
|
||||
print(f"{name:20s}: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
|
||||
|
||||
def tune_hyperparameters(pipeline, X_train, y_train, param_grid, cv=5):
|
||||
"""
|
||||
Perform hyperparameter tuning using GridSearchCV.
|
||||
# Select best model based on CV
|
||||
best_model_name = max(cv_results, key=cv_results.get)
|
||||
best_model = models[best_model_name]
|
||||
|
||||
Args:
|
||||
pipeline: Pipeline to tune
|
||||
X_train, y_train: Training data
|
||||
param_grid: Dictionary of parameters to search
|
||||
cv: Number of cross-validation folds
|
||||
print(f"\nBest model: {best_model_name}")
|
||||
|
||||
# Hyperparameter tuning for best model
|
||||
if best_model_name == 'Random Forest':
|
||||
param_grid = {
|
||||
'classifier__n_estimators': [100, 200],
|
||||
'classifier__max_depth': [10, 20, None],
|
||||
'classifier__min_samples_split': [2, 5]
|
||||
}
|
||||
elif best_model_name == 'Gradient Boosting':
|
||||
param_grid = {
|
||||
'classifier__n_estimators': [100, 200],
|
||||
'classifier__learning_rate': [0.01, 0.1],
|
||||
'classifier__max_depth': [3, 5]
|
||||
}
|
||||
else: # Logistic Regression
|
||||
param_grid = {
|
||||
'classifier__C': [0.1, 1.0, 10.0],
|
||||
'classifier__penalty': ['l2']
|
||||
}
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("Hyperparameter Tuning")
|
||||
print("="*60)
|
||||
|
||||
Returns:
|
||||
GridSearchCV object with best model
|
||||
"""
|
||||
grid_search = GridSearchCV(
|
||||
pipeline,
|
||||
param_grid,
|
||||
cv=cv,
|
||||
scoring='f1_weighted',
|
||||
n_jobs=-1,
|
||||
verbose=1
|
||||
best_model, param_grid, cv=5, scoring='accuracy',
|
||||
n_jobs=-1, verbose=0
|
||||
)
|
||||
|
||||
grid_search.fit(X_train, y_train)
|
||||
|
||||
print(f"Best parameters: {grid_search.best_params_}")
|
||||
print(f"Best CV score: {grid_search.best_score_:.3f}")
|
||||
print(f"Best CV score: {grid_search.best_score_:.4f}")
|
||||
|
||||
return grid_search
|
||||
# Evaluate on test set
|
||||
tuned_model = grid_search.best_estimator_
|
||||
y_pred = tuned_model.predict(X_test)
|
||||
y_pred_proba = tuned_model.predict_proba(X_test)
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("Test Set Evaluation")
|
||||
print("="*60)
|
||||
|
||||
def main():
|
||||
"""
|
||||
Example usage of the classification pipeline.
|
||||
"""
|
||||
# Load your data here
|
||||
# X, y = load_data()
|
||||
# Calculate metrics
|
||||
accuracy = accuracy_score(y_test, y_pred)
|
||||
precision = precision_score(y_test, y_pred, average='weighted')
|
||||
recall = recall_score(y_test, y_pred, average='weighted')
|
||||
f1 = f1_score(y_test, y_pred, average='weighted')
|
||||
|
||||
# Example with synthetic data
|
||||
from sklearn.datasets import make_classification
|
||||
X, y = make_classification(
|
||||
n_samples=1000,
|
||||
n_features=20,
|
||||
n_informative=15,
|
||||
n_redundant=5,
|
||||
random_state=42
|
||||
)
|
||||
print(f"Accuracy: {accuracy:.4f}")
|
||||
print(f"Precision: {precision:.4f}")
|
||||
print(f"Recall: {recall:.4f}")
|
||||
print(f"F1-Score: {f1:.4f}")
|
||||
|
||||
# Convert to DataFrame for demonstration
|
||||
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
|
||||
X = pd.DataFrame(X, columns=feature_names)
|
||||
# ROC AUC (if binary classification)
|
||||
if len(np.unique(y)) == 2:
|
||||
roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])
|
||||
print(f"ROC AUC: {roc_auc:.4f}")
|
||||
|
||||
# Split features into numeric and categorical (all numeric in this example)
|
||||
numeric_features = feature_names
|
||||
categorical_features = []
|
||||
|
||||
# Split data (use stratify for imbalanced classes)
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42, stratify=y
|
||||
)
|
||||
|
||||
# Create preprocessing pipeline
|
||||
preprocessor = create_preprocessing_pipeline(numeric_features, categorical_features)
|
||||
|
||||
# Create full pipeline
|
||||
pipeline = create_full_pipeline(preprocessor)
|
||||
|
||||
# Train model
|
||||
print("Training model...")
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Evaluate model
|
||||
print("\nEvaluating model...")
|
||||
results = evaluate_model(pipeline, X_train, y_train, X_test, y_test)
|
||||
|
||||
print(f"CV Accuracy: {results['cv_mean']:.3f} (+/- {results['cv_std']:.3f})")
|
||||
print(f"Test Accuracy: {results['test_score']:.3f}")
|
||||
if results['auc']:
|
||||
print(f"ROC-AUC: {results['auc']:.3f}")
|
||||
print("\nClassification Report:")
|
||||
print(results['classification_report'])
|
||||
|
||||
# Hyperparameter tuning (optional)
|
||||
print("\nTuning hyperparameters...")
|
||||
param_grid = {
|
||||
'classifier__n_estimators': [100, 200],
|
||||
'classifier__max_depth': [10, 20, None],
|
||||
'classifier__min_samples_split': [2, 5]
|
||||
}
|
||||
|
||||
grid_search = tune_hyperparameters(pipeline, X_train, y_train, param_grid)
|
||||
|
||||
# Evaluate best model
|
||||
print("\nEvaluating tuned model...")
|
||||
best_pipeline = grid_search.best_estimator_
|
||||
y_pred = best_pipeline.predict(X_test)
|
||||
print("\n" + "="*60)
|
||||
print("Classification Report")
|
||||
print("="*60)
|
||||
print(classification_report(y_test, y_pred))
|
||||
|
||||
# Save model
|
||||
print("\nSaving model...")
|
||||
joblib.dump(best_pipeline, 'best_model.pkl')
|
||||
print("Model saved as 'best_model.pkl'")
|
||||
print("\n" + "="*60)
|
||||
print("Confusion Matrix")
|
||||
print("="*60)
|
||||
print(confusion_matrix(y_test, y_pred))
|
||||
|
||||
# Feature importance (if available)
|
||||
if hasattr(tuned_model.named_steps['classifier'], 'feature_importances_'):
|
||||
print("\n" + "="*60)
|
||||
print("Top 10 Most Important Features")
|
||||
print("="*60)
|
||||
|
||||
feature_names = tuned_model.named_steps['preprocessor'].get_feature_names_out()
|
||||
importances = tuned_model.named_steps['classifier'].feature_importances_
|
||||
|
||||
feature_importance_df = pd.DataFrame({
|
||||
'feature': feature_names,
|
||||
'importance': importances
|
||||
}).sort_values('importance', ascending=False).head(10)
|
||||
|
||||
print(feature_importance_df.to_string(index=False))
|
||||
|
||||
return {
|
||||
'model': tuned_model,
|
||||
'y_test': y_test,
|
||||
'y_pred': y_pred,
|
||||
'y_pred_proba': y_pred_proba,
|
||||
'metrics': {
|
||||
'accuracy': accuracy,
|
||||
'precision': precision,
|
||||
'recall': recall,
|
||||
'f1': f1
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
# Example usage
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
# Load example dataset
|
||||
from sklearn.datasets import load_breast_cancer
|
||||
|
||||
# Load data
|
||||
data = load_breast_cancer()
|
||||
X = pd.DataFrame(data.data, columns=data.feature_names)
|
||||
y = data.target
|
||||
|
||||
# For demonstration, treat all features as numeric
|
||||
numeric_features = X.columns.tolist()
|
||||
categorical_features = []
|
||||
|
||||
print("="*60)
|
||||
print("Classification Pipeline Example")
|
||||
print("Dataset: Breast Cancer Wisconsin")
|
||||
print("="*60)
|
||||
|
||||
# Run complete pipeline
|
||||
results = train_and_evaluate_model(
|
||||
X, y, numeric_features, categorical_features,
|
||||
test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("Pipeline Complete!")
|
||||
print("="*60)
|
||||
|
||||
@@ -1,291 +1,386 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Clustering analysis script with multiple algorithms and evaluation.
|
||||
Demonstrates k-means, DBSCAN, and hierarchical clustering with visualization.
|
||||
Clustering analysis example with multiple algorithms, evaluation, and visualization.
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
|
||||
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
|
||||
from sklearn.decomposition import PCA
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
|
||||
from sklearn.mixture import GaussianMixture
|
||||
from sklearn.metrics import (
|
||||
silhouette_score, calinski_harabasz_score, davies_bouldin_score
|
||||
)
|
||||
import warnings
|
||||
warnings.filterwarnings('ignore')
|
||||
|
||||
|
||||
def scale_data(X):
|
||||
def preprocess_for_clustering(X, scale=True, pca_components=None):
|
||||
"""
|
||||
Scale features using StandardScaler.
|
||||
ALWAYS scale data before clustering!
|
||||
Preprocess data for clustering.
|
||||
|
||||
Args:
|
||||
X: Feature matrix
|
||||
Parameters:
|
||||
-----------
|
||||
X : array-like
|
||||
Feature matrix
|
||||
scale : bool
|
||||
Whether to standardize features
|
||||
pca_components : int or None
|
||||
Number of PCA components (None to skip PCA)
|
||||
|
||||
Returns:
|
||||
Scaled feature matrix and fitted scaler
|
||||
--------
|
||||
array
|
||||
Preprocessed data
|
||||
"""
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
return X_scaled, scaler
|
||||
X_processed = X.copy()
|
||||
|
||||
if scale:
|
||||
scaler = StandardScaler()
|
||||
X_processed = scaler.fit_transform(X_processed)
|
||||
|
||||
if pca_components is not None:
|
||||
pca = PCA(n_components=pca_components)
|
||||
X_processed = pca.fit_transform(X_processed)
|
||||
print(f"PCA: Explained variance ratio = {pca.explained_variance_ratio_.sum():.3f}")
|
||||
|
||||
return X_processed
|
||||
|
||||
|
||||
def find_optimal_k(X_scaled, k_range=range(2, 11)):
|
||||
def find_optimal_k_kmeans(X, k_range=range(2, 11)):
|
||||
"""
|
||||
Find optimal number of clusters using elbow method and silhouette score.
|
||||
Find optimal K for K-Means using elbow method and silhouette score.
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
k_range: Range of k values to try
|
||||
Parameters:
|
||||
-----------
|
||||
X : array-like
|
||||
Feature matrix (should be scaled)
|
||||
k_range : range
|
||||
Range of K values to test
|
||||
|
||||
Returns:
|
||||
Dictionary with inertias and silhouette scores
|
||||
--------
|
||||
dict
|
||||
Dictionary with inertia and silhouette scores for each K
|
||||
"""
|
||||
inertias = []
|
||||
silhouette_scores = []
|
||||
|
||||
for k in k_range:
|
||||
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
|
||||
labels = kmeans.fit_predict(X_scaled)
|
||||
labels = kmeans.fit_predict(X)
|
||||
|
||||
inertias.append(kmeans.inertia_)
|
||||
silhouette_scores.append(silhouette_score(X_scaled, labels))
|
||||
silhouette_scores.append(silhouette_score(X, labels))
|
||||
|
||||
# Plot results
|
||||
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
|
||||
|
||||
# Elbow plot
|
||||
ax1.plot(k_range, inertias, 'bo-')
|
||||
ax1.set_xlabel('Number of clusters (K)')
|
||||
ax1.set_ylabel('Inertia')
|
||||
ax1.set_title('Elbow Method')
|
||||
ax1.grid(True)
|
||||
|
||||
# Silhouette plot
|
||||
ax2.plot(k_range, silhouette_scores, 'ro-')
|
||||
ax2.set_xlabel('Number of clusters (K)')
|
||||
ax2.set_ylabel('Silhouette Score')
|
||||
ax2.set_title('Silhouette Analysis')
|
||||
ax2.grid(True)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.savefig('clustering_optimization.png', dpi=300, bbox_inches='tight')
|
||||
print("Saved: clustering_optimization.png")
|
||||
plt.close()
|
||||
|
||||
# Find best K based on silhouette score
|
||||
best_k = k_range[np.argmax(silhouette_scores)]
|
||||
print(f"\nRecommended K based on silhouette score: {best_k}")
|
||||
|
||||
return {
|
||||
'k_values': list(k_range),
|
||||
'inertias': inertias,
|
||||
'silhouette_scores': silhouette_scores
|
||||
'silhouette_scores': silhouette_scores,
|
||||
'best_k': best_k
|
||||
}
|
||||
|
||||
|
||||
def plot_elbow_silhouette(results):
|
||||
def compare_clustering_algorithms(X, n_clusters=3):
|
||||
"""
|
||||
Plot elbow method and silhouette scores.
|
||||
Compare different clustering algorithms.
|
||||
|
||||
Args:
|
||||
results: Dictionary from find_optimal_k
|
||||
"""
|
||||
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
|
||||
|
||||
# Elbow plot
|
||||
ax1.plot(results['k_values'], results['inertias'], 'bo-')
|
||||
ax1.set_xlabel('Number of clusters (k)')
|
||||
ax1.set_ylabel('Inertia')
|
||||
ax1.set_title('Elbow Method')
|
||||
ax1.grid(True, alpha=0.3)
|
||||
|
||||
# Silhouette plot
|
||||
ax2.plot(results['k_values'], results['silhouette_scores'], 'ro-')
|
||||
ax2.set_xlabel('Number of clusters (k)')
|
||||
ax2.set_ylabel('Silhouette Score')
|
||||
ax2.set_title('Silhouette Score vs k')
|
||||
ax2.grid(True, alpha=0.3)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.savefig('elbow_silhouette.png', dpi=300, bbox_inches='tight')
|
||||
print("Saved elbow and silhouette plots to 'elbow_silhouette.png'")
|
||||
plt.close()
|
||||
|
||||
|
||||
def evaluate_clustering(X_scaled, labels, algorithm_name):
|
||||
"""
|
||||
Evaluate clustering using multiple metrics.
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
labels: Cluster labels
|
||||
algorithm_name: Name of clustering algorithm
|
||||
Parameters:
|
||||
-----------
|
||||
X : array-like
|
||||
Feature matrix (should be scaled)
|
||||
n_clusters : int
|
||||
Number of clusters
|
||||
|
||||
Returns:
|
||||
Dictionary with evaluation metrics
|
||||
--------
|
||||
dict
|
||||
Dictionary with results for each algorithm
|
||||
"""
|
||||
# Filter out noise points for DBSCAN (-1 labels)
|
||||
mask = labels != -1
|
||||
X_filtered = X_scaled[mask]
|
||||
labels_filtered = labels[mask]
|
||||
print("="*60)
|
||||
print(f"Comparing Clustering Algorithms (n_clusters={n_clusters})")
|
||||
print("="*60)
|
||||
|
||||
n_clusters = len(set(labels_filtered))
|
||||
n_noise = list(labels).count(-1)
|
||||
|
||||
results = {
|
||||
'algorithm': algorithm_name,
|
||||
'n_clusters': n_clusters,
|
||||
'n_noise': n_noise
|
||||
algorithms = {
|
||||
'K-Means': KMeans(n_clusters=n_clusters, random_state=42, n_init=10),
|
||||
'Agglomerative': AgglomerativeClustering(n_clusters=n_clusters, linkage='ward'),
|
||||
'Gaussian Mixture': GaussianMixture(n_components=n_clusters, random_state=42)
|
||||
}
|
||||
|
||||
# Calculate metrics if we have valid clusters
|
||||
if n_clusters > 1:
|
||||
results['silhouette'] = silhouette_score(X_filtered, labels_filtered)
|
||||
results['davies_bouldin'] = davies_bouldin_score(X_filtered, labels_filtered)
|
||||
results['calinski_harabasz'] = calinski_harabasz_score(X_filtered, labels_filtered)
|
||||
# DBSCAN doesn't require n_clusters
|
||||
# We'll add it separately
|
||||
dbscan = DBSCAN(eps=0.5, min_samples=5)
|
||||
dbscan_labels = dbscan.fit_predict(X)
|
||||
|
||||
results = {}
|
||||
|
||||
for name, algorithm in algorithms.items():
|
||||
labels = algorithm.fit_predict(X)
|
||||
|
||||
# Calculate metrics
|
||||
silhouette = silhouette_score(X, labels)
|
||||
calinski = calinski_harabasz_score(X, labels)
|
||||
davies = davies_bouldin_score(X, labels)
|
||||
|
||||
results[name] = {
|
||||
'labels': labels,
|
||||
'n_clusters': n_clusters,
|
||||
'silhouette': silhouette,
|
||||
'calinski_harabasz': calinski,
|
||||
'davies_bouldin': davies
|
||||
}
|
||||
|
||||
print(f"\n{name}:")
|
||||
print(f" Silhouette Score: {silhouette:.4f} (higher is better)")
|
||||
print(f" Calinski-Harabasz: {calinski:.4f} (higher is better)")
|
||||
print(f" Davies-Bouldin: {davies:.4f} (lower is better)")
|
||||
|
||||
# DBSCAN results
|
||||
n_clusters_dbscan = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
|
||||
n_noise = list(dbscan_labels).count(-1)
|
||||
|
||||
if n_clusters_dbscan > 1:
|
||||
# Only calculate metrics if we have multiple clusters
|
||||
mask = dbscan_labels != -1 # Exclude noise
|
||||
if mask.sum() > 0:
|
||||
silhouette = silhouette_score(X[mask], dbscan_labels[mask])
|
||||
calinski = calinski_harabasz_score(X[mask], dbscan_labels[mask])
|
||||
davies = davies_bouldin_score(X[mask], dbscan_labels[mask])
|
||||
|
||||
results['DBSCAN'] = {
|
||||
'labels': dbscan_labels,
|
||||
'n_clusters': n_clusters_dbscan,
|
||||
'n_noise': n_noise,
|
||||
'silhouette': silhouette,
|
||||
'calinski_harabasz': calinski,
|
||||
'davies_bouldin': davies
|
||||
}
|
||||
|
||||
print(f"\nDBSCAN:")
|
||||
print(f" Clusters found: {n_clusters_dbscan}")
|
||||
print(f" Noise points: {n_noise}")
|
||||
print(f" Silhouette Score: {silhouette:.4f} (higher is better)")
|
||||
print(f" Calinski-Harabasz: {calinski:.4f} (higher is better)")
|
||||
print(f" Davies-Bouldin: {davies:.4f} (lower is better)")
|
||||
else:
|
||||
results['silhouette'] = None
|
||||
results['davies_bouldin'] = None
|
||||
results['calinski_harabasz'] = None
|
||||
print(f"\nDBSCAN:")
|
||||
print(f" Clusters found: {n_clusters_dbscan}")
|
||||
print(f" Noise points: {n_noise}")
|
||||
print(" Note: Insufficient clusters for metric calculation")
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def perform_kmeans(X_scaled, n_clusters=3):
|
||||
def visualize_clusters(X, results, true_labels=None):
|
||||
"""
|
||||
Perform k-means clustering.
|
||||
Visualize clustering results using PCA for 2D projection.
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
n_clusters: Number of clusters
|
||||
|
||||
Returns:
|
||||
Fitted KMeans model and labels
|
||||
Parameters:
|
||||
-----------
|
||||
X : array-like
|
||||
Feature matrix
|
||||
results : dict
|
||||
Dictionary with clustering results
|
||||
true_labels : array-like or None
|
||||
True labels (if available) for comparison
|
||||
"""
|
||||
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
|
||||
labels = kmeans.fit_predict(X_scaled)
|
||||
return kmeans, labels
|
||||
# Reduce to 2D using PCA
|
||||
pca = PCA(n_components=2)
|
||||
X_2d = pca.fit_transform(X)
|
||||
|
||||
# Determine number of subplots
|
||||
n_plots = len(results)
|
||||
if true_labels is not None:
|
||||
n_plots += 1
|
||||
|
||||
def perform_dbscan(X_scaled, eps=0.5, min_samples=5):
|
||||
"""
|
||||
Perform DBSCAN clustering.
|
||||
n_cols = min(3, n_plots)
|
||||
n_rows = (n_plots + n_cols - 1) // n_cols
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
eps: Maximum distance between neighbors
|
||||
min_samples: Minimum points to form dense region
|
||||
fig, axes = plt.subplots(n_rows, n_cols, figsize=(5*n_cols, 4*n_rows))
|
||||
if n_plots == 1:
|
||||
axes = np.array([axes])
|
||||
axes = axes.flatten()
|
||||
|
||||
Returns:
|
||||
Fitted DBSCAN model and labels
|
||||
"""
|
||||
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
|
||||
labels = dbscan.fit_predict(X_scaled)
|
||||
return dbscan, labels
|
||||
plot_idx = 0
|
||||
|
||||
# Plot true labels if available
|
||||
if true_labels is not None:
|
||||
ax = axes[plot_idx]
|
||||
scatter = ax.scatter(X_2d[:, 0], X_2d[:, 1], c=true_labels, cmap='viridis', alpha=0.6)
|
||||
ax.set_title('True Labels')
|
||||
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})')
|
||||
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})')
|
||||
plt.colorbar(scatter, ax=ax)
|
||||
plot_idx += 1
|
||||
|
||||
def perform_hierarchical(X_scaled, n_clusters=3, linkage='ward'):
|
||||
"""
|
||||
Perform hierarchical clustering.
|
||||
# Plot clustering results
|
||||
for name, result in results.items():
|
||||
ax = axes[plot_idx]
|
||||
labels = result['labels']
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
n_clusters: Number of clusters
|
||||
linkage: Linkage criterion ('ward', 'complete', 'average', 'single')
|
||||
scatter = ax.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis', alpha=0.6)
|
||||
|
||||
Returns:
|
||||
Fitted AgglomerativeClustering model and labels
|
||||
"""
|
||||
hierarchical = AgglomerativeClustering(n_clusters=n_clusters, linkage=linkage)
|
||||
labels = hierarchical.fit_predict(X_scaled)
|
||||
return hierarchical, labels
|
||||
# Highlight noise points for DBSCAN
|
||||
if name == 'DBSCAN' and -1 in labels:
|
||||
noise_mask = labels == -1
|
||||
ax.scatter(X_2d[noise_mask, 0], X_2d[noise_mask, 1],
|
||||
c='red', marker='x', s=100, label='Noise', alpha=0.8)
|
||||
ax.legend()
|
||||
|
||||
title = f"{name} (K={result['n_clusters']})"
|
||||
if 'silhouette' in result:
|
||||
title += f"\nSilhouette: {result['silhouette']:.3f}"
|
||||
ax.set_title(title)
|
||||
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})')
|
||||
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})')
|
||||
plt.colorbar(scatter, ax=ax)
|
||||
|
||||
def visualize_clusters_2d(X_scaled, labels, algorithm_name, method='pca'):
|
||||
"""
|
||||
Visualize clusters in 2D using PCA or t-SNE.
|
||||
plot_idx += 1
|
||||
|
||||
Args:
|
||||
X_scaled: Scaled feature matrix
|
||||
labels: Cluster labels
|
||||
algorithm_name: Name of algorithm for title
|
||||
method: 'pca' or 'tsne'
|
||||
"""
|
||||
# Reduce to 2D
|
||||
if method == 'pca':
|
||||
pca = PCA(n_components=2, random_state=42)
|
||||
X_2d = pca.fit_transform(X_scaled)
|
||||
variance = pca.explained_variance_ratio_
|
||||
xlabel = f'PC1 ({variance[0]:.1%} variance)'
|
||||
ylabel = f'PC2 ({variance[1]:.1%} variance)'
|
||||
else:
|
||||
from sklearn.manifold import TSNE
|
||||
# Use PCA first to speed up t-SNE
|
||||
pca = PCA(n_components=min(50, X_scaled.shape[1]), random_state=42)
|
||||
X_pca = pca.fit_transform(X_scaled)
|
||||
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
|
||||
X_2d = tsne.fit_transform(X_pca)
|
||||
xlabel = 't-SNE 1'
|
||||
ylabel = 't-SNE 2'
|
||||
# Hide unused subplots
|
||||
for idx in range(plot_idx, len(axes)):
|
||||
axes[idx].axis('off')
|
||||
|
||||
# Plot
|
||||
plt.figure(figsize=(10, 8))
|
||||
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis', alpha=0.6, s=50)
|
||||
plt.colorbar(scatter, label='Cluster')
|
||||
plt.xlabel(xlabel)
|
||||
plt.ylabel(ylabel)
|
||||
plt.title(f'{algorithm_name} Clustering ({method.upper()})')
|
||||
plt.grid(True, alpha=0.3)
|
||||
|
||||
filename = f'{algorithm_name.lower().replace(" ", "_")}_{method}.png'
|
||||
plt.savefig(filename, dpi=300, bbox_inches='tight')
|
||||
print(f"Saved visualization to '{filename}'")
|
||||
plt.tight_layout()
|
||||
plt.savefig('clustering_results.png', dpi=300, bbox_inches='tight')
|
||||
print("\nSaved: clustering_results.png")
|
||||
plt.close()
|
||||
|
||||
|
||||
def main():
|
||||
def complete_clustering_analysis(X, true_labels=None, scale=True,
|
||||
find_k=True, k_range=range(2, 11), n_clusters=3):
|
||||
"""
|
||||
Example clustering analysis workflow.
|
||||
Complete clustering analysis workflow.
|
||||
|
||||
Parameters:
|
||||
-----------
|
||||
X : array-like
|
||||
Feature matrix
|
||||
true_labels : array-like or None
|
||||
True labels (for comparison only, not used in clustering)
|
||||
scale : bool
|
||||
Whether to scale features
|
||||
find_k : bool
|
||||
Whether to search for optimal K
|
||||
k_range : range
|
||||
Range of K values to test
|
||||
n_clusters : int
|
||||
Number of clusters to use in comparison
|
||||
|
||||
Returns:
|
||||
--------
|
||||
dict
|
||||
Dictionary with all analysis results
|
||||
"""
|
||||
# Load your data here
|
||||
# X = load_data()
|
||||
print("="*60)
|
||||
print("Clustering Analysis")
|
||||
print("="*60)
|
||||
print(f"Data shape: {X.shape}")
|
||||
|
||||
# Example with synthetic data
|
||||
from sklearn.datasets import make_blobs
|
||||
X, y_true = make_blobs(
|
||||
n_samples=500,
|
||||
n_features=10,
|
||||
centers=4,
|
||||
cluster_std=1.0,
|
||||
random_state=42
|
||||
)
|
||||
# Preprocess data
|
||||
X_processed = preprocess_for_clustering(X, scale=scale)
|
||||
|
||||
print(f"Dataset shape: {X.shape}")
|
||||
# Find optimal K if requested
|
||||
optimization_results = None
|
||||
if find_k:
|
||||
print("\n" + "="*60)
|
||||
print("Finding Optimal Number of Clusters")
|
||||
print("="*60)
|
||||
optimization_results = find_optimal_k_kmeans(X_processed, k_range=k_range)
|
||||
|
||||
# Scale data (ALWAYS scale for clustering!)
|
||||
print("\nScaling data...")
|
||||
X_scaled, scaler = scale_data(X)
|
||||
# Use recommended K
|
||||
if optimization_results:
|
||||
n_clusters = optimization_results['best_k']
|
||||
|
||||
# Find optimal k
|
||||
print("\nFinding optimal number of clusters...")
|
||||
results = find_optimal_k(X_scaled)
|
||||
plot_elbow_silhouette(results)
|
||||
# Compare clustering algorithms
|
||||
comparison_results = compare_clustering_algorithms(X_processed, n_clusters=n_clusters)
|
||||
|
||||
# Based on elbow/silhouette, choose optimal k
|
||||
optimal_k = 4 # Adjust based on plots
|
||||
|
||||
# Perform k-means
|
||||
print(f"\nPerforming k-means with k={optimal_k}...")
|
||||
kmeans, kmeans_labels = perform_kmeans(X_scaled, n_clusters=optimal_k)
|
||||
kmeans_results = evaluate_clustering(X_scaled, kmeans_labels, 'K-Means')
|
||||
|
||||
# Perform DBSCAN
|
||||
print("\nPerforming DBSCAN...")
|
||||
dbscan, dbscan_labels = perform_dbscan(X_scaled, eps=0.5, min_samples=5)
|
||||
dbscan_results = evaluate_clustering(X_scaled, dbscan_labels, 'DBSCAN')
|
||||
|
||||
# Perform hierarchical clustering
|
||||
print("\nPerforming hierarchical clustering...")
|
||||
hierarchical, hier_labels = perform_hierarchical(X_scaled, n_clusters=optimal_k)
|
||||
hier_results = evaluate_clustering(X_scaled, hier_labels, 'Hierarchical')
|
||||
|
||||
# Print results
|
||||
# Visualize results
|
||||
print("\n" + "="*60)
|
||||
print("CLUSTERING RESULTS")
|
||||
print("Visualizing Results")
|
||||
print("="*60)
|
||||
visualize_clusters(X_processed, comparison_results, true_labels=true_labels)
|
||||
|
||||
return {
|
||||
'X_processed': X_processed,
|
||||
'optimization': optimization_results,
|
||||
'comparison': comparison_results
|
||||
}
|
||||
|
||||
|
||||
# Example usage
|
||||
if __name__ == "__main__":
|
||||
from sklearn.datasets import load_iris, make_blobs
|
||||
|
||||
print("="*60)
|
||||
print("Example 1: Iris Dataset")
|
||||
print("="*60)
|
||||
|
||||
for results in [kmeans_results, dbscan_results, hier_results]:
|
||||
print(f"\n{results['algorithm']}:")
|
||||
print(f" Clusters: {results['n_clusters']}")
|
||||
if results['n_noise'] > 0:
|
||||
print(f" Noise points: {results['n_noise']}")
|
||||
if results['silhouette']:
|
||||
print(f" Silhouette Score: {results['silhouette']:.3f}")
|
||||
print(f" Davies-Bouldin Index: {results['davies_bouldin']:.3f} (lower is better)")
|
||||
print(f" Calinski-Harabasz Index: {results['calinski_harabasz']:.1f} (higher is better)")
|
||||
# Load Iris dataset
|
||||
iris = load_iris()
|
||||
X_iris = iris.data
|
||||
y_iris = iris.target
|
||||
|
||||
# Visualize clusters
|
||||
print("\nCreating visualizations...")
|
||||
visualize_clusters_2d(X_scaled, kmeans_labels, 'K-Means', method='pca')
|
||||
visualize_clusters_2d(X_scaled, dbscan_labels, 'DBSCAN', method='pca')
|
||||
visualize_clusters_2d(X_scaled, hier_labels, 'Hierarchical', method='pca')
|
||||
results_iris = complete_clustering_analysis(
|
||||
X_iris,
|
||||
true_labels=y_iris,
|
||||
scale=True,
|
||||
find_k=True,
|
||||
k_range=range(2, 8),
|
||||
n_clusters=3
|
||||
)
|
||||
|
||||
print("\nClustering analysis complete!")
|
||||
print("\n" + "="*60)
|
||||
print("Example 2: Synthetic Dataset with Noise")
|
||||
print("="*60)
|
||||
|
||||
# Create synthetic dataset
|
||||
X_synth, y_synth = make_blobs(
|
||||
n_samples=500, n_features=2, centers=4,
|
||||
cluster_std=0.5, random_state=42
|
||||
)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
# Add noise points
|
||||
noise = np.random.randn(50, 2) * 3
|
||||
X_synth = np.vstack([X_synth, noise])
|
||||
y_synth_with_noise = np.concatenate([y_synth, np.full(50, -1)])
|
||||
|
||||
results_synth = complete_clustering_analysis(
|
||||
X_synth,
|
||||
true_labels=y_synth_with_noise,
|
||||
scale=True,
|
||||
find_k=True,
|
||||
k_range=range(2, 8),
|
||||
n_clusters=4
|
||||
)
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("Analysis Complete!")
|
||||
print("="*60)
|
||||
|
||||
Reference in New Issue
Block a user