Improve the scikit-learn skill

This commit is contained in:
Timothy Kassis
2025-11-04 10:11:46 -08:00
parent 63a4293f1a
commit 4ad4f9970f
10 changed files with 3293 additions and 3606 deletions

View File

@@ -1,345 +1,563 @@
# Data Preprocessing in scikit-learn
# Data Preprocessing and Feature Engineering Reference
## Overview
Preprocessing transforms raw data into a format suitable for machine learning algorithms. Many algorithms require standardized or normalized data to perform well.
## Standardization and Scaling
Data preprocessing transforms raw data into a format suitable for machine learning models. This includes scaling, encoding, handling missing values, and feature engineering.
## Feature Scaling and Normalization
### StandardScaler
Removes mean and scales to unit variance (z-score normalization).
**Formula**: `z = (x - μ) / σ`
**Use cases**:
- Most ML algorithms (especially SVM, neural networks, PCA)
- When features have different units or scales
- When assuming Gaussian-like distribution
**Important**: Fit only on training data, then transform both train and test sets.
**StandardScaler (`sklearn.preprocessing.StandardScaler`)**
- Standardizes features to zero mean and unit variance
- Formula: z = (x - mean) / std
- Use when: Features have different scales, algorithm assumes normally distributed data
- Required for: SVM, KNN, Neural Networks, PCA, Linear Regression with regularization
- Example:
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use same parameters
X_test_scaled = scaler.transform(X_test) # Use same parameters as training
# Access learned parameters
print(f"Mean: {scaler.mean_}")
print(f"Std: {scaler.scale_}")
```
### MinMaxScaler
Scales features to a specified range, typically [0, 1].
**Formula**: `X_scaled = (X - X_min) / (X_max - X_min)`
**Use cases**:
- When bounded range is needed
- Neural networks (often prefer [0, 1] range)
- When distribution is not Gaussian
- Image pixel values
**Parameters**:
- `feature_range`: Tuple (min, max), default (0, 1)
**Warning**: Sensitive to outliers since it uses min/max.
### MaxAbsScaler
Scales to [-1, 1] by dividing by maximum absolute value.
**Use cases**:
- Sparse data (preserves sparsity)
- Data already centered at zero
- When sign of values is meaningful
**Advantage**: Doesn't shift/center the data, preserves zero entries.
### RobustScaler
Uses median and interquartile range (IQR) instead of mean and standard deviation.
**Formula**: `X_scaled = (X - median) / IQR`
**Use cases**:
- When outliers are present
- When StandardScaler produces skewed results
- Robust statistics preferred
**Parameters**:
- `quantile_range`: Tuple (q_min, q_max), default (25.0, 75.0)
## Normalization
### normalize() function and Normalizer
Scales individual samples (rows) to unit norm, not features (columns).
**Use cases**:
- Text classification (TF-IDF vectors)
- When similarity metrics (dot product, cosine) are used
- When each sample should have equal weight
**Norms**:
- `l1`: Manhattan norm (sum of absolutes = 1)
- `l2`: Euclidean norm (sum of squares = 1) - **most common**
- `max`: Maximum absolute value = 1
**Key difference from scalers**: Operates on rows (samples), not columns (features).
**MinMaxScaler (`sklearn.preprocessing.MinMaxScaler`)**
- Scales features to a given range (default [0, 1])
- Formula: X_scaled = (X - X.min) / (X.max - X.min)
- Use when: Need bounded values, data not normally distributed
- Sensitive to outliers
- Example:
```python
from sklearn.preprocessing import Normalizer
normalizer = Normalizer(norm='l2')
X_normalized = normalizer.transform(X)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled = scaler.fit_transform(X_train)
# Custom range
scaler = MinMaxScaler(feature_range=(-1, 1))
X_scaled = scaler.fit_transform(X_train)
```
## Encoding Categorical Features
### RobustScaler
**RobustScaler (`sklearn.preprocessing.RobustScaler`)**
- Scales using median and interquartile range (IQR)
- Formula: X_scaled = (X - median) / IQR
- Use when: Data contains outliers
- Robust to outliers
- Example:
```python
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)
```
### Normalizer
**Normalizer (`sklearn.preprocessing.Normalizer`)**
- Normalizes samples individually to unit norm
- Common norms: 'l1', 'l2', 'max'
- Use when: Need to normalize each sample independently (e.g., text features)
- Example:
```python
from sklearn.preprocessing import Normalizer
normalizer = Normalizer(norm='l2') # Euclidean norm
X_normalized = normalizer.fit_transform(X)
```
### MaxAbsScaler
**MaxAbsScaler (`sklearn.preprocessing.MaxAbsScaler`)**
- Scales by maximum absolute value
- Range: [-1, 1]
- Doesn't shift/center data (preserves sparsity)
- Use when: Data is already centered or sparse
- Example:
```python
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X_sparse)
```
## Encoding Categorical Variables
### OneHotEncoder
**OneHotEncoder (`sklearn.preprocessing.OneHotEncoder`)**
- Creates binary columns for each category
- Use when: Nominal categories (no order), tree-based models or linear models
- Example:
```python
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(X_categorical)
# Get feature names
feature_names = encoder.get_feature_names_out(['color', 'size'])
# Handle unknown categories during transform
X_test_encoded = encoder.transform(X_test_categorical)
```
### OrdinalEncoder
Converts categories to integers (0 to n_categories - 1).
**Use cases**:
- Ordinal relationships exist (small < medium < large)
- Preprocessing before other transformations
- Tree-based algorithms (which can handle integers)
**Parameters**:
- `handle_unknown`: 'error' or 'use_encoded_value'
- `unknown_value`: Value for unknown categories
- `encoded_missing_value`: Value for missing data
**OrdinalEncoder (`sklearn.preprocessing.OrdinalEncoder`)**
- Encodes categories as integers
- Use when: Ordinal categories (ordered), or tree-based models
- Example:
```python
from sklearn.preprocessing import OrdinalEncoder
# Natural ordering
encoder = OrdinalEncoder()
X_encoded = encoder.fit_transform(X_categorical)
# Custom ordering
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
X_encoded = encoder.fit_transform(X_categorical)
```
### OneHotEncoder
Creates binary columns for each category.
**Use cases**:
- Nominal categories (no order)
- Linear models, neural networks
- When category relationships shouldn't be assumed
**Parameters**:
- `drop`: 'first', 'if_binary', array-like (prevents multicollinearity)
- `sparse_output`: True (default, memory efficient) or False
- `handle_unknown`: 'error', 'ignore', 'infrequent_if_exist'
- `min_frequency`: Group infrequent categories
- `max_categories`: Limit number of categories
**High cardinality handling**:
```python
encoder = OneHotEncoder(min_frequency=100, handle_unknown='infrequent_if_exist')
# Groups categories appearing < 100 times into 'infrequent' category
```
**Memory tip**: Use `sparse_output=True` (default) for high-cardinality features.
### TargetEncoder
Uses target statistics to encode categories.
**Use cases**:
- High-cardinality categorical features (zip codes, user IDs)
- When linear relationships with target are expected
- Often improves performance over one-hot encoding
**How it works**:
- Replaces category with mean of target for that category
- Uses cross-fitting during fit_transform() to prevent target leakage
- Applies smoothing to handle rare categories
**Parameters**:
- `smooth`: Smoothing parameter for rare categories
- `cv`: Cross-validation strategy
**Warning**: Only for supervised learning. Requires target variable.
```python
from sklearn.preprocessing import TargetEncoder
encoder = TargetEncoder()
X_encoded = encoder.fit_transform(X_categorical, y)
```
### LabelEncoder
Encodes target labels into integers 0 to n_classes - 1.
**Use cases**: Encoding target variable for classification (not features!)
**LabelEncoder (`sklearn.preprocessing.LabelEncoder`)**
- Encodes target labels (y) as integers
- Use for: Target variable encoding
- Example:
```python
from sklearn.preprocessing import LabelEncoder
**Important**: Use `LabelEncoder` for targets, not features. For features, use OrdinalEncoder or OneHotEncoder.
le = LabelEncoder()
y_encoded = le.fit_transform(y)
### Binarizer
Converts numeric values to binary (0 or 1) based on threshold.
# Decode back
y_decoded = le.inverse_transform(y_encoded)
print(f"Classes: {le.classes_}")
```
**Use cases**: Creating binary features from continuous values
### Target Encoding (using category_encoders)
```python
# Install: uv pip install category-encoders
from category_encoders import TargetEncoder
encoder = TargetEncoder()
X_train_encoded = encoder.fit_transform(X_train_categorical, y_train)
X_test_encoded = encoder.transform(X_test_categorical)
```
## Non-linear Transformations
### QuantileTransformer
Maps features to uniform or normal distribution using rank transformation.
**Use cases**:
- Unusual distributions (bimodal, heavy tails)
- Reducing outlier impact
- When normal distribution is desired
**Parameters**:
- `output_distribution`: 'uniform' (default) or 'normal'
- `n_quantiles`: Number of quantiles (default: min(1000, n_samples))
**Effect**: Strong transformation that reduces outlier influence and makes data more Gaussian-like.
### PowerTransformer
Applies parametric monotonic transformation to make data more Gaussian.
**Methods**:
- `yeo-johnson`: Works with positive and negative values (default)
- `box-cox`: Only positive values
**Use cases**:
- Skewed distributions
- When Gaussian assumption is important
- Variance stabilization
**Advantage**: Less radical than QuantileTransformer, preserves more of original relationships.
## Discretization
### KBinsDiscretizer
Bins continuous features into discrete intervals.
**Strategies**:
- `uniform`: Equal-width bins
- `quantile`: Equal-frequency bins
- `kmeans`: K-means clustering to determine bins
**Encoding**:
- `ordinal`: Integer encoding (0 to n_bins - 1)
- `onehot`: One-hot encoding
- `onehot-dense`: Dense one-hot encoding
**Use cases**:
- Making linear models handle non-linear relationships
- Reducing noise in features
- Making features more interpretable
### Power Transforms
**PowerTransformer**
- Makes data more Gaussian-like
- Methods: 'yeo-johnson' (works with negative values), 'box-cox' (positive only)
- Use when: Data is skewed, algorithm assumes normality
- Example:
```python
from sklearn.preprocessing import KBinsDiscretizer
disc = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
X_binned = disc.fit_transform(X)
from sklearn.preprocessing import PowerTransformer
# Yeo-Johnson (handles negative values)
pt = PowerTransformer(method='yeo-johnson', standardize=True)
X_transformed = pt.fit_transform(X)
# Box-Cox (positive values only)
pt = PowerTransformer(method='box-cox', standardize=True)
X_transformed = pt.fit_transform(X)
```
## Feature Generation
### PolynomialFeatures
Generates polynomial and interaction features.
**Parameters**:
- `degree`: Polynomial degree
- `interaction_only`: Only multiplicative interactions (no x²)
- `include_bias`: Include constant feature
**Use cases**:
- Adding non-linearity to linear models
- Feature engineering
- Polynomial regression
**Warning**: Number of features grows rapidly: (n+d)!/d!n! for degree d.
### Quantile Transformation
**QuantileTransformer**
- Transforms features to follow uniform or normal distribution
- Robust to outliers
- Use when: Want to reduce outlier impact
- Example:
```python
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
from sklearn.preprocessing import QuantileTransformer
# Transform to uniform distribution
qt = QuantileTransformer(output_distribution='uniform', random_state=42)
X_transformed = qt.fit_transform(X)
# Transform to normal distribution
qt = QuantileTransformer(output_distribution='normal', random_state=42)
X_transformed = qt.fit_transform(X)
```
### SplineTransformer
Generates B-spline basis functions.
### Log Transform
**Use cases**:
- Smooth non-linear transformations
- Alternative to PolynomialFeatures (less oscillation at boundaries)
- Generalized additive models (GAMs)
```python
import numpy as np
**Parameters**:
- `n_knots`: Number of knots
- `degree`: Spline degree
- `knots`: Knot positions ('uniform', 'quantile', or array)
# Log1p (log(1 + x)) - handles zeros
X_log = np.log1p(X)
## Missing Value Handling
# Or use FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
log_transformer = FunctionTransformer(np.log1p, inverse_func=np.expm1)
X_log = log_transformer.fit_transform(X)
```
## Missing Value Imputation
### SimpleImputer
Imputes missing values with various strategies.
**Strategies**:
- `mean`: Mean of column (numeric only)
- `median`: Median of column (numeric only)
- `most_frequent`: Mode (numeric or categorical)
- `constant`: Fill with constant value
**Parameters**:
- `strategy`: Imputation strategy
- `fill_value`: Value when strategy='constant'
- `missing_values`: What represents missing (np.nan, None, specific value)
**SimpleImputer (`sklearn.impute.SimpleImputer`)**
- Basic imputation strategies
- Strategies: 'mean', 'median', 'most_frequent', 'constant'
- Example:
```python
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
# For numerical features
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
# For categorical features
imputer = SimpleImputer(strategy='most_frequent')
X_imputed = imputer.fit_transform(X_categorical)
# Fill with constant
imputer = SimpleImputer(strategy='constant', fill_value=0)
X_imputed = imputer.fit_transform(X)
```
### KNNImputer
Imputes using k-nearest neighbors.
### Iterative Imputer
**Use cases**: When relationships between features should inform imputation
**IterativeImputer**
- Models each feature with missing values as function of other features
- More sophisticated than SimpleImputer
- Example:
```python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
**Parameters**:
- `n_neighbors`: Number of neighbors
- `weights`: 'uniform' or 'distance'
imputer = IterativeImputer(max_iter=10, random_state=42)
X_imputed = imputer.fit_transform(X)
```
### IterativeImputer
Models each feature with missing values as function of other features.
### KNN Imputer
**Use cases**:
- Complex relationships between features
- When multiple features have missing values
- Higher quality imputation (but slower)
**KNNImputer**
- Imputes using k-nearest neighbors
- Use when: Features are correlated
- Example:
```python
from sklearn.impute import KNNImputer
**Parameters**:
- `estimator`: Estimator for regression (default: BayesianRidge)
- `max_iter`: Maximum iterations
imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X)
```
## Function Transformers
## Feature Engineering
### FunctionTransformer
Applies custom function to data.
### Polynomial Features
**Use cases**:
- Custom transformations in pipelines
- Log transformation, square root, etc.
- Domain-specific preprocessing
**PolynomialFeatures**
- Creates polynomial and interaction features
- Use when: Need non-linear features for linear models
- Example:
```python
from sklearn.preprocessing import PolynomialFeatures
# Degree 2: includes x1, x2, x1^2, x2^2, x1*x2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# Get feature names
feature_names = poly.get_feature_names_out(['x1', 'x2'])
# Only interactions (no powers)
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interactions = poly.fit_transform(X)
```
### Binning/Discretization
**KBinsDiscretizer**
- Bins continuous features into discrete intervals
- Strategies: 'uniform', 'quantile', 'kmeans'
- Encoding: 'onehot', 'ordinal', 'onehot-dense'
- Example:
```python
from sklearn.preprocessing import KBinsDiscretizer
# Equal-width bins
binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
X_binned = binner.fit_transform(X)
# Equal-frequency bins (quantile-based)
binner = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
X_binned = binner.fit_transform(X)
```
### Binarization
**Binarizer**
- Converts features to binary (0 or 1) based on threshold
- Example:
```python
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.5)
X_binary = binarizer.fit_transform(X)
```
### Spline Features
**SplineTransformer**
- Creates spline basis functions
- Useful for capturing non-linear relationships
- Example:
```python
from sklearn.preprocessing import SplineTransformer
spline = SplineTransformer(n_knots=5, degree=3)
X_splines = spline.fit_transform(X)
```
## Text Feature Extraction
### CountVectorizer
**CountVectorizer (`sklearn.feature_extraction.text.CountVectorizer`)**
- Converts text to token count matrix
- Use for: Bag-of-words representation
- Example:
```python
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(
max_features=5000, # Keep top 5000 features
min_df=2, # Ignore terms appearing in < 2 documents
max_df=0.8, # Ignore terms appearing in > 80% documents
ngram_range=(1, 2) # Unigrams and bigrams
)
X_counts = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
```
### TfidfVectorizer
**TfidfVectorizer**
- TF-IDF (Term Frequency-Inverse Document Frequency) transformation
- Better than CountVectorizer for most tasks
- Example:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
max_features=5000,
min_df=2,
max_df=0.8,
ngram_range=(1, 2),
stop_words='english' # Remove English stop words
)
X_tfidf = vectorizer.fit_transform(documents)
```
### HashingVectorizer
**HashingVectorizer**
- Uses hashing trick for memory efficiency
- No fit needed, can't reverse transform
- Use when: Very large vocabulary, streaming data
- Example:
```python
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(n_features=2**18)
X_hashed = vectorizer.transform(documents) # No fit needed
```
## Feature Selection
### Filter Methods
**Variance Threshold**
- Removes low-variance features
- Example:
```python
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
```
**SelectKBest / SelectPercentile**
- Select features based on statistical tests
- Tests: f_classif, chi2, mutual_info_classif
- Example:
```python
from sklearn.feature_selection import SelectKBest, f_classif
# Select top 10 features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)
# Get selected feature indices
selected_indices = selector.get_support(indices=True)
```
### Wrapper Methods
**Recursive Feature Elimination (RFE)**
- Recursively removes features
- Uses model feature importances
- Example:
```python
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(estimator=model, n_features_to_select=10, step=1)
X_selected = rfe.fit_transform(X_train, y_train)
# Get selected features
selected_features = rfe.support_
feature_ranking = rfe.ranking_
```
**RFECV (with Cross-Validation)**
- RFE with cross-validation to find optimal number of features
- Example:
```python
from sklearn.feature_selection import RFECV
model = RandomForestClassifier(n_estimators=100, random_state=42)
rfecv = RFECV(estimator=model, cv=5, scoring='accuracy')
X_selected = rfecv.fit_transform(X_train, y_train)
print(f"Optimal number of features: {rfecv.n_features_}")
```
### Embedded Methods
**SelectFromModel**
- Select features based on model coefficients/importances
- Works with: Linear models (L1), Tree-based models
- Example:
```python
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
selector = SelectFromModel(model, threshold='median')
selector.fit(X_train, y_train)
X_selected = selector.transform(X_train)
# Get selected features
selected_features = selector.get_support()
```
**L1-based Feature Selection**
```python
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
model = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)
selector = SelectFromModel(model)
selector.fit(X_train, y_train)
X_selected = selector.transform(X_train)
```
## Handling Outliers
### IQR Method
```python
import numpy as np
Q1 = np.percentile(X, 25, axis=0)
Q3 = np.percentile(X, 75, axis=0)
IQR = Q3 - Q1
# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
mask = np.all((X >= lower_bound) & (X <= upper_bound), axis=1)
X_no_outliers = X[mask]
```
### Winsorization
```python
from scipy.stats import mstats
# Clip outliers at 5th and 95th percentiles
X_winsorized = mstats.winsorize(X, limits=[0.05, 0.05], axis=0)
```
## Custom Transformers
### Using FunctionTransformer
```python
from sklearn.preprocessing import FunctionTransformer
import numpy as np
log_transformer = FunctionTransformer(np.log1p, validate=True)
X_log = log_transformer.transform(X)
def log_transform(X):
return np.log1p(X)
transformer = FunctionTransformer(log_transform, inverse_func=np.expm1)
X_transformed = transformer.fit_transform(X)
```
### Creating Custom Transformer
```python
from sklearn.base import BaseEstimator, TransformerMixin
class CustomTransformer(BaseEstimator, TransformerMixin):
def __init__(self, parameter=1):
self.parameter = parameter
def fit(self, X, y=None):
# Learn parameters from X if needed
return self
def transform(self, X):
# Transform X
return X * self.parameter
transformer = CustomTransformer(parameter=2)
X_transformed = transformer.fit_transform(X)
```
## Best Practices
### Feature Scaling Guidelines
### Fit on Training Data Only
Always fit transformers on training data only:
```python
# Correct
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
**Always scale**:
- SVM, neural networks
- K-nearest neighbors
- Linear/Logistic regression with regularization
- PCA, LDA
- Gradient descent-based algorithms
**Don't need to scale**:
- Tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting)
- Naive Bayes
### Pipeline Integration
Always use preprocessing within pipelines to prevent data leakage:
# Wrong - causes data leakage
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))
```
### Use Pipelines
Combine preprocessing with models:
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
@@ -350,64 +568,39 @@ pipeline = Pipeline([
('classifier', LogisticRegression())
])
pipeline.fit(X_train, y_train) # Scaler fit only on train data
y_pred = pipeline.predict(X_test) # Scaler transform only on test data
pipeline.fit(X_train, y_train)
```
### Common Transformations by Data Type
**Numeric - Continuous**:
- StandardScaler (most common)
- MinMaxScaler (neural networks)
- RobustScaler (outliers present)
- PowerTransformer (skewed data)
**Numeric - Count Data**:
- sqrt or log transformation
- QuantileTransformer
- StandardScaler after transformation
**Categorical - Low Cardinality (<10 categories)**:
- OneHotEncoder
**Categorical - High Cardinality (>10 categories)**:
- TargetEncoder (supervised)
- Frequency encoding
- OneHotEncoder with min_frequency parameter
**Categorical - Ordinal**:
- OrdinalEncoder
**Text**:
- CountVectorizer or TfidfVectorizer
- Normalizer after vectorization
### Data Leakage Prevention
1. **Fit only on training data**: Never include test data when fitting preprocessors
2. **Use pipelines**: Ensures proper fit/transform separation
3. **Cross-validation**: Use Pipeline with cross_val_score() for proper evaluation
4. **Target encoding**: Use cv parameter in TargetEncoder for cross-fitting
### Handle Categorical and Numerical Separately
Use ColumnTransformer:
```python
# WRONG - data leakage
scaler = StandardScaler().fit(X_full)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# CORRECT - no leakage
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
numeric_features = ['age', 'income']
categorical_features = ['gender', 'occupation']
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
]
)
X_transformed = preprocessor.fit_transform(X)
```
## Preprocessing Checklist
### Algorithm-Specific Requirements
Before modeling:
1. Handle missing values (imputation or removal)
2. Encode categorical variables appropriately
3. Scale/normalize numeric features (if needed for algorithm)
4. Handle outliers (RobustScaler, clipping, removal)
5. Create additional features if beneficial (PolynomialFeatures, domain knowledge)
6. Check for data leakage in preprocessing steps
7. Wrap everything in a Pipeline
**Require Scaling:**
- SVM, KNN, Neural Networks
- PCA, Linear/Logistic Regression with regularization
- K-Means clustering
**Don't Require Scaling:**
- Tree-based models (Decision Trees, Random Forest, Gradient Boosting)
- Naive Bayes
**Encoding Requirements:**
- Linear models, SVM, KNN: One-hot encoding for nominal features
- Tree-based models: Can handle ordinal encoding directly