Improve the scikit-learn skill

2026-01-26 16:58:56 +08:00 · 2025-11-04 10:11:46 -08:00
parent 63a4293f1a
commit 4ad4f9970f
10 changed files with 3293 additions and 3606 deletions
--- a/scientific-packages/scikit-learn/references/preprocessing.md
+++ b/scientific-packages/scikit-learn/references/preprocessing.md
@@ -1,345 +1,563 @@
-# Data Preprocessing in scikit-learn
+# Data Preprocessing and Feature Engineering Reference

 ## Overview
-Preprocessing transforms raw data into a format suitable for machine learning algorithms. Many algorithms require standardized or normalized data to perform well.

-## Standardization and Scaling
+Data preprocessing transforms raw data into a format suitable for machine learning models. This includes scaling, encoding, handling missing values, and feature engineering.
+
+## Feature Scaling and Normalization

 ### StandardScaler
-Removes mean and scales to unit variance (z-score normalization).
-
-**Formula**: `z = (x - μ) / σ`
-
-**Use cases**:
- Most ML algorithms (especially SVM, neural networks, PCA)
- When features have different units or scales
- When assuming Gaussian-like distribution
-
-**Important**: Fit only on training data, then transform both train and test sets.

+**StandardScaler (`sklearn.preprocessing.StandardScaler`)**
+- Standardizes features to zero mean and unit variance
+- Formula: z = (x - mean) / std
+- Use when: Features have different scales, algorithm assumes normally distributed data
+- Required for: SVM, KNN, Neural Networks, PCA, Linear Regression with regularization
+- Example:
 ```python
 from sklearn.preprocessing import StandardScaler
+
 scaler = StandardScaler()
 X_train_scaled = scaler.fit_transform(X_train)
-X_test_scaled = scaler.transform(X_test)  # Use same parameters
+X_test_scaled = scaler.transform(X_test)  # Use same parameters as training
+
+# Access learned parameters
+print(f"Mean: {scaler.mean_}")
+print(f"Std: {scaler.scale_}")
 ```

 ### MinMaxScaler
-Scales features to a specified range, typically [0, 1].
-
-**Formula**: `X_scaled = (X - X_min) / (X_max - X_min)`
-
-**Use cases**:
- When bounded range is needed
- Neural networks (often prefer [0, 1] range)
- When distribution is not Gaussian
- Image pixel values
-
-**Parameters**:
- `feature_range`: Tuple (min, max), default (0, 1)
-
-**Warning**: Sensitive to outliers since it uses min/max.
-
-### MaxAbsScaler
-Scales to [-1, 1] by dividing by maximum absolute value.
-
-**Use cases**:
- Sparse data (preserves sparsity)
- Data already centered at zero
- When sign of values is meaningful
-
-**Advantage**: Doesn't shift/center the data, preserves zero entries.
-
-### RobustScaler
-Uses median and interquartile range (IQR) instead of mean and standard deviation.
-
-**Formula**: `X_scaled = (X - median) / IQR`
-
-**Use cases**:
- When outliers are present
- When StandardScaler produces skewed results
- Robust statistics preferred
-
-**Parameters**:
- `quantile_range`: Tuple (q_min, q_max), default (25.0, 75.0)
-
-## Normalization
-
-### normalize() function and Normalizer
-Scales individual samples (rows) to unit norm, not features (columns).
-
-**Use cases**:
- Text classification (TF-IDF vectors)
- When similarity metrics (dot product, cosine) are used
- When each sample should have equal weight
-
-**Norms**:
- `l1`: Manhattan norm (sum of absolutes = 1)
- `l2`: Euclidean norm (sum of squares = 1) - **most common**
- `max`: Maximum absolute value = 1
-
-**Key difference from scalers**: Operates on rows (samples), not columns (features).

+**MinMaxScaler (`sklearn.preprocessing.MinMaxScaler`)**
+- Scales features to a given range (default [0, 1])
+- Formula: X_scaled = (X - X.min) / (X.max - X.min)
+- Use when: Need bounded values, data not normally distributed
+- Sensitive to outliers
+- Example:
 ```python
-from sklearn.preprocessing import Normalizer
-normalizer = Normalizer(norm='l2')
-X_normalized = normalizer.transform(X)
+from sklearn.preprocessing import MinMaxScaler
+
+scaler = MinMaxScaler(feature_range=(0, 1))
+X_scaled = scaler.fit_transform(X_train)
+
+# Custom range
+scaler = MinMaxScaler(feature_range=(-1, 1))
+X_scaled = scaler.fit_transform(X_train)
 ```

-## Encoding Categorical Features
+### RobustScaler
+
+**RobustScaler (`sklearn.preprocessing.RobustScaler`)**
+- Scales using median and interquartile range (IQR)
+- Formula: X_scaled = (X - median) / IQR
+- Use when: Data contains outliers
+- Robust to outliers
+- Example:
+```python
+from sklearn.preprocessing import RobustScaler
+
+scaler = RobustScaler()
+X_scaled = scaler.fit_transform(X_train)
+```
+
+### Normalizer
+
+**Normalizer (`sklearn.preprocessing.Normalizer`)**
+- Normalizes samples individually to unit norm
+- Common norms: 'l1', 'l2', 'max'
+- Use when: Need to normalize each sample independently (e.g., text features)
+- Example:
+```python
+from sklearn.preprocessing import Normalizer
+
+normalizer = Normalizer(norm='l2')  # Euclidean norm
+X_normalized = normalizer.fit_transform(X)
+```
+
+### MaxAbsScaler
+
+**MaxAbsScaler (`sklearn.preprocessing.MaxAbsScaler`)**
+- Scales by maximum absolute value
+- Range: [-1, 1]
+- Doesn't shift/center data (preserves sparsity)
+- Use when: Data is already centered or sparse
+- Example:
+```python
+from sklearn.preprocessing import MaxAbsScaler
+
+scaler = MaxAbsScaler()
+X_scaled = scaler.fit_transform(X_sparse)
+```
+
+## Encoding Categorical Variables
+
+### OneHotEncoder
+
+**OneHotEncoder (`sklearn.preprocessing.OneHotEncoder`)**
+- Creates binary columns for each category
+- Use when: Nominal categories (no order), tree-based models or linear models
+- Example:
+```python
+from sklearn.preprocessing import OneHotEncoder
+
+encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
+X_encoded = encoder.fit_transform(X_categorical)
+
+# Get feature names
+feature_names = encoder.get_feature_names_out(['color', 'size'])
+
+# Handle unknown categories during transform
+X_test_encoded = encoder.transform(X_test_categorical)
+```

 ### OrdinalEncoder
-Converts categories to integers (0 to n_categories - 1).
-
-**Use cases**:
- Ordinal relationships exist (small < medium < large)
- Preprocessing before other transformations
- Tree-based algorithms (which can handle integers)
-
-**Parameters**:
- `handle_unknown`: 'error' or 'use_encoded_value'
- `unknown_value`: Value for unknown categories
- `encoded_missing_value`: Value for missing data

+**OrdinalEncoder (`sklearn.preprocessing.OrdinalEncoder`)**
+- Encodes categories as integers
+- Use when: Ordinal categories (ordered), or tree-based models
+- Example:
 ```python
 from sklearn.preprocessing import OrdinalEncoder
+
+# Natural ordering
 encoder = OrdinalEncoder()
 X_encoded = encoder.fit_transform(X_categorical)
+
+# Custom ordering
+encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
+X_encoded = encoder.fit_transform(X_categorical)
 ```

-### OneHotEncoder
-Creates binary columns for each category.
-
-**Use cases**:
- Nominal categories (no order)
- Linear models, neural networks
- When category relationships shouldn't be assumed
-
-**Parameters**:
- `drop`: 'first', 'if_binary', array-like (prevents multicollinearity)
- `sparse_output`: True (default, memory efficient) or False
- `handle_unknown`: 'error', 'ignore', 'infrequent_if_exist'
- `min_frequency`: Group infrequent categories
- `max_categories`: Limit number of categories
-
-**High cardinality handling**:
-```python
-encoder = OneHotEncoder(min_frequency=100, handle_unknown='infrequent_if_exist')
-# Groups categories appearing < 100 times into 'infrequent' category
-```
-
-**Memory tip**: Use `sparse_output=True` (default) for high-cardinality features.
-
-### TargetEncoder
-Uses target statistics to encode categories.
-
-**Use cases**:
- High-cardinality categorical features (zip codes, user IDs)
- When linear relationships with target are expected
- Often improves performance over one-hot encoding
-
-**How it works**:
- Replaces category with mean of target for that category
- Uses cross-fitting during fit_transform() to prevent target leakage
- Applies smoothing to handle rare categories
-
-**Parameters**:
- `smooth`: Smoothing parameter for rare categories
- `cv`: Cross-validation strategy
-
-**Warning**: Only for supervised learning. Requires target variable.
-
-```python
-from sklearn.preprocessing import TargetEncoder
-encoder = TargetEncoder()
-X_encoded = encoder.fit_transform(X_categorical, y)
-```
-
 ### LabelEncoder
-Encodes target labels into integers 0 to n_classes - 1.

-**Use cases**: Encoding target variable for classification (not features!)
+**LabelEncoder (`sklearn.preprocessing.LabelEncoder`)**
+- Encodes target labels (y) as integers
+- Use for: Target variable encoding
+- Example:
+```python
+from sklearn.preprocessing import LabelEncoder

-**Important**: Use `LabelEncoder` for targets, not features. For features, use OrdinalEncoder or OneHotEncoder.
+le = LabelEncoder()
+y_encoded = le.fit_transform(y)

-### Binarizer
-Converts numeric values to binary (0 or 1) based on threshold.
+# Decode back
+y_decoded = le.inverse_transform(y_encoded)
+print(f"Classes: {le.classes_}")
+```

-**Use cases**: Creating binary features from continuous values
+### Target Encoding (using category_encoders)
+
+```python
+# Install: uv pip install category-encoders
+from category_encoders import TargetEncoder
+
+encoder = TargetEncoder()
+X_train_encoded = encoder.fit_transform(X_train_categorical, y_train)
+X_test_encoded = encoder.transform(X_test_categorical)
+```

 ## Non-linear Transformations

-### QuantileTransformer
-Maps features to uniform or normal distribution using rank transformation.
-
-**Use cases**:
- Unusual distributions (bimodal, heavy tails)
- Reducing outlier impact
- When normal distribution is desired
-
-**Parameters**:
- `output_distribution`: 'uniform' (default) or 'normal'
- `n_quantiles`: Number of quantiles (default: min(1000, n_samples))
-
-**Effect**: Strong transformation that reduces outlier influence and makes data more Gaussian-like.
-
-### PowerTransformer
-Applies parametric monotonic transformation to make data more Gaussian.
-
-**Methods**:
- `yeo-johnson`: Works with positive and negative values (default)
- `box-cox`: Only positive values
-
-**Use cases**:
- Skewed distributions
- When Gaussian assumption is important
- Variance stabilization
-
-**Advantage**: Less radical than QuantileTransformer, preserves more of original relationships.
-
-## Discretization
-
-### KBinsDiscretizer
-Bins continuous features into discrete intervals.
-
-**Strategies**:
- `uniform`: Equal-width bins
- `quantile`: Equal-frequency bins
- `kmeans`: K-means clustering to determine bins
-
-**Encoding**:
- `ordinal`: Integer encoding (0 to n_bins - 1)
- `onehot`: One-hot encoding
- `onehot-dense`: Dense one-hot encoding
-
-**Use cases**:
- Making linear models handle non-linear relationships
- Reducing noise in features
- Making features more interpretable
+### Power Transforms

+**PowerTransformer**
+- Makes data more Gaussian-like
+- Methods: 'yeo-johnson' (works with negative values), 'box-cox' (positive only)
+- Use when: Data is skewed, algorithm assumes normality
+- Example:
 ```python
-from sklearn.preprocessing import KBinsDiscretizer
-disc = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
-X_binned = disc.fit_transform(X)
+from sklearn.preprocessing import PowerTransformer
+
+# Yeo-Johnson (handles negative values)
+pt = PowerTransformer(method='yeo-johnson', standardize=True)
+X_transformed = pt.fit_transform(X)
+
+# Box-Cox (positive values only)
+pt = PowerTransformer(method='box-cox', standardize=True)
+X_transformed = pt.fit_transform(X)
 ```

-## Feature Generation
-
-### PolynomialFeatures
-Generates polynomial and interaction features.
-
-**Parameters**:
- `degree`: Polynomial degree
- `interaction_only`: Only multiplicative interactions (no x²)
- `include_bias`: Include constant feature
-
-**Use cases**:
- Adding non-linearity to linear models
- Feature engineering
- Polynomial regression
-
-**Warning**: Number of features grows rapidly: (n+d)!/d!n! for degree d.
+### Quantile Transformation

+**QuantileTransformer**
+- Transforms features to follow uniform or normal distribution
+- Robust to outliers
+- Use when: Want to reduce outlier impact
+- Example:
 ```python
-from sklearn.preprocessing import PolynomialFeatures
-poly = PolynomialFeatures(degree=2, include_bias=False)
-X_poly = poly.fit_transform(X)
-# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
+from sklearn.preprocessing import QuantileTransformer
+
+# Transform to uniform distribution
+qt = QuantileTransformer(output_distribution='uniform', random_state=42)
+X_transformed = qt.fit_transform(X)
+
+# Transform to normal distribution
+qt = QuantileTransformer(output_distribution='normal', random_state=42)
+X_transformed = qt.fit_transform(X)
 ```

-### SplineTransformer
-Generates B-spline basis functions.
+### Log Transform

-**Use cases**:
- Smooth non-linear transformations
- Alternative to PolynomialFeatures (less oscillation at boundaries)
- Generalized additive models (GAMs)
+```python
+import numpy as np

-**Parameters**:
- `n_knots`: Number of knots
- `degree`: Spline degree
- `knots`: Knot positions ('uniform', 'quantile', or array)
+# Log1p (log(1 + x)) - handles zeros
+X_log = np.log1p(X)

-## Missing Value Handling
+# Or use FunctionTransformer
+from sklearn.preprocessing import FunctionTransformer
+
+log_transformer = FunctionTransformer(np.log1p, inverse_func=np.expm1)
+X_log = log_transformer.fit_transform(X)
+```
+
+## Missing Value Imputation

 ### SimpleImputer
-Imputes missing values with various strategies.
-
-**Strategies**:
- `mean`: Mean of column (numeric only)
- `median`: Median of column (numeric only)
- `most_frequent`: Mode (numeric or categorical)
- `constant`: Fill with constant value
-
-**Parameters**:
- `strategy`: Imputation strategy
- `fill_value`: Value when strategy='constant'
- `missing_values`: What represents missing (np.nan, None, specific value)

+**SimpleImputer (`sklearn.impute.SimpleImputer`)**
+- Basic imputation strategies
+- Strategies: 'mean', 'median', 'most_frequent', 'constant'
+- Example:
 ```python
 from sklearn.impute import SimpleImputer
-imputer = SimpleImputer(strategy='median')
+
+# For numerical features
+imputer = SimpleImputer(strategy='mean')
+X_imputed = imputer.fit_transform(X)
+
+# For categorical features
+imputer = SimpleImputer(strategy='most_frequent')
+X_imputed = imputer.fit_transform(X_categorical)
+
+# Fill with constant
+imputer = SimpleImputer(strategy='constant', fill_value=0)
 X_imputed = imputer.fit_transform(X)
 ```

-### KNNImputer
-Imputes using k-nearest neighbors.
+### Iterative Imputer

-**Use cases**: When relationships between features should inform imputation
+**IterativeImputer**
+- Models each feature with missing values as function of other features
+- More sophisticated than SimpleImputer
+- Example:
+```python
+from sklearn.experimental import enable_iterative_imputer
+from sklearn.impute import IterativeImputer

-**Parameters**:
- `n_neighbors`: Number of neighbors
- `weights`: 'uniform' or 'distance'
+imputer = IterativeImputer(max_iter=10, random_state=42)
+X_imputed = imputer.fit_transform(X)
+```

-### IterativeImputer
-Models each feature with missing values as function of other features.
+### KNN Imputer

-**Use cases**:
- Complex relationships between features
- When multiple features have missing values
- Higher quality imputation (but slower)
+**KNNImputer**
+- Imputes using k-nearest neighbors
+- Use when: Features are correlated
+- Example:
+```python
+from sklearn.impute import KNNImputer

-**Parameters**:
- `estimator`: Estimator for regression (default: BayesianRidge)
- `max_iter`: Maximum iterations
+imputer = KNNImputer(n_neighbors=5)
+X_imputed = imputer.fit_transform(X)
+```

-## Function Transformers
+## Feature Engineering

-### FunctionTransformer
-Applies custom function to data.
+### Polynomial Features

-**Use cases**:
- Custom transformations in pipelines
- Log transformation, square root, etc.
- Domain-specific preprocessing
+**PolynomialFeatures**
+- Creates polynomial and interaction features
+- Use when: Need non-linear features for linear models
+- Example:
+```python
+from sklearn.preprocessing import PolynomialFeatures
+
+# Degree 2: includes x1, x2, x1^2, x2^2, x1*x2
+poly = PolynomialFeatures(degree=2, include_bias=False)
+X_poly = poly.fit_transform(X)
+
+# Get feature names
+feature_names = poly.get_feature_names_out(['x1', 'x2'])
+
+# Only interactions (no powers)
+poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
+X_interactions = poly.fit_transform(X)
+```
+
+### Binning/Discretization
+
+**KBinsDiscretizer**
+- Bins continuous features into discrete intervals
+- Strategies: 'uniform', 'quantile', 'kmeans'
+- Encoding: 'onehot', 'ordinal', 'onehot-dense'
+- Example:
+```python
+from sklearn.preprocessing import KBinsDiscretizer
+
+# Equal-width bins
+binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
+X_binned = binner.fit_transform(X)
+
+# Equal-frequency bins (quantile-based)
+binner = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
+X_binned = binner.fit_transform(X)
+```
+
+### Binarization
+
+**Binarizer**
+- Converts features to binary (0 or 1) based on threshold
+- Example:
+```python
+from sklearn.preprocessing import Binarizer
+
+binarizer = Binarizer(threshold=0.5)
+X_binary = binarizer.fit_transform(X)
+```
+
+### Spline Features
+
+**SplineTransformer**
+- Creates spline basis functions
+- Useful for capturing non-linear relationships
+- Example:
+```python
+from sklearn.preprocessing import SplineTransformer
+
+spline = SplineTransformer(n_knots=5, degree=3)
+X_splines = spline.fit_transform(X)
+```
+
+## Text Feature Extraction
+
+### CountVectorizer
+
+**CountVectorizer (`sklearn.feature_extraction.text.CountVectorizer`)**
+- Converts text to token count matrix
+- Use for: Bag-of-words representation
+- Example:
+```python
+from sklearn.feature_extraction.text import CountVectorizer
+
+vectorizer = CountVectorizer(
+    max_features=5000,  # Keep top 5000 features
+    min_df=2,  # Ignore terms appearing in < 2 documents
+    max_df=0.8,  # Ignore terms appearing in > 80% documents
+    ngram_range=(1, 2)  # Unigrams and bigrams
+)
+
+X_counts = vectorizer.fit_transform(documents)
+feature_names = vectorizer.get_feature_names_out()
+```
+
+### TfidfVectorizer
+
+**TfidfVectorizer**
+- TF-IDF (Term Frequency-Inverse Document Frequency) transformation
+- Better than CountVectorizer for most tasks
+- Example:
+```python
+from sklearn.feature_extraction.text import TfidfVectorizer
+
+vectorizer = TfidfVectorizer(
+    max_features=5000,
+    min_df=2,
+    max_df=0.8,
+    ngram_range=(1, 2),
+    stop_words='english'  # Remove English stop words
+)
+
+X_tfidf = vectorizer.fit_transform(documents)
+```
+
+### HashingVectorizer
+
+**HashingVectorizer**
+- Uses hashing trick for memory efficiency
+- No fit needed, can't reverse transform
+- Use when: Very large vocabulary, streaming data
+- Example:
+```python
+from sklearn.feature_extraction.text import HashingVectorizer
+
+vectorizer = HashingVectorizer(n_features=2**18)
+X_hashed = vectorizer.transform(documents)  # No fit needed
+```
+
+## Feature Selection
+
+### Filter Methods
+
+**Variance Threshold**
+- Removes low-variance features
+- Example:
+```python
+from sklearn.feature_selection import VarianceThreshold
+
+selector = VarianceThreshold(threshold=0.01)
+X_selected = selector.fit_transform(X)
+```
+
+**SelectKBest / SelectPercentile**
+- Select features based on statistical tests
+- Tests: f_classif, chi2, mutual_info_classif
+- Example:
+```python
+from sklearn.feature_selection import SelectKBest, f_classif
+
+# Select top 10 features
+selector = SelectKBest(score_func=f_classif, k=10)
+X_selected = selector.fit_transform(X_train, y_train)
+
+# Get selected feature indices
+selected_indices = selector.get_support(indices=True)
+```
+
+### Wrapper Methods
+
+**Recursive Feature Elimination (RFE)**
+- Recursively removes features
+- Uses model feature importances
+- Example:
+```python
+from sklearn.feature_selection import RFE
+from sklearn.ensemble import RandomForestClassifier
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+rfe = RFE(estimator=model, n_features_to_select=10, step=1)
+X_selected = rfe.fit_transform(X_train, y_train)
+
+# Get selected features
+selected_features = rfe.support_
+feature_ranking = rfe.ranking_
+```
+
+**RFECV (with Cross-Validation)**
+- RFE with cross-validation to find optimal number of features
+- Example:
+```python
+from sklearn.feature_selection import RFECV
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+rfecv = RFECV(estimator=model, cv=5, scoring='accuracy')
+X_selected = rfecv.fit_transform(X_train, y_train)
+
+print(f"Optimal number of features: {rfecv.n_features_}")
+```
+
+### Embedded Methods
+
+**SelectFromModel**
+- Select features based on model coefficients/importances
+- Works with: Linear models (L1), Tree-based models
+- Example:
+```python
+from sklearn.feature_selection import SelectFromModel
+from sklearn.ensemble import RandomForestClassifier
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+selector = SelectFromModel(model, threshold='median')
+selector.fit(X_train, y_train)
+X_selected = selector.transform(X_train)
+
+# Get selected features
+selected_features = selector.get_support()
+```
+
+**L1-based Feature Selection**
+```python
+from sklearn.linear_model import LogisticRegression
+from sklearn.feature_selection import SelectFromModel
+
+model = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)
+selector = SelectFromModel(model)
+selector.fit(X_train, y_train)
+X_selected = selector.transform(X_train)
+```
+
+## Handling Outliers
+
+### IQR Method
+
+```python
+import numpy as np
+
+Q1 = np.percentile(X, 25, axis=0)
+Q3 = np.percentile(X, 75, axis=0)
+IQR = Q3 - Q1
+
+# Define outlier boundaries
+lower_bound = Q1 - 1.5 * IQR
+upper_bound = Q3 + 1.5 * IQR
+
+# Remove outliers
+mask = np.all((X >= lower_bound) & (X <= upper_bound), axis=1)
+X_no_outliers = X[mask]
+```
+
+### Winsorization
+
+```python
+from scipy.stats import mstats
+
+# Clip outliers at 5th and 95th percentiles
+X_winsorized = mstats.winsorize(X, limits=[0.05, 0.05], axis=0)
+```
+
+## Custom Transformers
+
+### Using FunctionTransformer

 ```python
 from sklearn.preprocessing import FunctionTransformer
 import numpy as np

-log_transformer = FunctionTransformer(np.log1p, validate=True)
-X_log = log_transformer.transform(X)
+def log_transform(X):
+    return np.log1p(X)
+
+transformer = FunctionTransformer(log_transform, inverse_func=np.expm1)
+X_transformed = transformer.fit_transform(X)
+```
+
+### Creating Custom Transformer
+
+```python
+from sklearn.base import BaseEstimator, TransformerMixin
+
+class CustomTransformer(BaseEstimator, TransformerMixin):
+    def __init__(self, parameter=1):
+        self.parameter = parameter
+
+    def fit(self, X, y=None):
+        # Learn parameters from X if needed
+        return self
+
+    def transform(self, X):
+        # Transform X
+        return X * self.parameter
+
+transformer = CustomTransformer(parameter=2)
+X_transformed = transformer.fit_transform(X)
 ```

 ## Best Practices

-### Feature Scaling Guidelines
+### Fit on Training Data Only
+Always fit transformers on training data only:
+```python
+# Correct
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)

-**Always scale**:
- SVM, neural networks
- K-nearest neighbors
- Linear/Logistic regression with regularization
- PCA, LDA
- Gradient descent-based algorithms
-
-**Don't need to scale**:
- Tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting)
- Naive Bayes
-
-### Pipeline Integration
-
-Always use preprocessing within pipelines to prevent data leakage:
+# Wrong - causes data leakage
+scaler = StandardScaler()
+X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))
+```

+### Use Pipelines
+Combine preprocessing with models:
 ```python
 from sklearn.pipeline import Pipeline
 from sklearn.preprocessing import StandardScaler
@@ -350,64 +568,39 @@ pipeline = Pipeline([
    ('classifier', LogisticRegression())
 ])

-pipeline.fit(X_train, y_train)  # Scaler fit only on train data
-y_pred = pipeline.predict(X_test)  # Scaler transform only on test data
+pipeline.fit(X_train, y_train)
 ```

-### Common Transformations by Data Type
-
-**Numeric - Continuous**:
- StandardScaler (most common)
- MinMaxScaler (neural networks)
- RobustScaler (outliers present)
- PowerTransformer (skewed data)
-
-**Numeric - Count Data**:
- sqrt or log transformation
- QuantileTransformer
- StandardScaler after transformation
-
-**Categorical - Low Cardinality (<10 categories)**:
- OneHotEncoder
-
-**Categorical - High Cardinality (>10 categories)**:
- TargetEncoder (supervised)
- Frequency encoding
- OneHotEncoder with min_frequency parameter
-
-**Categorical - Ordinal**:
- OrdinalEncoder
-
-**Text**:
- CountVectorizer or TfidfVectorizer
- Normalizer after vectorization
-
-### Data Leakage Prevention
-
-1. **Fit only on training data**: Never include test data when fitting preprocessors
-2. **Use pipelines**: Ensures proper fit/transform separation
-3. **Cross-validation**: Use Pipeline with cross_val_score() for proper evaluation
-4. **Target encoding**: Use cv parameter in TargetEncoder for cross-fitting
-
+### Handle Categorical and Numerical Separately
+Use ColumnTransformer:
 ```python
-# WRONG - data leakage
-scaler = StandardScaler().fit(X_full)
-X_train_scaled = scaler.transform(X_train)
-X_test_scaled = scaler.transform(X_test)
+from sklearn.compose import ColumnTransformer
+from sklearn.preprocessing import StandardScaler, OneHotEncoder

-# CORRECT - no leakage
-scaler = StandardScaler().fit(X_train)
-X_train_scaled = scaler.transform(X_train)
-X_test_scaled = scaler.transform(X_test)
+numeric_features = ['age', 'income']
+categorical_features = ['gender', 'occupation']
+
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', StandardScaler(), numeric_features),
+        ('cat', OneHotEncoder(), categorical_features)
+    ]
+)
+
+X_transformed = preprocessor.fit_transform(X)
 ```

-## Preprocessing Checklist
+### Algorithm-Specific Requirements

-Before modeling:
-1. Handle missing values (imputation or removal)
-2. Encode categorical variables appropriately
-3. Scale/normalize numeric features (if needed for algorithm)
-4. Handle outliers (RobustScaler, clipping, removal)
-5. Create additional features if beneficial (PolynomialFeatures, domain knowledge)
-6. Check for data leakage in preprocessing steps
-7. Wrap everything in a Pipeline
+**Require Scaling:**
+- SVM, KNN, Neural Networks
+- PCA, Linear/Logistic Regression with regularization
+- K-Means clustering
+
+**Don't Require Scaling:**
+- Tree-based models (Decision Trees, Random Forest, Gradient Boosting)
+- Naive Bayes
+
+**Encoding Requirements:**
+- Linear models, SVM, KNN: One-hot encoding for nominal features
+- Tree-based models: Can handle ordinal encoding directly