Files
claude-scientific-skills/scientific-packages/scikit-learn/references/pipelines_and_composition.md
2025-10-19 14:12:02 -07:00

680 lines
18 KiB
Markdown

# Pipelines and Composite Estimators in scikit-learn
## Overview
Pipelines chain multiple estimators into a single unit, ensuring proper workflow sequencing and preventing data leakage. As the documentation states: "Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification."
## Pipeline Basics
### Creating Pipelines
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
# Method 1: List of (name, estimator) tuples
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('classifier', LogisticRegression())
])
# Method 2: Using make_pipeline (auto-generates names)
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(
StandardScaler(),
PCA(n_components=10),
LogisticRegression()
)
```
### Using Pipelines
```python
# Fit and predict like any estimator
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
score = pipeline.score(X_test, y_test)
# Access steps
pipeline.named_steps['scaler']
pipeline.steps[0] # Returns ('scaler', StandardScaler(...))
pipeline[0] # Returns StandardScaler(...) object
pipeline['scaler'] # Returns StandardScaler(...) object
# Get final estimator
pipeline[-1] # Returns LogisticRegression(...) object
```
### Pipeline Rules
**All steps except the last must be transformers** (have `fit()` and `transform()` methods).
**The final step** can be:
- Predictor (classifier/regressor) with `fit()` and `predict()`
- Transformer with `fit()` and `transform()`
- Any estimator with at least `fit()`
### Pipeline Benefits
1. **Convenience**: Single `fit()` and `predict()` call
2. **Prevents data leakage**: Ensures proper fit/transform on train/test
3. **Joint parameter selection**: Tune all steps together with GridSearchCV
4. **Reproducibility**: Encapsulates entire workflow
## Accessing and Setting Parameters
### Nested Parameters
Access step parameters using `stepname__parameter` syntax:
```python
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
# Grid search over pipeline parameters
param_grid = {
'scaler__with_mean': [True, False],
'clf__C': [0.1, 1.0, 10.0],
'clf__penalty': ['l1', 'l2']
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
```
### Setting Parameters
```python
# Set parameters
pipeline.set_params(clf__C=10.0, scaler__with_std=False)
# Get parameters
params = pipeline.get_params()
```
## Caching Intermediate Results
Cache fitted transformers to avoid recomputation:
```python
from tempfile import mkdtemp
from shutil import rmtree
# Create cache directory
cachedir = mkdtemp()
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('clf', LogisticRegression())
], memory=cachedir)
# When doing grid search, scaler and PCA only fit once per fold
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
# Clean up cache
rmtree(cachedir)
# Or use joblib for persistent caching
from joblib import Memory
memory = Memory(location='./cache', verbose=0)
pipeline = Pipeline([...], memory=memory)
```
**When to use caching**:
- Expensive transformations (PCA, feature selection)
- Grid search over final estimator parameters only
- Multiple experiments with same preprocessing
## ColumnTransformer
Apply different transformations to different columns (essential for heterogeneous data).
### Basic Usage
```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Define which transformations for which columns
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['age', 'income', 'credit_score']),
('cat', OneHotEncoder(), ['country', 'occupation'])
],
remainder='drop' # What to do with remaining columns
)
X_transformed = preprocessor.fit_transform(X)
```
### Column Selection Methods
```python
# Method 1: Column names (list of strings)
('num', StandardScaler(), ['age', 'income'])
# Method 2: Column indices (list of integers)
('num', StandardScaler(), [0, 1, 2])
# Method 3: Boolean mask
('num', StandardScaler(), [True, True, False, True, False])
# Method 4: Slice
('num', StandardScaler(), slice(0, 3))
# Method 5: make_column_selector (by dtype or pattern)
from sklearn.compose import make_column_selector as selector
preprocessor = ColumnTransformer([
('num', StandardScaler(), selector(dtype_include='number')),
('cat', OneHotEncoder(), selector(dtype_include='object'))
])
# Select by pattern
selector(pattern='.*_score$') # All columns ending with '_score'
```
### Remainder Parameter
Controls what happens to columns not specified:
```python
# Drop remaining columns (default)
remainder='drop'
# Pass through remaining columns unchanged
remainder='passthrough'
# Apply transformer to remaining columns
remainder=StandardScaler()
```
### Full Pipeline with ColumnTransformer
```python
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
# Separate preprocessing for numeric and categorical
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['country', 'occupation', 'education']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Complete pipeline
clf = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Grid search over preprocessing and model parameters
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'preprocessor__cat__onehot__max_categories': [10, 20, None],
'classifier__n_estimators': [100, 200],
'classifier__max_depth': [10, 20, None]
}
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
```
## FeatureUnion
Combine multiple transformer outputs by concatenating features side-by-side.
```python
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
# Combine PCA and feature selection
combined_features = FeatureUnion([
('pca', PCA(n_components=10)),
('univ_select', SelectKBest(k=5))
])
X_features = combined_features.fit_transform(X, y)
# Result: 15 features (10 from PCA + 5 from SelectKBest)
# In a pipeline
pipeline = Pipeline([
('features', combined_features),
('classifier', LogisticRegression())
])
```
### FeatureUnion with Transformers on Different Data
```python
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import FunctionTransformer
import numpy as np
def get_numeric_data(X):
return X[:, :3] # First 3 columns
def get_text_data(X):
return X[:, 3] # 4th column (text)
from sklearn.feature_extraction.text import TfidfVectorizer
combined = FeatureUnion([
('numeric_features', Pipeline([
('selector', FunctionTransformer(get_numeric_data)),
('scaler', StandardScaler())
])),
('text_features', Pipeline([
('selector', FunctionTransformer(get_text_data)),
('tfidf', TfidfVectorizer())
]))
])
```
**Note**: ColumnTransformer is usually more convenient than FeatureUnion for heterogeneous data.
## Common Pipeline Patterns
### Classification Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.svm import SVC
pipeline = Pipeline([
('scaler', StandardScaler()),
('feature_selection', SelectKBest(f_classif, k=10)),
('classifier', SVC(kernel='rbf'))
])
```
### Regression Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipeline = Pipeline([
('scaler', StandardScaler()),
('poly', PolynomialFeatures(degree=2)),
('ridge', Ridge(alpha=1.0))
])
```
### Text Classification Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=1000)),
('classifier', MultinomialNB())
])
# Works directly with text
pipeline.fit(X_train_text, y_train)
y_pred = pipeline.predict(X_test_text)
```
### Image Processing Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=100)),
('mlp', MLPClassifier(hidden_layer_sizes=(100, 50)))
])
```
### Dimensionality Reduction + Clustering
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('kmeans', KMeans(n_clusters=5))
])
labels = pipeline.fit_predict(X)
```
## Custom Transformers
### Using FunctionTransformer
```python
from sklearn.preprocessing import FunctionTransformer
import numpy as np
# Log transformation
log_transformer = FunctionTransformer(np.log1p)
# Custom function
def custom_transform(X):
# Your transformation logic
return X_transformed
custom_transformer = FunctionTransformer(custom_transform)
# In pipeline
pipeline = Pipeline([
('log', log_transformer),
('scaler', StandardScaler()),
('model', LinearRegression())
])
```
### Creating Custom Transformer Class
```python
from sklearn.base import BaseEstimator, TransformerMixin
class CustomTransformer(BaseEstimator, TransformerMixin):
def __init__(self, parameter=1.0):
self.parameter = parameter
def fit(self, X, y=None):
# Learn parameters from X
self.learned_param_ = X.mean() # Example
return self
def transform(self, X):
# Transform X using learned parameters
return X * self.parameter - self.learned_param_
# Optional: for pipelines that need inverse transform
def inverse_transform(self, X):
return (X + self.learned_param_) / self.parameter
# Use in pipeline
pipeline = Pipeline([
('custom', CustomTransformer(parameter=2.0)),
('model', LinearRegression())
])
```
**Key requirements**:
- Inherit from `BaseEstimator` and `TransformerMixin`
- Implement `fit()` and `transform()` methods
- `fit()` must return `self`
- Use trailing underscore for learned attributes (`learned_param_`)
- Constructor parameters should be stored as attributes
### Transformer for Pandas DataFrames
```python
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class DataFrameTransformer(BaseEstimator, TransformerMixin):
def __init__(self, columns=None):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
if isinstance(X, pd.DataFrame):
if self.columns:
return X[self.columns].values
return X.values
return X
```
## Visualization
### Display Pipeline in Jupyter
```python
from sklearn import set_config
# Enable HTML display
set_config(display='diagram')
# Now displaying the pipeline shows interactive diagram
pipeline
```
### Print Pipeline Structure
```python
from sklearn.utils import estimator_html_repr
# Get HTML representation
html = estimator_html_repr(pipeline)
# Or just print
print(pipeline)
```
## Advanced Patterns
### Conditional Transformations
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, FunctionTransformer
def conditional_scale(X, scale=True):
if scale:
return StandardScaler().fit_transform(X)
return X
pipeline = Pipeline([
('conditional_scaler', FunctionTransformer(
conditional_scale,
kw_args={'scale': True}
)),
('model', LogisticRegression())
])
```
### Multiple Preprocessing Paths
```python
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Different preprocessing for different feature types
preprocessor = ColumnTransformer([
# Numeric: impute + scale
('num_standard', Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
]), ['age', 'income']),
# Numeric: impute + log + scale
('num_skewed', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('log', FunctionTransformer(np.log1p)),
('scaler', StandardScaler())
]), ['price', 'revenue']),
# Categorical: impute + one-hot
('cat', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
]), ['category', 'region']),
# Text: TF-IDF
('text', TfidfVectorizer(), 'description')
])
```
### Feature Engineering Pipeline
```python
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureEngineer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X = X.copy()
# Add engineered features
X['age_income_ratio'] = X['age'] / (X['income'] + 1)
X['total_score'] = X['score1'] + X['score2'] + X['score3']
return X
pipeline = Pipeline([
('engineer', FeatureEngineer()),
('preprocessor', preprocessor),
('model', RandomForestClassifier())
])
```
## Best Practices
### Always Use Pipelines When
1. **Preprocessing is needed**: Scaling, encoding, imputation
2. **Cross-validation**: Ensures proper fit/transform split
3. **Hyperparameter tuning**: Joint optimization of preprocessing and model
4. **Production deployment**: Single object to serialize
5. **Multiple steps**: Any workflow with >1 step
### Pipeline Do's
- ✅ Fit pipeline only on training data
- ✅ Use ColumnTransformer for heterogeneous data
- ✅ Cache expensive transformations during grid search
- ✅ Use make_pipeline for simple cases
- ✅ Set verbose=True to debug issues
- ✅ Use remainder='passthrough' when appropriate
### Pipeline Don'ts
- ❌ Fit preprocessing on full dataset before split (data leakage!)
- ❌ Manually transform test data (use pipeline.predict())
- ❌ Forget to handle missing values before scaling
- ❌ Mix pandas DataFrames and arrays inconsistently
- ❌ Skip using pipelines for "just one preprocessing step"
### Data Leakage Prevention
```python
# ❌ WRONG - Data leakage
scaler = StandardScaler().fit(X) # Fit on all data
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# ✅ CORRECT - No leakage with pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train) # Scaler fits only on train
y_pred = pipeline.predict(X_test) # Scaler transforms only on test
# ✅ CORRECT - No leakage in cross-validation
scores = cross_val_score(pipeline, X, y, cv=5)
# Each fold: scaler fits on train folds, transforms on test fold
```
### Debugging Pipelines
```python
# Examine intermediate outputs
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('model', LogisticRegression())
])
# Fit pipeline
pipeline.fit(X_train, y_train)
# Get output after scaling
X_scaled = pipeline.named_steps['scaler'].transform(X_train)
# Get output after PCA
X_pca = pipeline[:-1].transform(X_train) # All steps except last
# Or build partial pipeline
partial_pipeline = Pipeline(pipeline.steps[:-1])
X_transformed = partial_pipeline.transform(X_train)
```
### Saving and Loading Pipelines
```python
import joblib
# Save pipeline
joblib.dump(pipeline, 'model_pipeline.pkl')
# Load pipeline
pipeline = joblib.load('model_pipeline.pkl')
# Use loaded pipeline
y_pred = pipeline.predict(X_new)
```
## Common Errors and Solutions
**Error**: `ValueError: could not convert string to float`
- **Cause**: Categorical features not encoded
- **Solution**: Add OneHotEncoder or OrdinalEncoder to pipeline
**Error**: `All intermediate steps should be transformers`
- **Cause**: Non-transformer in non-final position
- **Solution**: Ensure only last step is predictor
**Error**: `X has different number of features than during fitting`
- **Cause**: Different columns in train and test
- **Solution**: Ensure consistent column handling, use `handle_unknown='ignore'` in OneHotEncoder
**Error**: Different results in cross-validation vs train-test split
- **Cause**: Data leakage (fitting preprocessing on all data)
- **Solution**: Always use Pipeline for preprocessing
**Error**: Pipeline too slow during grid search
- **Solution**: Use caching with `memory` parameter