mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-28 07:33:45 +08:00
Add more scientific skills
This commit is contained in:
@@ -0,0 +1,679 @@
|
||||
# Pipelines and Composite Estimators in scikit-learn
|
||||
|
||||
## Overview
|
||||
Pipelines chain multiple estimators into a single unit, ensuring proper workflow sequencing and preventing data leakage. As the documentation states: "Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification."
|
||||
|
||||
## Pipeline Basics
|
||||
|
||||
### Creating Pipelines
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
# Method 1: List of (name, estimator) tuples
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
|
||||
# Method 2: Using make_pipeline (auto-generates names)
|
||||
from sklearn.pipeline import make_pipeline
|
||||
pipeline = make_pipeline(
|
||||
StandardScaler(),
|
||||
PCA(n_components=10),
|
||||
LogisticRegression()
|
||||
)
|
||||
```
|
||||
|
||||
### Using Pipelines
|
||||
|
||||
```python
|
||||
# Fit and predict like any estimator
|
||||
pipeline.fit(X_train, y_train)
|
||||
y_pred = pipeline.predict(X_test)
|
||||
score = pipeline.score(X_test, y_test)
|
||||
|
||||
# Access steps
|
||||
pipeline.named_steps['scaler']
|
||||
pipeline.steps[0] # Returns ('scaler', StandardScaler(...))
|
||||
pipeline[0] # Returns StandardScaler(...) object
|
||||
pipeline['scaler'] # Returns StandardScaler(...) object
|
||||
|
||||
# Get final estimator
|
||||
pipeline[-1] # Returns LogisticRegression(...) object
|
||||
```
|
||||
|
||||
### Pipeline Rules
|
||||
|
||||
**All steps except the last must be transformers** (have `fit()` and `transform()` methods).
|
||||
|
||||
**The final step** can be:
|
||||
- Predictor (classifier/regressor) with `fit()` and `predict()`
|
||||
- Transformer with `fit()` and `transform()`
|
||||
- Any estimator with at least `fit()`
|
||||
|
||||
### Pipeline Benefits
|
||||
|
||||
1. **Convenience**: Single `fit()` and `predict()` call
|
||||
2. **Prevents data leakage**: Ensures proper fit/transform on train/test
|
||||
3. **Joint parameter selection**: Tune all steps together with GridSearchCV
|
||||
4. **Reproducibility**: Encapsulates entire workflow
|
||||
|
||||
## Accessing and Setting Parameters
|
||||
|
||||
### Nested Parameters
|
||||
|
||||
Access step parameters using `stepname__parameter` syntax:
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('clf', LogisticRegression())
|
||||
])
|
||||
|
||||
# Grid search over pipeline parameters
|
||||
param_grid = {
|
||||
'scaler__with_mean': [True, False],
|
||||
'clf__C': [0.1, 1.0, 10.0],
|
||||
'clf__penalty': ['l1', 'l2']
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
|
||||
grid_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Setting Parameters
|
||||
|
||||
```python
|
||||
# Set parameters
|
||||
pipeline.set_params(clf__C=10.0, scaler__with_std=False)
|
||||
|
||||
# Get parameters
|
||||
params = pipeline.get_params()
|
||||
```
|
||||
|
||||
## Caching Intermediate Results
|
||||
|
||||
Cache fitted transformers to avoid recomputation:
|
||||
|
||||
```python
|
||||
from tempfile import mkdtemp
|
||||
from shutil import rmtree
|
||||
|
||||
# Create cache directory
|
||||
cachedir = mkdtemp()
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('clf', LogisticRegression())
|
||||
], memory=cachedir)
|
||||
|
||||
# When doing grid search, scaler and PCA only fit once per fold
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
|
||||
grid_search.fit(X_train, y_train)
|
||||
|
||||
# Clean up cache
|
||||
rmtree(cachedir)
|
||||
|
||||
# Or use joblib for persistent caching
|
||||
from joblib import Memory
|
||||
memory = Memory(location='./cache', verbose=0)
|
||||
pipeline = Pipeline([...], memory=memory)
|
||||
```
|
||||
|
||||
**When to use caching**:
|
||||
- Expensive transformations (PCA, feature selection)
|
||||
- Grid search over final estimator parameters only
|
||||
- Multiple experiments with same preprocessing
|
||||
|
||||
## ColumnTransformer
|
||||
|
||||
Apply different transformations to different columns (essential for heterogeneous data).
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
|
||||
# Define which transformations for which columns
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', StandardScaler(), ['age', 'income', 'credit_score']),
|
||||
('cat', OneHotEncoder(), ['country', 'occupation'])
|
||||
],
|
||||
remainder='drop' # What to do with remaining columns
|
||||
)
|
||||
|
||||
X_transformed = preprocessor.fit_transform(X)
|
||||
```
|
||||
|
||||
### Column Selection Methods
|
||||
|
||||
```python
|
||||
# Method 1: Column names (list of strings)
|
||||
('num', StandardScaler(), ['age', 'income'])
|
||||
|
||||
# Method 2: Column indices (list of integers)
|
||||
('num', StandardScaler(), [0, 1, 2])
|
||||
|
||||
# Method 3: Boolean mask
|
||||
('num', StandardScaler(), [True, True, False, True, False])
|
||||
|
||||
# Method 4: Slice
|
||||
('num', StandardScaler(), slice(0, 3))
|
||||
|
||||
# Method 5: make_column_selector (by dtype or pattern)
|
||||
from sklearn.compose import make_column_selector as selector
|
||||
|
||||
preprocessor = ColumnTransformer([
|
||||
('num', StandardScaler(), selector(dtype_include='number')),
|
||||
('cat', OneHotEncoder(), selector(dtype_include='object'))
|
||||
])
|
||||
|
||||
# Select by pattern
|
||||
selector(pattern='.*_score$') # All columns ending with '_score'
|
||||
```
|
||||
|
||||
### Remainder Parameter
|
||||
|
||||
Controls what happens to columns not specified:
|
||||
|
||||
```python
|
||||
# Drop remaining columns (default)
|
||||
remainder='drop'
|
||||
|
||||
# Pass through remaining columns unchanged
|
||||
remainder='passthrough'
|
||||
|
||||
# Apply transformer to remaining columns
|
||||
remainder=StandardScaler()
|
||||
```
|
||||
|
||||
### Full Pipeline with ColumnTransformer
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
# Separate preprocessing for numeric and categorical
|
||||
numeric_features = ['age', 'income', 'credit_score']
|
||||
categorical_features = ['country', 'occupation', 'education']
|
||||
|
||||
numeric_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
|
||||
categorical_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
])
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
])
|
||||
|
||||
# Complete pipeline
|
||||
clf = Pipeline(steps=[
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', RandomForestClassifier())
|
||||
])
|
||||
|
||||
clf.fit(X_train, y_train)
|
||||
y_pred = clf.predict(X_test)
|
||||
|
||||
# Grid search over preprocessing and model parameters
|
||||
param_grid = {
|
||||
'preprocessor__num__imputer__strategy': ['mean', 'median'],
|
||||
'preprocessor__cat__onehot__max_categories': [10, 20, None],
|
||||
'classifier__n_estimators': [100, 200],
|
||||
'classifier__max_depth': [10, 20, None]
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(clf, param_grid, cv=5)
|
||||
grid_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## FeatureUnion
|
||||
|
||||
Combine multiple transformer outputs by concatenating features side-by-side.
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import FeatureUnion
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.feature_selection import SelectKBest
|
||||
|
||||
# Combine PCA and feature selection
|
||||
combined_features = FeatureUnion([
|
||||
('pca', PCA(n_components=10)),
|
||||
('univ_select', SelectKBest(k=5))
|
||||
])
|
||||
|
||||
X_features = combined_features.fit_transform(X, y)
|
||||
# Result: 15 features (10 from PCA + 5 from SelectKBest)
|
||||
|
||||
# In a pipeline
|
||||
pipeline = Pipeline([
|
||||
('features', combined_features),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
```
|
||||
|
||||
### FeatureUnion with Transformers on Different Data
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import FeatureUnion
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
import numpy as np
|
||||
|
||||
def get_numeric_data(X):
|
||||
return X[:, :3] # First 3 columns
|
||||
|
||||
def get_text_data(X):
|
||||
return X[:, 3] # 4th column (text)
|
||||
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
|
||||
combined = FeatureUnion([
|
||||
('numeric_features', Pipeline([
|
||||
('selector', FunctionTransformer(get_numeric_data)),
|
||||
('scaler', StandardScaler())
|
||||
])),
|
||||
('text_features', Pipeline([
|
||||
('selector', FunctionTransformer(get_text_data)),
|
||||
('tfidf', TfidfVectorizer())
|
||||
]))
|
||||
])
|
||||
```
|
||||
|
||||
**Note**: ColumnTransformer is usually more convenient than FeatureUnion for heterogeneous data.
|
||||
|
||||
## Common Pipeline Patterns
|
||||
|
||||
### Classification Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.feature_selection import SelectKBest, f_classif
|
||||
from sklearn.svm import SVC
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('feature_selection', SelectKBest(f_classif, k=10)),
|
||||
('classifier', SVC(kernel='rbf'))
|
||||
])
|
||||
```
|
||||
|
||||
### Regression Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
|
||||
from sklearn.linear_model import Ridge
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('poly', PolynomialFeatures(degree=2)),
|
||||
('ridge', Ridge(alpha=1.0))
|
||||
])
|
||||
```
|
||||
|
||||
### Text Classification Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from sklearn.naive_bayes import MultinomialNB
|
||||
|
||||
pipeline = Pipeline([
|
||||
('tfidf', TfidfVectorizer(max_features=1000)),
|
||||
('classifier', MultinomialNB())
|
||||
])
|
||||
|
||||
# Works directly with text
|
||||
pipeline.fit(X_train_text, y_train)
|
||||
y_pred = pipeline.predict(X_test_text)
|
||||
```
|
||||
|
||||
### Image Processing Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.neural_network import MLPClassifier
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=100)),
|
||||
('mlp', MLPClassifier(hidden_layer_sizes=(100, 50)))
|
||||
])
|
||||
```
|
||||
|
||||
### Dimensionality Reduction + Clustering
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.cluster import KMeans
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('kmeans', KMeans(n_clusters=5))
|
||||
])
|
||||
|
||||
labels = pipeline.fit_predict(X)
|
||||
```
|
||||
|
||||
## Custom Transformers
|
||||
|
||||
### Using FunctionTransformer
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import FunctionTransformer
|
||||
import numpy as np
|
||||
|
||||
# Log transformation
|
||||
log_transformer = FunctionTransformer(np.log1p)
|
||||
|
||||
# Custom function
|
||||
def custom_transform(X):
|
||||
# Your transformation logic
|
||||
return X_transformed
|
||||
|
||||
custom_transformer = FunctionTransformer(custom_transform)
|
||||
|
||||
# In pipeline
|
||||
pipeline = Pipeline([
|
||||
('log', log_transformer),
|
||||
('scaler', StandardScaler()),
|
||||
('model', LinearRegression())
|
||||
])
|
||||
```
|
||||
|
||||
### Creating Custom Transformer Class
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
|
||||
class CustomTransformer(BaseEstimator, TransformerMixin):
|
||||
def __init__(self, parameter=1.0):
|
||||
self.parameter = parameter
|
||||
|
||||
def fit(self, X, y=None):
|
||||
# Learn parameters from X
|
||||
self.learned_param_ = X.mean() # Example
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
# Transform X using learned parameters
|
||||
return X * self.parameter - self.learned_param_
|
||||
|
||||
# Optional: for pipelines that need inverse transform
|
||||
def inverse_transform(self, X):
|
||||
return (X + self.learned_param_) / self.parameter
|
||||
|
||||
# Use in pipeline
|
||||
pipeline = Pipeline([
|
||||
('custom', CustomTransformer(parameter=2.0)),
|
||||
('model', LinearRegression())
|
||||
])
|
||||
```
|
||||
|
||||
**Key requirements**:
|
||||
- Inherit from `BaseEstimator` and `TransformerMixin`
|
||||
- Implement `fit()` and `transform()` methods
|
||||
- `fit()` must return `self`
|
||||
- Use trailing underscore for learned attributes (`learned_param_`)
|
||||
- Constructor parameters should be stored as attributes
|
||||
|
||||
### Transformer for Pandas DataFrames
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
import pandas as pd
|
||||
|
||||
class DataFrameTransformer(BaseEstimator, TransformerMixin):
|
||||
def __init__(self, columns=None):
|
||||
self.columns = columns
|
||||
|
||||
def fit(self, X, y=None):
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
if isinstance(X, pd.DataFrame):
|
||||
if self.columns:
|
||||
return X[self.columns].values
|
||||
return X.values
|
||||
return X
|
||||
```
|
||||
|
||||
## Visualization
|
||||
|
||||
### Display Pipeline in Jupyter
|
||||
|
||||
```python
|
||||
from sklearn import set_config
|
||||
|
||||
# Enable HTML display
|
||||
set_config(display='diagram')
|
||||
|
||||
# Now displaying the pipeline shows interactive diagram
|
||||
pipeline
|
||||
```
|
||||
|
||||
### Print Pipeline Structure
|
||||
|
||||
```python
|
||||
from sklearn.utils import estimator_html_repr
|
||||
|
||||
# Get HTML representation
|
||||
html = estimator_html_repr(pipeline)
|
||||
|
||||
# Or just print
|
||||
print(pipeline)
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Conditional Transformations
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler, FunctionTransformer
|
||||
|
||||
def conditional_scale(X, scale=True):
|
||||
if scale:
|
||||
return StandardScaler().fit_transform(X)
|
||||
return X
|
||||
|
||||
pipeline = Pipeline([
|
||||
('conditional_scaler', FunctionTransformer(
|
||||
conditional_scale,
|
||||
kw_args={'scale': True}
|
||||
)),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
```
|
||||
|
||||
### Multiple Preprocessing Paths
|
||||
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
# Different preprocessing for different feature types
|
||||
preprocessor = ColumnTransformer([
|
||||
# Numeric: impute + scale
|
||||
('num_standard', Pipeline([
|
||||
('imputer', SimpleImputer(strategy='mean')),
|
||||
('scaler', StandardScaler())
|
||||
]), ['age', 'income']),
|
||||
|
||||
# Numeric: impute + log + scale
|
||||
('num_skewed', Pipeline([
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('log', FunctionTransformer(np.log1p)),
|
||||
('scaler', StandardScaler())
|
||||
]), ['price', 'revenue']),
|
||||
|
||||
# Categorical: impute + one-hot
|
||||
('cat', Pipeline([
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
]), ['category', 'region']),
|
||||
|
||||
# Text: TF-IDF
|
||||
('text', TfidfVectorizer(), 'description')
|
||||
])
|
||||
```
|
||||
|
||||
### Feature Engineering Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
|
||||
class FeatureEngineer(BaseEstimator, TransformerMixin):
|
||||
def fit(self, X, y=None):
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
X = X.copy()
|
||||
# Add engineered features
|
||||
X['age_income_ratio'] = X['age'] / (X['income'] + 1)
|
||||
X['total_score'] = X['score1'] + X['score2'] + X['score3']
|
||||
return X
|
||||
|
||||
pipeline = Pipeline([
|
||||
('engineer', FeatureEngineer()),
|
||||
('preprocessor', preprocessor),
|
||||
('model', RandomForestClassifier())
|
||||
])
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Always Use Pipelines When
|
||||
|
||||
1. **Preprocessing is needed**: Scaling, encoding, imputation
|
||||
2. **Cross-validation**: Ensures proper fit/transform split
|
||||
3. **Hyperparameter tuning**: Joint optimization of preprocessing and model
|
||||
4. **Production deployment**: Single object to serialize
|
||||
5. **Multiple steps**: Any workflow with >1 step
|
||||
|
||||
### Pipeline Do's
|
||||
|
||||
- ✅ Fit pipeline only on training data
|
||||
- ✅ Use ColumnTransformer for heterogeneous data
|
||||
- ✅ Cache expensive transformations during grid search
|
||||
- ✅ Use make_pipeline for simple cases
|
||||
- ✅ Set verbose=True to debug issues
|
||||
- ✅ Use remainder='passthrough' when appropriate
|
||||
|
||||
### Pipeline Don'ts
|
||||
|
||||
- ❌ Fit preprocessing on full dataset before split (data leakage!)
|
||||
- ❌ Manually transform test data (use pipeline.predict())
|
||||
- ❌ Forget to handle missing values before scaling
|
||||
- ❌ Mix pandas DataFrames and arrays inconsistently
|
||||
- ❌ Skip using pipelines for "just one preprocessing step"
|
||||
|
||||
### Data Leakage Prevention
|
||||
|
||||
```python
|
||||
# ❌ WRONG - Data leakage
|
||||
scaler = StandardScaler().fit(X) # Fit on all data
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y)
|
||||
X_train_scaled = scaler.transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# ✅ CORRECT - No leakage with pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y)
|
||||
pipeline.fit(X_train, y_train) # Scaler fits only on train
|
||||
y_pred = pipeline.predict(X_test) # Scaler transforms only on test
|
||||
|
||||
# ✅ CORRECT - No leakage in cross-validation
|
||||
scores = cross_val_score(pipeline, X, y, cv=5)
|
||||
# Each fold: scaler fits on train folds, transforms on test fold
|
||||
```
|
||||
|
||||
### Debugging Pipelines
|
||||
|
||||
```python
|
||||
# Examine intermediate outputs
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
|
||||
# Fit pipeline
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Get output after scaling
|
||||
X_scaled = pipeline.named_steps['scaler'].transform(X_train)
|
||||
|
||||
# Get output after PCA
|
||||
X_pca = pipeline[:-1].transform(X_train) # All steps except last
|
||||
|
||||
# Or build partial pipeline
|
||||
partial_pipeline = Pipeline(pipeline.steps[:-1])
|
||||
X_transformed = partial_pipeline.transform(X_train)
|
||||
```
|
||||
|
||||
### Saving and Loading Pipelines
|
||||
|
||||
```python
|
||||
import joblib
|
||||
|
||||
# Save pipeline
|
||||
joblib.dump(pipeline, 'model_pipeline.pkl')
|
||||
|
||||
# Load pipeline
|
||||
pipeline = joblib.load('model_pipeline.pkl')
|
||||
|
||||
# Use loaded pipeline
|
||||
y_pred = pipeline.predict(X_new)
|
||||
```
|
||||
|
||||
## Common Errors and Solutions
|
||||
|
||||
**Error**: `ValueError: could not convert string to float`
|
||||
- **Cause**: Categorical features not encoded
|
||||
- **Solution**: Add OneHotEncoder or OrdinalEncoder to pipeline
|
||||
|
||||
**Error**: `All intermediate steps should be transformers`
|
||||
- **Cause**: Non-transformer in non-final position
|
||||
- **Solution**: Ensure only last step is predictor
|
||||
|
||||
**Error**: `X has different number of features than during fitting`
|
||||
- **Cause**: Different columns in train and test
|
||||
- **Solution**: Ensure consistent column handling, use `handle_unknown='ignore'` in OneHotEncoder
|
||||
|
||||
**Error**: Different results in cross-validation vs train-test split
|
||||
- **Cause**: Data leakage (fitting preprocessing on all data)
|
||||
- **Solution**: Always use Pipeline for preprocessing
|
||||
|
||||
**Error**: Pipeline too slow during grid search
|
||||
- **Solution**: Use caching with `memory` parameter
|
||||
Reference in New Issue
Block a user