Add more scientific skills

2026-03-28 07:33:45 +08:00 · 2025-10-19 14:12:02 -07:00
parent 78d5ac2b56
commit 660c8574d0
210 changed files with 88957 additions and 1 deletions
--- a/scientific-packages/scikit-learn/references/pipelines_and_composition.md
+++ b/scientific-packages/scikit-learn/references/pipelines_and_composition.md
@@ -0,0 +1,679 @@
+# Pipelines and Composite Estimators in scikit-learn
+
+## Overview
+Pipelines chain multiple estimators into a single unit, ensuring proper workflow sequencing and preventing data leakage. As the documentation states: "Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification."
+
+## Pipeline Basics
+
+### Creating Pipelines
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.decomposition import PCA
+from sklearn.linear_model import LogisticRegression
+
+# Method 1: List of (name, estimator) tuples
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=10)),
+    ('classifier', LogisticRegression())
+])
+
+# Method 2: Using make_pipeline (auto-generates names)
+from sklearn.pipeline import make_pipeline
+pipeline = make_pipeline(
+    StandardScaler(),
+    PCA(n_components=10),
+    LogisticRegression()
+)
+```
+
+### Using Pipelines
+
+```python
+# Fit and predict like any estimator
+pipeline.fit(X_train, y_train)
+y_pred = pipeline.predict(X_test)
+score = pipeline.score(X_test, y_test)
+
+# Access steps
+pipeline.named_steps['scaler']
+pipeline.steps[0]  # Returns ('scaler', StandardScaler(...))
+pipeline[0]        # Returns StandardScaler(...) object
+pipeline['scaler'] # Returns StandardScaler(...) object
+
+# Get final estimator
+pipeline[-1]  # Returns LogisticRegression(...) object
+```
+
+### Pipeline Rules
+
+**All steps except the last must be transformers** (have `fit()` and `transform()` methods).
+
+**The final step** can be:
+- Predictor (classifier/regressor) with `fit()` and `predict()`
+- Transformer with `fit()` and `transform()`
+- Any estimator with at least `fit()`
+
+### Pipeline Benefits
+
+1. **Convenience**: Single `fit()` and `predict()` call
+2. **Prevents data leakage**: Ensures proper fit/transform on train/test
+3. **Joint parameter selection**: Tune all steps together with GridSearchCV
+4. **Reproducibility**: Encapsulates entire workflow
+
+## Accessing and Setting Parameters
+
+### Nested Parameters
+
+Access step parameters using `stepname__parameter` syntax:
+
+```python
+from sklearn.model_selection import GridSearchCV
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('clf', LogisticRegression())
+])
+
+# Grid search over pipeline parameters
+param_grid = {
+    'scaler__with_mean': [True, False],
+    'clf__C': [0.1, 1.0, 10.0],
+    'clf__penalty': ['l1', 'l2']
+}
+
+grid_search = GridSearchCV(pipeline, param_grid, cv=5)
+grid_search.fit(X_train, y_train)
+```
+
+### Setting Parameters
+
+```python
+# Set parameters
+pipeline.set_params(clf__C=10.0, scaler__with_std=False)
+
+# Get parameters
+params = pipeline.get_params()
+```
+
+## Caching Intermediate Results
+
+Cache fitted transformers to avoid recomputation:
+
+```python
+from tempfile import mkdtemp
+from shutil import rmtree
+
+# Create cache directory
+cachedir = mkdtemp()
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=10)),
+    ('clf', LogisticRegression())
+], memory=cachedir)
+
+# When doing grid search, scaler and PCA only fit once per fold
+grid_search = GridSearchCV(pipeline, param_grid, cv=5)
+grid_search.fit(X_train, y_train)
+
+# Clean up cache
+rmtree(cachedir)
+
+# Or use joblib for persistent caching
+from joblib import Memory
+memory = Memory(location='./cache', verbose=0)
+pipeline = Pipeline([...], memory=memory)
+```
+
+**When to use caching**:
+- Expensive transformations (PCA, feature selection)
+- Grid search over final estimator parameters only
+- Multiple experiments with same preprocessing
+
+## ColumnTransformer
+
+Apply different transformations to different columns (essential for heterogeneous data).
+
+### Basic Usage
+
+```python
+from sklearn.compose import ColumnTransformer
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+
+# Define which transformations for which columns
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', StandardScaler(), ['age', 'income', 'credit_score']),
+        ('cat', OneHotEncoder(), ['country', 'occupation'])
+    ],
+    remainder='drop'  # What to do with remaining columns
+)
+
+X_transformed = preprocessor.fit_transform(X)
+```
+
+### Column Selection Methods
+
+```python
+# Method 1: Column names (list of strings)
+('num', StandardScaler(), ['age', 'income'])
+
+# Method 2: Column indices (list of integers)
+('num', StandardScaler(), [0, 1, 2])
+
+# Method 3: Boolean mask
+('num', StandardScaler(), [True, True, False, True, False])
+
+# Method 4: Slice
+('num', StandardScaler(), slice(0, 3))
+
+# Method 5: make_column_selector (by dtype or pattern)
+from sklearn.compose import make_column_selector as selector
+
+preprocessor = ColumnTransformer([
+    ('num', StandardScaler(), selector(dtype_include='number')),
+    ('cat', OneHotEncoder(), selector(dtype_include='object'))
+])
+
+# Select by pattern
+selector(pattern='.*_score$')  # All columns ending with '_score'
+```
+
+### Remainder Parameter
+
+Controls what happens to columns not specified:
+
+```python
+# Drop remaining columns (default)
+remainder='drop'
+
+# Pass through remaining columns unchanged
+remainder='passthrough'
+
+# Apply transformer to remaining columns
+remainder=StandardScaler()
+```
+
+### Full Pipeline with ColumnTransformer
+
+```python
+from sklearn.compose import ColumnTransformer
+from sklearn.pipeline import Pipeline
+from sklearn.impute import SimpleImputer
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+from sklearn.ensemble import RandomForestClassifier
+
+# Separate preprocessing for numeric and categorical
+numeric_features = ['age', 'income', 'credit_score']
+categorical_features = ['country', 'occupation', 'education']
+
+numeric_transformer = Pipeline(steps=[
+    ('imputer', SimpleImputer(strategy='median')),
+    ('scaler', StandardScaler())
+])
+
+categorical_transformer = Pipeline(steps=[
+    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
+    ('onehot', OneHotEncoder(handle_unknown='ignore'))
+])
+
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', numeric_transformer, numeric_features),
+        ('cat', categorical_transformer, categorical_features)
+    ])
+
+# Complete pipeline
+clf = Pipeline(steps=[
+    ('preprocessor', preprocessor),
+    ('classifier', RandomForestClassifier())
+])
+
+clf.fit(X_train, y_train)
+y_pred = clf.predict(X_test)
+
+# Grid search over preprocessing and model parameters
+param_grid = {
+    'preprocessor__num__imputer__strategy': ['mean', 'median'],
+    'preprocessor__cat__onehot__max_categories': [10, 20, None],
+    'classifier__n_estimators': [100, 200],
+    'classifier__max_depth': [10, 20, None]
+}
+
+grid_search = GridSearchCV(clf, param_grid, cv=5)
+grid_search.fit(X_train, y_train)
+```
+
+## FeatureUnion
+
+Combine multiple transformer outputs by concatenating features side-by-side.
+
+```python
+from sklearn.pipeline import FeatureUnion
+from sklearn.decomposition import PCA
+from sklearn.feature_selection import SelectKBest
+
+# Combine PCA and feature selection
+combined_features = FeatureUnion([
+    ('pca', PCA(n_components=10)),
+    ('univ_select', SelectKBest(k=5))
+])
+
+X_features = combined_features.fit_transform(X, y)
+# Result: 15 features (10 from PCA + 5 from SelectKBest)
+
+# In a pipeline
+pipeline = Pipeline([
+    ('features', combined_features),
+    ('classifier', LogisticRegression())
+])
+```
+
+### FeatureUnion with Transformers on Different Data
+
+```python
+from sklearn.pipeline import FeatureUnion
+from sklearn.preprocessing import FunctionTransformer
+import numpy as np
+
+def get_numeric_data(X):
+    return X[:, :3]  # First 3 columns
+
+def get_text_data(X):
+    return X[:, 3]   # 4th column (text)
+
+from sklearn.feature_extraction.text import TfidfVectorizer
+
+combined = FeatureUnion([
+    ('numeric_features', Pipeline([
+        ('selector', FunctionTransformer(get_numeric_data)),
+        ('scaler', StandardScaler())
+    ])),
+    ('text_features', Pipeline([
+        ('selector', FunctionTransformer(get_text_data)),
+        ('tfidf', TfidfVectorizer())
+    ]))
+])
+```
+
+**Note**: ColumnTransformer is usually more convenient than FeatureUnion for heterogeneous data.
+
+## Common Pipeline Patterns
+
+### Classification Pipeline
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.feature_selection import SelectKBest, f_classif
+from sklearn.svm import SVC
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('feature_selection', SelectKBest(f_classif, k=10)),
+    ('classifier', SVC(kernel='rbf'))
+])
+```
+
+### Regression Pipeline
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler, PolynomialFeatures
+from sklearn.linear_model import Ridge
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('poly', PolynomialFeatures(degree=2)),
+    ('ridge', Ridge(alpha=1.0))
+])
+```
+
+### Text Classification Pipeline
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.naive_bayes import MultinomialNB
+
+pipeline = Pipeline([
+    ('tfidf', TfidfVectorizer(max_features=1000)),
+    ('classifier', MultinomialNB())
+])
+
+# Works directly with text
+pipeline.fit(X_train_text, y_train)
+y_pred = pipeline.predict(X_test_text)
+```
+
+### Image Processing Pipeline
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.decomposition import PCA
+from sklearn.neural_network import MLPClassifier
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=100)),
+    ('mlp', MLPClassifier(hidden_layer_sizes=(100, 50)))
+])
+```
+
+### Dimensionality Reduction + Clustering
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.decomposition import PCA
+from sklearn.cluster import KMeans
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=10)),
+    ('kmeans', KMeans(n_clusters=5))
+])
+
+labels = pipeline.fit_predict(X)
+```
+
+## Custom Transformers
+
+### Using FunctionTransformer
+
+```python
+from sklearn.preprocessing import FunctionTransformer
+import numpy as np
+
+# Log transformation
+log_transformer = FunctionTransformer(np.log1p)
+
+# Custom function
+def custom_transform(X):
+    # Your transformation logic
+    return X_transformed
+
+custom_transformer = FunctionTransformer(custom_transform)
+
+# In pipeline
+pipeline = Pipeline([
+    ('log', log_transformer),
+    ('scaler', StandardScaler()),
+    ('model', LinearRegression())
+])
+```
+
+### Creating Custom Transformer Class
+
+```python
+from sklearn.base import BaseEstimator, TransformerMixin
+
+class CustomTransformer(BaseEstimator, TransformerMixin):
+    def __init__(self, parameter=1.0):
+        self.parameter = parameter
+
+    def fit(self, X, y=None):
+        # Learn parameters from X
+        self.learned_param_ = X.mean()  # Example
+        return self
+
+    def transform(self, X):
+        # Transform X using learned parameters
+        return X * self.parameter - self.learned_param_
+
+    # Optional: for pipelines that need inverse transform
+    def inverse_transform(self, X):
+        return (X + self.learned_param_) / self.parameter
+
+# Use in pipeline
+pipeline = Pipeline([
+    ('custom', CustomTransformer(parameter=2.0)),
+    ('model', LinearRegression())
+])
+```
+
+**Key requirements**:
+- Inherit from `BaseEstimator` and `TransformerMixin`
+- Implement `fit()` and `transform()` methods
+- `fit()` must return `self`
+- Use trailing underscore for learned attributes (`learned_param_`)
+- Constructor parameters should be stored as attributes
+
+### Transformer for Pandas DataFrames
+
+```python
+from sklearn.base import BaseEstimator, TransformerMixin
+import pandas as pd
+
+class DataFrameTransformer(BaseEstimator, TransformerMixin):
+    def __init__(self, columns=None):
+        self.columns = columns
+
+    def fit(self, X, y=None):
+        return self
+
+    def transform(self, X):
+        if isinstance(X, pd.DataFrame):
+            if self.columns:
+                return X[self.columns].values
+            return X.values
+        return X
+```
+
+## Visualization
+
+### Display Pipeline in Jupyter
+
+```python
+from sklearn import set_config
+
+# Enable HTML display
+set_config(display='diagram')
+
+# Now displaying the pipeline shows interactive diagram
+pipeline
+```
+
+### Print Pipeline Structure
+
+```python
+from sklearn.utils import estimator_html_repr
+
+# Get HTML representation
+html = estimator_html_repr(pipeline)
+
+# Or just print
+print(pipeline)
+```
+
+## Advanced Patterns
+
+### Conditional Transformations
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler, FunctionTransformer
+
+def conditional_scale(X, scale=True):
+    if scale:
+        return StandardScaler().fit_transform(X)
+    return X
+
+pipeline = Pipeline([
+    ('conditional_scaler', FunctionTransformer(
+        conditional_scale,
+        kw_args={'scale': True}
+    )),
+    ('model', LogisticRegression())
+])
+```
+
+### Multiple Preprocessing Paths
+
+```python
+from sklearn.compose import ColumnTransformer
+from sklearn.pipeline import Pipeline
+
+# Different preprocessing for different feature types
+preprocessor = ColumnTransformer([
+    # Numeric: impute + scale
+    ('num_standard', Pipeline([
+        ('imputer', SimpleImputer(strategy='mean')),
+        ('scaler', StandardScaler())
+    ]), ['age', 'income']),
+
+    # Numeric: impute + log + scale
+    ('num_skewed', Pipeline([
+        ('imputer', SimpleImputer(strategy='median')),
+        ('log', FunctionTransformer(np.log1p)),
+        ('scaler', StandardScaler())
+    ]), ['price', 'revenue']),
+
+    # Categorical: impute + one-hot
+    ('cat', Pipeline([
+        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
+        ('onehot', OneHotEncoder(handle_unknown='ignore'))
+    ]), ['category', 'region']),
+
+    # Text: TF-IDF
+    ('text', TfidfVectorizer(), 'description')
+])
+```
+
+### Feature Engineering Pipeline
+
+```python
+from sklearn.base import BaseEstimator, TransformerMixin
+
+class FeatureEngineer(BaseEstimator, TransformerMixin):
+    def fit(self, X, y=None):
+        return self
+
+    def transform(self, X):
+        X = X.copy()
+        # Add engineered features
+        X['age_income_ratio'] = X['age'] / (X['income'] + 1)
+        X['total_score'] = X['score1'] + X['score2'] + X['score3']
+        return X
+
+pipeline = Pipeline([
+    ('engineer', FeatureEngineer()),
+    ('preprocessor', preprocessor),
+    ('model', RandomForestClassifier())
+])
+```
+
+## Best Practices
+
+### Always Use Pipelines When
+
+1. **Preprocessing is needed**: Scaling, encoding, imputation
+2. **Cross-validation**: Ensures proper fit/transform split
+3. **Hyperparameter tuning**: Joint optimization of preprocessing and model
+4. **Production deployment**: Single object to serialize
+5. **Multiple steps**: Any workflow with >1 step
+
+### Pipeline Do's
+
+- ✅ Fit pipeline only on training data
+- ✅ Use ColumnTransformer for heterogeneous data
+- ✅ Cache expensive transformations during grid search
+- ✅ Use make_pipeline for simple cases
+- ✅ Set verbose=True to debug issues
+- ✅ Use remainder='passthrough' when appropriate
+
+### Pipeline Don'ts
+
+- ❌ Fit preprocessing on full dataset before split (data leakage!)
+- ❌ Manually transform test data (use pipeline.predict())
+- ❌ Forget to handle missing values before scaling
+- ❌ Mix pandas DataFrames and arrays inconsistently
+- ❌ Skip using pipelines for "just one preprocessing step"
+
+### Data Leakage Prevention
+
+```python
+# ❌ WRONG - Data leakage
+scaler = StandardScaler().fit(X)  # Fit on all data
+X_train, X_test, y_train, y_test = train_test_split(X, y)
+X_train_scaled = scaler.transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+
+# ✅ CORRECT - No leakage with pipeline
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('model', LogisticRegression())
+])
+
+X_train, X_test, y_train, y_test = train_test_split(X, y)
+pipeline.fit(X_train, y_train)  # Scaler fits only on train
+y_pred = pipeline.predict(X_test)  # Scaler transforms only on test
+
+# ✅ CORRECT - No leakage in cross-validation
+scores = cross_val_score(pipeline, X, y, cv=5)
+# Each fold: scaler fits on train folds, transforms on test fold
+```
+
+### Debugging Pipelines
+
+```python
+# Examine intermediate outputs
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=10)),
+    ('model', LogisticRegression())
+])
+
+# Fit pipeline
+pipeline.fit(X_train, y_train)
+
+# Get output after scaling
+X_scaled = pipeline.named_steps['scaler'].transform(X_train)
+
+# Get output after PCA
+X_pca = pipeline[:-1].transform(X_train)  # All steps except last
+
+# Or build partial pipeline
+partial_pipeline = Pipeline(pipeline.steps[:-1])
+X_transformed = partial_pipeline.transform(X_train)
+```
+
+### Saving and Loading Pipelines
+
+```python
+import joblib
+
+# Save pipeline
+joblib.dump(pipeline, 'model_pipeline.pkl')
+
+# Load pipeline
+pipeline = joblib.load('model_pipeline.pkl')
+
+# Use loaded pipeline
+y_pred = pipeline.predict(X_new)
+```
+
+## Common Errors and Solutions
+
+**Error**: `ValueError: could not convert string to float`
+- **Cause**: Categorical features not encoded
+- **Solution**: Add OneHotEncoder or OrdinalEncoder to pipeline
+
+**Error**: `All intermediate steps should be transformers`
+- **Cause**: Non-transformer in non-final position
+- **Solution**: Ensure only last step is predictor
+
+**Error**: `X has different number of features than during fitting`
+- **Cause**: Different columns in train and test
+- **Solution**: Ensure consistent column handling, use `handle_unknown='ignore'` in OneHotEncoder
+
+**Error**: Different results in cross-validation vs train-test split
+- **Cause**: Data leakage (fitting preprocessing on all data)
+- **Solution**: Always use Pipeline for preprocessing
+
+**Error**: Pipeline too slow during grid search
+- **Solution**: Use caching with `memory` parameter