--- name: scikit-learn description: Comprehensive guide for scikit-learn, Python's machine learning library. This skill should be used when building classification or regression models, performing clustering analysis, reducing dimensionality, preprocessing data (scaling, encoding, imputation), evaluating models with cross-validation and metrics, tuning hyperparameters, creating ML pipelines, detecting anomalies, or implementing any supervised or unsupervised learning tasks. Provides algorithm selection guidance, best practices for preventing data leakage, handling imbalanced data, and working with mixed data types. --- # Scikit-learn: Machine Learning in Python ## Overview This skill provides comprehensive guidance for using scikit-learn, Python's premier machine learning library. Scikit-learn offers simple, efficient tools for predictive data analysis, including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. This skill should be used when implementing machine learning workflows, building predictive models, analyzing datasets using supervised or unsupervised learning, preprocessing data for ML tasks, evaluating model performance, or optimizing hyperparameters. ## When to Use This Skill Activate this skill when: - Building classification models (spam detection, image recognition, medical diagnosis) - Creating regression models (price prediction, forecasting, trend analysis) - Performing clustering analysis (customer segmentation, pattern discovery) - Reducing dimensionality (PCA, t-SNE for visualization) - Preprocessing data (scaling, encoding, imputation) - Evaluating model performance (cross-validation, metrics) - Tuning hyperparameters (grid search, random search) - Creating machine learning pipelines - Detecting anomalies or outliers - Implementing ensemble methods ## Core Machine Learning Workflow ### Standard ML Pipeline Follow this general workflow for supervised learning tasks: 1. **Data Preparation** - Load and explore data - Split into train/test sets - Handle missing values - Encode categorical features - Scale/normalize features 2. **Model Selection** - Start with baseline model - Try more complex models - Use domain knowledge to guide selection 3. **Model Training** - Fit model on training data - Use pipelines to prevent data leakage - Apply cross-validation 4. **Model Evaluation** - Evaluate on test set - Use appropriate metrics - Analyze errors 5. **Model Optimization** - Tune hyperparameters - Feature engineering - Ensemble methods 6. **Deployment** - Save model using joblib - Create prediction pipeline - Monitor performance ### Classification Quick Start ```python from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report from sklearn.pipeline import Pipeline # Create pipeline (prevents data leakage) pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) ]) # Split data (use stratify for imbalanced classes) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Train pipeline.fit(X_train, y_train) # Evaluate y_pred = pipeline.predict(X_test) print(classification_report(y_test, y_pred)) # Cross-validation for robust evaluation from sklearn.model_selection import cross_val_score scores = cross_val_score(pipeline, X_train, y_train, cv=5) print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})") ``` ### Regression Quick Start ```python from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score from sklearn.pipeline import Pipeline # Create pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('regressor', RandomForestRegressor(n_estimators=100, random_state=42)) ]) # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Train pipeline.fit(X_train, y_train) # Evaluate y_pred = pipeline.predict(X_test) rmse = mean_squared_error(y_test, y_pred, squared=False) r2 = r2_score(y_test, y_pred) print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}") ``` ## Algorithm Selection Guide ### Classification Algorithms **Start with baseline**: LogisticRegression - Fast, interpretable, works well for linearly separable data - Good for high-dimensional data (text classification) **General-purpose**: RandomForestClassifier - Handles non-linear relationships - Robust to outliers - Provides feature importance - Good default choice **Best performance**: HistGradientBoostingClassifier - State-of-the-art for tabular data - Fast on large datasets (>10K samples) - Often wins Kaggle competitions **Special cases**: - **Small datasets (<1K)**: SVC with RBF kernel - **Very large datasets (>100K)**: SGDClassifier or LinearSVC - **Interpretability critical**: LogisticRegression or DecisionTreeClassifier - **Probabilistic predictions**: GaussianNB or calibrated models - **Text classification**: LogisticRegression with TfidfVectorizer ### Regression Algorithms **Start with baseline**: LinearRegression or Ridge - Fast, interpretable - Works well when relationships are linear **General-purpose**: RandomForestRegressor - Handles non-linear relationships - Robust to outliers - Good default choice **Best performance**: HistGradientBoostingRegressor - State-of-the-art for tabular data - Fast on large datasets **Special cases**: - **Regularization needed**: Ridge (L2) or Lasso (L1 + feature selection) - **Very large datasets**: SGDRegressor - **Outliers present**: HuberRegressor or RANSAC ### Clustering Algorithms **Known number of clusters**: KMeans - Fast and scalable - Assumes spherical clusters **Unknown number of clusters**: DBSCAN or HDBSCAN - Handles arbitrary shapes - Automatic outlier detection **Hierarchical relationships**: AgglomerativeClustering - Creates hierarchy of clusters - Good for visualization (dendrograms) **Soft clustering (probabilities)**: GaussianMixture - Provides cluster probabilities - Handles elliptical clusters ### Dimensionality Reduction **Preprocessing/feature extraction**: PCA - Fast and efficient - Linear transformation - ALWAYS standardize first **Visualization only**: t-SNE or UMAP - Preserves local structure - Non-linear - DO NOT use for preprocessing **Sparse data (text)**: TruncatedSVD - Works with sparse matrices - Latent Semantic Analysis **Non-negative data**: NMF - Interpretable components - Topic modeling ## Working with Different Data Types ### Numeric Features **Continuous features**: 1. Check distribution 2. Handle outliers (remove, clip, or use RobustScaler) 3. Scale using StandardScaler (most algorithms) or MinMaxScaler (neural networks) **Count data**: 1. Consider log transformation or sqrt 2. Scale after transformation **Skewed data**: 1. Use PowerTransformer (Yeo-Johnson or Box-Cox) 2. Or QuantileTransformer for stronger normalization ### Categorical Features **Low cardinality (<10 categories)**: ```python from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder(drop='first', sparse_output=True) ``` **High cardinality (>10 categories)**: ```python from sklearn.preprocessing import TargetEncoder encoder = TargetEncoder() # Uses target statistics, prevents leakage with cross-fitting ``` **Ordinal relationships**: ```python from sklearn.preprocessing import OrdinalEncoder encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']]) ``` ### Text Data ```python from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline text_pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')), ('classifier', MultinomialNB()) ]) text_pipeline.fit(X_train_text, y_train) ``` ### Mixed Data Types ```python from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline # Define feature types numeric_features = ['age', 'income', 'credit_score'] categorical_features = ['country', 'occupation'] # Separate preprocessing pipelines numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True)) ]) # Combine with ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # Complete pipeline from sklearn.ensemble import RandomForestClassifier pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier()) ]) pipeline.fit(X_train, y_train) ``` ## Model Evaluation ### Classification Metrics **Balanced datasets**: Use accuracy or F1-score **Imbalanced datasets**: Use balanced_accuracy, F1-weighted, or ROC-AUC ```python from sklearn.metrics import balanced_accuracy_score, f1_score, roc_auc_score balanced_acc = balanced_accuracy_score(y_true, y_pred) f1 = f1_score(y_true, y_pred, average='weighted') # ROC-AUC requires probabilities y_proba = model.predict_proba(X_test) auc = roc_auc_score(y_true, y_proba, multi_class='ovr') ``` **Cost-sensitive**: Define custom scorer or adjust decision threshold **Comprehensive report**: ```python from sklearn.metrics import classification_report, confusion_matrix print(classification_report(y_true, y_pred)) print(confusion_matrix(y_true, y_pred)) ``` ### Regression Metrics **Standard use**: RMSE and R² ```python from sklearn.metrics import mean_squared_error, r2_score rmse = mean_squared_error(y_true, y_pred, squared=False) r2 = r2_score(y_true, y_pred) ``` **Outliers present**: Use MAE (robust to outliers) ```python from sklearn.metrics import mean_absolute_error mae = mean_absolute_error(y_true, y_pred) ``` **Percentage errors matter**: Use MAPE ```python from sklearn.metrics import mean_absolute_percentage_error mape = mean_absolute_percentage_error(y_true, y_pred) ``` ### Cross-Validation **Standard approach** (5-10 folds): ```python from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') print(f"CV Score: {scores.mean():.3f} (+/- {scores.std():.3f})") ``` **Imbalanced classes** (use stratification): ```python from sklearn.model_selection import StratifiedKFold cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=cv) ``` **Time series** (respect temporal order): ```python from sklearn.model_selection import TimeSeriesSplit cv = TimeSeriesSplit(n_splits=5) scores = cross_val_score(model, X, y, cv=cv) ``` **Multiple metrics**: ```python from sklearn.model_selection import cross_validate scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted'] results = cross_validate(model, X, y, cv=5, scoring=scoring) for metric in scoring: scores = results[f'test_{metric}'] print(f"{metric}: {scores.mean():.3f}") ``` ## Hyperparameter Tuning ### Grid Search (Exhaustive) ```python from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [100, 200, 500], 'max_depth': [10, 20, 30, None], 'min_samples_split': [2, 5, 10] } grid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='f1_weighted', n_jobs=-1, # Use all CPU cores verbose=1 ) grid_search.fit(X_train, y_train) print(f"Best parameters: {grid_search.best_params_}") print(f"Best CV score: {grid_search.best_score_:.3f}") # Use best model best_model = grid_search.best_estimator_ test_score = best_model.score(X_test, y_test) ``` ### Random Search (Faster) ```python from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint, uniform param_distributions = { 'n_estimators': randint(100, 1000), 'max_depth': randint(5, 50), 'min_samples_split': randint(2, 20), 'max_features': uniform(0.1, 0.9) } random_search = RandomizedSearchCV( RandomForestClassifier(random_state=42), param_distributions, n_iter=100, # Number of combinations to try cv=5, scoring='f1_weighted', n_jobs=-1, random_state=42 ) random_search.fit(X_train, y_train) ``` ### Pipeline Hyperparameter Tuning ```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC pipeline = Pipeline([ ('scaler', StandardScaler()), ('svm', SVC()) ]) # Use double underscore for nested parameters param_grid = { 'svm__C': [0.1, 1, 10, 100], 'svm__kernel': ['rbf', 'linear'], 'svm__gamma': ['scale', 'auto', 0.001, 0.01] } grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1) grid_search.fit(X_train, y_train) ``` ## Feature Engineering and Selection ### Feature Importance ```python # Tree-based models have built-in feature importance from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train) importances = model.feature_importances_ feature_importance_df = pd.DataFrame({ 'feature': feature_names, 'importance': importances }).sort_values('importance', ascending=False) # Permutation importance (works for any model) from sklearn.inspection import permutation_importance result = permutation_importance( model, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1 ) importance_df = pd.DataFrame({ 'feature': feature_names, 'importance': result.importances_mean, 'std': result.importances_std }).sort_values('importance', ascending=False) ``` ### Feature Selection Methods **Univariate selection**: ```python from sklearn.feature_selection import SelectKBest, f_classif selector = SelectKBest(f_classif, k=10) X_selected = selector.fit_transform(X, y) selected_features = selector.get_support(indices=True) ``` **Recursive Feature Elimination**: ```python from sklearn.feature_selection import RFECV from sklearn.ensemble import RandomForestClassifier selector = RFECV( RandomForestClassifier(n_estimators=100), step=1, cv=5, n_jobs=-1 ) X_selected = selector.fit_transform(X, y) print(f"Optimal features: {selector.n_features_}") ``` **Model-based selection**: ```python from sklearn.feature_selection import SelectFromModel selector = SelectFromModel( RandomForestClassifier(n_estimators=100), threshold='median' # or '0.5*mean', or specific value ) X_selected = selector.fit_transform(X, y) ``` ### Polynomial Features ```python from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('poly', PolynomialFeatures(degree=2, include_bias=False)), ('scaler', StandardScaler()), ('ridge', Ridge()) ]) pipeline.fit(X_train, y_train) ``` ## Common Patterns and Best Practices ### Always Use Pipelines Pipelines prevent data leakage and ensure proper workflow: ✅ **Correct**: ```python pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ]) pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) ``` ❌ **Wrong** (data leakage): ```python scaler = StandardScaler().fit(X) # Fit on all data! X_train, X_test = train_test_split(scaler.transform(X)) ``` ### Stratify for Imbalanced Classes ```python # Always use stratify for classification with imbalanced classes X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42 ) ``` ### Scale When Necessary **Scale for**: SVM, Neural Networks, KNN, Linear Models with regularization, PCA, Gradient Descent **Don't scale for**: Tree-based models (Random Forest, Gradient Boosting), Naive Bayes ### Handle Missing Values ```python from sklearn.impute import SimpleImputer # Numeric: use median (robust to outliers) imputer = SimpleImputer(strategy='median') # Categorical: use constant value or most_frequent imputer = SimpleImputer(strategy='constant', fill_value='missing') ``` ### Use Appropriate Metrics - **Balanced classification**: accuracy, F1 - **Imbalanced classification**: balanced_accuracy, F1-weighted, ROC-AUC - **Regression with outliers**: MAE instead of RMSE - **Cost-sensitive**: custom scorer ### Set Random States ```python # For reproducibility model = RandomForestClassifier(random_state=42) X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=42 ) ``` ### Use Parallel Processing ```python # Use all CPU cores model = RandomForestClassifier(n_jobs=-1) grid_search = GridSearchCV(model, param_grid, n_jobs=-1) ``` ## Unsupervised Learning ### Clustering Workflow ```python from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score # Always scale for clustering scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Elbow method to find optimal k inertias = [] silhouette_scores = [] K_range = range(2, 11) for k in K_range: kmeans = KMeans(n_clusters=k, random_state=42) labels = kmeans.fit_predict(X_scaled) inertias.append(kmeans.inertia_) silhouette_scores.append(silhouette_score(X_scaled, labels)) # Plot and choose k import matplotlib.pyplot as plt fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) ax1.plot(K_range, inertias, 'bo-') ax1.set_xlabel('k') ax1.set_ylabel('Inertia') ax2.plot(K_range, silhouette_scores, 'ro-') ax2.set_xlabel('k') ax2.set_ylabel('Silhouette Score') plt.show() # Fit final model optimal_k = 5 # Based on elbow/silhouette kmeans = KMeans(n_clusters=optimal_k, random_state=42) labels = kmeans.fit_predict(X_scaled) ``` ### Dimensionality Reduction ```python from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler # ALWAYS scale before PCA scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Specify variance to retain pca = PCA(n_components=0.95) # Keep 95% of variance X_pca = pca.fit_transform(X_scaled) print(f"Original features: {X.shape[1]}") print(f"Reduced features: {pca.n_components_}") print(f"Variance explained: {pca.explained_variance_ratio_.sum():.3f}") # Visualize explained variance import matplotlib.pyplot as plt plt.plot(np.cumsum(pca.explained_variance_ratio_)) plt.xlabel('Number of components') plt.ylabel('Cumulative explained variance') plt.show() ``` ### Visualization with t-SNE ```python from sklearn.manifold import TSNE from sklearn.decomposition import PCA # Reduce to 50 dimensions with PCA first (faster) pca = PCA(n_components=min(50, X.shape[1])) X_pca = pca.fit_transform(X_scaled) # Apply t-SNE (only for visualization!) tsne = TSNE(n_components=2, random_state=42, perplexity=30) X_tsne = tsne.fit_transform(X_pca) # Visualize plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.6) plt.colorbar() plt.title('t-SNE Visualization') plt.show() ``` ## Saving and Loading Models ```python import joblib # Save model or pipeline joblib.dump(model, 'model.pkl') joblib.dump(pipeline, 'pipeline.pkl') # Load loaded_model = joblib.load('model.pkl') loaded_pipeline = joblib.load('pipeline.pkl') # Use loaded model predictions = loaded_model.predict(X_new) ``` ## Reference Documentation This skill includes comprehensive reference files: - **`references/supervised_learning.md`**: Detailed coverage of all classification and regression algorithms, parameters, use cases, and selection guidelines - **`references/preprocessing.md`**: Complete guide to data preprocessing including scaling, encoding, imputation, transformations, and best practices - **`references/model_evaluation.md`**: In-depth coverage of cross-validation strategies, metrics, hyperparameter tuning, and validation techniques - **`references/unsupervised_learning.md`**: Comprehensive guide to clustering, dimensionality reduction, anomaly detection, and evaluation methods - **`references/pipelines_and_composition.md`**: Complete guide to Pipeline, ColumnTransformer, FeatureUnion, custom transformers, and composition patterns - **`references/quick_reference.md`**: Quick lookup guide with code snippets, common patterns, and decision trees for algorithm selection Read these files when: - Need detailed parameter explanations for specific algorithms - Comparing multiple algorithms for a task - Understanding evaluation metrics in depth - Building complex preprocessing workflows - Troubleshooting common issues Example search patterns: ```python # To find information about specific algorithms grep -r "GradientBoosting" references/ # To find preprocessing techniques grep -r "OneHotEncoder" references/preprocessing.md # To find evaluation metrics grep -r "f1_score" references/model_evaluation.md ``` ## Common Pitfalls to Avoid 1. **Data leakage**: Always use pipelines, fit only on training data 2. **Not scaling**: Scale for distance-based algorithms (SVM, KNN, Neural Networks) 3. **Wrong metrics**: Use appropriate metrics for imbalanced data 4. **Not using cross-validation**: Single train-test split can be misleading 5. **Forgetting stratification**: Stratify for imbalanced classification 6. **Using t-SNE for preprocessing**: t-SNE is for visualization only! 7. **Not setting random_state**: Results won't be reproducible 8. **Ignoring class imbalance**: Use stratification, appropriate metrics, or resampling 9. **PCA without scaling**: Components will be dominated by high-variance features 10. **Testing on training data**: Always evaluate on held-out test set