Improve the scikit-learn skill

2026-01-26 16:58:56 +08:00 · 2025-11-04 10:11:46 -08:00
parent 63a4293f1a
commit 4ad4f9970f
10 changed files with 3293 additions and 3606 deletions
--- a/scientific-packages/scikit-learn/references/supervised_learning.md
+++ b/scientific-packages/scikit-learn/references/supervised_learning.md
@@ -1,261 +1,378 @@
-# Supervised Learning in scikit-learn
+# Supervised Learning Reference

 ## Overview
-Supervised learning algorithms learn patterns from labeled training data to make predictions on new data. Scikit-learn organizes supervised learning into 17 major categories.
+
+Supervised learning algorithms learn from labeled training data to make predictions on new data. Scikit-learn provides comprehensive implementations for both classification and regression tasks.

 ## Linear Models

 ### Regression
- **LinearRegression**: Ordinary least squares regression
- **Ridge**: L2-regularized regression, good for multicollinearity
- **Lasso**: L1-regularized regression, performs feature selection
- **ElasticNet**: Combined L1/L2 regularization
- **LassoLars**: Lasso using Least Angle Regression algorithm
- **BayesianRidge**: Bayesian approach with automatic relevance determination
+
+**Linear Regression (`sklearn.linear_model.LinearRegression`)**
+- Ordinary least squares regression
+- Fast, interpretable, no hyperparameters
+- Use when: Linear relationships, interpretability matters
+- Example:
+```python
+from sklearn.linear_model import LinearRegression
+
+model = LinearRegression()
+model.fit(X_train, y_train)
+predictions = model.predict(X_test)
+```
+
+**Ridge Regression (`sklearn.linear_model.Ridge`)**
+- L2 regularization to prevent overfitting
+- Key parameter: `alpha` (regularization strength, default=1.0)
+- Use when: Multicollinearity present, need regularization
+- Example:
+```python
+from sklearn.linear_model import Ridge
+
+model = Ridge(alpha=1.0)
+model.fit(X_train, y_train)
+```
+
+**Lasso (`sklearn.linear_model.Lasso`)**
+- L1 regularization with feature selection
+- Key parameter: `alpha` (regularization strength)
+- Use when: Want sparse models, feature selection
+- Can reduce some coefficients to exactly zero
+- Example:
+```python
+from sklearn.linear_model import Lasso
+
+model = Lasso(alpha=0.1)
+model.fit(X_train, y_train)
+# Check which features were selected
+print(f"Non-zero coefficients: {sum(model.coef_ != 0)}")
+```
+
+**ElasticNet (`sklearn.linear_model.ElasticNet`)**
+- Combines L1 and L2 regularization
+- Key parameters: `alpha`, `l1_ratio` (0=Ridge, 1=Lasso)
+- Use when: Need both feature selection and regularization
+- Example:
+```python
+from sklearn.linear_model import ElasticNet
+
+model = ElasticNet(alpha=0.1, l1_ratio=0.5)
+model.fit(X_train, y_train)
+```

 ### Classification
- **LogisticRegression**: Binary and multiclass classification
- **RidgeClassifier**: Ridge regression for classification
- **SGDClassifier**: Linear classifiers with SGD training

-**Use cases**: Baseline models, interpretable predictions, high-dimensional data, when linear relationships are expected
+**Logistic Regression (`sklearn.linear_model.LogisticRegression`)**
+- Binary and multiclass classification
+- Key parameters: `C` (inverse regularization), `penalty` ('l1', 'l2', 'elasticnet')
+- Returns probability estimates
+- Use when: Need probabilistic predictions, interpretability
+- Example:
+```python
+from sklearn.linear_model import LogisticRegression

-**Key parameters**:
- `alpha`: Regularization strength (higher = more regularization)
- `fit_intercept`: Whether to calculate intercept
- `solver`: Optimization algorithm ('lbfgs', 'saga', 'liblinear')
+model = LogisticRegression(C=1.0, max_iter=1000)
+model.fit(X_train, y_train)
+probas = model.predict_proba(X_test)
+```

-## Support Vector Machines (SVM)
+**Stochastic Gradient Descent (SGD)**
+- `SGDClassifier`, `SGDRegressor`
+- Efficient for large-scale learning
+- Key parameters: `loss`, `penalty`, `alpha`, `learning_rate`
+- Use when: Very large datasets (>10^4 samples)
+- Example:
+```python
+from sklearn.linear_model import SGDClassifier

- **SVC**: Support Vector Classification
- **SVR**: Support Vector Regression
- **LinearSVC**: Linear SVM using liblinear (faster for large datasets)
- **OneClassSVM**: Unsupervised outlier detection
+model = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3)
+model.fit(X_train, y_train)
+```

-**Use cases**: Complex non-linear decision boundaries, high-dimensional spaces, when clear margin of separation exists
+## Support Vector Machines

-**Key parameters**:
- `kernel`: 'linear', 'poly', 'rbf', 'sigmoid'
- `C`: Regularization parameter (lower = more regularization)
- `gamma`: Kernel coefficient ('scale', 'auto', or float)
- `degree`: Polynomial degree (for poly kernel)
+**SVC (`sklearn.svm.SVC`)**
+- Classification with kernel methods
+- Key parameters: `C`, `kernel` ('linear', 'rbf', 'poly'), `gamma`
+- Use when: Small to medium datasets, complex decision boundaries
+- Note: Does not scale well to large datasets
+- Example:
+```python
+from sklearn.svm import SVC

-**Performance tip**: SVMs don't scale well beyond tens of thousands of samples. Use LinearSVC for large datasets with linear kernel.
+# Linear kernel for linearly separable data
+model_linear = SVC(kernel='linear', C=1.0)
+
+# RBF kernel for non-linear data
+model_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
+model_rbf.fit(X_train, y_train)
+```
+
+**SVR (`sklearn.svm.SVR`)**
+- Regression with kernel methods
+- Similar parameters to SVC
+- Additional parameter: `epsilon` (tube width)
+- Example:
+```python
+from sklearn.svm import SVR
+
+model = SVR(kernel='rbf', C=1.0, epsilon=0.1)
+model.fit(X_train, y_train)
+```

 ## Decision Trees

- **DecisionTreeClassifier**: Classification tree
- **DecisionTreeRegressor**: Regression tree
- **ExtraTreeClassifier/Regressor**: Extremely randomized tree
+**DecisionTreeClassifier / DecisionTreeRegressor**
+- Non-parametric model learning decision rules
+- Key parameters:
+  - `max_depth`: Maximum tree depth (prevents overfitting)
+  - `min_samples_split`: Minimum samples to split a node
+  - `min_samples_leaf`: Minimum samples in leaf
+  - `criterion`: 'gini', 'entropy' for classification; 'squared_error', 'absolute_error' for regression
+- Use when: Need interpretable model, non-linear relationships, mixed feature types
+- Prone to overfitting - use ensembles or pruning
+- Example:
+```python
+from sklearn.tree import DecisionTreeClassifier

-**Use cases**: Non-linear relationships, feature importance analysis, interpretable rules, handling mixed data types
+model = DecisionTreeClassifier(
+    max_depth=5,
+    min_samples_split=20,
+    min_samples_leaf=10,
+    criterion='gini'
+)
+model.fit(X_train, y_train)

-**Key parameters**:
- `max_depth`: Maximum tree depth (controls overfitting)
- `min_samples_split`: Minimum samples to split a node
- `min_samples_leaf`: Minimum samples in leaf node
- `max_features`: Number of features to consider for splits
- `criterion`: 'gini', 'entropy' (classification); 'squared_error', 'absolute_error' (regression)
-
-**Overfitting prevention**: Limit `max_depth`, increase `min_samples_split/leaf`, use pruning with `ccp_alpha`
+# Visualize the tree
+from sklearn.tree import plot_tree
+plot_tree(model, feature_names=feature_names, class_names=class_names)
+```

 ## Ensemble Methods

 ### Random Forests
- **RandomForestClassifier**: Ensemble of decision trees
- **RandomForestRegressor**: Regression variant

-**Use cases**: Robust general-purpose algorithm, reduces overfitting vs single trees, handles non-linear relationships
+**RandomForestClassifier / RandomForestRegressor**
+- Ensemble of decision trees with bagging
+- Key parameters:
+  - `n_estimators`: Number of trees (default=100)
+  - `max_depth`: Maximum tree depth
+  - `max_features`: Features to consider for splits ('sqrt', 'log2', or int)
+  - `min_samples_split`, `min_samples_leaf`: Control tree growth
+- Use when: High accuracy needed, can afford computation
+- Provides feature importance
+- Example:
+```python
+from sklearn.ensemble import RandomForestClassifier

-**Key parameters**:
- `n_estimators`: Number of trees (higher = better but slower)
- `max_depth`: Maximum tree depth
- `max_features`: Features per split ('sqrt', 'log2', int, float)
- `bootstrap`: Whether to use bootstrap samples
- `n_jobs`: Parallel processing (-1 uses all cores)
+model = RandomForestClassifier(
+    n_estimators=100,
+    max_depth=10,
+    max_features='sqrt',
+    n_jobs=-1  # Use all CPU cores
+)
+model.fit(X_train, y_train)
+
+# Feature importance
+importances = model.feature_importances_
+```

 ### Gradient Boosting
- **HistGradientBoostingClassifier/Regressor**: Histogram-based, fast for large datasets (>10k samples)
- **GradientBoostingClassifier/Regressor**: Traditional implementation, better for small datasets

-**Use cases**: High-performance predictions, winning Kaggle competitions, structured/tabular data
+**GradientBoostingClassifier / GradientBoostingRegressor**
+- Sequential ensemble building trees on residuals
+- Key parameters:
+  - `n_estimators`: Number of boosting stages
+  - `learning_rate`: Shrinks contribution of each tree
+  - `max_depth`: Depth of individual trees (typically 3-5)
+  - `subsample`: Fraction of samples for training each tree
+- Use when: Need high accuracy, can afford training time
+- Often achieves best performance
+- Example:
+```python
+from sklearn.ensemble import GradientBoostingClassifier

-**Key parameters**:
- `n_estimators`: Number of boosting stages
- `learning_rate`: Shrinks contribution of each tree
- `max_depth`: Maximum tree depth (typically 3-8)
- `subsample`: Fraction of samples per tree (enables stochastic gradient boosting)
- `early_stopping`: Stop when validation score stops improving
+model = GradientBoostingClassifier(
+    n_estimators=100,
+    learning_rate=0.1,
+    max_depth=3,
+    subsample=0.8
+)
+model.fit(X_train, y_train)
+```

-**Performance tip**: HistGradientBoosting is orders of magnitude faster for large datasets
+**HistGradientBoostingClassifier / HistGradientBoostingRegressor**
+- Faster gradient boosting with histogram-based algorithm
+- Native support for missing values and categorical features
+- Key parameters: Similar to GradientBoosting
+- Use when: Large datasets, need faster training
+- Example:
+```python
+from sklearn.ensemble import HistGradientBoostingClassifier

-### AdaBoost
- **AdaBoostClassifier/Regressor**: Adaptive boosting
+model = HistGradientBoostingClassifier(
+    max_iter=100,
+    learning_rate=0.1,
+    max_depth=None,  # No limit by default
+    categorical_features='from_dtype'  # Auto-detect categorical
+)
+model.fit(X_train, y_train)
+```

-**Use cases**: Boosting weak learners, less prone to overfitting than other methods
+### Other Ensemble Methods

-**Key parameters**:
- `estimator`: Base estimator (default: DecisionTreeClassifier with max_depth=1)
- `n_estimators`: Number of boosting iterations
- `learning_rate`: Weight applied to each classifier
+**AdaBoost**
+- Adaptive boosting focusing on misclassified samples
+- Key parameters: `n_estimators`, `learning_rate`, `estimator` (base estimator)
+- Use when: Simple boosting approach needed
+- Example:
+```python
+from sklearn.ensemble import AdaBoostClassifier

-### Bagging
- **BaggingClassifier/Regressor**: Bootstrap aggregating with any base estimator
+model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)
+model.fit(X_train, y_train)
+```

-**Use cases**: Reducing variance of unstable models, parallel ensemble creation
+**Voting Classifier / Regressor**
+- Combines predictions from multiple models
+- Types: 'hard' (majority vote) or 'soft' (average probabilities)
+- Use when: Want to ensemble different model types
+- Example:
+```python
+from sklearn.ensemble import VotingClassifier
+from sklearn.linear_model import LogisticRegression
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.svm import SVC

-**Key parameters**:
- `estimator`: Base estimator to fit
- `n_estimators`: Number of estimators
- `max_samples`: Samples to draw per estimator
- `bootstrap`: Whether to use replacement
+model = VotingClassifier(
+    estimators=[
+        ('lr', LogisticRegression()),
+        ('dt', DecisionTreeClassifier()),
+        ('svc', SVC(probability=True))
+    ],
+    voting='soft'
+)
+model.fit(X_train, y_train)
+```

-### Voting & Stacking
- **VotingClassifier/Regressor**: Combines different model types
- **StackingClassifier/Regressor**: Meta-learner trained on base predictions
+**Stacking Classifier / Regressor**
+- Trains a meta-model on predictions from base models
+- More sophisticated than voting
+- Key parameter: `final_estimator` (meta-learner)
+- Example:
+```python
+from sklearn.ensemble import StackingClassifier
+from sklearn.linear_model import LogisticRegression
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.svm import SVC

-**Use cases**: Combining diverse models, leveraging different model strengths
+model = StackingClassifier(
+    estimators=[
+        ('dt', DecisionTreeClassifier()),
+        ('svc', SVC())
+    ],
+    final_estimator=LogisticRegression()
+)
+model.fit(X_train, y_train)
+```

-## Neural Networks
+## K-Nearest Neighbors

- **MLPClassifier**: Multi-layer perceptron classifier
- **MLPRegressor**: Multi-layer perceptron regressor
+**KNeighborsClassifier / KNeighborsRegressor**
+- Non-parametric method based on distance
+- Key parameters:
+  - `n_neighbors`: Number of neighbors (default=5)
+  - `weights`: 'uniform' or 'distance'
+  - `metric`: Distance metric ('euclidean', 'manhattan', etc.)
+- Use when: Small dataset, simple baseline needed
+- Slow prediction on large datasets
+- Example:
+```python
+from sklearn.neighbors import KNeighborsClassifier

-**Use cases**: Complex non-linear patterns, when gradient boosting is too slow, deep feature learning
-
-**Key parameters**:
- `hidden_layer_sizes`: Tuple of hidden layer sizes (e.g., (100, 50))
- `activation`: 'relu', 'tanh', 'logistic'
- `solver`: 'adam', 'lbfgs', 'sgd'
- `alpha`: L2 regularization term
- `learning_rate`: Learning rate schedule
- `early_stopping`: Stop when validation score stops improving
-
-**Important**: Feature scaling is critical for neural networks. Always use StandardScaler or similar.
-
-## Nearest Neighbors
-
- **KNeighborsClassifier/Regressor**: K-nearest neighbors
- **RadiusNeighborsClassifier/Regressor**: Radius-based neighbors
- **NearestCentroid**: Classification using class centroids
-
-**Use cases**: Simple baseline, irregular decision boundaries, when interpretability isn't critical
-
-**Key parameters**:
- `n_neighbors`: Number of neighbors (typically 3-11)
- `weights`: 'uniform' or 'distance' (distance-weighted voting)
- `metric`: Distance metric ('euclidean', 'manhattan', 'minkowski')
- `algorithm`: 'auto', 'ball_tree', 'kd_tree', 'brute'
+model = KNeighborsClassifier(n_neighbors=5, weights='distance')
+model.fit(X_train, y_train)
+```

 ## Naive Bayes

- **GaussianNB**: Assumes Gaussian distribution of features
- **MultinomialNB**: For discrete counts (text classification)
- **BernoulliNB**: For binary/boolean features
- **CategoricalNB**: For categorical features
- **ComplementNB**: Adapted for imbalanced datasets
+**GaussianNB, MultinomialNB, BernoulliNB**
+- Probabilistic classifiers based on Bayes' theorem
+- Fast training and prediction
+- GaussianNB: Continuous features (assumes Gaussian distribution)
+- MultinomialNB: Count features (text classification)
+- BernoulliNB: Binary features
+- Use when: Text classification, fast baseline, probabilistic predictions
+- Example:
+```python
+from sklearn.naive_bayes import GaussianNB, MultinomialNB

-**Use cases**: Text classification, fast baseline, when features are independent, small training sets
+# For continuous features
+model_gaussian = GaussianNB()

-**Key parameters**:
- `alpha`: Smoothing parameter (Laplace/Lidstone smoothing)
- `fit_prior`: Whether to learn class prior probabilities
+# For text/count data
+model_multinomial = MultinomialNB(alpha=1.0)  # alpha is smoothing parameter
+model_multinomial.fit(X_train, y_train)
+```

-## Linear/Quadratic Discriminant Analysis
+## Neural Networks

- **LinearDiscriminantAnalysis**: Linear decision boundary with dimensionality reduction
- **QuadraticDiscriminantAnalysis**: Quadratic decision boundary
+**MLPClassifier / MLPRegressor**
+- Multi-layer perceptron (feedforward neural network)
+- Key parameters:
+  - `hidden_layer_sizes`: Tuple of hidden layer sizes, e.g., (100, 50)
+  - `activation`: 'relu', 'tanh', 'logistic'
+  - `solver`: 'adam', 'sgd', 'lbfgs'
+  - `alpha`: L2 regularization parameter
+  - `learning_rate`: 'constant', 'adaptive'
+- Use when: Complex non-linear patterns, large datasets
+- Requires feature scaling
+- Example:
+```python
+from sklearn.neural_network import MLPClassifier
+from sklearn.preprocessing import StandardScaler

-**Use cases**: When classes have Gaussian distributions, dimensionality reduction, when covariance assumptions hold
+# Scale features first
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)

-## Gaussian Processes
+model = MLPClassifier(
+    hidden_layer_sizes=(100, 50),
+    activation='relu',
+    solver='adam',
+    alpha=0.0001,
+    max_iter=1000
+)
+model.fit(X_train_scaled, y_train)
+```

- **GaussianProcessClassifier**: Probabilistic classification
- **GaussianProcessRegressor**: Probabilistic regression with uncertainty estimates
+## Algorithm Selection Guide

-**Use cases**: When uncertainty quantification is important, small datasets, smooth function approximation
+### Choose based on:

-**Key parameters**:
- `kernel`: Covariance function (RBF, Matern, RationalQuadratic, etc.)
- `alpha`: Noise level
+**Dataset size:**
+- Small (<1k samples): KNN, SVM, Decision Trees
+- Medium (1k-100k): Random Forest, Gradient Boosting, Linear Models
+- Large (>100k): SGD, Linear Models, HistGradientBoosting

-**Limitation**: Doesn't scale well to large datasets (O(n³) complexity)
+**Interpretability:**
+- High: Linear Models, Decision Trees
+- Medium: Random Forest (feature importance)
+- Low: SVM with RBF kernel, Neural Networks

-## Stochastic Gradient Descent
+**Accuracy vs Speed:**
+- Fast training: Naive Bayes, Linear Models, KNN
+- High accuracy: Gradient Boosting, Random Forest, Stacking
+- Fast prediction: Linear Models, Naive Bayes
+- Slow prediction: KNN (on large datasets), SVM

- **SGDClassifier**: Linear classifiers with SGD
- **SGDRegressor**: Linear regressors with SGD
+**Feature types:**
+- Continuous: Most algorithms work well
+- Categorical: Trees, HistGradientBoosting (native support)
+- Mixed: Trees, Gradient Boosting
+- Text: Naive Bayes, Linear Models with TF-IDF

-**Use cases**: Very large datasets (>100k samples), online learning, when data doesn't fit in memory
-
-**Key parameters**:
- `loss`: Loss function ('hinge', 'log_loss', 'squared_error', etc.)
- `penalty`: Regularization ('l2', 'l1', 'elasticnet')
- `alpha`: Regularization strength
- `learning_rate`: Learning rate schedule
-
-## Semi-Supervised Learning
-
- **SelfTrainingClassifier**: Self-training with any base classifier
- **LabelPropagation**: Label propagation through graph
- **LabelSpreading**: Label spreading (modified label propagation)
-
-**Use cases**: When labeled data is scarce but unlabeled data is abundant
-
-## Feature Selection
-
- **VarianceThreshold**: Remove low-variance features
- **SelectKBest**: Select K highest scoring features
- **SelectPercentile**: Select top percentile of features
- **RFE**: Recursive feature elimination
- **RFECV**: RFE with cross-validation
- **SelectFromModel**: Select features based on importance
- **SequentialFeatureSelector**: Forward/backward feature selection
-
-**Use cases**: Reducing dimensionality, removing irrelevant features, improving interpretability, reducing overfitting
-
-## Probability Calibration
-
- **CalibratedClassifierCV**: Calibrate classifier probabilities
-
-**Use cases**: When probability estimates are important (not just class predictions), especially with SVM and Naive Bayes
-
-**Methods**:
- `sigmoid`: Platt scaling
- `isotonic`: Isotonic regression (more flexible, needs more data)
-
-## Multi-Output Methods
-
- **MultiOutputClassifier**: Fit one classifier per target
- **MultiOutputRegressor**: Fit one regressor per target
- **ClassifierChain**: Models dependencies between targets
- **RegressorChain**: Regression variant
-
-**Use cases**: Predicting multiple related targets simultaneously
-
-## Specialized Regression
-
- **IsotonicRegression**: Monotonic regression
- **QuantileRegressor**: Quantile regression for prediction intervals
-
-## Algorithm Selection Guidelines
-
-**Start with**:
-1. **Logistic Regression** (classification) or **LinearRegression/Ridge** (regression) as baseline
-2. **RandomForestClassifier/Regressor** for general non-linear problems
-3. **HistGradientBoostingClassifier/Regressor** when best performance is needed
-
-**Consider dataset size**:
- Small (<1k samples): SVM, Gaussian Processes, any algorithm
- Medium (1k-100k): Random Forests, Gradient Boosting, Neural Networks
- Large (>100k): SGD, HistGradientBoosting, LinearSVC
-
-**Consider interpretability needs**:
- High interpretability: Linear models, Decision Trees, Naive Bayes
- Medium: Random Forests (feature importance), Rule extraction
- Low (black box acceptable): Gradient Boosting, Neural Networks, SVM with RBF kernel
-
-**Consider training time**:
- Fast: Linear models, Naive Bayes, Decision Trees
- Medium: Random Forests (parallelizable), SVM (small data)
- Slow: Gradient Boosting, Neural Networks, SVM (large data), Gaussian Processes
+**Common starting points:**
+1. Logistic Regression (classification) / Linear Regression (regression) - fast baseline
+2. Random Forest - good default choice
+3. Gradient Boosting - optimize for best accuracy