Add more scientific skills

2026-01-26 16:58:56 +08:00 · 2025-10-19 14:12:02 -07:00
parent 78d5ac2b56
commit 660c8574d0
210 changed files with 88957 additions and 1 deletions
--- a/scientific-packages/scikit-learn/references/supervised_learning.md
+++ b/scientific-packages/scikit-learn/references/supervised_learning.md
@@ -0,0 +1,261 @@
+# Supervised Learning in scikit-learn
+
+## Overview
+Supervised learning algorithms learn patterns from labeled training data to make predictions on new data. Scikit-learn organizes supervised learning into 17 major categories.
+
+## Linear Models
+
+### Regression
+- **LinearRegression**: Ordinary least squares regression
+- **Ridge**: L2-regularized regression, good for multicollinearity
+- **Lasso**: L1-regularized regression, performs feature selection
+- **ElasticNet**: Combined L1/L2 regularization
+- **LassoLars**: Lasso using Least Angle Regression algorithm
+- **BayesianRidge**: Bayesian approach with automatic relevance determination
+
+### Classification
+- **LogisticRegression**: Binary and multiclass classification
+- **RidgeClassifier**: Ridge regression for classification
+- **SGDClassifier**: Linear classifiers with SGD training
+
+**Use cases**: Baseline models, interpretable predictions, high-dimensional data, when linear relationships are expected
+
+**Key parameters**:
+- `alpha`: Regularization strength (higher = more regularization)
+- `fit_intercept`: Whether to calculate intercept
+- `solver`: Optimization algorithm ('lbfgs', 'saga', 'liblinear')
+
+## Support Vector Machines (SVM)
+
+- **SVC**: Support Vector Classification
+- **SVR**: Support Vector Regression
+- **LinearSVC**: Linear SVM using liblinear (faster for large datasets)
+- **OneClassSVM**: Unsupervised outlier detection
+
+**Use cases**: Complex non-linear decision boundaries, high-dimensional spaces, when clear margin of separation exists
+
+**Key parameters**:
+- `kernel`: 'linear', 'poly', 'rbf', 'sigmoid'
+- `C`: Regularization parameter (lower = more regularization)
+- `gamma`: Kernel coefficient ('scale', 'auto', or float)
+- `degree`: Polynomial degree (for poly kernel)
+
+**Performance tip**: SVMs don't scale well beyond tens of thousands of samples. Use LinearSVC for large datasets with linear kernel.
+
+## Decision Trees
+
+- **DecisionTreeClassifier**: Classification tree
+- **DecisionTreeRegressor**: Regression tree
+- **ExtraTreeClassifier/Regressor**: Extremely randomized tree
+
+**Use cases**: Non-linear relationships, feature importance analysis, interpretable rules, handling mixed data types
+
+**Key parameters**:
+- `max_depth`: Maximum tree depth (controls overfitting)
+- `min_samples_split`: Minimum samples to split a node
+- `min_samples_leaf`: Minimum samples in leaf node
+- `max_features`: Number of features to consider for splits
+- `criterion`: 'gini', 'entropy' (classification); 'squared_error', 'absolute_error' (regression)
+
+**Overfitting prevention**: Limit `max_depth`, increase `min_samples_split/leaf`, use pruning with `ccp_alpha`
+
+## Ensemble Methods
+
+### Random Forests
+- **RandomForestClassifier**: Ensemble of decision trees
+- **RandomForestRegressor**: Regression variant
+
+**Use cases**: Robust general-purpose algorithm, reduces overfitting vs single trees, handles non-linear relationships
+
+**Key parameters**:
+- `n_estimators`: Number of trees (higher = better but slower)
+- `max_depth`: Maximum tree depth
+- `max_features`: Features per split ('sqrt', 'log2', int, float)
+- `bootstrap`: Whether to use bootstrap samples
+- `n_jobs`: Parallel processing (-1 uses all cores)
+
+### Gradient Boosting
+- **HistGradientBoostingClassifier/Regressor**: Histogram-based, fast for large datasets (>10k samples)
+- **GradientBoostingClassifier/Regressor**: Traditional implementation, better for small datasets
+
+**Use cases**: High-performance predictions, winning Kaggle competitions, structured/tabular data
+
+**Key parameters**:
+- `n_estimators`: Number of boosting stages
+- `learning_rate`: Shrinks contribution of each tree
+- `max_depth`: Maximum tree depth (typically 3-8)
+- `subsample`: Fraction of samples per tree (enables stochastic gradient boosting)
+- `early_stopping`: Stop when validation score stops improving
+
+**Performance tip**: HistGradientBoosting is orders of magnitude faster for large datasets
+
+### AdaBoost
+- **AdaBoostClassifier/Regressor**: Adaptive boosting
+
+**Use cases**: Boosting weak learners, less prone to overfitting than other methods
+
+**Key parameters**:
+- `estimator`: Base estimator (default: DecisionTreeClassifier with max_depth=1)
+- `n_estimators`: Number of boosting iterations
+- `learning_rate`: Weight applied to each classifier
+
+### Bagging
+- **BaggingClassifier/Regressor**: Bootstrap aggregating with any base estimator
+
+**Use cases**: Reducing variance of unstable models, parallel ensemble creation
+
+**Key parameters**:
+- `estimator`: Base estimator to fit
+- `n_estimators`: Number of estimators
+- `max_samples`: Samples to draw per estimator
+- `bootstrap`: Whether to use replacement
+
+### Voting & Stacking
+- **VotingClassifier/Regressor**: Combines different model types
+- **StackingClassifier/Regressor**: Meta-learner trained on base predictions
+
+**Use cases**: Combining diverse models, leveraging different model strengths
+
+## Neural Networks
+
+- **MLPClassifier**: Multi-layer perceptron classifier
+- **MLPRegressor**: Multi-layer perceptron regressor
+
+**Use cases**: Complex non-linear patterns, when gradient boosting is too slow, deep feature learning
+
+**Key parameters**:
+- `hidden_layer_sizes`: Tuple of hidden layer sizes (e.g., (100, 50))
+- `activation`: 'relu', 'tanh', 'logistic'
+- `solver`: 'adam', 'lbfgs', 'sgd'
+- `alpha`: L2 regularization term
+- `learning_rate`: Learning rate schedule
+- `early_stopping`: Stop when validation score stops improving
+
+**Important**: Feature scaling is critical for neural networks. Always use StandardScaler or similar.
+
+## Nearest Neighbors
+
+- **KNeighborsClassifier/Regressor**: K-nearest neighbors
+- **RadiusNeighborsClassifier/Regressor**: Radius-based neighbors
+- **NearestCentroid**: Classification using class centroids
+
+**Use cases**: Simple baseline, irregular decision boundaries, when interpretability isn't critical
+
+**Key parameters**:
+- `n_neighbors`: Number of neighbors (typically 3-11)
+- `weights`: 'uniform' or 'distance' (distance-weighted voting)
+- `metric`: Distance metric ('euclidean', 'manhattan', 'minkowski')
+- `algorithm`: 'auto', 'ball_tree', 'kd_tree', 'brute'
+
+## Naive Bayes
+
+- **GaussianNB**: Assumes Gaussian distribution of features
+- **MultinomialNB**: For discrete counts (text classification)
+- **BernoulliNB**: For binary/boolean features
+- **CategoricalNB**: For categorical features
+- **ComplementNB**: Adapted for imbalanced datasets
+
+**Use cases**: Text classification, fast baseline, when features are independent, small training sets
+
+**Key parameters**:
+- `alpha`: Smoothing parameter (Laplace/Lidstone smoothing)
+- `fit_prior`: Whether to learn class prior probabilities
+
+## Linear/Quadratic Discriminant Analysis
+
+- **LinearDiscriminantAnalysis**: Linear decision boundary with dimensionality reduction
+- **QuadraticDiscriminantAnalysis**: Quadratic decision boundary
+
+**Use cases**: When classes have Gaussian distributions, dimensionality reduction, when covariance assumptions hold
+
+## Gaussian Processes
+
+- **GaussianProcessClassifier**: Probabilistic classification
+- **GaussianProcessRegressor**: Probabilistic regression with uncertainty estimates
+
+**Use cases**: When uncertainty quantification is important, small datasets, smooth function approximation
+
+**Key parameters**:
+- `kernel`: Covariance function (RBF, Matern, RationalQuadratic, etc.)
+- `alpha`: Noise level
+
+**Limitation**: Doesn't scale well to large datasets (O(n³) complexity)
+
+## Stochastic Gradient Descent
+
+- **SGDClassifier**: Linear classifiers with SGD
+- **SGDRegressor**: Linear regressors with SGD
+
+**Use cases**: Very large datasets (>100k samples), online learning, when data doesn't fit in memory
+
+**Key parameters**:
+- `loss`: Loss function ('hinge', 'log_loss', 'squared_error', etc.)
+- `penalty`: Regularization ('l2', 'l1', 'elasticnet')
+- `alpha`: Regularization strength
+- `learning_rate`: Learning rate schedule
+
+## Semi-Supervised Learning
+
+- **SelfTrainingClassifier**: Self-training with any base classifier
+- **LabelPropagation**: Label propagation through graph
+- **LabelSpreading**: Label spreading (modified label propagation)
+
+**Use cases**: When labeled data is scarce but unlabeled data is abundant
+
+## Feature Selection
+
+- **VarianceThreshold**: Remove low-variance features
+- **SelectKBest**: Select K highest scoring features
+- **SelectPercentile**: Select top percentile of features
+- **RFE**: Recursive feature elimination
+- **RFECV**: RFE with cross-validation
+- **SelectFromModel**: Select features based on importance
+- **SequentialFeatureSelector**: Forward/backward feature selection
+
+**Use cases**: Reducing dimensionality, removing irrelevant features, improving interpretability, reducing overfitting
+
+## Probability Calibration
+
+- **CalibratedClassifierCV**: Calibrate classifier probabilities
+
+**Use cases**: When probability estimates are important (not just class predictions), especially with SVM and Naive Bayes
+
+**Methods**:
+- `sigmoid`: Platt scaling
+- `isotonic`: Isotonic regression (more flexible, needs more data)
+
+## Multi-Output Methods
+
+- **MultiOutputClassifier**: Fit one classifier per target
+- **MultiOutputRegressor**: Fit one regressor per target
+- **ClassifierChain**: Models dependencies between targets
+- **RegressorChain**: Regression variant
+
+**Use cases**: Predicting multiple related targets simultaneously
+
+## Specialized Regression
+
+- **IsotonicRegression**: Monotonic regression
+- **QuantileRegressor**: Quantile regression for prediction intervals
+
+## Algorithm Selection Guidelines
+
+**Start with**:
+1. **Logistic Regression** (classification) or **LinearRegression/Ridge** (regression) as baseline
+2. **RandomForestClassifier/Regressor** for general non-linear problems
+3. **HistGradientBoostingClassifier/Regressor** when best performance is needed
+
+**Consider dataset size**:
+- Small (<1k samples): SVM, Gaussian Processes, any algorithm
+- Medium (1k-100k): Random Forests, Gradient Boosting, Neural Networks
+- Large (>100k): SGD, HistGradientBoosting, LinearSVC
+
+**Consider interpretability needs**:
+- High interpretability: Linear models, Decision Trees, Naive Bayes
+- Medium: Random Forests (feature importance), Rule extraction
+- Low (black box acceptable): Gradient Boosting, Neural Networks, SVM with RBF kernel
+
+**Consider training time**:
+- Fast: Linear models, Naive Bayes, Decision Trees
+- Medium: Random Forests (parallelizable), SVM (small data)
+- Slow: Gradient Boosting, Neural Networks, SVM (large data), Gaussian Processes