skills/claude-scientific-skills

Fork 0

mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-01-26 16:58:56 +08:00

Files

Timothy Kassis 660c8574d0 Add more scientific skills

2025-10-19 14:12:02 -07:00

10 KiB

Raw Blame History

Supervised Learning in scikit-learn

Overview

Supervised learning algorithms learn patterns from labeled training data to make predictions on new data. Scikit-learn organizes supervised learning into 17 major categories.

Linear Models

Regression

LinearRegression: Ordinary least squares regression
Ridge: L2-regularized regression, good for multicollinearity
Lasso: L1-regularized regression, performs feature selection
ElasticNet: Combined L1/L2 regularization
LassoLars: Lasso using Least Angle Regression algorithm
BayesianRidge: Bayesian approach with automatic relevance determination

Classification

LogisticRegression: Binary and multiclass classification
RidgeClassifier: Ridge regression for classification
SGDClassifier: Linear classifiers with SGD training

Use cases: Baseline models, interpretable predictions, high-dimensional data, when linear relationships are expected

Key parameters:

alpha: Regularization strength (higher = more regularization)
fit_intercept: Whether to calculate intercept
solver: Optimization algorithm ('lbfgs', 'saga', 'liblinear')

Support Vector Machines (SVM)

SVC: Support Vector Classification
SVR: Support Vector Regression
LinearSVC: Linear SVM using liblinear (faster for large datasets)
OneClassSVM: Unsupervised outlier detection

Use cases: Complex non-linear decision boundaries, high-dimensional spaces, when clear margin of separation exists

Key parameters:

kernel: 'linear', 'poly', 'rbf', 'sigmoid'
C: Regularization parameter (lower = more regularization)
gamma: Kernel coefficient ('scale', 'auto', or float)
degree: Polynomial degree (for poly kernel)

Performance tip: SVMs don't scale well beyond tens of thousands of samples. Use LinearSVC for large datasets with linear kernel.

Decision Trees

DecisionTreeClassifier: Classification tree
DecisionTreeRegressor: Regression tree
ExtraTreeClassifier/Regressor: Extremely randomized tree

Use cases: Non-linear relationships, feature importance analysis, interpretable rules, handling mixed data types

Key parameters:

max_depth: Maximum tree depth (controls overfitting)
min_samples_split: Minimum samples to split a node
min_samples_leaf: Minimum samples in leaf node
max_features: Number of features to consider for splits
criterion: 'gini', 'entropy' (classification); 'squared_error', 'absolute_error' (regression)

Overfitting prevention: Limit max_depth, increase min_samples_split/leaf, use pruning with ccp_alpha

Ensemble Methods

Random Forests

RandomForestClassifier: Ensemble of decision trees
RandomForestRegressor: Regression variant

Use cases: Robust general-purpose algorithm, reduces overfitting vs single trees, handles non-linear relationships

Key parameters:

n_estimators: Number of trees (higher = better but slower)
max_depth: Maximum tree depth
max_features: Features per split ('sqrt', 'log2', int, float)
bootstrap: Whether to use bootstrap samples
n_jobs: Parallel processing (-1 uses all cores)

Gradient Boosting

HistGradientBoostingClassifier/Regressor: Histogram-based, fast for large datasets (>10k samples)
GradientBoostingClassifier/Regressor: Traditional implementation, better for small datasets

Use cases: High-performance predictions, winning Kaggle competitions, structured/tabular data

Key parameters:

n_estimators: Number of boosting stages
learning_rate: Shrinks contribution of each tree
max_depth: Maximum tree depth (typically 3-8)
subsample: Fraction of samples per tree (enables stochastic gradient boosting)
early_stopping: Stop when validation score stops improving

Performance tip: HistGradientBoosting is orders of magnitude faster for large datasets

AdaBoost

AdaBoostClassifier/Regressor: Adaptive boosting

Use cases: Boosting weak learners, less prone to overfitting than other methods

Key parameters:

estimator: Base estimator (default: DecisionTreeClassifier with max_depth=1)
n_estimators: Number of boosting iterations
learning_rate: Weight applied to each classifier

Bagging

BaggingClassifier/Regressor: Bootstrap aggregating with any base estimator

Use cases: Reducing variance of unstable models, parallel ensemble creation

Key parameters:

estimator: Base estimator to fit
n_estimators: Number of estimators
max_samples: Samples to draw per estimator
bootstrap: Whether to use replacement

Voting & Stacking

VotingClassifier/Regressor: Combines different model types
StackingClassifier/Regressor: Meta-learner trained on base predictions

Use cases: Combining diverse models, leveraging different model strengths

Neural Networks

MLPClassifier: Multi-layer perceptron classifier
MLPRegressor: Multi-layer perceptron regressor

Use cases: Complex non-linear patterns, when gradient boosting is too slow, deep feature learning

Key parameters:

hidden_layer_sizes: Tuple of hidden layer sizes (e.g., (100, 50))
activation: 'relu', 'tanh', 'logistic'
solver: 'adam', 'lbfgs', 'sgd'
alpha: L2 regularization term
learning_rate: Learning rate schedule
early_stopping: Stop when validation score stops improving

Important: Feature scaling is critical for neural networks. Always use StandardScaler or similar.

Nearest Neighbors

KNeighborsClassifier/Regressor: K-nearest neighbors
RadiusNeighborsClassifier/Regressor: Radius-based neighbors
NearestCentroid: Classification using class centroids

Use cases: Simple baseline, irregular decision boundaries, when interpretability isn't critical

Key parameters:

n_neighbors: Number of neighbors (typically 3-11)
weights: 'uniform' or 'distance' (distance-weighted voting)
metric: Distance metric ('euclidean', 'manhattan', 'minkowski')
algorithm: 'auto', 'ball_tree', 'kd_tree', 'brute'

Naive Bayes

GaussianNB: Assumes Gaussian distribution of features
MultinomialNB: For discrete counts (text classification)
BernoulliNB: For binary/boolean features
CategoricalNB: For categorical features
ComplementNB: Adapted for imbalanced datasets

Use cases: Text classification, fast baseline, when features are independent, small training sets

Key parameters:

alpha: Smoothing parameter (Laplace/Lidstone smoothing)
fit_prior: Whether to learn class prior probabilities

Linear/Quadratic Discriminant Analysis

LinearDiscriminantAnalysis: Linear decision boundary with dimensionality reduction
QuadraticDiscriminantAnalysis: Quadratic decision boundary

Use cases: When classes have Gaussian distributions, dimensionality reduction, when covariance assumptions hold

Gaussian Processes

GaussianProcessClassifier: Probabilistic classification
GaussianProcessRegressor: Probabilistic regression with uncertainty estimates

Use cases: When uncertainty quantification is important, small datasets, smooth function approximation

Key parameters:

kernel: Covariance function (RBF, Matern, RationalQuadratic, etc.)
alpha: Noise level

Limitation: Doesn't scale well to large datasets (O(n³) complexity)

Stochastic Gradient Descent

SGDClassifier: Linear classifiers with SGD
SGDRegressor: Linear regressors with SGD

Use cases: Very large datasets (>100k samples), online learning, when data doesn't fit in memory

Key parameters:

loss: Loss function ('hinge', 'log_loss', 'squared_error', etc.)
penalty: Regularization ('l2', 'l1', 'elasticnet')
alpha: Regularization strength
learning_rate: Learning rate schedule

Semi-Supervised Learning

SelfTrainingClassifier: Self-training with any base classifier
LabelPropagation: Label propagation through graph
LabelSpreading: Label spreading (modified label propagation)

Use cases: When labeled data is scarce but unlabeled data is abundant

Feature Selection

VarianceThreshold: Remove low-variance features
SelectKBest: Select K highest scoring features
SelectPercentile: Select top percentile of features
RFE: Recursive feature elimination
RFECV: RFE with cross-validation
SelectFromModel: Select features based on importance
SequentialFeatureSelector: Forward/backward feature selection

Use cases: Reducing dimensionality, removing irrelevant features, improving interpretability, reducing overfitting

Probability Calibration

CalibratedClassifierCV: Calibrate classifier probabilities

Use cases: When probability estimates are important (not just class predictions), especially with SVM and Naive Bayes

Methods:

sigmoid: Platt scaling
isotonic: Isotonic regression (more flexible, needs more data)

Multi-Output Methods

MultiOutputClassifier: Fit one classifier per target
MultiOutputRegressor: Fit one regressor per target
ClassifierChain: Models dependencies between targets
RegressorChain: Regression variant

Use cases: Predicting multiple related targets simultaneously

Specialized Regression

IsotonicRegression: Monotonic regression
QuantileRegressor: Quantile regression for prediction intervals

Algorithm Selection Guidelines

Start with:

Logistic Regression (classification) or LinearRegression/Ridge (regression) as baseline
RandomForestClassifier/Regressor for general non-linear problems
HistGradientBoostingClassifier/Regressor when best performance is needed

Consider dataset size:

Small (<1k samples): SVM, Gaussian Processes, any algorithm
Medium (1k-100k): Random Forests, Gradient Boosting, Neural Networks
Large (>100k): SGD, HistGradientBoosting, LinearSVC

Consider interpretability needs:

High interpretability: Linear models, Decision Trees, Naive Bayes
Medium: Random Forests (feature importance), Rule extraction
Low (black box acceptable): Gradient Boosting, Neural Networks, SVM with RBF kernel

Consider training time:

Fast: Linear models, Naive Bayes, Decision Trees
Medium: Random Forests (parallelizable), SVM (small data)
Slow: Gradient Boosting, Neural Networks, SVM (large data), Gaussian Processes

10 KiB Raw Blame History

Supervised Learning in scikit-learn

Overview

Linear Models

Regression

Classification

Support Vector Machines (SVM)

Decision Trees

Ensemble Methods

Random Forests

Gradient Boosting

AdaBoost

Bagging

Voting & Stacking

Neural Networks

Nearest Neighbors

Naive Bayes

Linear/Quadratic Discriminant Analysis

Gaussian Processes

Stochastic Gradient Descent

Semi-Supervised Learning

Feature Selection

Probability Calibration

Multi-Output Methods

Specialized Regression

Algorithm Selection Guidelines

10 KiB

Raw Blame History