Add more scientific skills

2026-03-28 07:33:45 +08:00 · 2025-10-19 14:12:02 -07:00
parent 78d5ac2b56
commit 660c8574d0
210 changed files with 88957 additions and 1 deletions
--- a/scientific-packages/scikit-learn/references/model_evaluation.md
+++ b/scientific-packages/scikit-learn/references/model_evaluation.md
@@ -0,0 +1,601 @@
+# Model Evaluation and Selection in scikit-learn
+
+## Overview
+Model evaluation assesses how well models generalize to unseen data. Scikit-learn provides three main APIs for evaluation:
+1. **Estimator score methods**: Built-in evaluation (accuracy for classifiers, R² for regressors)
+2. **Scoring parameter**: Used in cross-validation and hyperparameter tuning
+3. **Metric functions**: Specialized evaluation in `sklearn.metrics`
+
+## Cross-Validation
+
+Cross-validation evaluates model performance by splitting data into multiple train/test sets. This addresses overfitting: "a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data."
+
+### Basic Cross-Validation
+
+```python
+from sklearn.model_selection import cross_val_score
+from sklearn.linear_model import LogisticRegression
+
+model = LogisticRegression()
+scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
+print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
+```
+
+### Cross-Validation Strategies
+
+#### For i.i.d. Data
+
+**KFold**: Standard k-fold cross-validation
+- Splits data into k equal folds
+- Each fold used once as test set
+- `n_splits`: Number of folds (typically 5 or 10)
+
+```python
+from sklearn.model_selection import KFold
+cv = KFold(n_splits=5, shuffle=True, random_state=42)
+```
+
+**RepeatedKFold**: Repeats KFold with different randomization
+- More robust estimation
+- Computationally expensive
+
+**LeaveOneOut (LOO)**: Each sample is a test set
+- Maximum training data usage
+- Very computationally expensive
+- High variance in estimates
+- Use only for small datasets (<1000 samples)
+
+**ShuffleSplit**: Random train/test splits
+- Flexible train/test sizes
+- Can control number of iterations
+- Good for quick evaluation
+
+```python
+from sklearn.model_selection import ShuffleSplit
+cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
+```
+
+#### For Imbalanced Classes
+
+**StratifiedKFold**: Preserves class proportions in each fold
+- Essential for imbalanced datasets
+- Default for classification in cross_val_score()
+
+```python
+from sklearn.model_selection import StratifiedKFold
+cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
+```
+
+**StratifiedShuffleSplit**: Stratified random splits
+
+#### For Grouped Data
+
+Use when samples are not independent (e.g., multiple measurements from same subject).
+
+**GroupKFold**: Groups don't appear in both train and test
+```python
+from sklearn.model_selection import GroupKFold
+cv = GroupKFold(n_splits=5)
+scores = cross_val_score(model, X, y, groups=groups, cv=cv)
+```
+
+**StratifiedGroupKFold**: Combines stratification with group separation
+
+**LeaveOneGroupOut**: Each group becomes a test set
+
+#### For Time Series
+
+**TimeSeriesSplit**: Expanding window approach
+- Successive training sets are supersets
+- Respects temporal ordering
+- No data leakage from future to past
+
+```python
+from sklearn.model_selection import TimeSeriesSplit
+cv = TimeSeriesSplit(n_splits=5)
+for train_idx, test_idx in cv.split(X):
+    # Train on indices 0 to t, test on t+1 to t+k
+    pass
+```
+
+### Cross-Validation Functions
+
+**cross_val_score**: Returns array of scores
+```python
+scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
+```
+
+**cross_validate**: Returns multiple metrics and timing
+```python
+results = cross_validate(
+    model, X, y, cv=5,
+    scoring=['accuracy', 'f1_weighted', 'roc_auc'],
+    return_train_score=True,
+    return_estimator=True  # Returns fitted estimators
+)
+print(results['test_accuracy'])
+print(results['fit_time'])
+```
+
+**cross_val_predict**: Returns predictions for model blending/visualization
+```python
+from sklearn.model_selection import cross_val_predict
+y_pred = cross_val_predict(model, X, y, cv=5)
+# Use for confusion matrix, error analysis, etc.
+```
+
+## Hyperparameter Tuning
+
+### GridSearchCV
+Exhaustively searches all parameter combinations.
+
+```python
+from sklearn.model_selection import GridSearchCV
+from sklearn.ensemble import RandomForestClassifier
+
+param_grid = {
+    'n_estimators': [100, 200, 500],
+    'max_depth': [10, 20, 30, None],
+    'min_samples_split': [2, 5, 10],
+    'min_samples_leaf': [1, 2, 4]
+}
+
+grid_search = GridSearchCV(
+    RandomForestClassifier(random_state=42),
+    param_grid,
+    cv=5,
+    scoring='f1_weighted',
+    n_jobs=-1,  # Use all CPU cores
+    verbose=2
+)
+
+grid_search.fit(X_train, y_train)
+print("Best parameters:", grid_search.best_params_)
+print("Best score:", grid_search.best_score_)
+
+# Use best model
+best_model = grid_search.best_estimator_
+```
+
+**When to use**:
+- Small parameter spaces
+- When computational resources allow
+- When exhaustive search is desired
+
+### RandomizedSearchCV
+Samples parameter combinations from distributions.
+
+```python
+from sklearn.model_selection import RandomizedSearchCV
+from scipy.stats import randint, uniform
+
+param_distributions = {
+    'n_estimators': randint(100, 1000),
+    'max_depth': randint(5, 50),
+    'min_samples_split': randint(2, 20),
+    'min_samples_leaf': randint(1, 10),
+    'max_features': uniform(0.1, 0.9)
+}
+
+random_search = RandomizedSearchCV(
+    RandomForestClassifier(random_state=42),
+    param_distributions,
+    n_iter=100,  # Number of parameter settings sampled
+    cv=5,
+    scoring='f1_weighted',
+    n_jobs=-1,
+    random_state=42
+)
+
+random_search.fit(X_train, y_train)
+```
+
+**When to use**:
+- Large parameter spaces
+- When budget is limited
+- Often finds good parameters faster than GridSearchCV
+
+**Advantage**: "Budget can be chosen independent of the number of parameters and possible values"
+
+### Successive Halving
+
+**HalvingGridSearchCV** and **HalvingRandomSearchCV**: Tournament-style selection
+
+**How it works**:
+1. Start with many candidates, minimal resources
+2. Eliminate poor performers
+3. Increase resources for remaining candidates
+4. Repeat until best candidates found
+
+**When to use**:
+- Large parameter spaces
+- Expensive model training
+- When many parameter combinations are clearly inferior
+
+```python
+from sklearn.experimental import enable_halving_search_cv
+from sklearn.model_selection import HalvingGridSearchCV
+
+halving_search = HalvingGridSearchCV(
+    estimator,
+    param_grid,
+    factor=3,  # Proportion of candidates eliminated each round
+    cv=5
+)
+```
+
+## Classification Metrics
+
+### Accuracy-Based Metrics
+
+**Accuracy**: Proportion of correct predictions
+```python
+from sklearn.metrics import accuracy_score
+accuracy = accuracy_score(y_true, y_pred)
+```
+
+**When to use**: Balanced datasets only
+**When NOT to use**: Imbalanced datasets (misleading)
+
+**Balanced Accuracy**: Average recall per class
+```python
+from sklearn.metrics import balanced_accuracy_score
+bal_acc = balanced_accuracy_score(y_true, y_pred)
+```
+
+**When to use**: Imbalanced datasets, ensures all classes matter equally
+
+### Precision, Recall, F-Score
+
+**Precision**: Of predicted positives, how many are actually positive
+- Formula: TP / (TP + FP)
+- Answers: "How reliable are positive predictions?"
+
+**Recall** (Sensitivity): Of actual positives, how many are predicted positive
+- Formula: TP / (TP + FN)
+- Answers: "How complete is positive detection?"
+
+**F1-Score**: Harmonic mean of precision and recall
+- Formula: 2 * (precision * recall) / (precision + recall)
+- Balanced measure when both precision and recall are important
+
+```python
+from sklearn.metrics import precision_recall_fscore_support, f1_score
+
+precision, recall, f1, support = precision_recall_fscore_support(
+    y_true, y_pred, average='weighted'
+)
+
+# Or individually
+f1 = f1_score(y_true, y_pred, average='weighted')
+```
+
+**Averaging strategies for multiclass**:
+- `binary`: Binary classification only
+- `micro`: Calculate globally (total TP, FP, FN)
+- `macro`: Calculate per class, unweighted mean (all classes equal)
+- `weighted`: Calculate per class, weighted by support (class frequency)
+- `samples`: For multilabel classification
+
+**When to use**:
+- `macro`: When all classes equally important (even rare ones)
+- `weighted`: When class frequency matters
+- `micro`: When overall performance across all samples matters
+
+### Confusion Matrix
+
+Shows true positives, false positives, true negatives, false negatives.
+
+```python
+from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
+import matplotlib.pyplot as plt
+
+cm = confusion_matrix(y_true, y_pred)
+disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
+disp.plot()
+plt.show()
+```
+
+### ROC Curve and AUC
+
+**ROC (Receiver Operating Characteristic)**: Plot of true positive rate vs false positive rate at different thresholds
+
+**AUC (Area Under Curve)**: Measures overall ability to discriminate between classes
+- 1.0 = perfect classifier
+- 0.5 = random classifier
+- <0.5 = worse than random
+
+```python
+from sklearn.metrics import roc_auc_score, roc_curve
+import matplotlib.pyplot as plt
+
+# Requires probability predictions
+y_proba = model.predict_proba(X_test)[:, 1]  # Probabilities for positive class
+
+auc = roc_auc_score(y_true, y_proba)
+fpr, tpr, thresholds = roc_curve(y_true, y_proba)
+
+plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
+plt.xlabel('False Positive Rate')
+plt.ylabel('True Positive Rate')
+plt.legend()
+plt.show()
+```
+
+**Multiclass ROC**: Use `multi_class='ovr'` (one-vs-rest) or `'ovo'` (one-vs-one)
+
+```python
+auc = roc_auc_score(y_true, y_proba, multi_class='ovr')
+```
+
+### Log Loss
+
+Measures probability calibration quality.
+
+```python
+from sklearn.metrics import log_loss
+loss = log_loss(y_true, y_proba)
+```
+
+**When to use**: When probability quality matters, not just class predictions
+**Lower is better**: Perfect predictions have log loss of 0
+
+### Classification Report
+
+Comprehensive summary of precision, recall, f1-score per class.
+
+```python
+from sklearn.metrics import classification_report
+
+print(classification_report(y_true, y_pred, target_names=['Class 0', 'Class 1']))
+```
+
+## Regression Metrics
+
+### Mean Squared Error (MSE)
+Average squared difference between predictions and true values.
+
+```python
+from sklearn.metrics import mean_squared_error
+mse = mean_squared_error(y_true, y_pred)
+rmse = mean_squared_error(y_true, y_pred, squared=False)  # Root MSE
+```
+
+**Characteristics**:
+- Penalizes large errors heavily (squared term)
+- Same units as target² (use RMSE for same units as target)
+- Lower is better
+
+### Mean Absolute Error (MAE)
+Average absolute difference between predictions and true values.
+
+```python
+from sklearn.metrics import mean_absolute_error
+mae = mean_absolute_error(y_true, y_pred)
+```
+
+**Characteristics**:
+- More robust to outliers than MSE
+- Same units as target
+- More interpretable
+- Lower is better
+
+**MSE vs MAE**: Use MAE when outliers shouldn't dominate the metric
+
+### R² Score (Coefficient of Determination)
+Proportion of variance explained by the model.
+
+```python
+from sklearn.metrics import r2_score
+r2 = r2_score(y_true, y_pred)
+```
+
+**Interpretation**:
+- 1.0 = perfect predictions
+- 0.0 = model as good as mean
+- <0.0 = model worse than mean (possible!)
+- Higher is better
+
+**Note**: Can be negative for models that perform worse than predicting the mean.
+
+### Mean Absolute Percentage Error (MAPE)
+Percentage-based error metric.
+
+```python
+from sklearn.metrics import mean_absolute_percentage_error
+mape = mean_absolute_percentage_error(y_true, y_pred)
+```
+
+**When to use**: When relative errors matter more than absolute errors
+**Warning**: Undefined when true values are zero
+
+### Median Absolute Error
+Median of absolute errors (robust to outliers).
+
+```python
+from sklearn.metrics import median_absolute_error
+med_ae = median_absolute_error(y_true, y_pred)
+```
+
+### Max Error
+Maximum residual error.
+
+```python
+from sklearn.metrics import max_error
+max_err = max_error(y_true, y_pred)
+```
+
+**When to use**: When worst-case performance matters
+
+## Custom Scoring Functions
+
+Create custom scorers for GridSearchCV and cross_val_score:
+
+```python
+from sklearn.metrics import make_scorer, fbeta_score
+
+# F2 score (weights recall higher than precision)
+f2_scorer = make_scorer(fbeta_score, beta=2)
+
+# Custom function
+def custom_metric(y_true, y_pred):
+    # Your custom logic
+    return score
+
+custom_scorer = make_scorer(custom_metric, greater_is_better=True)
+
+# Use in cross-validation or grid search
+scores = cross_val_score(model, X, y, cv=5, scoring=custom_scorer)
+```
+
+## Scoring Parameter Options
+
+Common scoring strings for `scoring` parameter:
+
+**Classification**:
+- `'accuracy'`, `'balanced_accuracy'`
+- `'precision'`, `'recall'`, `'f1'` (add `_macro`, `_micro`, `_weighted` for multiclass)
+- `'roc_auc'`, `'roc_auc_ovr'`, `'roc_auc_ovo'`
+- `'log_loss'` (lower is better, negate for maximization)
+- `'jaccard'` (Jaccard similarity)
+
+**Regression**:
+- `'r2'`
+- `'neg_mean_squared_error'`, `'neg_root_mean_squared_error'`
+- `'neg_mean_absolute_error'`
+- `'neg_mean_absolute_percentage_error'`
+- `'neg_median_absolute_error'`
+
+**Note**: Many metrics are negated (neg_*) so GridSearchCV can maximize them.
+
+## Validation Strategies
+
+### Train-Test Split
+Simple single split.
+
+```python
+from sklearn.model_selection import train_test_split
+
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y,
+    test_size=0.2,
+    random_state=42,
+    stratify=y  # For classification with imbalanced classes
+)
+```
+
+**When to use**: Large datasets, quick evaluation
+**Parameters**:
+- `test_size`: Proportion for test (typically 0.2-0.3)
+- `stratify`: Preserves class proportions
+- `random_state`: Reproducibility
+
+### Train-Validation-Test Split
+Three-way split for hyperparameter tuning.
+
+```python
+# First split: train+val and test
+X_trainval, X_test, y_trainval, y_test = train_test_split(
+    X, y, test_size=0.2, random_state=42
+)
+
+# Second split: train and validation
+X_train, X_val, y_train, y_val = train_test_split(
+    X_trainval, y_trainval, test_size=0.2, random_state=42
+)
+
+# Or use GridSearchCV with train+val, then evaluate on test
+```
+
+**When to use**: Model selection and final evaluation
+**Strategy**:
+1. Train: Model training
+2. Validation: Hyperparameter tuning
+3. Test: Final, unbiased evaluation (touch only once!)
+
+### Learning Curves
+
+Diagnose bias vs variance issues.
+
+```python
+from sklearn.model_selection import learning_curve
+import matplotlib.pyplot as plt
+
+train_sizes, train_scores, val_scores = learning_curve(
+    model, X, y,
+    cv=5,
+    train_sizes=np.linspace(0.1, 1.0, 10),
+    scoring='accuracy',
+    n_jobs=-1
+)
+
+plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
+plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
+plt.xlabel('Training set size')
+plt.ylabel('Score')
+plt.legend()
+plt.show()
+```
+
+**Interpretation**:
+- Large gap between train and validation: **Overfitting** (high variance)
+- Both scores low: **Underfitting** (high bias)
+- Scores converging but low: Need better features or more complex model
+- Validation score still improving: More data would help
+
+## Best Practices
+
+### Metric Selection Guidelines
+
+**Classification - Balanced classes**:
+- Accuracy or F1-score
+
+**Classification - Imbalanced classes**:
+- Balanced accuracy
+- F1-score (weighted or macro)
+- ROC-AUC
+- Precision-Recall curve
+
+**Classification - Cost-sensitive**:
+- Custom scorer with cost matrix
+- Adjust threshold on probabilities
+
+**Regression - Typical use**:
+- RMSE (sensitive to outliers)
+- R² (proportion of variance explained)
+
+**Regression - Outliers present**:
+- MAE (robust to outliers)
+- Median absolute error
+
+**Regression - Percentage errors matter**:
+- MAPE
+
+### Cross-Validation Guidelines
+
+**Number of folds**:
+- 5-10 folds typical
+- More folds = more computation, less variance in estimate
+- LeaveOneOut only for small datasets
+
+**Stratification**:
+- Always use for classification with imbalanced classes
+- Use StratifiedKFold by default for classification
+
+**Grouping**:
+- Always use when samples are not independent
+- Time series: Always use TimeSeriesSplit
+
+**Nested cross-validation**:
+- For unbiased performance estimate when doing hyperparameter tuning
+- Outer loop: Performance estimation
+- Inner loop: Hyperparameter selection
+
+### Avoiding Common Pitfalls
+
+1. **Data leakage**: Fit preprocessors only on training data within each CV fold (use Pipeline!)
+2. **Test set leakage**: Never use test set for model selection
+3. **Improper metric**: Use metrics appropriate for problem (balanced_accuracy for imbalanced data)
+4. **Multiple testing**: More models evaluated = higher chance of random good results
+5. **Temporal leakage**: For time series, use TimeSeriesSplit, not random splits
+6. **Target leakage**: Features shouldn't contain information not available at prediction time
--- a/scientific-packages/scikit-learn/references/pipelines_and_composition.md
+++ b/scientific-packages/scikit-learn/references/pipelines_and_composition.md
@@ -0,0 +1,679 @@
+# Pipelines and Composite Estimators in scikit-learn
+
+## Overview
+Pipelines chain multiple estimators into a single unit, ensuring proper workflow sequencing and preventing data leakage. As the documentation states: "Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification."
+
+## Pipeline Basics
+
+### Creating Pipelines
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.decomposition import PCA
+from sklearn.linear_model import LogisticRegression
+
+# Method 1: List of (name, estimator) tuples
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=10)),
+    ('classifier', LogisticRegression())
+])
+
+# Method 2: Using make_pipeline (auto-generates names)
+from sklearn.pipeline import make_pipeline
+pipeline = make_pipeline(
+    StandardScaler(),
+    PCA(n_components=10),
+    LogisticRegression()
+)
+```
+
+### Using Pipelines
+
+```python
+# Fit and predict like any estimator
+pipeline.fit(X_train, y_train)
+y_pred = pipeline.predict(X_test)
+score = pipeline.score(X_test, y_test)
+
+# Access steps
+pipeline.named_steps['scaler']
+pipeline.steps[0]  # Returns ('scaler', StandardScaler(...))
+pipeline[0]        # Returns StandardScaler(...) object
+pipeline['scaler'] # Returns StandardScaler(...) object
+
+# Get final estimator
+pipeline[-1]  # Returns LogisticRegression(...) object
+```
+
+### Pipeline Rules
+
+**All steps except the last must be transformers** (have `fit()` and `transform()` methods).
+
+**The final step** can be:
+- Predictor (classifier/regressor) with `fit()` and `predict()`
+- Transformer with `fit()` and `transform()`
+- Any estimator with at least `fit()`
+
+### Pipeline Benefits
+
+1. **Convenience**: Single `fit()` and `predict()` call
+2. **Prevents data leakage**: Ensures proper fit/transform on train/test
+3. **Joint parameter selection**: Tune all steps together with GridSearchCV
+4. **Reproducibility**: Encapsulates entire workflow
+
+## Accessing and Setting Parameters
+
+### Nested Parameters
+
+Access step parameters using `stepname__parameter` syntax:
+
+```python
+from sklearn.model_selection import GridSearchCV
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('clf', LogisticRegression())
+])
+
+# Grid search over pipeline parameters
+param_grid = {
+    'scaler__with_mean': [True, False],
+    'clf__C': [0.1, 1.0, 10.0],
+    'clf__penalty': ['l1', 'l2']
+}
+
+grid_search = GridSearchCV(pipeline, param_grid, cv=5)
+grid_search.fit(X_train, y_train)
+```
+
+### Setting Parameters
+
+```python
+# Set parameters
+pipeline.set_params(clf__C=10.0, scaler__with_std=False)
+
+# Get parameters
+params = pipeline.get_params()
+```
+
+## Caching Intermediate Results
+
+Cache fitted transformers to avoid recomputation:
+
+```python
+from tempfile import mkdtemp
+from shutil import rmtree
+
+# Create cache directory
+cachedir = mkdtemp()
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=10)),
+    ('clf', LogisticRegression())
+], memory=cachedir)
+
+# When doing grid search, scaler and PCA only fit once per fold
+grid_search = GridSearchCV(pipeline, param_grid, cv=5)
+grid_search.fit(X_train, y_train)
+
+# Clean up cache
+rmtree(cachedir)
+
+# Or use joblib for persistent caching
+from joblib import Memory
+memory = Memory(location='./cache', verbose=0)
+pipeline = Pipeline([...], memory=memory)
+```
+
+**When to use caching**:
+- Expensive transformations (PCA, feature selection)
+- Grid search over final estimator parameters only
+- Multiple experiments with same preprocessing
+
+## ColumnTransformer
+
+Apply different transformations to different columns (essential for heterogeneous data).
+
+### Basic Usage
+
+```python
+from sklearn.compose import ColumnTransformer
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+
+# Define which transformations for which columns
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', StandardScaler(), ['age', 'income', 'credit_score']),
+        ('cat', OneHotEncoder(), ['country', 'occupation'])
+    ],
+    remainder='drop'  # What to do with remaining columns
+)
+
+X_transformed = preprocessor.fit_transform(X)
+```
+
+### Column Selection Methods
+
+```python
+# Method 1: Column names (list of strings)
+('num', StandardScaler(), ['age', 'income'])
+
+# Method 2: Column indices (list of integers)
+('num', StandardScaler(), [0, 1, 2])
+
+# Method 3: Boolean mask
+('num', StandardScaler(), [True, True, False, True, False])
+
+# Method 4: Slice
+('num', StandardScaler(), slice(0, 3))
+
+# Method 5: make_column_selector (by dtype or pattern)
+from sklearn.compose import make_column_selector as selector
+
+preprocessor = ColumnTransformer([
+    ('num', StandardScaler(), selector(dtype_include='number')),
+    ('cat', OneHotEncoder(), selector(dtype_include='object'))
+])
+
+# Select by pattern
+selector(pattern='.*_score$')  # All columns ending with '_score'
+```
+
+### Remainder Parameter
+
+Controls what happens to columns not specified:
+
+```python
+# Drop remaining columns (default)
+remainder='drop'
+
+# Pass through remaining columns unchanged
+remainder='passthrough'
+
+# Apply transformer to remaining columns
+remainder=StandardScaler()
+```
+
+### Full Pipeline with ColumnTransformer
+
+```python
+from sklearn.compose import ColumnTransformer
+from sklearn.pipeline import Pipeline
+from sklearn.impute import SimpleImputer
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+from sklearn.ensemble import RandomForestClassifier
+
+# Separate preprocessing for numeric and categorical
+numeric_features = ['age', 'income', 'credit_score']
+categorical_features = ['country', 'occupation', 'education']
+
+numeric_transformer = Pipeline(steps=[
+    ('imputer', SimpleImputer(strategy='median')),
+    ('scaler', StandardScaler())
+])
+
+categorical_transformer = Pipeline(steps=[
+    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
+    ('onehot', OneHotEncoder(handle_unknown='ignore'))
+])
+
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', numeric_transformer, numeric_features),
+        ('cat', categorical_transformer, categorical_features)
+    ])
+
+# Complete pipeline
+clf = Pipeline(steps=[
+    ('preprocessor', preprocessor),
+    ('classifier', RandomForestClassifier())
+])
+
+clf.fit(X_train, y_train)
+y_pred = clf.predict(X_test)
+
+# Grid search over preprocessing and model parameters
+param_grid = {
+    'preprocessor__num__imputer__strategy': ['mean', 'median'],
+    'preprocessor__cat__onehot__max_categories': [10, 20, None],
+    'classifier__n_estimators': [100, 200],
+    'classifier__max_depth': [10, 20, None]
+}
+
+grid_search = GridSearchCV(clf, param_grid, cv=5)
+grid_search.fit(X_train, y_train)
+```
+
+## FeatureUnion
+
+Combine multiple transformer outputs by concatenating features side-by-side.
+
+```python
+from sklearn.pipeline import FeatureUnion
+from sklearn.decomposition import PCA
+from sklearn.feature_selection import SelectKBest
+
+# Combine PCA and feature selection
+combined_features = FeatureUnion([
+    ('pca', PCA(n_components=10)),
+    ('univ_select', SelectKBest(k=5))
+])
+
+X_features = combined_features.fit_transform(X, y)
+# Result: 15 features (10 from PCA + 5 from SelectKBest)
+
+# In a pipeline
+pipeline = Pipeline([
+    ('features', combined_features),
+    ('classifier', LogisticRegression())
+])
+```
+
+### FeatureUnion with Transformers on Different Data
+
+```python
+from sklearn.pipeline import FeatureUnion
+from sklearn.preprocessing import FunctionTransformer
+import numpy as np
+
+def get_numeric_data(X):
+    return X[:, :3]  # First 3 columns
+
+def get_text_data(X):
+    return X[:, 3]   # 4th column (text)
+
+from sklearn.feature_extraction.text import TfidfVectorizer
+
+combined = FeatureUnion([
+    ('numeric_features', Pipeline([
+        ('selector', FunctionTransformer(get_numeric_data)),
+        ('scaler', StandardScaler())
+    ])),
+    ('text_features', Pipeline([
+        ('selector', FunctionTransformer(get_text_data)),
+        ('tfidf', TfidfVectorizer())
+    ]))
+])
+```
+
+**Note**: ColumnTransformer is usually more convenient than FeatureUnion for heterogeneous data.
+
+## Common Pipeline Patterns
+
+### Classification Pipeline
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.feature_selection import SelectKBest, f_classif
+from sklearn.svm import SVC
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('feature_selection', SelectKBest(f_classif, k=10)),
+    ('classifier', SVC(kernel='rbf'))
+])
+```
+
+### Regression Pipeline
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler, PolynomialFeatures
+from sklearn.linear_model import Ridge
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('poly', PolynomialFeatures(degree=2)),
+    ('ridge', Ridge(alpha=1.0))
+])
+```
+
+### Text Classification Pipeline
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.naive_bayes import MultinomialNB
+
+pipeline = Pipeline([
+    ('tfidf', TfidfVectorizer(max_features=1000)),
+    ('classifier', MultinomialNB())
+])
+
+# Works directly with text
+pipeline.fit(X_train_text, y_train)
+y_pred = pipeline.predict(X_test_text)
+```
+
+### Image Processing Pipeline
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.decomposition import PCA
+from sklearn.neural_network import MLPClassifier
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=100)),
+    ('mlp', MLPClassifier(hidden_layer_sizes=(100, 50)))
+])
+```
+
+### Dimensionality Reduction + Clustering
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.decomposition import PCA
+from sklearn.cluster import KMeans
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=10)),
+    ('kmeans', KMeans(n_clusters=5))
+])
+
+labels = pipeline.fit_predict(X)
+```
+
+## Custom Transformers
+
+### Using FunctionTransformer
+
+```python
+from sklearn.preprocessing import FunctionTransformer
+import numpy as np
+
+# Log transformation
+log_transformer = FunctionTransformer(np.log1p)
+
+# Custom function
+def custom_transform(X):
+    # Your transformation logic
+    return X_transformed
+
+custom_transformer = FunctionTransformer(custom_transform)
+
+# In pipeline
+pipeline = Pipeline([
+    ('log', log_transformer),
+    ('scaler', StandardScaler()),
+    ('model', LinearRegression())
+])
+```
+
+### Creating Custom Transformer Class
+
+```python
+from sklearn.base import BaseEstimator, TransformerMixin
+
+class CustomTransformer(BaseEstimator, TransformerMixin):
+    def __init__(self, parameter=1.0):
+        self.parameter = parameter
+
+    def fit(self, X, y=None):
+        # Learn parameters from X
+        self.learned_param_ = X.mean()  # Example
+        return self
+
+    def transform(self, X):
+        # Transform X using learned parameters
+        return X * self.parameter - self.learned_param_
+
+    # Optional: for pipelines that need inverse transform
+    def inverse_transform(self, X):
+        return (X + self.learned_param_) / self.parameter
+
+# Use in pipeline
+pipeline = Pipeline([
+    ('custom', CustomTransformer(parameter=2.0)),
+    ('model', LinearRegression())
+])
+```
+
+**Key requirements**:
+- Inherit from `BaseEstimator` and `TransformerMixin`
+- Implement `fit()` and `transform()` methods
+- `fit()` must return `self`
+- Use trailing underscore for learned attributes (`learned_param_`)
+- Constructor parameters should be stored as attributes
+
+### Transformer for Pandas DataFrames
+
+```python
+from sklearn.base import BaseEstimator, TransformerMixin
+import pandas as pd
+
+class DataFrameTransformer(BaseEstimator, TransformerMixin):
+    def __init__(self, columns=None):
+        self.columns = columns
+
+    def fit(self, X, y=None):
+        return self
+
+    def transform(self, X):
+        if isinstance(X, pd.DataFrame):
+            if self.columns:
+                return X[self.columns].values
+            return X.values
+        return X
+```
+
+## Visualization
+
+### Display Pipeline in Jupyter
+
+```python
+from sklearn import set_config
+
+# Enable HTML display
+set_config(display='diagram')
+
+# Now displaying the pipeline shows interactive diagram
+pipeline
+```
+
+### Print Pipeline Structure
+
+```python
+from sklearn.utils import estimator_html_repr
+
+# Get HTML representation
+html = estimator_html_repr(pipeline)
+
+# Or just print
+print(pipeline)
+```
+
+## Advanced Patterns
+
+### Conditional Transformations
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler, FunctionTransformer
+
+def conditional_scale(X, scale=True):
+    if scale:
+        return StandardScaler().fit_transform(X)
+    return X
+
+pipeline = Pipeline([
+    ('conditional_scaler', FunctionTransformer(
+        conditional_scale,
+        kw_args={'scale': True}
+    )),
+    ('model', LogisticRegression())
+])
+```
+
+### Multiple Preprocessing Paths
+
+```python
+from sklearn.compose import ColumnTransformer
+from sklearn.pipeline import Pipeline
+
+# Different preprocessing for different feature types
+preprocessor = ColumnTransformer([
+    # Numeric: impute + scale
+    ('num_standard', Pipeline([
+        ('imputer', SimpleImputer(strategy='mean')),
+        ('scaler', StandardScaler())
+    ]), ['age', 'income']),
+
+    # Numeric: impute + log + scale
+    ('num_skewed', Pipeline([
+        ('imputer', SimpleImputer(strategy='median')),
+        ('log', FunctionTransformer(np.log1p)),
+        ('scaler', StandardScaler())
+    ]), ['price', 'revenue']),
+
+    # Categorical: impute + one-hot
+    ('cat', Pipeline([
+        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
+        ('onehot', OneHotEncoder(handle_unknown='ignore'))
+    ]), ['category', 'region']),
+
+    # Text: TF-IDF
+    ('text', TfidfVectorizer(), 'description')
+])
+```
+
+### Feature Engineering Pipeline
+
+```python
+from sklearn.base import BaseEstimator, TransformerMixin
+
+class FeatureEngineer(BaseEstimator, TransformerMixin):
+    def fit(self, X, y=None):
+        return self
+
+    def transform(self, X):
+        X = X.copy()
+        # Add engineered features
+        X['age_income_ratio'] = X['age'] / (X['income'] + 1)
+        X['total_score'] = X['score1'] + X['score2'] + X['score3']
+        return X
+
+pipeline = Pipeline([
+    ('engineer', FeatureEngineer()),
+    ('preprocessor', preprocessor),
+    ('model', RandomForestClassifier())
+])
+```
+
+## Best Practices
+
+### Always Use Pipelines When
+
+1. **Preprocessing is needed**: Scaling, encoding, imputation
+2. **Cross-validation**: Ensures proper fit/transform split
+3. **Hyperparameter tuning**: Joint optimization of preprocessing and model
+4. **Production deployment**: Single object to serialize
+5. **Multiple steps**: Any workflow with >1 step
+
+### Pipeline Do's
+
+- ✅ Fit pipeline only on training data
+- ✅ Use ColumnTransformer for heterogeneous data
+- ✅ Cache expensive transformations during grid search
+- ✅ Use make_pipeline for simple cases
+- ✅ Set verbose=True to debug issues
+- ✅ Use remainder='passthrough' when appropriate
+
+### Pipeline Don'ts
+
+- ❌ Fit preprocessing on full dataset before split (data leakage!)
+- ❌ Manually transform test data (use pipeline.predict())
+- ❌ Forget to handle missing values before scaling
+- ❌ Mix pandas DataFrames and arrays inconsistently
+- ❌ Skip using pipelines for "just one preprocessing step"
+
+### Data Leakage Prevention
+
+```python
+# ❌ WRONG - Data leakage
+scaler = StandardScaler().fit(X)  # Fit on all data
+X_train, X_test, y_train, y_test = train_test_split(X, y)
+X_train_scaled = scaler.transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+
+# ✅ CORRECT - No leakage with pipeline
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('model', LogisticRegression())
+])
+
+X_train, X_test, y_train, y_test = train_test_split(X, y)
+pipeline.fit(X_train, y_train)  # Scaler fits only on train
+y_pred = pipeline.predict(X_test)  # Scaler transforms only on test
+
+# ✅ CORRECT - No leakage in cross-validation
+scores = cross_val_score(pipeline, X, y, cv=5)
+# Each fold: scaler fits on train folds, transforms on test fold
+```
+
+### Debugging Pipelines
+
+```python
+# Examine intermediate outputs
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=10)),
+    ('model', LogisticRegression())
+])
+
+# Fit pipeline
+pipeline.fit(X_train, y_train)
+
+# Get output after scaling
+X_scaled = pipeline.named_steps['scaler'].transform(X_train)
+
+# Get output after PCA
+X_pca = pipeline[:-1].transform(X_train)  # All steps except last
+
+# Or build partial pipeline
+partial_pipeline = Pipeline(pipeline.steps[:-1])
+X_transformed = partial_pipeline.transform(X_train)
+```
+
+### Saving and Loading Pipelines
+
+```python
+import joblib
+
+# Save pipeline
+joblib.dump(pipeline, 'model_pipeline.pkl')
+
+# Load pipeline
+pipeline = joblib.load('model_pipeline.pkl')
+
+# Use loaded pipeline
+y_pred = pipeline.predict(X_new)
+```
+
+## Common Errors and Solutions
+
+**Error**: `ValueError: could not convert string to float`
+- **Cause**: Categorical features not encoded
+- **Solution**: Add OneHotEncoder or OrdinalEncoder to pipeline
+
+**Error**: `All intermediate steps should be transformers`
+- **Cause**: Non-transformer in non-final position
+- **Solution**: Ensure only last step is predictor
+
+**Error**: `X has different number of features than during fitting`
+- **Cause**: Different columns in train and test
+- **Solution**: Ensure consistent column handling, use `handle_unknown='ignore'` in OneHotEncoder
+
+**Error**: Different results in cross-validation vs train-test split
+- **Cause**: Data leakage (fitting preprocessing on all data)
+- **Solution**: Always use Pipeline for preprocessing
+
+**Error**: Pipeline too slow during grid search
+- **Solution**: Use caching with `memory` parameter
--- a/scientific-packages/scikit-learn/references/preprocessing.md
+++ b/scientific-packages/scikit-learn/references/preprocessing.md
@@ -0,0 +1,413 @@
+# Data Preprocessing in scikit-learn
+
+## Overview
+Preprocessing transforms raw data into a format suitable for machine learning algorithms. Many algorithms require standardized or normalized data to perform well.
+
+## Standardization and Scaling
+
+### StandardScaler
+Removes mean and scales to unit variance (z-score normalization).
+
+**Formula**: `z = (x - μ) / σ`
+
+**Use cases**:
+- Most ML algorithms (especially SVM, neural networks, PCA)
+- When features have different units or scales
+- When assuming Gaussian-like distribution
+
+**Important**: Fit only on training data, then transform both train and test sets.
+
+```python
+from sklearn.preprocessing import StandardScaler
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)  # Use same parameters
+```
+
+### MinMaxScaler
+Scales features to a specified range, typically [0, 1].
+
+**Formula**: `X_scaled = (X - X_min) / (X_max - X_min)`
+
+**Use cases**:
+- When bounded range is needed
+- Neural networks (often prefer [0, 1] range)
+- When distribution is not Gaussian
+- Image pixel values
+
+**Parameters**:
+- `feature_range`: Tuple (min, max), default (0, 1)
+
+**Warning**: Sensitive to outliers since it uses min/max.
+
+### MaxAbsScaler
+Scales to [-1, 1] by dividing by maximum absolute value.
+
+**Use cases**:
+- Sparse data (preserves sparsity)
+- Data already centered at zero
+- When sign of values is meaningful
+
+**Advantage**: Doesn't shift/center the data, preserves zero entries.
+
+### RobustScaler
+Uses median and interquartile range (IQR) instead of mean and standard deviation.
+
+**Formula**: `X_scaled = (X - median) / IQR`
+
+**Use cases**:
+- When outliers are present
+- When StandardScaler produces skewed results
+- Robust statistics preferred
+
+**Parameters**:
+- `quantile_range`: Tuple (q_min, q_max), default (25.0, 75.0)
+
+## Normalization
+
+### normalize() function and Normalizer
+Scales individual samples (rows) to unit norm, not features (columns).
+
+**Use cases**:
+- Text classification (TF-IDF vectors)
+- When similarity metrics (dot product, cosine) are used
+- When each sample should have equal weight
+
+**Norms**:
+- `l1`: Manhattan norm (sum of absolutes = 1)
+- `l2`: Euclidean norm (sum of squares = 1) - **most common**
+- `max`: Maximum absolute value = 1
+
+**Key difference from scalers**: Operates on rows (samples), not columns (features).
+
+```python
+from sklearn.preprocessing import Normalizer
+normalizer = Normalizer(norm='l2')
+X_normalized = normalizer.transform(X)
+```
+
+## Encoding Categorical Features
+
+### OrdinalEncoder
+Converts categories to integers (0 to n_categories - 1).
+
+**Use cases**:
+- Ordinal relationships exist (small < medium < large)
+- Preprocessing before other transformations
+- Tree-based algorithms (which can handle integers)
+
+**Parameters**:
+- `handle_unknown`: 'error' or 'use_encoded_value'
+- `unknown_value`: Value for unknown categories
+- `encoded_missing_value`: Value for missing data
+
+```python
+from sklearn.preprocessing import OrdinalEncoder
+encoder = OrdinalEncoder()
+X_encoded = encoder.fit_transform(X_categorical)
+```
+
+### OneHotEncoder
+Creates binary columns for each category.
+
+**Use cases**:
+- Nominal categories (no order)
+- Linear models, neural networks
+- When category relationships shouldn't be assumed
+
+**Parameters**:
+- `drop`: 'first', 'if_binary', array-like (prevents multicollinearity)
+- `sparse_output`: True (default, memory efficient) or False
+- `handle_unknown`: 'error', 'ignore', 'infrequent_if_exist'
+- `min_frequency`: Group infrequent categories
+- `max_categories`: Limit number of categories
+
+**High cardinality handling**:
+```python
+encoder = OneHotEncoder(min_frequency=100, handle_unknown='infrequent_if_exist')
+# Groups categories appearing < 100 times into 'infrequent' category
+```
+
+**Memory tip**: Use `sparse_output=True` (default) for high-cardinality features.
+
+### TargetEncoder
+Uses target statistics to encode categories.
+
+**Use cases**:
+- High-cardinality categorical features (zip codes, user IDs)
+- When linear relationships with target are expected
+- Often improves performance over one-hot encoding
+
+**How it works**:
+- Replaces category with mean of target for that category
+- Uses cross-fitting during fit_transform() to prevent target leakage
+- Applies smoothing to handle rare categories
+
+**Parameters**:
+- `smooth`: Smoothing parameter for rare categories
+- `cv`: Cross-validation strategy
+
+**Warning**: Only for supervised learning. Requires target variable.
+
+```python
+from sklearn.preprocessing import TargetEncoder
+encoder = TargetEncoder()
+X_encoded = encoder.fit_transform(X_categorical, y)
+```
+
+### LabelEncoder
+Encodes target labels into integers 0 to n_classes - 1.
+
+**Use cases**: Encoding target variable for classification (not features!)
+
+**Important**: Use `LabelEncoder` for targets, not features. For features, use OrdinalEncoder or OneHotEncoder.
+
+### Binarizer
+Converts numeric values to binary (0 or 1) based on threshold.
+
+**Use cases**: Creating binary features from continuous values
+
+## Non-linear Transformations
+
+### QuantileTransformer
+Maps features to uniform or normal distribution using rank transformation.
+
+**Use cases**:
+- Unusual distributions (bimodal, heavy tails)
+- Reducing outlier impact
+- When normal distribution is desired
+
+**Parameters**:
+- `output_distribution`: 'uniform' (default) or 'normal'
+- `n_quantiles`: Number of quantiles (default: min(1000, n_samples))
+
+**Effect**: Strong transformation that reduces outlier influence and makes data more Gaussian-like.
+
+### PowerTransformer
+Applies parametric monotonic transformation to make data more Gaussian.
+
+**Methods**:
+- `yeo-johnson`: Works with positive and negative values (default)
+- `box-cox`: Only positive values
+
+**Use cases**:
+- Skewed distributions
+- When Gaussian assumption is important
+- Variance stabilization
+
+**Advantage**: Less radical than QuantileTransformer, preserves more of original relationships.
+
+## Discretization
+
+### KBinsDiscretizer
+Bins continuous features into discrete intervals.
+
+**Strategies**:
+- `uniform`: Equal-width bins
+- `quantile`: Equal-frequency bins
+- `kmeans`: K-means clustering to determine bins
+
+**Encoding**:
+- `ordinal`: Integer encoding (0 to n_bins - 1)
+- `onehot`: One-hot encoding
+- `onehot-dense`: Dense one-hot encoding
+
+**Use cases**:
+- Making linear models handle non-linear relationships
+- Reducing noise in features
+- Making features more interpretable
+
+```python
+from sklearn.preprocessing import KBinsDiscretizer
+disc = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
+X_binned = disc.fit_transform(X)
+```
+
+## Feature Generation
+
+### PolynomialFeatures
+Generates polynomial and interaction features.
+
+**Parameters**:
+- `degree`: Polynomial degree
+- `interaction_only`: Only multiplicative interactions (no x²)
+- `include_bias`: Include constant feature
+
+**Use cases**:
+- Adding non-linearity to linear models
+- Feature engineering
+- Polynomial regression
+
+**Warning**: Number of features grows rapidly: (n+d)!/d!n! for degree d.
+
+```python
+from sklearn.preprocessing import PolynomialFeatures
+poly = PolynomialFeatures(degree=2, include_bias=False)
+X_poly = poly.fit_transform(X)
+# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
+```
+
+### SplineTransformer
+Generates B-spline basis functions.
+
+**Use cases**:
+- Smooth non-linear transformations
+- Alternative to PolynomialFeatures (less oscillation at boundaries)
+- Generalized additive models (GAMs)
+
+**Parameters**:
+- `n_knots`: Number of knots
+- `degree`: Spline degree
+- `knots`: Knot positions ('uniform', 'quantile', or array)
+
+## Missing Value Handling
+
+### SimpleImputer
+Imputes missing values with various strategies.
+
+**Strategies**:
+- `mean`: Mean of column (numeric only)
+- `median`: Median of column (numeric only)
+- `most_frequent`: Mode (numeric or categorical)
+- `constant`: Fill with constant value
+
+**Parameters**:
+- `strategy`: Imputation strategy
+- `fill_value`: Value when strategy='constant'
+- `missing_values`: What represents missing (np.nan, None, specific value)
+
+```python
+from sklearn.impute import SimpleImputer
+imputer = SimpleImputer(strategy='median')
+X_imputed = imputer.fit_transform(X)
+```
+
+### KNNImputer
+Imputes using k-nearest neighbors.
+
+**Use cases**: When relationships between features should inform imputation
+
+**Parameters**:
+- `n_neighbors`: Number of neighbors
+- `weights`: 'uniform' or 'distance'
+
+### IterativeImputer
+Models each feature with missing values as function of other features.
+
+**Use cases**:
+- Complex relationships between features
+- When multiple features have missing values
+- Higher quality imputation (but slower)
+
+**Parameters**:
+- `estimator`: Estimator for regression (default: BayesianRidge)
+- `max_iter`: Maximum iterations
+
+## Function Transformers
+
+### FunctionTransformer
+Applies custom function to data.
+
+**Use cases**:
+- Custom transformations in pipelines
+- Log transformation, square root, etc.
+- Domain-specific preprocessing
+
+```python
+from sklearn.preprocessing import FunctionTransformer
+import numpy as np
+
+log_transformer = FunctionTransformer(np.log1p, validate=True)
+X_log = log_transformer.transform(X)
+```
+
+## Best Practices
+
+### Feature Scaling Guidelines
+
+**Always scale**:
+- SVM, neural networks
+- K-nearest neighbors
+- Linear/Logistic regression with regularization
+- PCA, LDA
+- Gradient descent-based algorithms
+
+**Don't need to scale**:
+- Tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting)
+- Naive Bayes
+
+### Pipeline Integration
+
+Always use preprocessing within pipelines to prevent data leakage:
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LogisticRegression
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('classifier', LogisticRegression())
+])
+
+pipeline.fit(X_train, y_train)  # Scaler fit only on train data
+y_pred = pipeline.predict(X_test)  # Scaler transform only on test data
+```
+
+### Common Transformations by Data Type
+
+**Numeric - Continuous**:
+- StandardScaler (most common)
+- MinMaxScaler (neural networks)
+- RobustScaler (outliers present)
+- PowerTransformer (skewed data)
+
+**Numeric - Count Data**:
+- sqrt or log transformation
+- QuantileTransformer
+- StandardScaler after transformation
+
+**Categorical - Low Cardinality (<10 categories)**:
+- OneHotEncoder
+
+**Categorical - High Cardinality (>10 categories)**:
+- TargetEncoder (supervised)
+- Frequency encoding
+- OneHotEncoder with min_frequency parameter
+
+**Categorical - Ordinal**:
+- OrdinalEncoder
+
+**Text**:
+- CountVectorizer or TfidfVectorizer
+- Normalizer after vectorization
+
+### Data Leakage Prevention
+
+1. **Fit only on training data**: Never include test data when fitting preprocessors
+2. **Use pipelines**: Ensures proper fit/transform separation
+3. **Cross-validation**: Use Pipeline with cross_val_score() for proper evaluation
+4. **Target encoding**: Use cv parameter in TargetEncoder for cross-fitting
+
+```python
+# WRONG - data leakage
+scaler = StandardScaler().fit(X_full)
+X_train_scaled = scaler.transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+
+# CORRECT - no leakage
+scaler = StandardScaler().fit(X_train)
+X_train_scaled = scaler.transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+```
+
+## Preprocessing Checklist
+
+Before modeling:
+1. Handle missing values (imputation or removal)
+2. Encode categorical variables appropriately
+3. Scale/normalize numeric features (if needed for algorithm)
+4. Handle outliers (RobustScaler, clipping, removal)
+5. Create additional features if beneficial (PolynomialFeatures, domain knowledge)
+6. Check for data leakage in preprocessing steps
+7. Wrap everything in a Pipeline
--- a/scientific-packages/scikit-learn/references/quick_reference.md
+++ b/scientific-packages/scikit-learn/references/quick_reference.md
@@ -0,0 +1,625 @@
+# Scikit-learn Quick Reference
+
+## Essential Imports
+
+```python
+# Core
+import numpy as np
+import pandas as pd
+from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
+from sklearn.pipeline import Pipeline, make_pipeline
+from sklearn.compose import ColumnTransformer
+
+# Preprocessing
+from sklearn.preprocessing import (
+    StandardScaler, MinMaxScaler, RobustScaler,
+    OneHotEncoder, OrdinalEncoder, LabelEncoder,
+    PolynomialFeatures
+)
+from sklearn.impute import SimpleImputer
+
+# Models - Classification
+from sklearn.linear_model import LogisticRegression
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.ensemble import (
+    RandomForestClassifier,
+    GradientBoostingClassifier,
+    HistGradientBoostingClassifier
+)
+from sklearn.svm import SVC
+from sklearn.neighbors import KNeighborsClassifier
+
+# Models - Regression
+from sklearn.linear_model import LinearRegression, Ridge, Lasso
+from sklearn.ensemble import (
+    RandomForestRegressor,
+    GradientBoostingRegressor,
+    HistGradientBoostingRegressor
+)
+
+# Clustering
+from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
+from sklearn.mixture import GaussianMixture
+
+# Dimensionality Reduction
+from sklearn.decomposition import PCA, NMF, TruncatedSVD
+from sklearn.manifold import TSNE
+
+# Metrics
+from sklearn.metrics import (
+    accuracy_score, precision_score, recall_score, f1_score,
+    confusion_matrix, classification_report,
+    mean_squared_error, r2_score, mean_absolute_error
+)
+```
+
+## Basic Workflow Template
+
+### Classification
+
+```python
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import StandardScaler
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.metrics import classification_report
+
+# Split data
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.2, random_state=42, stratify=y
+)
+
+# Scale features
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+
+# Train model
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+model.fit(X_train_scaled, y_train)
+
+# Predict and evaluate
+y_pred = model.predict(X_test_scaled)
+print(classification_report(y_test, y_pred))
+```
+
+### Regression
+
+```python
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import StandardScaler
+from sklearn.ensemble import RandomForestRegressor
+from sklearn.metrics import mean_squared_error, r2_score
+
+# Split data
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.2, random_state=42
+)
+
+# Scale features
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+
+# Train model
+model = RandomForestRegressor(n_estimators=100, random_state=42)
+model.fit(X_train_scaled, y_train)
+
+# Predict and evaluate
+y_pred = model.predict(X_test_scaled)
+print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.3f}")
+print(f"R²: {r2_score(y_test, y_pred):.3f}")
+```
+
+### With Pipeline (Recommended)
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.model_selection import train_test_split, cross_val_score
+
+# Create pipeline
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
+])
+
+# Split and train
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.2, random_state=42
+)
+pipeline.fit(X_train, y_train)
+
+# Evaluate
+score = pipeline.score(X_test, y_test)
+cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
+print(f"Test accuracy: {score:.3f}")
+print(f"CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
+```
+
+## Common Preprocessing Patterns
+
+### Numeric Data
+
+```python
+from sklearn.preprocessing import StandardScaler
+from sklearn.impute import SimpleImputer
+from sklearn.pipeline import Pipeline
+
+numeric_transformer = Pipeline([
+    ('imputer', SimpleImputer(strategy='median')),
+    ('scaler', StandardScaler())
+])
+```
+
+### Categorical Data
+
+```python
+from sklearn.preprocessing import OneHotEncoder
+from sklearn.impute import SimpleImputer
+from sklearn.pipeline import Pipeline
+
+categorical_transformer = Pipeline([
+    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
+    ('onehot', OneHotEncoder(handle_unknown='ignore'))
+])
+```
+
+### Mixed Data with ColumnTransformer
+
+```python
+from sklearn.compose import ColumnTransformer
+
+numeric_features = ['age', 'income', 'credit_score']
+categorical_features = ['country', 'occupation']
+
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', numeric_transformer, numeric_features),
+        ('cat', categorical_transformer, categorical_features)
+    ])
+
+# Complete pipeline
+from sklearn.ensemble import RandomForestClassifier
+pipeline = Pipeline([
+    ('preprocessor', preprocessor),
+    ('classifier', RandomForestClassifier())
+])
+```
+
+## Model Selection Cheat Sheet
+
+### Quick Decision Tree
+
+```
+Is it supervised?
+├─ Yes
+│  ├─ Predicting categories? → Classification
+│  │  ├─ Start with: LogisticRegression (baseline)
+│  │  ├─ Then try: RandomForestClassifier
+│  │  └─ Best performance: HistGradientBoostingClassifier
+│  └─ Predicting numbers? → Regression
+│     ├─ Start with: LinearRegression/Ridge (baseline)
+│     ├─ Then try: RandomForestRegressor
+│     └─ Best performance: HistGradientBoostingRegressor
+└─ No
+   ├─ Grouping similar items? → Clustering
+   │  ├─ Know # clusters: KMeans
+   │  └─ Unknown # clusters: DBSCAN or HDBSCAN
+   ├─ Reducing dimensions?
+   │  ├─ For preprocessing: PCA
+   │  └─ For visualization: t-SNE or UMAP
+   └─ Finding outliers? → IsolationForest or LocalOutlierFactor
+```
+
+### Algorithm Selection by Data Size
+
+- **Small (<1K samples)**: Any algorithm
+- **Medium (1K-100K)**: Random Forests, Gradient Boosting, Neural Networks
+- **Large (>100K)**: SGDClassifier/Regressor, HistGradientBoosting, LinearSVC
+
+### When to Scale Features
+
+**Always scale**:
+- SVM, Neural Networks
+- K-Nearest Neighbors
+- Linear/Logistic Regression (with regularization)
+- PCA, LDA
+- Any gradient descent algorithm
+
+**Don't need to scale**:
+- Tree-based (Decision Trees, Random Forests, Gradient Boosting)
+- Naive Bayes
+
+## Hyperparameter Tuning
+
+### GridSearchCV
+
+```python
+from sklearn.model_selection import GridSearchCV
+
+param_grid = {
+    'n_estimators': [100, 200, 500],
+    'max_depth': [10, 20, None],
+    'min_samples_split': [2, 5, 10]
+}
+
+grid_search = GridSearchCV(
+    RandomForestClassifier(random_state=42),
+    param_grid,
+    cv=5,
+    scoring='f1_weighted',
+    n_jobs=-1
+)
+
+grid_search.fit(X_train, y_train)
+best_model = grid_search.best_estimator_
+print(f"Best params: {grid_search.best_params_}")
+```
+
+### RandomizedSearchCV (Faster)
+
+```python
+from sklearn.model_selection import RandomizedSearchCV
+from scipy.stats import randint, uniform
+
+param_distributions = {
+    'n_estimators': randint(100, 1000),
+    'max_depth': randint(5, 50),
+    'min_samples_split': randint(2, 20)
+}
+
+random_search = RandomizedSearchCV(
+    RandomForestClassifier(random_state=42),
+    param_distributions,
+    n_iter=50,  # Number of combinations to try
+    cv=5,
+    n_jobs=-1,
+    random_state=42
+)
+
+random_search.fit(X_train, y_train)
+```
+
+### Pipeline with GridSearchCV
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.svm import SVC
+from sklearn.model_selection import GridSearchCV
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('svm', SVC())
+])
+
+param_grid = {
+    'svm__C': [0.1, 1, 10],
+    'svm__kernel': ['rbf', 'linear'],
+    'svm__gamma': ['scale', 'auto']
+}
+
+grid = GridSearchCV(pipeline, param_grid, cv=5)
+grid.fit(X_train, y_train)
+```
+
+## Cross-Validation
+
+### Basic Cross-Validation
+
+```python
+from sklearn.model_selection import cross_val_score
+
+scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
+print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
+```
+
+### Multiple Metrics
+
+```python
+from sklearn.model_selection import cross_validate
+
+scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
+results = cross_validate(model, X, y, cv=5, scoring=scoring)
+
+for metric in scoring:
+    scores = results[f'test_{metric}']
+    print(f"{metric}: {scores.mean():.3f} (+/- {scores.std():.3f})")
+```
+
+### Custom CV Strategies
+
+```python
+from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
+
+# For imbalanced classification
+cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
+
+# For time series
+cv = TimeSeriesSplit(n_splits=5)
+
+scores = cross_val_score(model, X, y, cv=cv)
+```
+
+## Common Metrics
+
+### Classification
+
+```python
+from sklearn.metrics import (
+    accuracy_score, balanced_accuracy_score,
+    precision_score, recall_score, f1_score,
+    confusion_matrix, classification_report,
+    roc_auc_score
+)
+
+# Basic metrics
+accuracy = accuracy_score(y_true, y_pred)
+f1 = f1_score(y_true, y_pred, average='weighted')
+
+# Comprehensive report
+print(classification_report(y_true, y_pred))
+
+# ROC AUC (requires probabilities)
+y_proba = model.predict_proba(X_test)[:, 1]
+auc = roc_auc_score(y_true, y_proba)
+```
+
+### Regression
+
+```python
+from sklearn.metrics import (
+    mean_squared_error,
+    mean_absolute_error,
+    r2_score
+)
+
+mse = mean_squared_error(y_true, y_pred)
+rmse = mean_squared_error(y_true, y_pred, squared=False)
+mae = mean_absolute_error(y_true, y_pred)
+r2 = r2_score(y_true, y_pred)
+
+print(f"RMSE: {rmse:.3f}")
+print(f"MAE: {mae:.3f}")
+print(f"R²: {r2:.3f}")
+```
+
+## Feature Engineering
+
+### Polynomial Features
+
+```python
+from sklearn.preprocessing import PolynomialFeatures
+
+poly = PolynomialFeatures(degree=2, include_bias=False)
+X_poly = poly.fit_transform(X)
+# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
+```
+
+### Feature Selection
+
+```python
+from sklearn.feature_selection import (
+    SelectKBest, f_classif,
+    RFE,
+    SelectFromModel
+)
+
+# Univariate selection
+selector = SelectKBest(f_classif, k=10)
+X_selected = selector.fit_transform(X, y)
+
+# Recursive feature elimination
+from sklearn.ensemble import RandomForestClassifier
+rfe = RFE(RandomForestClassifier(), n_features_to_select=10)
+X_selected = rfe.fit_transform(X, y)
+
+# Model-based selection
+selector = SelectFromModel(
+    RandomForestClassifier(n_estimators=100),
+    threshold='median'
+)
+X_selected = selector.fit_transform(X, y)
+```
+
+### Feature Importance
+
+```python
+# Tree-based models
+model = RandomForestClassifier()
+model.fit(X_train, y_train)
+importances = model.feature_importances_
+
+# Visualize
+import matplotlib.pyplot as plt
+indices = np.argsort(importances)[::-1]
+plt.bar(range(X.shape[1]), importances[indices])
+plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
+plt.show()
+
+# Permutation importance (works for any model)
+from sklearn.inspection import permutation_importance
+result = permutation_importance(model, X_test, y_test, n_repeats=10)
+importances = result.importances_mean
+```
+
+## Clustering
+
+### K-Means
+
+```python
+from sklearn.cluster import KMeans
+from sklearn.preprocessing import StandardScaler
+
+# Always scale for k-means
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+
+# Fit k-means
+kmeans = KMeans(n_clusters=3, random_state=42)
+labels = kmeans.fit_predict(X_scaled)
+
+# Evaluate
+from sklearn.metrics import silhouette_score
+score = silhouette_score(X_scaled, labels)
+print(f"Silhouette score: {score:.3f}")
+```
+
+### Elbow Method
+
+```python
+inertias = []
+K_range = range(2, 11)
+
+for k in K_range:
+    kmeans = KMeans(n_clusters=k, random_state=42)
+    kmeans.fit(X_scaled)
+    inertias.append(kmeans.inertia_)
+
+plt.plot(K_range, inertias, 'bo-')
+plt.xlabel('k')
+plt.ylabel('Inertia')
+plt.show()
+```
+
+### DBSCAN
+
+```python
+from sklearn.cluster import DBSCAN
+
+dbscan = DBSCAN(eps=0.5, min_samples=5)
+labels = dbscan.fit_predict(X_scaled)
+
+# -1 indicates noise/outliers
+n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
+n_noise = list(labels).count(-1)
+print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
+```
+
+## Dimensionality Reduction
+
+### PCA
+
+```python
+from sklearn.decomposition import PCA
+from sklearn.preprocessing import StandardScaler
+
+# Always scale before PCA
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+
+# Specify n_components
+pca = PCA(n_components=2)
+X_pca = pca.fit_transform(X_scaled)
+
+# Or specify variance to retain
+pca = PCA(n_components=0.95)  # Keep 95% variance
+X_pca = pca.fit_transform(X_scaled)
+
+print(f"Explained variance: {pca.explained_variance_ratio_}")
+print(f"Components needed: {pca.n_components_}")
+```
+
+### t-SNE (Visualization Only)
+
+```python
+from sklearn.manifold import TSNE
+
+# Reduce to 50 dimensions with PCA first (recommended)
+pca = PCA(n_components=50)
+X_pca = pca.fit_transform(X_scaled)
+
+# Apply t-SNE
+tsne = TSNE(n_components=2, random_state=42, perplexity=30)
+X_tsne = tsne.fit_transform(X_pca)
+
+# Visualize
+plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
+plt.colorbar()
+plt.show()
+```
+
+## Saving and Loading Models
+
+```python
+import joblib
+
+# Save model
+joblib.dump(model, 'model.pkl')
+
+# Save pipeline
+joblib.dump(pipeline, 'pipeline.pkl')
+
+# Load
+model = joblib.load('model.pkl')
+pipeline = joblib.load('pipeline.pkl')
+
+# Use loaded model
+y_pred = model.predict(X_new)
+```
+
+## Common Pitfalls and Solutions
+
+### Data Leakage
+❌ **Wrong**: Fit on all data before split
+```python
+scaler = StandardScaler().fit(X)
+X_train, X_test = train_test_split(scaler.transform(X))
+```
+
+✅ **Correct**: Use pipeline or fit only on train
+```python
+X_train, X_test = train_test_split(X)
+pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
+pipeline.fit(X_train, y_train)
+```
+
+### Not Scaling
+❌ **Wrong**: Using SVM without scaling
+```python
+svm = SVC()
+svm.fit(X_train, y_train)
+```
+
+✅ **Correct**: Scale for SVM
+```python
+pipeline = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
+pipeline.fit(X_train, y_train)
+```
+
+### Wrong Metric for Imbalanced Data
+❌ **Wrong**: Using accuracy for 99:1 imbalance
+```python
+accuracy = accuracy_score(y_true, y_pred)  # Can be misleading
+```
+
+✅ **Correct**: Use appropriate metrics
+```python
+f1 = f1_score(y_true, y_pred, average='weighted')
+balanced_acc = balanced_accuracy_score(y_true, y_pred)
+```
+
+### Not Using Stratification
+❌ **Wrong**: Random split for imbalanced data
+```python
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
+```
+
+✅ **Correct**: Stratify for imbalanced classes
+```python
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.2, stratify=y
+)
+```
+
+## Performance Tips
+
+1. **Use n_jobs=-1** for parallel processing (RandomForest, GridSearchCV)
+2. **Use HistGradientBoosting** for large datasets (>10K samples)
+3. **Use MiniBatchKMeans** for large clustering tasks
+4. **Use IncrementalPCA** for data that doesn't fit in memory
+5. **Use sparse matrices** for high-dimensional sparse data (text)
+6. **Cache transformers** in pipelines during grid search
+7. **Use RandomizedSearchCV** instead of GridSearchCV for large parameter spaces
+8. **Reduce dimensionality** with PCA before applying expensive algorithms
--- a/scientific-packages/scikit-learn/references/supervised_learning.md
+++ b/scientific-packages/scikit-learn/references/supervised_learning.md
@@ -0,0 +1,261 @@
+# Supervised Learning in scikit-learn
+
+## Overview
+Supervised learning algorithms learn patterns from labeled training data to make predictions on new data. Scikit-learn organizes supervised learning into 17 major categories.
+
+## Linear Models
+
+### Regression
+- **LinearRegression**: Ordinary least squares regression
+- **Ridge**: L2-regularized regression, good for multicollinearity
+- **Lasso**: L1-regularized regression, performs feature selection
+- **ElasticNet**: Combined L1/L2 regularization
+- **LassoLars**: Lasso using Least Angle Regression algorithm
+- **BayesianRidge**: Bayesian approach with automatic relevance determination
+
+### Classification
+- **LogisticRegression**: Binary and multiclass classification
+- **RidgeClassifier**: Ridge regression for classification
+- **SGDClassifier**: Linear classifiers with SGD training
+
+**Use cases**: Baseline models, interpretable predictions, high-dimensional data, when linear relationships are expected
+
+**Key parameters**:
+- `alpha`: Regularization strength (higher = more regularization)
+- `fit_intercept`: Whether to calculate intercept
+- `solver`: Optimization algorithm ('lbfgs', 'saga', 'liblinear')
+
+## Support Vector Machines (SVM)
+
+- **SVC**: Support Vector Classification
+- **SVR**: Support Vector Regression
+- **LinearSVC**: Linear SVM using liblinear (faster for large datasets)
+- **OneClassSVM**: Unsupervised outlier detection
+
+**Use cases**: Complex non-linear decision boundaries, high-dimensional spaces, when clear margin of separation exists
+
+**Key parameters**:
+- `kernel`: 'linear', 'poly', 'rbf', 'sigmoid'
+- `C`: Regularization parameter (lower = more regularization)
+- `gamma`: Kernel coefficient ('scale', 'auto', or float)
+- `degree`: Polynomial degree (for poly kernel)
+
+**Performance tip**: SVMs don't scale well beyond tens of thousands of samples. Use LinearSVC for large datasets with linear kernel.
+
+## Decision Trees
+
+- **DecisionTreeClassifier**: Classification tree
+- **DecisionTreeRegressor**: Regression tree
+- **ExtraTreeClassifier/Regressor**: Extremely randomized tree
+
+**Use cases**: Non-linear relationships, feature importance analysis, interpretable rules, handling mixed data types
+
+**Key parameters**:
+- `max_depth`: Maximum tree depth (controls overfitting)
+- `min_samples_split`: Minimum samples to split a node
+- `min_samples_leaf`: Minimum samples in leaf node
+- `max_features`: Number of features to consider for splits
+- `criterion`: 'gini', 'entropy' (classification); 'squared_error', 'absolute_error' (regression)
+
+**Overfitting prevention**: Limit `max_depth`, increase `min_samples_split/leaf`, use pruning with `ccp_alpha`
+
+## Ensemble Methods
+
+### Random Forests
+- **RandomForestClassifier**: Ensemble of decision trees
+- **RandomForestRegressor**: Regression variant
+
+**Use cases**: Robust general-purpose algorithm, reduces overfitting vs single trees, handles non-linear relationships
+
+**Key parameters**:
+- `n_estimators`: Number of trees (higher = better but slower)
+- `max_depth`: Maximum tree depth
+- `max_features`: Features per split ('sqrt', 'log2', int, float)
+- `bootstrap`: Whether to use bootstrap samples
+- `n_jobs`: Parallel processing (-1 uses all cores)
+
+### Gradient Boosting
+- **HistGradientBoostingClassifier/Regressor**: Histogram-based, fast for large datasets (>10k samples)
+- **GradientBoostingClassifier/Regressor**: Traditional implementation, better for small datasets
+
+**Use cases**: High-performance predictions, winning Kaggle competitions, structured/tabular data
+
+**Key parameters**:
+- `n_estimators`: Number of boosting stages
+- `learning_rate`: Shrinks contribution of each tree
+- `max_depth`: Maximum tree depth (typically 3-8)
+- `subsample`: Fraction of samples per tree (enables stochastic gradient boosting)
+- `early_stopping`: Stop when validation score stops improving
+
+**Performance tip**: HistGradientBoosting is orders of magnitude faster for large datasets
+
+### AdaBoost
+- **AdaBoostClassifier/Regressor**: Adaptive boosting
+
+**Use cases**: Boosting weak learners, less prone to overfitting than other methods
+
+**Key parameters**:
+- `estimator`: Base estimator (default: DecisionTreeClassifier with max_depth=1)
+- `n_estimators`: Number of boosting iterations
+- `learning_rate`: Weight applied to each classifier
+
+### Bagging
+- **BaggingClassifier/Regressor**: Bootstrap aggregating with any base estimator
+
+**Use cases**: Reducing variance of unstable models, parallel ensemble creation
+
+**Key parameters**:
+- `estimator`: Base estimator to fit
+- `n_estimators`: Number of estimators
+- `max_samples`: Samples to draw per estimator
+- `bootstrap`: Whether to use replacement
+
+### Voting & Stacking
+- **VotingClassifier/Regressor**: Combines different model types
+- **StackingClassifier/Regressor**: Meta-learner trained on base predictions
+
+**Use cases**: Combining diverse models, leveraging different model strengths
+
+## Neural Networks
+
+- **MLPClassifier**: Multi-layer perceptron classifier
+- **MLPRegressor**: Multi-layer perceptron regressor
+
+**Use cases**: Complex non-linear patterns, when gradient boosting is too slow, deep feature learning
+
+**Key parameters**:
+- `hidden_layer_sizes`: Tuple of hidden layer sizes (e.g., (100, 50))
+- `activation`: 'relu', 'tanh', 'logistic'
+- `solver`: 'adam', 'lbfgs', 'sgd'
+- `alpha`: L2 regularization term
+- `learning_rate`: Learning rate schedule
+- `early_stopping`: Stop when validation score stops improving
+
+**Important**: Feature scaling is critical for neural networks. Always use StandardScaler or similar.
+
+## Nearest Neighbors
+
+- **KNeighborsClassifier/Regressor**: K-nearest neighbors
+- **RadiusNeighborsClassifier/Regressor**: Radius-based neighbors
+- **NearestCentroid**: Classification using class centroids
+
+**Use cases**: Simple baseline, irregular decision boundaries, when interpretability isn't critical
+
+**Key parameters**:
+- `n_neighbors`: Number of neighbors (typically 3-11)
+- `weights`: 'uniform' or 'distance' (distance-weighted voting)
+- `metric`: Distance metric ('euclidean', 'manhattan', 'minkowski')
+- `algorithm`: 'auto', 'ball_tree', 'kd_tree', 'brute'
+
+## Naive Bayes
+
+- **GaussianNB**: Assumes Gaussian distribution of features
+- **MultinomialNB**: For discrete counts (text classification)
+- **BernoulliNB**: For binary/boolean features
+- **CategoricalNB**: For categorical features
+- **ComplementNB**: Adapted for imbalanced datasets
+
+**Use cases**: Text classification, fast baseline, when features are independent, small training sets
+
+**Key parameters**:
+- `alpha`: Smoothing parameter (Laplace/Lidstone smoothing)
+- `fit_prior`: Whether to learn class prior probabilities
+
+## Linear/Quadratic Discriminant Analysis
+
+- **LinearDiscriminantAnalysis**: Linear decision boundary with dimensionality reduction
+- **QuadraticDiscriminantAnalysis**: Quadratic decision boundary
+
+**Use cases**: When classes have Gaussian distributions, dimensionality reduction, when covariance assumptions hold
+
+## Gaussian Processes
+
+- **GaussianProcessClassifier**: Probabilistic classification
+- **GaussianProcessRegressor**: Probabilistic regression with uncertainty estimates
+
+**Use cases**: When uncertainty quantification is important, small datasets, smooth function approximation
+
+**Key parameters**:
+- `kernel`: Covariance function (RBF, Matern, RationalQuadratic, etc.)
+- `alpha`: Noise level
+
+**Limitation**: Doesn't scale well to large datasets (O(n³) complexity)
+
+## Stochastic Gradient Descent
+
+- **SGDClassifier**: Linear classifiers with SGD
+- **SGDRegressor**: Linear regressors with SGD
+
+**Use cases**: Very large datasets (>100k samples), online learning, when data doesn't fit in memory
+
+**Key parameters**:
+- `loss`: Loss function ('hinge', 'log_loss', 'squared_error', etc.)
+- `penalty`: Regularization ('l2', 'l1', 'elasticnet')
+- `alpha`: Regularization strength
+- `learning_rate`: Learning rate schedule
+
+## Semi-Supervised Learning
+
+- **SelfTrainingClassifier**: Self-training with any base classifier
+- **LabelPropagation**: Label propagation through graph
+- **LabelSpreading**: Label spreading (modified label propagation)
+
+**Use cases**: When labeled data is scarce but unlabeled data is abundant
+
+## Feature Selection
+
+- **VarianceThreshold**: Remove low-variance features
+- **SelectKBest**: Select K highest scoring features
+- **SelectPercentile**: Select top percentile of features
+- **RFE**: Recursive feature elimination
+- **RFECV**: RFE with cross-validation
+- **SelectFromModel**: Select features based on importance
+- **SequentialFeatureSelector**: Forward/backward feature selection
+
+**Use cases**: Reducing dimensionality, removing irrelevant features, improving interpretability, reducing overfitting
+
+## Probability Calibration
+
+- **CalibratedClassifierCV**: Calibrate classifier probabilities
+
+**Use cases**: When probability estimates are important (not just class predictions), especially with SVM and Naive Bayes
+
+**Methods**:
+- `sigmoid`: Platt scaling
+- `isotonic`: Isotonic regression (more flexible, needs more data)
+
+## Multi-Output Methods
+
+- **MultiOutputClassifier**: Fit one classifier per target
+- **MultiOutputRegressor**: Fit one regressor per target
+- **ClassifierChain**: Models dependencies between targets
+- **RegressorChain**: Regression variant
+
+**Use cases**: Predicting multiple related targets simultaneously
+
+## Specialized Regression
+
+- **IsotonicRegression**: Monotonic regression
+- **QuantileRegressor**: Quantile regression for prediction intervals
+
+## Algorithm Selection Guidelines
+
+**Start with**:
+1. **Logistic Regression** (classification) or **LinearRegression/Ridge** (regression) as baseline
+2. **RandomForestClassifier/Regressor** for general non-linear problems
+3. **HistGradientBoostingClassifier/Regressor** when best performance is needed
+
+**Consider dataset size**:
+- Small (<1k samples): SVM, Gaussian Processes, any algorithm
+- Medium (1k-100k): Random Forests, Gradient Boosting, Neural Networks
+- Large (>100k): SGD, HistGradientBoosting, LinearSVC
+
+**Consider interpretability needs**:
+- High interpretability: Linear models, Decision Trees, Naive Bayes
+- Medium: Random Forests (feature importance), Rule extraction
+- Low (black box acceptable): Gradient Boosting, Neural Networks, SVM with RBF kernel
+
+**Consider training time**:
+- Fast: Linear models, Naive Bayes, Decision Trees
+- Medium: Random Forests (parallelizable), SVM (small data)
+- Slow: Gradient Boosting, Neural Networks, SVM (large data), Gaussian Processes
--- a/scientific-packages/scikit-learn/references/unsupervised_learning.md
+++ b/scientific-packages/scikit-learn/references/unsupervised_learning.md
@@ -0,0 +1,728 @@
+# Unsupervised Learning in scikit-learn
+
+## Overview
+Unsupervised learning discovers patterns in data without labeled targets. Main tasks include clustering (grouping similar samples), dimensionality reduction (reducing feature count), and anomaly detection (finding outliers).
+
+## Clustering Algorithms
+
+### K-Means
+
+Groups data into k clusters by minimizing within-cluster variance.
+
+**Algorithm**:
+1. Initialize k centroids (k-means++ initialization recommended)
+2. Assign each point to nearest centroid
+3. Update centroids to mean of assigned points
+4. Repeat until convergence
+
+```python
+from sklearn.cluster import KMeans
+
+kmeans = KMeans(
+    n_clusters=3,
+    init='k-means++',  # Smart initialization
+    n_init=10,         # Number of times to run with different seeds
+    max_iter=300,
+    random_state=42
+)
+labels = kmeans.fit_predict(X)
+centroids = kmeans.cluster_centers_
+```
+
+**Use cases**:
+- Customer segmentation
+- Image compression
+- Data preprocessing (clustering as features)
+
+**Strengths**:
+- Fast and scalable
+- Simple to understand
+- Works well with spherical clusters
+
+**Limitations**:
+- Assumes spherical clusters of similar size
+- Sensitive to initialization (mitigated by k-means++)
+- Must specify k beforehand
+- Sensitive to outliers
+
+**Choosing k**: Use elbow method, silhouette score, or domain knowledge
+
+**Variants**:
+- **MiniBatchKMeans**: Faster for large datasets, uses mini-batches
+- **KMeans with n_init='auto'**: Adaptive number of initializations
+
+### DBSCAN
+
+Density-Based Spatial Clustering of Applications with Noise. Identifies clusters as dense regions separated by sparse areas.
+
+```python
+from sklearn.cluster import DBSCAN
+
+dbscan = DBSCAN(
+    eps=0.5,           # Maximum distance between neighbors
+    min_samples=5,     # Minimum points to form dense region
+    metric='euclidean'
+)
+labels = dbscan.fit_predict(X)
+# -1 indicates noise/outliers
+```
+
+**Use cases**:
+- Arbitrary cluster shapes
+- Outlier detection
+- When cluster count is unknown
+- Geographic/spatial data
+
+**Strengths**:
+- Discovers arbitrary-shaped clusters
+- Automatically detects outliers
+- Doesn't require specifying number of clusters
+- Robust to outliers
+
+**Limitations**:
+- Struggles with varying densities
+- Sensitive to eps and min_samples parameters
+- Not deterministic (border points may vary)
+
+**Parameter tuning**:
+- `eps`: Plot k-distance graph, look for elbow
+- `min_samples`: Rule of thumb: 2 * dimensions
+
+### HDBSCAN
+
+Hierarchical DBSCAN that handles variable cluster densities.
+
+```python
+from sklearn.cluster import HDBSCAN
+
+hdbscan = HDBSCAN(
+    min_cluster_size=5,
+    min_samples=None,  # Defaults to min_cluster_size
+    metric='euclidean'
+)
+labels = hdbscan.fit_predict(X)
+```
+
+**Advantages over DBSCAN**:
+- Handles variable density clusters
+- More robust parameter selection
+- Provides cluster membership probabilities
+- Hierarchical structure
+
+**Use cases**: When DBSCAN struggles with varying densities
+
+### Hierarchical Clustering
+
+Builds nested cluster hierarchies using agglomerative (bottom-up) approach.
+
+```python
+from sklearn.cluster import AgglomerativeClustering
+
+agg_clust = AgglomerativeClustering(
+    n_clusters=3,
+    linkage='ward',  # 'ward', 'complete', 'average', 'single'
+    metric='euclidean'
+)
+labels = agg_clust.fit_predict(X)
+
+# Visualize with dendrogram
+from scipy.cluster.hierarchy import dendrogram, linkage as scipy_linkage
+import matplotlib.pyplot as plt
+
+linkage_matrix = scipy_linkage(X, method='ward')
+dendrogram(linkage_matrix)
+plt.show()
+```
+
+**Linkage methods**:
+- `ward`: Minimizes variance (only with Euclidean) - **most common**
+- `complete`: Maximum distance between clusters
+- `average`: Average distance between clusters
+- `single`: Minimum distance between clusters
+
+**Use cases**:
+- When hierarchical structure is meaningful
+- Taxonomy/phylogenetic trees
+- When visualization is important (dendrograms)
+
+**Strengths**:
+- No need to specify k initially (cut dendrogram at desired level)
+- Produces hierarchy of clusters
+- Deterministic
+
+**Limitations**:
+- Computationally expensive (O(n²) to O(n³))
+- Not suitable for large datasets
+- Cannot undo previous merges
+
+### Spectral Clustering
+
+Performs dimensionality reduction using affinity matrix before clustering.
+
+```python
+from sklearn.cluster import SpectralClustering
+
+spectral = SpectralClustering(
+    n_clusters=3,
+    affinity='rbf',  # 'rbf', 'nearest_neighbors', 'precomputed'
+    gamma=1.0,
+    n_neighbors=10,
+    random_state=42
+)
+labels = spectral.fit_predict(X)
+```
+
+**Use cases**:
+- Non-convex clusters
+- Image segmentation
+- Graph clustering
+- When similarity matrix is available
+
+**Strengths**:
+- Handles non-convex clusters
+- Works with similarity matrices
+- Often better than k-means for complex shapes
+
+**Limitations**:
+- Computationally expensive
+- Requires specifying number of clusters
+- Memory intensive
+
+### Mean Shift
+
+Discovers clusters through iterative centroid updates based on density.
+
+```python
+from sklearn.cluster import MeanShift, estimate_bandwidth
+
+# Estimate bandwidth
+bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)
+
+mean_shift = MeanShift(bandwidth=bandwidth)
+labels = mean_shift.fit_predict(X)
+cluster_centers = mean_shift.cluster_centers_
+```
+
+**Use cases**:
+- When cluster count is unknown
+- Computer vision applications
+- Object tracking
+
+**Strengths**:
+- Automatically determines number of clusters
+- Handles arbitrary shapes
+- No assumptions about cluster shape
+
+**Limitations**:
+- Computationally expensive
+- Very sensitive to bandwidth parameter
+- Doesn't scale well
+
+### Affinity Propagation
+
+Uses message-passing between samples to identify exemplars.
+
+```python
+from sklearn.cluster import AffinityPropagation
+
+affinity_prop = AffinityPropagation(
+    damping=0.5,       # Damping factor (0.5-1.0)
+    preference=None,   # Self-preference (controls number of clusters)
+    random_state=42
+)
+labels = affinity_prop.fit_predict(X)
+exemplars = affinity_prop.cluster_centers_indices_
+```
+
+**Use cases**:
+- When number of clusters is unknown
+- When exemplars (representative samples) are needed
+
+**Strengths**:
+- Automatically determines number of clusters
+- Identifies exemplar samples
+- No initialization required
+
+**Limitations**:
+- Very slow: O(n²t) where t is iterations
+- Not suitable for large datasets
+- Memory intensive
+
+### Gaussian Mixture Models (GMM)
+
+Probabilistic model assuming data comes from mixture of Gaussian distributions.
+
+```python
+from sklearn.mixture import GaussianMixture
+
+gmm = GaussianMixture(
+    n_components=3,
+    covariance_type='full',  # 'full', 'tied', 'diag', 'spherical'
+    random_state=42
+)
+labels = gmm.fit_predict(X)
+probabilities = gmm.predict_proba(X)  # Soft clustering
+```
+
+**Covariance types**:
+- `full`: Each component has its own covariance matrix
+- `tied`: All components share same covariance
+- `diag`: Diagonal covariance (independent features)
+- `spherical`: Spherical covariance (isotropic)
+
+**Use cases**:
+- When soft clustering is needed (probabilities)
+- When clusters have different shapes/sizes
+- Generative modeling
+- Density estimation
+
+**Strengths**:
+- Provides probabilities (soft clustering)
+- Can handle elliptical clusters
+- Generative model (can sample new data)
+- Model selection with BIC/AIC
+
+**Limitations**:
+- Assumes Gaussian distributions
+- Sensitive to initialization
+- Can converge to local optima
+
+**Model selection**:
+```python
+from sklearn.mixture import GaussianMixture
+import numpy as np
+
+n_components_range = range(2, 10)
+bic_scores = []
+
+for n in n_components_range:
+    gmm = GaussianMixture(n_components=n, random_state=42)
+    gmm.fit(X)
+    bic_scores.append(gmm.bic(X))
+
+optimal_n = n_components_range[np.argmin(bic_scores)]
+```
+
+### BIRCH
+
+Builds Clustering Feature Tree for memory-efficient processing of large datasets.
+
+```python
+from sklearn.cluster import Birch
+
+birch = Birch(
+    n_clusters=3,
+    threshold=0.5,
+    branching_factor=50
+)
+labels = birch.fit_predict(X)
+```
+
+**Use cases**:
+- Very large datasets
+- Streaming data
+- Memory constraints
+
+**Strengths**:
+- Memory efficient
+- Single pass over data
+- Incremental learning
+
+## Dimensionality Reduction
+
+### Principal Component Analysis (PCA)
+
+Finds orthogonal components that explain maximum variance.
+
+```python
+from sklearn.decomposition import PCA
+import matplotlib.pyplot as plt
+
+# Specify number of components
+pca = PCA(n_components=2, random_state=42)
+X_transformed = pca.fit_transform(X)
+
+print("Explained variance ratio:", pca.explained_variance_ratio_)
+print("Total variance explained:", pca.explained_variance_ratio_.sum())
+
+# Or specify variance to retain
+pca = PCA(n_components=0.95)  # Keep 95% of variance
+X_transformed = pca.fit_transform(X)
+print(f"Components needed: {pca.n_components_}")
+
+# Visualize explained variance
+plt.plot(np.cumsum(pca.explained_variance_ratio_))
+plt.xlabel('Number of components')
+plt.ylabel('Cumulative explained variance')
+plt.show()
+```
+
+**Use cases**:
+- Visualization (reduce to 2-3 dimensions)
+- Remove multicollinearity
+- Noise reduction
+- Speed up training
+- Feature extraction
+
+**Strengths**:
+- Fast and efficient
+- Reduces multicollinearity
+- Works well for linear relationships
+- Interpretable components
+
+**Limitations**:
+- Only linear transformations
+- Sensitive to scaling (always standardize first!)
+- Components may be hard to interpret
+
+**Variants**:
+- **IncrementalPCA**: For datasets that don't fit in memory
+- **KernelPCA**: Non-linear dimensionality reduction
+- **SparsePCA**: Sparse loadings for interpretability
+
+### t-SNE
+
+t-Distributed Stochastic Neighbor Embedding for visualization.
+
+```python
+from sklearn.manifold import TSNE
+
+tsne = TSNE(
+    n_components=2,
+    perplexity=30,      # Balance local vs global structure (5-50)
+    learning_rate='auto',
+    n_iter=1000,
+    random_state=42
+)
+X_embedded = tsne.fit_transform(X)
+
+# Visualize
+plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y)
+plt.show()
+```
+
+**Use cases**:
+- Visualization only (do not use for preprocessing!)
+- Exploring high-dimensional data
+- Finding clusters visually
+
+**Important notes**:
+- **Only for visualization**, not for preprocessing
+- Each run produces different results (use random_state for reproducibility)
+- Slow for large datasets
+- Cannot transform new data (no transform() method)
+
+**Parameter tuning**:
+- `perplexity`: 5-50, larger for larger datasets
+- Lower perplexity = focus on local structure
+- Higher perplexity = focus on global structure
+
+### UMAP
+
+Uniform Manifold Approximation and Projection (requires umap-learn package).
+
+**Advantages over t-SNE**:
+- Preserves global structure better
+- Faster
+- Can transform new data
+- Can be used for preprocessing (not just visualization)
+
+### Truncated SVD (LSA)
+
+Similar to PCA but works with sparse matrices (e.g., TF-IDF).
+
+```python
+from sklearn.decomposition import TruncatedSVD
+
+svd = TruncatedSVD(n_components=100, random_state=42)
+X_reduced = svd.fit_transform(X_sparse)
+```
+
+**Use cases**:
+- Text data (after TF-IDF)
+- Sparse matrices
+- Latent Semantic Analysis (LSA)
+
+### Non-negative Matrix Factorization (NMF)
+
+Factorizes data into non-negative components.
+
+```python
+from sklearn.decomposition import NMF
+
+nmf = NMF(n_components=10, init='nndsvd', random_state=42)
+W = nmf.fit_transform(X)  # Document-topic matrix
+H = nmf.components_        # Topic-word matrix
+```
+
+**Use cases**:
+- Topic modeling
+- Audio source separation
+- Image processing
+- When non-negativity is important (e.g., counts)
+
+**Strengths**:
+- Interpretable components (additive, non-negative)
+- Sparse representations
+
+### Independent Component Analysis (ICA)
+
+Separates multivariate signal into independent components.
+
+```python
+from sklearn.decomposition import FastICA
+
+ica = FastICA(n_components=10, random_state=42)
+X_independent = ica.fit_transform(X)
+```
+
+**Use cases**:
+- Blind source separation
+- Signal processing
+- Feature extraction when independence is expected
+
+### Factor Analysis
+
+Models observed variables as linear combinations of latent factors plus noise.
+
+```python
+from sklearn.decomposition import FactorAnalysis
+
+fa = FactorAnalysis(n_components=5, random_state=42)
+X_factors = fa.fit_transform(X)
+```
+
+**Use cases**:
+- When noise is heteroscedastic
+- Latent variable modeling
+- Psychology/social science research
+
+**Difference from PCA**: Models noise explicitly, assumes features have independent noise
+
+## Anomaly Detection
+
+### One-Class SVM
+
+Learns boundary around normal data.
+
+```python
+from sklearn.svm import OneClassSVM
+
+oc_svm = OneClassSVM(
+    nu=0.1,           # Proportion of outliers expected
+    kernel='rbf',
+    gamma='auto'
+)
+oc_svm.fit(X_train)
+predictions = oc_svm.predict(X_test)  # 1 for inliers, -1 for outliers
+```
+
+**Use cases**:
+- Novelty detection
+- When only normal data is available for training
+
+### Isolation Forest
+
+Isolates outliers using random forests.
+
+```python
+from sklearn.ensemble import IsolationForest
+
+iso_forest = IsolationForest(
+    contamination=0.1,  # Expected proportion of outliers
+    random_state=42
+)
+predictions = iso_forest.fit_predict(X)  # 1 for inliers, -1 for outliers
+scores = iso_forest.score_samples(X)     # Anomaly scores
+```
+
+**Use cases**:
+- General anomaly detection
+- Works well with high-dimensional data
+- Fast and scalable
+
+**Strengths**:
+- Fast
+- Effective in high dimensions
+- Low memory requirements
+
+### Local Outlier Factor (LOF)
+
+Detects outliers based on local density deviation.
+
+```python
+from sklearn.neighbors import LocalOutlierFactor
+
+lof = LocalOutlierFactor(
+    n_neighbors=20,
+    contamination=0.1
+)
+predictions = lof.fit_predict(X)  # 1 for inliers, -1 for outliers
+scores = lof.negative_outlier_factor_  # Anomaly scores (negative)
+```
+
+**Use cases**:
+- Finding local outliers
+- When global methods fail
+
+## Clustering Evaluation
+
+### With Ground Truth Labels
+
+When true labels are available (for validation):
+
+**Adjusted Rand Index (ARI)**:
+```python
+from sklearn.metrics import adjusted_rand_score
+ari = adjusted_rand_score(y_true, y_pred)
+# Range: [-1, 1], 1 = perfect, 0 = random
+```
+
+**Normalized Mutual Information (NMI)**:
+```python
+from sklearn.metrics import normalized_mutual_info_score
+nmi = normalized_mutual_info_score(y_true, y_pred)
+# Range: [0, 1], 1 = perfect
+```
+
+**V-Measure**:
+```python
+from sklearn.metrics import v_measure_score
+v = v_measure_score(y_true, y_pred)
+# Range: [0, 1], harmonic mean of homogeneity and completeness
+```
+
+### Without Ground Truth Labels
+
+When true labels are unavailable (unsupervised evaluation):
+
+**Silhouette Score**:
+Measures how similar objects are to their own cluster vs other clusters.
+
+```python
+from sklearn.metrics import silhouette_score, silhouette_samples
+import matplotlib.pyplot as plt
+
+score = silhouette_score(X, labels)
+# Range: [-1, 1], higher is better
+# >0.7: Strong structure
+# 0.5-0.7: Reasonable structure
+# 0.25-0.5: Weak structure
+# <0.25: No substantial structure
+
+# Per-sample scores for detailed analysis
+sample_scores = silhouette_samples(X, labels)
+
+# Visualize silhouette plot
+for i in range(n_clusters):
+    cluster_scores = sample_scores[labels == i]
+    cluster_scores.sort()
+    plt.barh(range(len(cluster_scores)), cluster_scores)
+plt.axvline(x=score, color='red', linestyle='--')
+plt.show()
+```
+
+**Davies-Bouldin Index**:
+```python
+from sklearn.metrics import davies_bouldin_score
+db = davies_bouldin_score(X, labels)
+# Lower is better, 0 = perfect
+```
+
+**Calinski-Harabasz Index** (Variance Ratio Criterion):
+```python
+from sklearn.metrics import calinski_harabasz_score
+ch = calinski_harabasz_score(X, labels)
+# Higher is better
+```
+
+**Inertia** (K-Means specific):
+```python
+inertia = kmeans.inertia_
+# Sum of squared distances to nearest cluster center
+# Use for elbow method
+```
+
+### Elbow Method (K-Means)
+
+```python
+from sklearn.cluster import KMeans
+import matplotlib.pyplot as plt
+
+inertias = []
+K_range = range(2, 11)
+
+for k in K_range:
+    kmeans = KMeans(n_clusters=k, random_state=42)
+    kmeans.fit(X)
+    inertias.append(kmeans.inertia_)
+
+plt.plot(K_range, inertias, 'bo-')
+plt.xlabel('Number of clusters (k)')
+plt.ylabel('Inertia')
+plt.title('Elbow Method')
+plt.show()
+# Look for "elbow" where inertia starts decreasing more slowly
+```
+
+## Best Practices
+
+### Clustering Algorithm Selection
+
+**Use K-Means when**:
+- Clusters are spherical and similar size
+- Speed is important
+- Data is not too high-dimensional
+
+**Use DBSCAN when**:
+- Arbitrary cluster shapes
+- Number of clusters unknown
+- Outlier detection needed
+
+**Use Hierarchical when**:
+- Hierarchy is meaningful
+- Small to medium datasets
+- Visualization is important
+
+**Use GMM when**:
+- Soft clustering needed
+- Clusters have different shapes/sizes
+- Probabilistic interpretation needed
+
+**Use Spectral Clustering when**:
+- Non-convex clusters
+- Have similarity matrix
+- Moderate dataset size
+
+### Preprocessing for Clustering
+
+1. **Always scale features**: Use StandardScaler or MinMaxScaler
+2. **Handle outliers**: Remove or use robust algorithms (DBSCAN, HDBSCAN)
+3. **Reduce dimensionality if needed**: PCA for speed, careful with interpretation
+4. **Check for categorical variables**: Encode appropriately or use specialized algorithms
+
+### Dimensionality Reduction Guidelines
+
+**For preprocessing/feature extraction**:
+- PCA (linear relationships)
+- TruncatedSVD (sparse data)
+- NMF (non-negative data)
+
+**For visualization only**:
+- t-SNE (preserves local structure)
+- UMAP (preserves both local and global structure)
+
+**Always**:
+- Standardize features before PCA
+- Use appropriate n_components (elbow plot, explained variance)
+- Don't use t-SNE for anything except visualization
+
+### Common Pitfalls
+
+1. **Not scaling data**: Most algorithms sensitive to scale
+2. **Using t-SNE for preprocessing**: Only for visualization!
+3. **Overfitting cluster count**: Too many clusters = overfitting noise
+4. **Ignoring outliers**: Can severely affect centroid-based methods
+5. **Wrong metric**: Euclidean assumes all features equally important
+6. **Not validating results**: Always check with multiple metrics and domain knowledge
+7. **PCA without standardization**: Components dominated by high-variance features