Improve the scikit-learn skill

2026-03-27 07:09:27 +08:00 · 2025-11-04 10:11:46 -08:00
parent 63a4293f1a
commit 4ad4f9970f
10 changed files with 3293 additions and 3606 deletions
--- a/scientific-packages/scikit-learn/SKILL.md
+++ b/scientific-packages/scikit-learn/SKILL.md
--- a/scientific-packages/scikit-learn/references/model_evaluation.md
+++ b/scientific-packages/scikit-learn/references/model_evaluation.md
--- a/scientific-packages/scikit-learn/references/pipelines_and_composition.md
+++ b/scientific-packages/scikit-learn/references/pipelines_and_composition.md
--- a/scientific-packages/scikit-learn/references/preprocessing.md
+++ b/scientific-packages/scikit-learn/references/preprocessing.md
@@ -1,345 +1,563 @@
-# Data Preprocessing in scikit-learn
+# Data Preprocessing and Feature Engineering Reference

 ## Overview
-Preprocessing transforms raw data into a format suitable for machine learning algorithms. Many algorithms require standardized or normalized data to perform well.

-## Standardization and Scaling
+Data preprocessing transforms raw data into a format suitable for machine learning models. This includes scaling, encoding, handling missing values, and feature engineering.
+
+## Feature Scaling and Normalization

 ### StandardScaler
-Removes mean and scales to unit variance (z-score normalization).
-
-**Formula**: `z = (x - μ) / σ`
-
-**Use cases**:
- Most ML algorithms (especially SVM, neural networks, PCA)
- When features have different units or scales
- When assuming Gaussian-like distribution
-
-**Important**: Fit only on training data, then transform both train and test sets.

+**StandardScaler (`sklearn.preprocessing.StandardScaler`)**
+- Standardizes features to zero mean and unit variance
+- Formula: z = (x - mean) / std
+- Use when: Features have different scales, algorithm assumes normally distributed data
+- Required for: SVM, KNN, Neural Networks, PCA, Linear Regression with regularization
+- Example:
 ```python
 from sklearn.preprocessing import StandardScaler
+
 scaler = StandardScaler()
 X_train_scaled = scaler.fit_transform(X_train)
-X_test_scaled = scaler.transform(X_test)  # Use same parameters
+X_test_scaled = scaler.transform(X_test)  # Use same parameters as training
+
+# Access learned parameters
+print(f"Mean: {scaler.mean_}")
+print(f"Std: {scaler.scale_}")
 ```

 ### MinMaxScaler
-Scales features to a specified range, typically [0, 1].
-
-**Formula**: `X_scaled = (X - X_min) / (X_max - X_min)`
-
-**Use cases**:
- When bounded range is needed
- Neural networks (often prefer [0, 1] range)
- When distribution is not Gaussian
- Image pixel values
-
-**Parameters**:
- `feature_range`: Tuple (min, max), default (0, 1)
-
-**Warning**: Sensitive to outliers since it uses min/max.
-
-### MaxAbsScaler
-Scales to [-1, 1] by dividing by maximum absolute value.
-
-**Use cases**:
- Sparse data (preserves sparsity)
- Data already centered at zero
- When sign of values is meaningful
-
-**Advantage**: Doesn't shift/center the data, preserves zero entries.
-
-### RobustScaler
-Uses median and interquartile range (IQR) instead of mean and standard deviation.
-
-**Formula**: `X_scaled = (X - median) / IQR`
-
-**Use cases**:
- When outliers are present
- When StandardScaler produces skewed results
- Robust statistics preferred
-
-**Parameters**:
- `quantile_range`: Tuple (q_min, q_max), default (25.0, 75.0)
-
-## Normalization
-
-### normalize() function and Normalizer
-Scales individual samples (rows) to unit norm, not features (columns).
-
-**Use cases**:
- Text classification (TF-IDF vectors)
- When similarity metrics (dot product, cosine) are used
- When each sample should have equal weight
-
-**Norms**:
- `l1`: Manhattan norm (sum of absolutes = 1)
- `l2`: Euclidean norm (sum of squares = 1) - **most common**
- `max`: Maximum absolute value = 1
-
-**Key difference from scalers**: Operates on rows (samples), not columns (features).

+**MinMaxScaler (`sklearn.preprocessing.MinMaxScaler`)**
+- Scales features to a given range (default [0, 1])
+- Formula: X_scaled = (X - X.min) / (X.max - X.min)
+- Use when: Need bounded values, data not normally distributed
+- Sensitive to outliers
+- Example:
 ```python
-from sklearn.preprocessing import Normalizer
-normalizer = Normalizer(norm='l2')
-X_normalized = normalizer.transform(X)
+from sklearn.preprocessing import MinMaxScaler
+
+scaler = MinMaxScaler(feature_range=(0, 1))
+X_scaled = scaler.fit_transform(X_train)
+
+# Custom range
+scaler = MinMaxScaler(feature_range=(-1, 1))
+X_scaled = scaler.fit_transform(X_train)
 ```

-## Encoding Categorical Features
+### RobustScaler
+
+**RobustScaler (`sklearn.preprocessing.RobustScaler`)**
+- Scales using median and interquartile range (IQR)
+- Formula: X_scaled = (X - median) / IQR
+- Use when: Data contains outliers
+- Robust to outliers
+- Example:
+```python
+from sklearn.preprocessing import RobustScaler
+
+scaler = RobustScaler()
+X_scaled = scaler.fit_transform(X_train)
+```
+
+### Normalizer
+
+**Normalizer (`sklearn.preprocessing.Normalizer`)**
+- Normalizes samples individually to unit norm
+- Common norms: 'l1', 'l2', 'max'
+- Use when: Need to normalize each sample independently (e.g., text features)
+- Example:
+```python
+from sklearn.preprocessing import Normalizer
+
+normalizer = Normalizer(norm='l2')  # Euclidean norm
+X_normalized = normalizer.fit_transform(X)
+```
+
+### MaxAbsScaler
+
+**MaxAbsScaler (`sklearn.preprocessing.MaxAbsScaler`)**
+- Scales by maximum absolute value
+- Range: [-1, 1]
+- Doesn't shift/center data (preserves sparsity)
+- Use when: Data is already centered or sparse
+- Example:
+```python
+from sklearn.preprocessing import MaxAbsScaler
+
+scaler = MaxAbsScaler()
+X_scaled = scaler.fit_transform(X_sparse)
+```
+
+## Encoding Categorical Variables
+
+### OneHotEncoder
+
+**OneHotEncoder (`sklearn.preprocessing.OneHotEncoder`)**
+- Creates binary columns for each category
+- Use when: Nominal categories (no order), tree-based models or linear models
+- Example:
+```python
+from sklearn.preprocessing import OneHotEncoder
+
+encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
+X_encoded = encoder.fit_transform(X_categorical)
+
+# Get feature names
+feature_names = encoder.get_feature_names_out(['color', 'size'])
+
+# Handle unknown categories during transform
+X_test_encoded = encoder.transform(X_test_categorical)
+```

 ### OrdinalEncoder
-Converts categories to integers (0 to n_categories - 1).
-
-**Use cases**:
- Ordinal relationships exist (small < medium < large)
- Preprocessing before other transformations
- Tree-based algorithms (which can handle integers)
-
-**Parameters**:
- `handle_unknown`: 'error' or 'use_encoded_value'
- `unknown_value`: Value for unknown categories
- `encoded_missing_value`: Value for missing data

+**OrdinalEncoder (`sklearn.preprocessing.OrdinalEncoder`)**
+- Encodes categories as integers
+- Use when: Ordinal categories (ordered), or tree-based models
+- Example:
 ```python
 from sklearn.preprocessing import OrdinalEncoder
+
+# Natural ordering
 encoder = OrdinalEncoder()
 X_encoded = encoder.fit_transform(X_categorical)
+
+# Custom ordering
+encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
+X_encoded = encoder.fit_transform(X_categorical)
 ```

-### OneHotEncoder
-Creates binary columns for each category.
-
-**Use cases**:
- Nominal categories (no order)
- Linear models, neural networks
- When category relationships shouldn't be assumed
-
-**Parameters**:
- `drop`: 'first', 'if_binary', array-like (prevents multicollinearity)
- `sparse_output`: True (default, memory efficient) or False
- `handle_unknown`: 'error', 'ignore', 'infrequent_if_exist'
- `min_frequency`: Group infrequent categories
- `max_categories`: Limit number of categories
-
-**High cardinality handling**:
-```python
-encoder = OneHotEncoder(min_frequency=100, handle_unknown='infrequent_if_exist')
-# Groups categories appearing < 100 times into 'infrequent' category
-```
-
-**Memory tip**: Use `sparse_output=True` (default) for high-cardinality features.
-
-### TargetEncoder
-Uses target statistics to encode categories.
-
-**Use cases**:
- High-cardinality categorical features (zip codes, user IDs)
- When linear relationships with target are expected
- Often improves performance over one-hot encoding
-
-**How it works**:
- Replaces category with mean of target for that category
- Uses cross-fitting during fit_transform() to prevent target leakage
- Applies smoothing to handle rare categories
-
-**Parameters**:
- `smooth`: Smoothing parameter for rare categories
- `cv`: Cross-validation strategy
-
-**Warning**: Only for supervised learning. Requires target variable.
-
-```python
-from sklearn.preprocessing import TargetEncoder
-encoder = TargetEncoder()
-X_encoded = encoder.fit_transform(X_categorical, y)
-```
-
 ### LabelEncoder
-Encodes target labels into integers 0 to n_classes - 1.

-**Use cases**: Encoding target variable for classification (not features!)
+**LabelEncoder (`sklearn.preprocessing.LabelEncoder`)**
+- Encodes target labels (y) as integers
+- Use for: Target variable encoding
+- Example:
+```python
+from sklearn.preprocessing import LabelEncoder

-**Important**: Use `LabelEncoder` for targets, not features. For features, use OrdinalEncoder or OneHotEncoder.
+le = LabelEncoder()
+y_encoded = le.fit_transform(y)

-### Binarizer
-Converts numeric values to binary (0 or 1) based on threshold.
+# Decode back
+y_decoded = le.inverse_transform(y_encoded)
+print(f"Classes: {le.classes_}")
+```

-**Use cases**: Creating binary features from continuous values
+### Target Encoding (using category_encoders)
+
+```python
+# Install: uv pip install category-encoders
+from category_encoders import TargetEncoder
+
+encoder = TargetEncoder()
+X_train_encoded = encoder.fit_transform(X_train_categorical, y_train)
+X_test_encoded = encoder.transform(X_test_categorical)
+```

 ## Non-linear Transformations

-### QuantileTransformer
-Maps features to uniform or normal distribution using rank transformation.
-
-**Use cases**:
- Unusual distributions (bimodal, heavy tails)
- Reducing outlier impact
- When normal distribution is desired
-
-**Parameters**:
- `output_distribution`: 'uniform' (default) or 'normal'
- `n_quantiles`: Number of quantiles (default: min(1000, n_samples))
-
-**Effect**: Strong transformation that reduces outlier influence and makes data more Gaussian-like.
-
-### PowerTransformer
-Applies parametric monotonic transformation to make data more Gaussian.
-
-**Methods**:
- `yeo-johnson`: Works with positive and negative values (default)
- `box-cox`: Only positive values
-
-**Use cases**:
- Skewed distributions
- When Gaussian assumption is important
- Variance stabilization
-
-**Advantage**: Less radical than QuantileTransformer, preserves more of original relationships.
-
-## Discretization
-
-### KBinsDiscretizer
-Bins continuous features into discrete intervals.
-
-**Strategies**:
- `uniform`: Equal-width bins
- `quantile`: Equal-frequency bins
- `kmeans`: K-means clustering to determine bins
-
-**Encoding**:
- `ordinal`: Integer encoding (0 to n_bins - 1)
- `onehot`: One-hot encoding
- `onehot-dense`: Dense one-hot encoding
-
-**Use cases**:
- Making linear models handle non-linear relationships
- Reducing noise in features
- Making features more interpretable
+### Power Transforms

+**PowerTransformer**
+- Makes data more Gaussian-like
+- Methods: 'yeo-johnson' (works with negative values), 'box-cox' (positive only)
+- Use when: Data is skewed, algorithm assumes normality
+- Example:
 ```python
-from sklearn.preprocessing import KBinsDiscretizer
-disc = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
-X_binned = disc.fit_transform(X)
+from sklearn.preprocessing import PowerTransformer
+
+# Yeo-Johnson (handles negative values)
+pt = PowerTransformer(method='yeo-johnson', standardize=True)
+X_transformed = pt.fit_transform(X)
+
+# Box-Cox (positive values only)
+pt = PowerTransformer(method='box-cox', standardize=True)
+X_transformed = pt.fit_transform(X)
 ```

-## Feature Generation
-
-### PolynomialFeatures
-Generates polynomial and interaction features.
-
-**Parameters**:
- `degree`: Polynomial degree
- `interaction_only`: Only multiplicative interactions (no x²)
- `include_bias`: Include constant feature
-
-**Use cases**:
- Adding non-linearity to linear models
- Feature engineering
- Polynomial regression
-
-**Warning**: Number of features grows rapidly: (n+d)!/d!n! for degree d.
+### Quantile Transformation

+**QuantileTransformer**
+- Transforms features to follow uniform or normal distribution
+- Robust to outliers
+- Use when: Want to reduce outlier impact
+- Example:
 ```python
-from sklearn.preprocessing import PolynomialFeatures
-poly = PolynomialFeatures(degree=2, include_bias=False)
-X_poly = poly.fit_transform(X)
-# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
+from sklearn.preprocessing import QuantileTransformer
+
+# Transform to uniform distribution
+qt = QuantileTransformer(output_distribution='uniform', random_state=42)
+X_transformed = qt.fit_transform(X)
+
+# Transform to normal distribution
+qt = QuantileTransformer(output_distribution='normal', random_state=42)
+X_transformed = qt.fit_transform(X)
 ```

-### SplineTransformer
-Generates B-spline basis functions.
+### Log Transform

-**Use cases**:
- Smooth non-linear transformations
- Alternative to PolynomialFeatures (less oscillation at boundaries)
- Generalized additive models (GAMs)
+```python
+import numpy as np

-**Parameters**:
- `n_knots`: Number of knots
- `degree`: Spline degree
- `knots`: Knot positions ('uniform', 'quantile', or array)
+# Log1p (log(1 + x)) - handles zeros
+X_log = np.log1p(X)

-## Missing Value Handling
+# Or use FunctionTransformer
+from sklearn.preprocessing import FunctionTransformer
+
+log_transformer = FunctionTransformer(np.log1p, inverse_func=np.expm1)
+X_log = log_transformer.fit_transform(X)
+```
+
+## Missing Value Imputation

 ### SimpleImputer
-Imputes missing values with various strategies.
-
-**Strategies**:
- `mean`: Mean of column (numeric only)
- `median`: Median of column (numeric only)
- `most_frequent`: Mode (numeric or categorical)
- `constant`: Fill with constant value
-
-**Parameters**:
- `strategy`: Imputation strategy
- `fill_value`: Value when strategy='constant'
- `missing_values`: What represents missing (np.nan, None, specific value)

+**SimpleImputer (`sklearn.impute.SimpleImputer`)**
+- Basic imputation strategies
+- Strategies: 'mean', 'median', 'most_frequent', 'constant'
+- Example:
 ```python
 from sklearn.impute import SimpleImputer
-imputer = SimpleImputer(strategy='median')
+
+# For numerical features
+imputer = SimpleImputer(strategy='mean')
+X_imputed = imputer.fit_transform(X)
+
+# For categorical features
+imputer = SimpleImputer(strategy='most_frequent')
+X_imputed = imputer.fit_transform(X_categorical)
+
+# Fill with constant
+imputer = SimpleImputer(strategy='constant', fill_value=0)
 X_imputed = imputer.fit_transform(X)
 ```

-### KNNImputer
-Imputes using k-nearest neighbors.
+### Iterative Imputer

-**Use cases**: When relationships between features should inform imputation
+**IterativeImputer**
+- Models each feature with missing values as function of other features
+- More sophisticated than SimpleImputer
+- Example:
+```python
+from sklearn.experimental import enable_iterative_imputer
+from sklearn.impute import IterativeImputer

-**Parameters**:
- `n_neighbors`: Number of neighbors
- `weights`: 'uniform' or 'distance'
+imputer = IterativeImputer(max_iter=10, random_state=42)
+X_imputed = imputer.fit_transform(X)
+```

-### IterativeImputer
-Models each feature with missing values as function of other features.
+### KNN Imputer

-**Use cases**:
- Complex relationships between features
- When multiple features have missing values
- Higher quality imputation (but slower)
+**KNNImputer**
+- Imputes using k-nearest neighbors
+- Use when: Features are correlated
+- Example:
+```python
+from sklearn.impute import KNNImputer

-**Parameters**:
- `estimator`: Estimator for regression (default: BayesianRidge)
- `max_iter`: Maximum iterations
+imputer = KNNImputer(n_neighbors=5)
+X_imputed = imputer.fit_transform(X)
+```

-## Function Transformers
+## Feature Engineering

-### FunctionTransformer
-Applies custom function to data.
+### Polynomial Features

-**Use cases**:
- Custom transformations in pipelines
- Log transformation, square root, etc.
- Domain-specific preprocessing
+**PolynomialFeatures**
+- Creates polynomial and interaction features
+- Use when: Need non-linear features for linear models
+- Example:
+```python
+from sklearn.preprocessing import PolynomialFeatures
+
+# Degree 2: includes x1, x2, x1^2, x2^2, x1*x2
+poly = PolynomialFeatures(degree=2, include_bias=False)
+X_poly = poly.fit_transform(X)
+
+# Get feature names
+feature_names = poly.get_feature_names_out(['x1', 'x2'])
+
+# Only interactions (no powers)
+poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
+X_interactions = poly.fit_transform(X)
+```
+
+### Binning/Discretization
+
+**KBinsDiscretizer**
+- Bins continuous features into discrete intervals
+- Strategies: 'uniform', 'quantile', 'kmeans'
+- Encoding: 'onehot', 'ordinal', 'onehot-dense'
+- Example:
+```python
+from sklearn.preprocessing import KBinsDiscretizer
+
+# Equal-width bins
+binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
+X_binned = binner.fit_transform(X)
+
+# Equal-frequency bins (quantile-based)
+binner = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
+X_binned = binner.fit_transform(X)
+```
+
+### Binarization
+
+**Binarizer**
+- Converts features to binary (0 or 1) based on threshold
+- Example:
+```python
+from sklearn.preprocessing import Binarizer
+
+binarizer = Binarizer(threshold=0.5)
+X_binary = binarizer.fit_transform(X)
+```
+
+### Spline Features
+
+**SplineTransformer**
+- Creates spline basis functions
+- Useful for capturing non-linear relationships
+- Example:
+```python
+from sklearn.preprocessing import SplineTransformer
+
+spline = SplineTransformer(n_knots=5, degree=3)
+X_splines = spline.fit_transform(X)
+```
+
+## Text Feature Extraction
+
+### CountVectorizer
+
+**CountVectorizer (`sklearn.feature_extraction.text.CountVectorizer`)**
+- Converts text to token count matrix
+- Use for: Bag-of-words representation
+- Example:
+```python
+from sklearn.feature_extraction.text import CountVectorizer
+
+vectorizer = CountVectorizer(
+    max_features=5000,  # Keep top 5000 features
+    min_df=2,  # Ignore terms appearing in < 2 documents
+    max_df=0.8,  # Ignore terms appearing in > 80% documents
+    ngram_range=(1, 2)  # Unigrams and bigrams
+)
+
+X_counts = vectorizer.fit_transform(documents)
+feature_names = vectorizer.get_feature_names_out()
+```
+
+### TfidfVectorizer
+
+**TfidfVectorizer**
+- TF-IDF (Term Frequency-Inverse Document Frequency) transformation
+- Better than CountVectorizer for most tasks
+- Example:
+```python
+from sklearn.feature_extraction.text import TfidfVectorizer
+
+vectorizer = TfidfVectorizer(
+    max_features=5000,
+    min_df=2,
+    max_df=0.8,
+    ngram_range=(1, 2),
+    stop_words='english'  # Remove English stop words
+)
+
+X_tfidf = vectorizer.fit_transform(documents)
+```
+
+### HashingVectorizer
+
+**HashingVectorizer**
+- Uses hashing trick for memory efficiency
+- No fit needed, can't reverse transform
+- Use when: Very large vocabulary, streaming data
+- Example:
+```python
+from sklearn.feature_extraction.text import HashingVectorizer
+
+vectorizer = HashingVectorizer(n_features=2**18)
+X_hashed = vectorizer.transform(documents)  # No fit needed
+```
+
+## Feature Selection
+
+### Filter Methods
+
+**Variance Threshold**
+- Removes low-variance features
+- Example:
+```python
+from sklearn.feature_selection import VarianceThreshold
+
+selector = VarianceThreshold(threshold=0.01)
+X_selected = selector.fit_transform(X)
+```
+
+**SelectKBest / SelectPercentile**
+- Select features based on statistical tests
+- Tests: f_classif, chi2, mutual_info_classif
+- Example:
+```python
+from sklearn.feature_selection import SelectKBest, f_classif
+
+# Select top 10 features
+selector = SelectKBest(score_func=f_classif, k=10)
+X_selected = selector.fit_transform(X_train, y_train)
+
+# Get selected feature indices
+selected_indices = selector.get_support(indices=True)
+```
+
+### Wrapper Methods
+
+**Recursive Feature Elimination (RFE)**
+- Recursively removes features
+- Uses model feature importances
+- Example:
+```python
+from sklearn.feature_selection import RFE
+from sklearn.ensemble import RandomForestClassifier
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+rfe = RFE(estimator=model, n_features_to_select=10, step=1)
+X_selected = rfe.fit_transform(X_train, y_train)
+
+# Get selected features
+selected_features = rfe.support_
+feature_ranking = rfe.ranking_
+```
+
+**RFECV (with Cross-Validation)**
+- RFE with cross-validation to find optimal number of features
+- Example:
+```python
+from sklearn.feature_selection import RFECV
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+rfecv = RFECV(estimator=model, cv=5, scoring='accuracy')
+X_selected = rfecv.fit_transform(X_train, y_train)
+
+print(f"Optimal number of features: {rfecv.n_features_}")
+```
+
+### Embedded Methods
+
+**SelectFromModel**
+- Select features based on model coefficients/importances
+- Works with: Linear models (L1), Tree-based models
+- Example:
+```python
+from sklearn.feature_selection import SelectFromModel
+from sklearn.ensemble import RandomForestClassifier
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+selector = SelectFromModel(model, threshold='median')
+selector.fit(X_train, y_train)
+X_selected = selector.transform(X_train)
+
+# Get selected features
+selected_features = selector.get_support()
+```
+
+**L1-based Feature Selection**
+```python
+from sklearn.linear_model import LogisticRegression
+from sklearn.feature_selection import SelectFromModel
+
+model = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)
+selector = SelectFromModel(model)
+selector.fit(X_train, y_train)
+X_selected = selector.transform(X_train)
+```
+
+## Handling Outliers
+
+### IQR Method
+
+```python
+import numpy as np
+
+Q1 = np.percentile(X, 25, axis=0)
+Q3 = np.percentile(X, 75, axis=0)
+IQR = Q3 - Q1
+
+# Define outlier boundaries
+lower_bound = Q1 - 1.5 * IQR
+upper_bound = Q3 + 1.5 * IQR
+
+# Remove outliers
+mask = np.all((X >= lower_bound) & (X <= upper_bound), axis=1)
+X_no_outliers = X[mask]
+```
+
+### Winsorization
+
+```python
+from scipy.stats import mstats
+
+# Clip outliers at 5th and 95th percentiles
+X_winsorized = mstats.winsorize(X, limits=[0.05, 0.05], axis=0)
+```
+
+## Custom Transformers
+
+### Using FunctionTransformer

 ```python
 from sklearn.preprocessing import FunctionTransformer
 import numpy as np

-log_transformer = FunctionTransformer(np.log1p, validate=True)
-X_log = log_transformer.transform(X)
+def log_transform(X):
+    return np.log1p(X)
+
+transformer = FunctionTransformer(log_transform, inverse_func=np.expm1)
+X_transformed = transformer.fit_transform(X)
+```
+
+### Creating Custom Transformer
+
+```python
+from sklearn.base import BaseEstimator, TransformerMixin
+
+class CustomTransformer(BaseEstimator, TransformerMixin):
+    def __init__(self, parameter=1):
+        self.parameter = parameter
+
+    def fit(self, X, y=None):
+        # Learn parameters from X if needed
+        return self
+
+    def transform(self, X):
+        # Transform X
+        return X * self.parameter
+
+transformer = CustomTransformer(parameter=2)
+X_transformed = transformer.fit_transform(X)
 ```

 ## Best Practices

-### Feature Scaling Guidelines
+### Fit on Training Data Only
+Always fit transformers on training data only:
+```python
+# Correct
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)

-**Always scale**:
- SVM, neural networks
- K-nearest neighbors
- Linear/Logistic regression with regularization
- PCA, LDA
- Gradient descent-based algorithms
-
-**Don't need to scale**:
- Tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting)
- Naive Bayes
-
-### Pipeline Integration
-
-Always use preprocessing within pipelines to prevent data leakage:
+# Wrong - causes data leakage
+scaler = StandardScaler()
+X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))
+```

+### Use Pipelines
+Combine preprocessing with models:
 ```python
 from sklearn.pipeline import Pipeline
 from sklearn.preprocessing import StandardScaler
@@ -350,64 +568,39 @@ pipeline = Pipeline([
    ('classifier', LogisticRegression())
 ])

-pipeline.fit(X_train, y_train)  # Scaler fit only on train data
-y_pred = pipeline.predict(X_test)  # Scaler transform only on test data
+pipeline.fit(X_train, y_train)
 ```

-### Common Transformations by Data Type
-
-**Numeric - Continuous**:
- StandardScaler (most common)
- MinMaxScaler (neural networks)
- RobustScaler (outliers present)
- PowerTransformer (skewed data)
-
-**Numeric - Count Data**:
- sqrt or log transformation
- QuantileTransformer
- StandardScaler after transformation
-
-**Categorical - Low Cardinality (<10 categories)**:
- OneHotEncoder
-
-**Categorical - High Cardinality (>10 categories)**:
- TargetEncoder (supervised)
- Frequency encoding
- OneHotEncoder with min_frequency parameter
-
-**Categorical - Ordinal**:
- OrdinalEncoder
-
-**Text**:
- CountVectorizer or TfidfVectorizer
- Normalizer after vectorization
-
-### Data Leakage Prevention
-
-1. **Fit only on training data**: Never include test data when fitting preprocessors
-2. **Use pipelines**: Ensures proper fit/transform separation
-3. **Cross-validation**: Use Pipeline with cross_val_score() for proper evaluation
-4. **Target encoding**: Use cv parameter in TargetEncoder for cross-fitting
-
+### Handle Categorical and Numerical Separately
+Use ColumnTransformer:
 ```python
-# WRONG - data leakage
-scaler = StandardScaler().fit(X_full)
-X_train_scaled = scaler.transform(X_train)
-X_test_scaled = scaler.transform(X_test)
+from sklearn.compose import ColumnTransformer
+from sklearn.preprocessing import StandardScaler, OneHotEncoder

-# CORRECT - no leakage
-scaler = StandardScaler().fit(X_train)
-X_train_scaled = scaler.transform(X_train)
-X_test_scaled = scaler.transform(X_test)
+numeric_features = ['age', 'income']
+categorical_features = ['gender', 'occupation']
+
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', StandardScaler(), numeric_features),
+        ('cat', OneHotEncoder(), categorical_features)
+    ]
+)
+
+X_transformed = preprocessor.fit_transform(X)
 ```

-## Preprocessing Checklist
+### Algorithm-Specific Requirements

-Before modeling:
-1. Handle missing values (imputation or removal)
-2. Encode categorical variables appropriately
-3. Scale/normalize numeric features (if needed for algorithm)
-4. Handle outliers (RobustScaler, clipping, removal)
-5. Create additional features if beneficial (PolynomialFeatures, domain knowledge)
-6. Check for data leakage in preprocessing steps
-7. Wrap everything in a Pipeline
+**Require Scaling:**
+- SVM, KNN, Neural Networks
+- PCA, Linear/Logistic Regression with regularization
+- K-Means clustering
+
+**Don't Require Scaling:**
+- Tree-based models (Decision Trees, Random Forest, Gradient Boosting)
+- Naive Bayes
+
+**Encoding Requirements:**
+- Linear models, SVM, KNN: One-hot encoding for nominal features
+- Tree-based models: Can handle ordinal encoding directly
--- a/scientific-packages/scikit-learn/references/quick_reference.md
+++ b/scientific-packages/scikit-learn/references/quick_reference.md
@@ -1,546 +1,287 @@
 # Scikit-learn Quick Reference

-## Essential Imports
+## Common Import Patterns

 ```python
-# Core
-import numpy as np
-import pandas as pd
+# Core scikit-learn
+import sklearn
+
+# Data splitting and cross-validation
 from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
-from sklearn.pipeline import Pipeline, make_pipeline
-from sklearn.compose import ColumnTransformer

 # Preprocessing
-from sklearn.preprocessing import (
-    StandardScaler, MinMaxScaler, RobustScaler,
-    OneHotEncoder, OrdinalEncoder, LabelEncoder,
-    PolynomialFeatures
-)
+from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
 from sklearn.impute import SimpleImputer

-# Models - Classification
-from sklearn.linear_model import LogisticRegression
+# Feature selection
+from sklearn.feature_selection import SelectKBest, RFE
+
+# Supervised learning
+from sklearn.linear_model import LogisticRegression, Ridge, Lasso
+from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
+from sklearn.svm import SVC, SVR
 from sklearn.tree import DecisionTreeClassifier
-from sklearn.ensemble import (
-    RandomForestClassifier,
-    GradientBoostingClassifier,
-    HistGradientBoostingClassifier
-)
-from sklearn.svm import SVC
-from sklearn.neighbors import KNeighborsClassifier

-# Models - Regression
-from sklearn.linear_model import LinearRegression, Ridge, Lasso
-from sklearn.ensemble import (
-    RandomForestRegressor,
-    GradientBoostingRegressor,
-    HistGradientBoostingRegressor
-)
-
-# Clustering
+# Unsupervised learning
 from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
-from sklearn.mixture import GaussianMixture
-
-# Dimensionality Reduction
-from sklearn.decomposition import PCA, NMF, TruncatedSVD
-from sklearn.manifold import TSNE
+from sklearn.decomposition import PCA, NMF

 # Metrics
 from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
-    confusion_matrix, classification_report,
-    mean_squared_error, r2_score, mean_absolute_error
+    mean_squared_error, r2_score, confusion_matrix, classification_report
 )
+
+# Pipeline
+from sklearn.pipeline import Pipeline, make_pipeline
+from sklearn.compose import ColumnTransformer, make_column_transformer
+
+# Utilities
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
 ```

-## Basic Workflow Template
+## Installation

-### Classification
+```bash
+# Using uv (recommended)
+uv pip install scikit-learn
+
+# Optional dependencies
+uv pip install scikit-learn[plots]  # For plotting utilities
+uv pip install pandas numpy matplotlib seaborn  # Common companions
+```
+
+## Quick Workflow Templates
+
+### Classification Pipeline

 ```python
 from sklearn.model_selection import train_test_split
 from sklearn.preprocessing import StandardScaler
 from sklearn.ensemble import RandomForestClassifier
-from sklearn.metrics import classification_report
+from sklearn.metrics import classification_report, confusion_matrix

 # Split data
 X_train, X_test, y_train, y_test = train_test_split(
-    X, y, test_size=0.2, random_state=42, stratify=y
+    X, y, test_size=0.2, stratify=y, random_state=42
 )

-# Scale features
+# Preprocess
 scaler = StandardScaler()
 X_train_scaled = scaler.fit_transform(X_train)
 X_test_scaled = scaler.transform(X_test)

-# Train model
+# Train
 model = RandomForestClassifier(n_estimators=100, random_state=42)
 model.fit(X_train_scaled, y_train)

-# Predict and evaluate
+# Evaluate
 y_pred = model.predict(X_test_scaled)
 print(classification_report(y_test, y_pred))
+print(confusion_matrix(y_test, y_pred))
 ```

-### Regression
+### Regression Pipeline

 ```python
 from sklearn.model_selection import train_test_split
 from sklearn.preprocessing import StandardScaler
-from sklearn.ensemble import RandomForestRegressor
+from sklearn.ensemble import GradientBoostingRegressor
 from sklearn.metrics import mean_squared_error, r2_score

-# Split data
+# Split
 X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
 )

-# Scale features
+# Preprocess and train
 scaler = StandardScaler()
 X_train_scaled = scaler.fit_transform(X_train)
 X_test_scaled = scaler.transform(X_test)

-# Train model
-model = RandomForestRegressor(n_estimators=100, random_state=42)
+model = GradientBoostingRegressor(n_estimators=100, random_state=42)
 model.fit(X_train_scaled, y_train)

-# Predict and evaluate
+# Evaluate
 y_pred = model.predict(X_test_scaled)
 print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.3f}")
-print(f"R²: {r2_score(y_test, y_pred):.3f}")
+print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
 ```

-### With Pipeline (Recommended)
+### Cross-Validation

 ```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler
+from sklearn.model_selection import cross_val_score
 from sklearn.ensemble import RandomForestClassifier
-from sklearn.model_selection import train_test_split, cross_val_score

-# Create pipeline
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
-])
-
-# Split and train
-X_train, X_test, y_train, y_test = train_test_split(
-    X, y, test_size=0.2, random_state=42
-)
-pipeline.fit(X_train, y_train)
-
-# Evaluate
-score = pipeline.score(X_test, y_test)
-cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
-print(f"Test accuracy: {score:.3f}")
-print(f"CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
+print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
 ```

-## Common Preprocessing Patterns
-
-### Numeric Data
+### Complete Pipeline with Mixed Data Types

 ```python
-from sklearn.preprocessing import StandardScaler
-from sklearn.impute import SimpleImputer
 from sklearn.pipeline import Pipeline
+from sklearn.compose import ColumnTransformer
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+from sklearn.impute import SimpleImputer
+from sklearn.ensemble import RandomForestClassifier

+# Define feature types
+numeric_features = ['age', 'income']
+categorical_features = ['gender', 'occupation']
+
+# Create preprocessing pipelines
 numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
 ])
-```
-
-### Categorical Data
-
-```python
-from sklearn.preprocessing import OneHotEncoder
-from sklearn.impute import SimpleImputer
-from sklearn.pipeline import Pipeline

 categorical_transformer = Pipeline([
-    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
+    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
 ])
-```

-### Mixed Data with ColumnTransformer
-
-```python
-from sklearn.compose import ColumnTransformer
-
-numeric_features = ['age', 'income', 'credit_score']
-categorical_features = ['country', 'occupation']
-
-preprocessor = ColumnTransformer(
-    transformers=[
-        ('num', numeric_transformer, numeric_features),
-        ('cat', categorical_transformer, categorical_features)
-    ])
-
-# Complete pipeline
-from sklearn.ensemble import RandomForestClassifier
-pipeline = Pipeline([
-    ('preprocessor', preprocessor),
-    ('classifier', RandomForestClassifier())
+# Combine transformers
+preprocessor = ColumnTransformer([
+    ('num', numeric_transformer, numeric_features),
+    ('cat', categorical_transformer, categorical_features)
 ])
+
+# Full pipeline
+model = Pipeline([
+    ('preprocessor', preprocessor),
+    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
+])
+
+# Fit and predict
+model.fit(X_train, y_train)
+y_pred = model.predict(X_test)
 ```

-## Model Selection Cheat Sheet
-
-### Quick Decision Tree
-
-```
-Is it supervised?
-├─ Yes
-│  ├─ Predicting categories? → Classification
-│  │  ├─ Start with: LogisticRegression (baseline)
-│  │  ├─ Then try: RandomForestClassifier
-│  │  └─ Best performance: HistGradientBoostingClassifier
-│  └─ Predicting numbers? → Regression
-│     ├─ Start with: LinearRegression/Ridge (baseline)
-│     ├─ Then try: RandomForestRegressor
-│     └─ Best performance: HistGradientBoostingRegressor
-└─ No
-   ├─ Grouping similar items? → Clustering
-   │  ├─ Know # clusters: KMeans
-   │  └─ Unknown # clusters: DBSCAN or HDBSCAN
-   ├─ Reducing dimensions?
-   │  ├─ For preprocessing: PCA
-   │  └─ For visualization: t-SNE or UMAP
-   └─ Finding outliers? → IsolationForest or LocalOutlierFactor
-```
-
-### Algorithm Selection by Data Size
-
- **Small (<1K samples)**: Any algorithm
- **Medium (1K-100K)**: Random Forests, Gradient Boosting, Neural Networks
- **Large (>100K)**: SGDClassifier/Regressor, HistGradientBoosting, LinearSVC
-
-### When to Scale Features
-
-**Always scale**:
- SVM, Neural Networks
- K-Nearest Neighbors
- Linear/Logistic Regression (with regularization)
- PCA, LDA
- Any gradient descent algorithm
-
-**Don't need to scale**:
- Tree-based (Decision Trees, Random Forests, Gradient Boosting)
- Naive Bayes
-
-## Hyperparameter Tuning
-
-### GridSearchCV
+### Hyperparameter Tuning

 ```python
 from sklearn.model_selection import GridSearchCV
+from sklearn.ensemble import RandomForestClassifier

 param_grid = {
-    'n_estimators': [100, 200, 500],
+    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10]
 }

+model = RandomForestClassifier(random_state=42)
 grid_search = GridSearchCV(
-    RandomForestClassifier(random_state=42),
-    param_grid,
-    cv=5,
-    scoring='f1_weighted',
-    n_jobs=-1
+    model, param_grid, cv=5, scoring='accuracy', n_jobs=-1
 )

 grid_search.fit(X_train, y_train)
-best_model = grid_search.best_estimator_
 print(f"Best params: {grid_search.best_params_}")
+print(f"Best score: {grid_search.best_score_:.3f}")
+
+# Use best model
+best_model = grid_search.best_estimator_
 ```

-### RandomizedSearchCV (Faster)
+## Common Patterns
+
+### Loading Data

 ```python
-from sklearn.model_selection import RandomizedSearchCV
-from scipy.stats import randint, uniform
+# From scikit-learn datasets
+from sklearn.datasets import load_iris, load_digits, make_classification

-param_distributions = {
-    'n_estimators': randint(100, 1000),
-    'max_depth': randint(5, 50),
-    'min_samples_split': randint(2, 20)
-}
+# Built-in datasets
+iris = load_iris()
+X, y = iris.data, iris.target

-random_search = RandomizedSearchCV(
-    RandomForestClassifier(random_state=42),
-    param_distributions,
-    n_iter=50,  # Number of combinations to try
-    cv=5,
-    n_jobs=-1,
-    random_state=42
+# Synthetic data
+X, y = make_classification(
+    n_samples=1000, n_features=20, n_classes=2, random_state=42
 )

-random_search.fit(X_train, y_train)
+# From pandas
+import pandas as pd
+df = pd.read_csv('data.csv')
+X = df.drop('target', axis=1)
+y = df['target']
 ```

-### Pipeline with GridSearchCV
+### Handling Imbalanced Data

 ```python
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.svm import SVC
-from sklearn.model_selection import GridSearchCV
-
-pipeline = Pipeline([
-    ('scaler', StandardScaler()),
-    ('svm', SVC())
-])
-
-param_grid = {
-    'svm__C': [0.1, 1, 10],
-    'svm__kernel': ['rbf', 'linear'],
-    'svm__gamma': ['scale', 'auto']
-}
-
-grid = GridSearchCV(pipeline, param_grid, cv=5)
-grid.fit(X_train, y_train)
-```
-
-## Cross-Validation
-
-### Basic Cross-Validation
-
-```python
-from sklearn.model_selection import cross_val_score
-
-scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
-print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
-```
-
-### Multiple Metrics
-
-```python
-from sklearn.model_selection import cross_validate
-
-scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
-results = cross_validate(model, X, y, cv=5, scoring=scoring)
-
-for metric in scoring:
-    scores = results[f'test_{metric}']
-    print(f"{metric}: {scores.mean():.3f} (+/- {scores.std():.3f})")
-```
-
-### Custom CV Strategies
-
-```python
-from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
-
-# For imbalanced classification
-cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
-
-# For time series
-cv = TimeSeriesSplit(n_splits=5)
-
-scores = cross_val_score(model, X, y, cv=cv)
-```
-
-## Common Metrics
-
-### Classification
-
-```python
-from sklearn.metrics import (
-    accuracy_score, balanced_accuracy_score,
-    precision_score, recall_score, f1_score,
-    confusion_matrix, classification_report,
-    roc_auc_score
-)
-
-# Basic metrics
-accuracy = accuracy_score(y_true, y_pred)
-f1 = f1_score(y_true, y_pred, average='weighted')
-
-# Comprehensive report
-print(classification_report(y_true, y_pred))
-
-# ROC AUC (requires probabilities)
-y_proba = model.predict_proba(X_test)[:, 1]
-auc = roc_auc_score(y_true, y_proba)
-```
-
-### Regression
-
-```python
-from sklearn.metrics import (
-    mean_squared_error,
-    mean_absolute_error,
-    r2_score
-)
-
-mse = mean_squared_error(y_true, y_pred)
-rmse = mean_squared_error(y_true, y_pred, squared=False)
-mae = mean_absolute_error(y_true, y_pred)
-r2 = r2_score(y_true, y_pred)
-
-print(f"RMSE: {rmse:.3f}")
-print(f"MAE: {mae:.3f}")
-print(f"R²: {r2:.3f}")
-```
-
-## Feature Engineering
-
-### Polynomial Features
-
-```python
-from sklearn.preprocessing import PolynomialFeatures
-
-poly = PolynomialFeatures(degree=2, include_bias=False)
-X_poly = poly.fit_transform(X)
-# [x1, x2] → [x1, x2, x1², x1·x2, x2²]
-```
-
-### Feature Selection
-
-```python
-from sklearn.feature_selection import (
-    SelectKBest, f_classif,
-    RFE,
-    SelectFromModel
-)
-
-# Univariate selection
-selector = SelectKBest(f_classif, k=10)
-X_selected = selector.fit_transform(X, y)
-
-# Recursive feature elimination
 from sklearn.ensemble import RandomForestClassifier
-rfe = RFE(RandomForestClassifier(), n_features_to_select=10)
-X_selected = rfe.fit_transform(X, y)

-# Model-based selection
-selector = SelectFromModel(
-    RandomForestClassifier(n_estimators=100),
-    threshold='median'
-)
-X_selected = selector.fit_transform(X, y)
+# Use class_weight parameter
+model = RandomForestClassifier(class_weight='balanced', random_state=42)
+model.fit(X_train, y_train)
+
+# Or use appropriate metrics
+from sklearn.metrics import balanced_accuracy_score, f1_score
+print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.3f}")
+print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
 ```

 ### Feature Importance

 ```python
-# Tree-based models
-model = RandomForestClassifier()
+from sklearn.ensemble import RandomForestClassifier
+import pandas as pd
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
 model.fit(X_train, y_train)
-importances = model.feature_importances_

-# Visualize
-import matplotlib.pyplot as plt
-indices = np.argsort(importances)[::-1]
-plt.bar(range(X.shape[1]), importances[indices])
-plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
-plt.show()
+# Get feature importances
+importances = pd.DataFrame({
+    'feature': feature_names,
+    'importance': model.feature_importances_
+}).sort_values('importance', ascending=False)

-# Permutation importance (works for any model)
-from sklearn.inspection import permutation_importance
-result = permutation_importance(model, X_test, y_test, n_repeats=10)
-importances = result.importances_mean
+print(importances.head(10))
 ```

-## Clustering
-
-### K-Means
+### Clustering

 ```python
 from sklearn.cluster import KMeans
 from sklearn.preprocessing import StandardScaler

-# Always scale for k-means
+# Scale data first
 scaler = StandardScaler()
 X_scaled = scaler.fit_transform(X)

-# Fit k-means
+# Fit K-Means
 kmeans = KMeans(n_clusters=3, random_state=42)
 labels = kmeans.fit_predict(X_scaled)

 # Evaluate
 from sklearn.metrics import silhouette_score
 score = silhouette_score(X_scaled, labels)
-print(f"Silhouette score: {score:.3f}")
+print(f"Silhouette Score: {score:.3f}")
 ```

-### Elbow Method
-
-```python
-inertias = []
-K_range = range(2, 11)
-
-for k in K_range:
-    kmeans = KMeans(n_clusters=k, random_state=42)
-    kmeans.fit(X_scaled)
-    inertias.append(kmeans.inertia_)
-
-plt.plot(K_range, inertias, 'bo-')
-plt.xlabel('k')
-plt.ylabel('Inertia')
-plt.show()
-```
-
-### DBSCAN
-
-```python
-from sklearn.cluster import DBSCAN
-
-dbscan = DBSCAN(eps=0.5, min_samples=5)
-labels = dbscan.fit_predict(X_scaled)
-
-# -1 indicates noise/outliers
-n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
-n_noise = list(labels).count(-1)
-print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
-```
-
-## Dimensionality Reduction
-
-### PCA
+### Dimensionality Reduction

 ```python
 from sklearn.decomposition import PCA
-from sklearn.preprocessing import StandardScaler
+import matplotlib.pyplot as plt

-# Always scale before PCA
-scaler = StandardScaler()
-X_scaled = scaler.fit_transform(X)
-
-# Specify n_components
+# Fit PCA
 pca = PCA(n_components=2)
-X_pca = pca.fit_transform(X_scaled)
+X_reduced = pca.fit_transform(X)

-# Or specify variance to retain
-pca = PCA(n_components=0.95)  # Keep 95% variance
-X_pca = pca.fit_transform(X_scaled)
-
-print(f"Explained variance: {pca.explained_variance_ratio_}")
-print(f"Components needed: {pca.n_components_}")
+# Plot
+plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis')
+plt.xlabel('PC1')
+plt.ylabel('PC2')
+plt.title(f'PCA (explained variance: {pca.explained_variance_ratio_.sum():.2%})')
 ```

-### t-SNE (Visualization Only)
-
-```python
-from sklearn.manifold import TSNE
-
-# Reduce to 50 dimensions with PCA first (recommended)
-pca = PCA(n_components=50)
-X_pca = pca.fit_transform(X_scaled)
-
-# Apply t-SNE
-tsne = TSNE(n_components=2, random_state=42, perplexity=30)
-X_tsne = tsne.fit_transform(X_pca)
-
-# Visualize
-plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
-plt.colorbar()
-plt.show()
-```
-
-## Saving and Loading Models
+### Model Persistence

 ```python
 import joblib
@@ -548,78 +289,145 @@ import joblib
 # Save model
 joblib.dump(model, 'model.pkl')

-# Save pipeline
-joblib.dump(pipeline, 'pipeline.pkl')
-
-# Load
-model = joblib.load('model.pkl')
-pipeline = joblib.load('pipeline.pkl')
-
-# Use loaded model
-y_pred = model.predict(X_new)
+# Load model
+loaded_model = joblib.load('model.pkl')
+predictions = loaded_model.predict(X_new)
 ```

-## Common Pitfalls and Solutions
+## Common Gotchas and Solutions

 ### Data Leakage
-❌ **Wrong**: Fit on all data before split
 ```python
-scaler = StandardScaler().fit(X)
-X_train, X_test = train_test_split(scaler.transform(X))
-```
+# WRONG: Fitting scaler on all data
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+X_train, X_test = train_test_split(X_scaled)

-✅ **Correct**: Use pipeline or fit only on train
-```python
+# RIGHT: Fit on training data only
 X_train, X_test = train_test_split(X)
-pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
-pipeline.fit(X_train, y_train)
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+
+# BEST: Use Pipeline
+from sklearn.pipeline import Pipeline
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('model', LogisticRegression())
+])
+pipeline.fit(X_train, y_train)  # No leakage!
 ```

-### Not Scaling
-❌ **Wrong**: Using SVM without scaling
-```python
-svm = SVC()
-svm.fit(X_train, y_train)
-```
-
-✅ **Correct**: Scale for SVM
-```python
-pipeline = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
-pipeline.fit(X_train, y_train)
-```
-
-### Wrong Metric for Imbalanced Data
-❌ **Wrong**: Using accuracy for 99:1 imbalance
-```python
-accuracy = accuracy_score(y_true, y_pred)  # Can be misleading
-```
-
-✅ **Correct**: Use appropriate metrics
-```python
-f1 = f1_score(y_true, y_pred, average='weighted')
-balanced_acc = balanced_accuracy_score(y_true, y_pred)
-```
-
-### Not Using Stratification
-❌ **Wrong**: Random split for imbalanced data
-```python
-X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
-```
-
-✅ **Correct**: Stratify for imbalanced classes
+### Stratified Splitting for Classification
 ```python
+# Always use stratify for classification
 X_train, X_test, y_train, y_test = train_test_split(
-    X, y, test_size=0.2, stratify=y
+    X, y, test_size=0.2, stratify=y, random_state=42
 )
 ```

+### Random State for Reproducibility
+```python
+# Set random_state for reproducibility
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+```
+
+### Handling Unknown Categories
+```python
+# Use handle_unknown='ignore' for OneHotEncoder
+encoder = OneHotEncoder(handle_unknown='ignore')
+```
+
+### Feature Names with Pipelines
+```python
+# Get feature names after transformation
+preprocessor.fit(X_train)
+feature_names = preprocessor.get_feature_names_out()
+```
+
+## Cheat Sheet: Algorithm Selection
+
+### Classification
+
+| Problem | Algorithm | When to Use |
+|---------|-----------|-------------|
+| Binary/Multiclass | Logistic Regression | Fast baseline, interpretability |
+| Binary/Multiclass | Random Forest | Good default, robust |
+| Binary/Multiclass | Gradient Boosting | Best accuracy, willing to tune |
+| Binary/Multiclass | SVM | Small data, complex boundaries |
+| Binary/Multiclass | Naive Bayes | Text classification, fast |
+| High dimensions | Linear SVM or Logistic | Text, many features |
+
+### Regression
+
+| Problem | Algorithm | When to Use |
+|---------|-----------|-------------|
+| Continuous target | Linear Regression | Fast baseline, interpretability |
+| Continuous target | Ridge/Lasso | Regularization needed |
+| Continuous target | Random Forest | Good default, non-linear |
+| Continuous target | Gradient Boosting | Best accuracy |
+| Continuous target | SVR | Small data, non-linear |
+
+### Clustering
+
+| Problem | Algorithm | When to Use |
+|---------|-----------|-------------|
+| Known K, spherical | K-Means | Fast, simple |
+| Unknown K, arbitrary shapes | DBSCAN | Noise/outliers present |
+| Hierarchical structure | Agglomerative | Need dendrogram |
+| Soft clustering | Gaussian Mixture | Probability estimates |
+
+### Dimensionality Reduction
+
+| Problem | Algorithm | When to Use |
+|---------|-----------|-------------|
+| Linear reduction | PCA | Variance explanation |
+| Visualization | t-SNE | 2D/3D plots |
+| Non-negative data | NMF | Images, text |
+| Sparse data | TruncatedSVD | Text, recommender systems |
+
 ## Performance Tips

-1. **Use n_jobs=-1** for parallel processing (RandomForest, GridSearchCV)
-2. **Use HistGradientBoosting** for large datasets (>10K samples)
-3. **Use MiniBatchKMeans** for large clustering tasks
-4. **Use IncrementalPCA** for data that doesn't fit in memory
-5. **Use sparse matrices** for high-dimensional sparse data (text)
-6. **Cache transformers** in pipelines during grid search
-7. **Use RandomizedSearchCV** instead of GridSearchCV for large parameter spaces
-8. **Reduce dimensionality** with PCA before applying expensive algorithms
+### Speed Up Training
+```python
+# Use n_jobs=-1 for parallel processing
+model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
+
+# Use warm_start for incremental learning
+model = RandomForestClassifier(n_estimators=100, warm_start=True)
+model.fit(X, y)
+model.n_estimators += 50
+model.fit(X, y)  # Adds 50 more trees
+
+# Use partial_fit for online learning
+from sklearn.linear_model import SGDClassifier
+model = SGDClassifier()
+for X_batch, y_batch in batches:
+    model.partial_fit(X_batch, y_batch, classes=np.unique(y))
+```
+
+### Memory Efficiency
+```python
+# Use sparse matrices
+from scipy.sparse import csr_matrix
+X_sparse = csr_matrix(X)
+
+# Use MiniBatchKMeans for large data
+from sklearn.cluster import MiniBatchKMeans
+model = MiniBatchKMeans(n_clusters=8, batch_size=100)
+```
+
+## Version Check
+
+```python
+import sklearn
+print(f"scikit-learn version: {sklearn.__version__}")
+```
+
+## Useful Resources
+
+- Official Documentation: https://scikit-learn.org/stable/
+- User Guide: https://scikit-learn.org/stable/user_guide.html
+- API Reference: https://scikit-learn.org/stable/api/index.html
+- Examples: https://scikit-learn.org/stable/auto_examples/index.html
+- Tutorials: https://scikit-learn.org/stable/tutorial/index.html
--- a/scientific-packages/scikit-learn/references/supervised_learning.md
+++ b/scientific-packages/scikit-learn/references/supervised_learning.md
@@ -1,261 +1,378 @@
-# Supervised Learning in scikit-learn
+# Supervised Learning Reference

 ## Overview
-Supervised learning algorithms learn patterns from labeled training data to make predictions on new data. Scikit-learn organizes supervised learning into 17 major categories.
+
+Supervised learning algorithms learn from labeled training data to make predictions on new data. Scikit-learn provides comprehensive implementations for both classification and regression tasks.

 ## Linear Models

 ### Regression
- **LinearRegression**: Ordinary least squares regression
- **Ridge**: L2-regularized regression, good for multicollinearity
- **Lasso**: L1-regularized regression, performs feature selection
- **ElasticNet**: Combined L1/L2 regularization
- **LassoLars**: Lasso using Least Angle Regression algorithm
- **BayesianRidge**: Bayesian approach with automatic relevance determination
+
+**Linear Regression (`sklearn.linear_model.LinearRegression`)**
+- Ordinary least squares regression
+- Fast, interpretable, no hyperparameters
+- Use when: Linear relationships, interpretability matters
+- Example:
+```python
+from sklearn.linear_model import LinearRegression
+
+model = LinearRegression()
+model.fit(X_train, y_train)
+predictions = model.predict(X_test)
+```
+
+**Ridge Regression (`sklearn.linear_model.Ridge`)**
+- L2 regularization to prevent overfitting
+- Key parameter: `alpha` (regularization strength, default=1.0)
+- Use when: Multicollinearity present, need regularization
+- Example:
+```python
+from sklearn.linear_model import Ridge
+
+model = Ridge(alpha=1.0)
+model.fit(X_train, y_train)
+```
+
+**Lasso (`sklearn.linear_model.Lasso`)**
+- L1 regularization with feature selection
+- Key parameter: `alpha` (regularization strength)
+- Use when: Want sparse models, feature selection
+- Can reduce some coefficients to exactly zero
+- Example:
+```python
+from sklearn.linear_model import Lasso
+
+model = Lasso(alpha=0.1)
+model.fit(X_train, y_train)
+# Check which features were selected
+print(f"Non-zero coefficients: {sum(model.coef_ != 0)}")
+```
+
+**ElasticNet (`sklearn.linear_model.ElasticNet`)**
+- Combines L1 and L2 regularization
+- Key parameters: `alpha`, `l1_ratio` (0=Ridge, 1=Lasso)
+- Use when: Need both feature selection and regularization
+- Example:
+```python
+from sklearn.linear_model import ElasticNet
+
+model = ElasticNet(alpha=0.1, l1_ratio=0.5)
+model.fit(X_train, y_train)
+```

 ### Classification
- **LogisticRegression**: Binary and multiclass classification
- **RidgeClassifier**: Ridge regression for classification
- **SGDClassifier**: Linear classifiers with SGD training

-**Use cases**: Baseline models, interpretable predictions, high-dimensional data, when linear relationships are expected
+**Logistic Regression (`sklearn.linear_model.LogisticRegression`)**
+- Binary and multiclass classification
+- Key parameters: `C` (inverse regularization), `penalty` ('l1', 'l2', 'elasticnet')
+- Returns probability estimates
+- Use when: Need probabilistic predictions, interpretability
+- Example:
+```python
+from sklearn.linear_model import LogisticRegression

-**Key parameters**:
- `alpha`: Regularization strength (higher = more regularization)
- `fit_intercept`: Whether to calculate intercept
- `solver`: Optimization algorithm ('lbfgs', 'saga', 'liblinear')
+model = LogisticRegression(C=1.0, max_iter=1000)
+model.fit(X_train, y_train)
+probas = model.predict_proba(X_test)
+```

-## Support Vector Machines (SVM)
+**Stochastic Gradient Descent (SGD)**
+- `SGDClassifier`, `SGDRegressor`
+- Efficient for large-scale learning
+- Key parameters: `loss`, `penalty`, `alpha`, `learning_rate`
+- Use when: Very large datasets (>10^4 samples)
+- Example:
+```python
+from sklearn.linear_model import SGDClassifier

- **SVC**: Support Vector Classification
- **SVR**: Support Vector Regression
- **LinearSVC**: Linear SVM using liblinear (faster for large datasets)
- **OneClassSVM**: Unsupervised outlier detection
+model = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3)
+model.fit(X_train, y_train)
+```

-**Use cases**: Complex non-linear decision boundaries, high-dimensional spaces, when clear margin of separation exists
+## Support Vector Machines

-**Key parameters**:
- `kernel`: 'linear', 'poly', 'rbf', 'sigmoid'
- `C`: Regularization parameter (lower = more regularization)
- `gamma`: Kernel coefficient ('scale', 'auto', or float)
- `degree`: Polynomial degree (for poly kernel)
+**SVC (`sklearn.svm.SVC`)**
+- Classification with kernel methods
+- Key parameters: `C`, `kernel` ('linear', 'rbf', 'poly'), `gamma`
+- Use when: Small to medium datasets, complex decision boundaries
+- Note: Does not scale well to large datasets
+- Example:
+```python
+from sklearn.svm import SVC

-**Performance tip**: SVMs don't scale well beyond tens of thousands of samples. Use LinearSVC for large datasets with linear kernel.
+# Linear kernel for linearly separable data
+model_linear = SVC(kernel='linear', C=1.0)
+
+# RBF kernel for non-linear data
+model_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
+model_rbf.fit(X_train, y_train)
+```
+
+**SVR (`sklearn.svm.SVR`)**
+- Regression with kernel methods
+- Similar parameters to SVC
+- Additional parameter: `epsilon` (tube width)
+- Example:
+```python
+from sklearn.svm import SVR
+
+model = SVR(kernel='rbf', C=1.0, epsilon=0.1)
+model.fit(X_train, y_train)
+```

 ## Decision Trees

- **DecisionTreeClassifier**: Classification tree
- **DecisionTreeRegressor**: Regression tree
- **ExtraTreeClassifier/Regressor**: Extremely randomized tree
+**DecisionTreeClassifier / DecisionTreeRegressor**
+- Non-parametric model learning decision rules
+- Key parameters:
+  - `max_depth`: Maximum tree depth (prevents overfitting)
+  - `min_samples_split`: Minimum samples to split a node
+  - `min_samples_leaf`: Minimum samples in leaf
+  - `criterion`: 'gini', 'entropy' for classification; 'squared_error', 'absolute_error' for regression
+- Use when: Need interpretable model, non-linear relationships, mixed feature types
+- Prone to overfitting - use ensembles or pruning
+- Example:
+```python
+from sklearn.tree import DecisionTreeClassifier

-**Use cases**: Non-linear relationships, feature importance analysis, interpretable rules, handling mixed data types
+model = DecisionTreeClassifier(
+    max_depth=5,
+    min_samples_split=20,
+    min_samples_leaf=10,
+    criterion='gini'
+)
+model.fit(X_train, y_train)

-**Key parameters**:
- `max_depth`: Maximum tree depth (controls overfitting)
- `min_samples_split`: Minimum samples to split a node
- `min_samples_leaf`: Minimum samples in leaf node
- `max_features`: Number of features to consider for splits
- `criterion`: 'gini', 'entropy' (classification); 'squared_error', 'absolute_error' (regression)
-
-**Overfitting prevention**: Limit `max_depth`, increase `min_samples_split/leaf`, use pruning with `ccp_alpha`
+# Visualize the tree
+from sklearn.tree import plot_tree
+plot_tree(model, feature_names=feature_names, class_names=class_names)
+```

 ## Ensemble Methods

 ### Random Forests
- **RandomForestClassifier**: Ensemble of decision trees
- **RandomForestRegressor**: Regression variant

-**Use cases**: Robust general-purpose algorithm, reduces overfitting vs single trees, handles non-linear relationships
+**RandomForestClassifier / RandomForestRegressor**
+- Ensemble of decision trees with bagging
+- Key parameters:
+  - `n_estimators`: Number of trees (default=100)
+  - `max_depth`: Maximum tree depth
+  - `max_features`: Features to consider for splits ('sqrt', 'log2', or int)
+  - `min_samples_split`, `min_samples_leaf`: Control tree growth
+- Use when: High accuracy needed, can afford computation
+- Provides feature importance
+- Example:
+```python
+from sklearn.ensemble import RandomForestClassifier

-**Key parameters**:
- `n_estimators`: Number of trees (higher = better but slower)
- `max_depth`: Maximum tree depth
- `max_features`: Features per split ('sqrt', 'log2', int, float)
- `bootstrap`: Whether to use bootstrap samples
- `n_jobs`: Parallel processing (-1 uses all cores)
+model = RandomForestClassifier(
+    n_estimators=100,
+    max_depth=10,
+    max_features='sqrt',
+    n_jobs=-1  # Use all CPU cores
+)
+model.fit(X_train, y_train)
+
+# Feature importance
+importances = model.feature_importances_
+```

 ### Gradient Boosting
- **HistGradientBoostingClassifier/Regressor**: Histogram-based, fast for large datasets (>10k samples)
- **GradientBoostingClassifier/Regressor**: Traditional implementation, better for small datasets

-**Use cases**: High-performance predictions, winning Kaggle competitions, structured/tabular data
+**GradientBoostingClassifier / GradientBoostingRegressor**
+- Sequential ensemble building trees on residuals
+- Key parameters:
+  - `n_estimators`: Number of boosting stages
+  - `learning_rate`: Shrinks contribution of each tree
+  - `max_depth`: Depth of individual trees (typically 3-5)
+  - `subsample`: Fraction of samples for training each tree
+- Use when: Need high accuracy, can afford training time
+- Often achieves best performance
+- Example:
+```python
+from sklearn.ensemble import GradientBoostingClassifier

-**Key parameters**:
- `n_estimators`: Number of boosting stages
- `learning_rate`: Shrinks contribution of each tree
- `max_depth`: Maximum tree depth (typically 3-8)
- `subsample`: Fraction of samples per tree (enables stochastic gradient boosting)
- `early_stopping`: Stop when validation score stops improving
+model = GradientBoostingClassifier(
+    n_estimators=100,
+    learning_rate=0.1,
+    max_depth=3,
+    subsample=0.8
+)
+model.fit(X_train, y_train)
+```

-**Performance tip**: HistGradientBoosting is orders of magnitude faster for large datasets
+**HistGradientBoostingClassifier / HistGradientBoostingRegressor**
+- Faster gradient boosting with histogram-based algorithm
+- Native support for missing values and categorical features
+- Key parameters: Similar to GradientBoosting
+- Use when: Large datasets, need faster training
+- Example:
+```python
+from sklearn.ensemble import HistGradientBoostingClassifier

-### AdaBoost
- **AdaBoostClassifier/Regressor**: Adaptive boosting
+model = HistGradientBoostingClassifier(
+    max_iter=100,
+    learning_rate=0.1,
+    max_depth=None,  # No limit by default
+    categorical_features='from_dtype'  # Auto-detect categorical
+)
+model.fit(X_train, y_train)
+```

-**Use cases**: Boosting weak learners, less prone to overfitting than other methods
+### Other Ensemble Methods

-**Key parameters**:
- `estimator`: Base estimator (default: DecisionTreeClassifier with max_depth=1)
- `n_estimators`: Number of boosting iterations
- `learning_rate`: Weight applied to each classifier
+**AdaBoost**
+- Adaptive boosting focusing on misclassified samples
+- Key parameters: `n_estimators`, `learning_rate`, `estimator` (base estimator)
+- Use when: Simple boosting approach needed
+- Example:
+```python
+from sklearn.ensemble import AdaBoostClassifier

-### Bagging
- **BaggingClassifier/Regressor**: Bootstrap aggregating with any base estimator
+model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)
+model.fit(X_train, y_train)
+```

-**Use cases**: Reducing variance of unstable models, parallel ensemble creation
+**Voting Classifier / Regressor**
+- Combines predictions from multiple models
+- Types: 'hard' (majority vote) or 'soft' (average probabilities)
+- Use when: Want to ensemble different model types
+- Example:
+```python
+from sklearn.ensemble import VotingClassifier
+from sklearn.linear_model import LogisticRegression
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.svm import SVC

-**Key parameters**:
- `estimator`: Base estimator to fit
- `n_estimators`: Number of estimators
- `max_samples`: Samples to draw per estimator
- `bootstrap`: Whether to use replacement
+model = VotingClassifier(
+    estimators=[
+        ('lr', LogisticRegression()),
+        ('dt', DecisionTreeClassifier()),
+        ('svc', SVC(probability=True))
+    ],
+    voting='soft'
+)
+model.fit(X_train, y_train)
+```

-### Voting & Stacking
- **VotingClassifier/Regressor**: Combines different model types
- **StackingClassifier/Regressor**: Meta-learner trained on base predictions
+**Stacking Classifier / Regressor**
+- Trains a meta-model on predictions from base models
+- More sophisticated than voting
+- Key parameter: `final_estimator` (meta-learner)
+- Example:
+```python
+from sklearn.ensemble import StackingClassifier
+from sklearn.linear_model import LogisticRegression
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.svm import SVC

-**Use cases**: Combining diverse models, leveraging different model strengths
+model = StackingClassifier(
+    estimators=[
+        ('dt', DecisionTreeClassifier()),
+        ('svc', SVC())
+    ],
+    final_estimator=LogisticRegression()
+)
+model.fit(X_train, y_train)
+```

-## Neural Networks
+## K-Nearest Neighbors

- **MLPClassifier**: Multi-layer perceptron classifier
- **MLPRegressor**: Multi-layer perceptron regressor
+**KNeighborsClassifier / KNeighborsRegressor**
+- Non-parametric method based on distance
+- Key parameters:
+  - `n_neighbors`: Number of neighbors (default=5)
+  - `weights`: 'uniform' or 'distance'
+  - `metric`: Distance metric ('euclidean', 'manhattan', etc.)
+- Use when: Small dataset, simple baseline needed
+- Slow prediction on large datasets
+- Example:
+```python
+from sklearn.neighbors import KNeighborsClassifier

-**Use cases**: Complex non-linear patterns, when gradient boosting is too slow, deep feature learning
-
-**Key parameters**:
- `hidden_layer_sizes`: Tuple of hidden layer sizes (e.g., (100, 50))
- `activation`: 'relu', 'tanh', 'logistic'
- `solver`: 'adam', 'lbfgs', 'sgd'
- `alpha`: L2 regularization term
- `learning_rate`: Learning rate schedule
- `early_stopping`: Stop when validation score stops improving
-
-**Important**: Feature scaling is critical for neural networks. Always use StandardScaler or similar.
-
-## Nearest Neighbors
-
- **KNeighborsClassifier/Regressor**: K-nearest neighbors
- **RadiusNeighborsClassifier/Regressor**: Radius-based neighbors
- **NearestCentroid**: Classification using class centroids
-
-**Use cases**: Simple baseline, irregular decision boundaries, when interpretability isn't critical
-
-**Key parameters**:
- `n_neighbors`: Number of neighbors (typically 3-11)
- `weights`: 'uniform' or 'distance' (distance-weighted voting)
- `metric`: Distance metric ('euclidean', 'manhattan', 'minkowski')
- `algorithm`: 'auto', 'ball_tree', 'kd_tree', 'brute'
+model = KNeighborsClassifier(n_neighbors=5, weights='distance')
+model.fit(X_train, y_train)
+```

 ## Naive Bayes

- **GaussianNB**: Assumes Gaussian distribution of features
- **MultinomialNB**: For discrete counts (text classification)
- **BernoulliNB**: For binary/boolean features
- **CategoricalNB**: For categorical features
- **ComplementNB**: Adapted for imbalanced datasets
+**GaussianNB, MultinomialNB, BernoulliNB**
+- Probabilistic classifiers based on Bayes' theorem
+- Fast training and prediction
+- GaussianNB: Continuous features (assumes Gaussian distribution)
+- MultinomialNB: Count features (text classification)
+- BernoulliNB: Binary features
+- Use when: Text classification, fast baseline, probabilistic predictions
+- Example:
+```python
+from sklearn.naive_bayes import GaussianNB, MultinomialNB

-**Use cases**: Text classification, fast baseline, when features are independent, small training sets
+# For continuous features
+model_gaussian = GaussianNB()

-**Key parameters**:
- `alpha`: Smoothing parameter (Laplace/Lidstone smoothing)
- `fit_prior`: Whether to learn class prior probabilities
+# For text/count data
+model_multinomial = MultinomialNB(alpha=1.0)  # alpha is smoothing parameter
+model_multinomial.fit(X_train, y_train)
+```

-## Linear/Quadratic Discriminant Analysis
+## Neural Networks

- **LinearDiscriminantAnalysis**: Linear decision boundary with dimensionality reduction
- **QuadraticDiscriminantAnalysis**: Quadratic decision boundary
+**MLPClassifier / MLPRegressor**
+- Multi-layer perceptron (feedforward neural network)
+- Key parameters:
+  - `hidden_layer_sizes`: Tuple of hidden layer sizes, e.g., (100, 50)
+  - `activation`: 'relu', 'tanh', 'logistic'
+  - `solver`: 'adam', 'sgd', 'lbfgs'
+  - `alpha`: L2 regularization parameter
+  - `learning_rate`: 'constant', 'adaptive'
+- Use when: Complex non-linear patterns, large datasets
+- Requires feature scaling
+- Example:
+```python
+from sklearn.neural_network import MLPClassifier
+from sklearn.preprocessing import StandardScaler

-**Use cases**: When classes have Gaussian distributions, dimensionality reduction, when covariance assumptions hold
+# Scale features first
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)

-## Gaussian Processes
+model = MLPClassifier(
+    hidden_layer_sizes=(100, 50),
+    activation='relu',
+    solver='adam',
+    alpha=0.0001,
+    max_iter=1000
+)
+model.fit(X_train_scaled, y_train)
+```

- **GaussianProcessClassifier**: Probabilistic classification
- **GaussianProcessRegressor**: Probabilistic regression with uncertainty estimates
+## Algorithm Selection Guide

-**Use cases**: When uncertainty quantification is important, small datasets, smooth function approximation
+### Choose based on:

-**Key parameters**:
- `kernel`: Covariance function (RBF, Matern, RationalQuadratic, etc.)
- `alpha`: Noise level
+**Dataset size:**
+- Small (<1k samples): KNN, SVM, Decision Trees
+- Medium (1k-100k): Random Forest, Gradient Boosting, Linear Models
+- Large (>100k): SGD, Linear Models, HistGradientBoosting

-**Limitation**: Doesn't scale well to large datasets (O(n³) complexity)
+**Interpretability:**
+- High: Linear Models, Decision Trees
+- Medium: Random Forest (feature importance)
+- Low: SVM with RBF kernel, Neural Networks

-## Stochastic Gradient Descent
+**Accuracy vs Speed:**
+- Fast training: Naive Bayes, Linear Models, KNN
+- High accuracy: Gradient Boosting, Random Forest, Stacking
+- Fast prediction: Linear Models, Naive Bayes
+- Slow prediction: KNN (on large datasets), SVM

- **SGDClassifier**: Linear classifiers with SGD
- **SGDRegressor**: Linear regressors with SGD
+**Feature types:**
+- Continuous: Most algorithms work well
+- Categorical: Trees, HistGradientBoosting (native support)
+- Mixed: Trees, Gradient Boosting
+- Text: Naive Bayes, Linear Models with TF-IDF

-**Use cases**: Very large datasets (>100k samples), online learning, when data doesn't fit in memory
-
-**Key parameters**:
- `loss`: Loss function ('hinge', 'log_loss', 'squared_error', etc.)
- `penalty`: Regularization ('l2', 'l1', 'elasticnet')
- `alpha`: Regularization strength
- `learning_rate`: Learning rate schedule
-
-## Semi-Supervised Learning
-
- **SelfTrainingClassifier**: Self-training with any base classifier
- **LabelPropagation**: Label propagation through graph
- **LabelSpreading**: Label spreading (modified label propagation)
-
-**Use cases**: When labeled data is scarce but unlabeled data is abundant
-
-## Feature Selection
-
- **VarianceThreshold**: Remove low-variance features
- **SelectKBest**: Select K highest scoring features
- **SelectPercentile**: Select top percentile of features
- **RFE**: Recursive feature elimination
- **RFECV**: RFE with cross-validation
- **SelectFromModel**: Select features based on importance
- **SequentialFeatureSelector**: Forward/backward feature selection
-
-**Use cases**: Reducing dimensionality, removing irrelevant features, improving interpretability, reducing overfitting
-
-## Probability Calibration
-
- **CalibratedClassifierCV**: Calibrate classifier probabilities
-
-**Use cases**: When probability estimates are important (not just class predictions), especially with SVM and Naive Bayes
-
-**Methods**:
- `sigmoid`: Platt scaling
- `isotonic`: Isotonic regression (more flexible, needs more data)
-
-## Multi-Output Methods
-
- **MultiOutputClassifier**: Fit one classifier per target
- **MultiOutputRegressor**: Fit one regressor per target
- **ClassifierChain**: Models dependencies between targets
- **RegressorChain**: Regression variant
-
-**Use cases**: Predicting multiple related targets simultaneously
-
-## Specialized Regression
-
- **IsotonicRegression**: Monotonic regression
- **QuantileRegressor**: Quantile regression for prediction intervals
-
-## Algorithm Selection Guidelines
-
-**Start with**:
-1. **Logistic Regression** (classification) or **LinearRegression/Ridge** (regression) as baseline
-2. **RandomForestClassifier/Regressor** for general non-linear problems
-3. **HistGradientBoostingClassifier/Regressor** when best performance is needed
-
-**Consider dataset size**:
- Small (<1k samples): SVM, Gaussian Processes, any algorithm
- Medium (1k-100k): Random Forests, Gradient Boosting, Neural Networks
- Large (>100k): SGD, HistGradientBoosting, LinearSVC
-
-**Consider interpretability needs**:
- High interpretability: Linear models, Decision Trees, Naive Bayes
- Medium: Random Forests (feature importance), Rule extraction
- Low (black box acceptable): Gradient Boosting, Neural Networks, SVM with RBF kernel
-
-**Consider training time**:
- Fast: Linear models, Naive Bayes, Decision Trees
- Medium: Random Forests (parallelizable), SVM (small data)
- Slow: Gradient Boosting, Neural Networks, SVM (large data), Gaussian Processes
+**Common starting points:**
+1. Logistic Regression (classification) / Linear Regression (regression) - fast baseline
+2. Random Forest - good default choice
+3. Gradient Boosting - optimize for best accuracy
--- a/scientific-packages/scikit-learn/references/unsupervised_learning.md
+++ b/scientific-packages/scikit-learn/references/unsupervised_learning.md
--- a/scientific-packages/scikit-learn/scripts/classification_pipeline.py
+++ b/scientific-packages/scikit-learn/scripts/classification_pipeline.py
@@ -1,219 +1,257 @@
-#!/usr/bin/env python3
 """
-Complete classification pipeline with preprocessing, training, evaluation, and hyperparameter tuning.
-Demonstrates best practices for scikit-learn workflows.
+Complete classification pipeline example with preprocessing, model training,
+hyperparameter tuning, and evaluation.
 """

 import numpy as np
 import pandas as pd
-from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
+from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
 from sklearn.preprocessing import StandardScaler, OneHotEncoder
 from sklearn.impute import SimpleImputer
 from sklearn.compose import ColumnTransformer
 from sklearn.pipeline import Pipeline
-from sklearn.ensemble import RandomForestClassifier
-from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
-import joblib
+from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import (
+    classification_report, confusion_matrix, roc_auc_score,
+    accuracy_score, precision_score, recall_score, f1_score
+)
+import warnings
+warnings.filterwarnings('ignore')


 def create_preprocessing_pipeline(numeric_features, categorical_features):
    """
-    Create preprocessing pipeline for mixed data types.
+    Create a preprocessing pipeline for mixed data types.

-    Args:
-        numeric_features: List of numeric column names
-        categorical_features: List of categorical column names
+    Parameters:
+    -----------
+    numeric_features : list
+        List of numeric feature column names
+    categorical_features : list
+        List of categorical feature column names

    Returns:
-        ColumnTransformer with appropriate preprocessing for each data type
+    --------
+    ColumnTransformer
+        Preprocessing pipeline
    """
+    # Numeric preprocessing
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])

+    # Categorical preprocessing
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
-        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True))
+        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])

+    # Combine transformers
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
-        ])
+        ]
+    )

    return preprocessor


-def create_full_pipeline(preprocessor, classifier=None):
+def train_and_evaluate_model(X, y, numeric_features, categorical_features,
+                             test_size=0.2, random_state=42):
    """
-    Create complete ML pipeline with preprocessing and classification.
+    Complete pipeline: preprocess, train, tune, and evaluate a classifier.

-    Args:
-        preprocessor: Preprocessing ColumnTransformer
-        classifier: Classifier instance (default: RandomForestClassifier)
+    Parameters:
+    -----------
+    X : DataFrame or array
+        Feature matrix
+    y : Series or array
+        Target variable
+    numeric_features : list
+        List of numeric feature names
+    categorical_features : list
+        List of categorical feature names
+    test_size : float
+        Proportion of data for testing
+    random_state : int
+        Random seed

    Returns:
-        Complete Pipeline
+    --------
+    dict
+        Dictionary containing trained model, predictions, and metrics
    """
-    if classifier is None:
-        classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
+    # Split data with stratification
+    X_train, X_test, y_train, y_test = train_test_split(
+        X, y, test_size=test_size, stratify=y, random_state=random_state
+    )

-    pipeline = Pipeline(steps=[
-        ('preprocessor', preprocessor),
-        ('classifier', classifier)
-    ])
+    print(f"Training set size: {len(X_train)}")
+    print(f"Test set size: {len(X_test)}")
+    print(f"Class distribution in training: {pd.Series(y_train).value_counts().to_dict()}")

-    return pipeline
+    # Create preprocessor
+    preprocessor = create_preprocessing_pipeline(numeric_features, categorical_features)

-
-def evaluate_model(pipeline, X_train, y_train, X_test, y_test, cv=5):
-    """
-    Evaluate model using cross-validation and test set.
-
-    Args:
-        pipeline: Trained pipeline
-        X_train, y_train: Training data
-        X_test, y_test: Test data
-        cv: Number of cross-validation folds
-
-    Returns:
-        Dictionary with evaluation results
-    """
-    # Cross-validation on training set
-    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='accuracy')
-
-    # Test set evaluation
-    y_pred = pipeline.predict(X_test)
-    test_score = pipeline.score(X_test, y_test)
-
-    # Get probabilities if available
-    try:
-        y_proba = pipeline.predict_proba(X_test)
-        if len(np.unique(y_test)) == 2:
-            # Binary classification
-            auc = roc_auc_score(y_test, y_proba[:, 1])
-        else:
-            # Multiclass
-            auc = roc_auc_score(y_test, y_proba, multi_class='ovr')
-    except:
-        auc = None
-
-    results = {
-        'cv_mean': cv_scores.mean(),
-        'cv_std': cv_scores.std(),
-        'test_score': test_score,
-        'auc': auc,
-        'classification_report': classification_report(y_test, y_pred),
-        'confusion_matrix': confusion_matrix(y_test, y_pred)
+    # Define models to compare
+    models = {
+        'Logistic Regression': Pipeline([
+            ('preprocessor', preprocessor),
+            ('classifier', LogisticRegression(max_iter=1000, random_state=random_state))
+        ]),
+        'Random Forest': Pipeline([
+            ('preprocessor', preprocessor),
+            ('classifier', RandomForestClassifier(n_estimators=100, random_state=random_state))
+        ]),
+        'Gradient Boosting': Pipeline([
+            ('preprocessor', preprocessor),
+            ('classifier', GradientBoostingClassifier(n_estimators=100, random_state=random_state))
+        ])
    }

-    return results
+    # Compare models using cross-validation
+    print("\n" + "="*60)
+    print("Model Comparison (5-Fold Cross-Validation)")
+    print("="*60)

+    cv_results = {}
+    for name, model in models.items():
+        scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
+        cv_results[name] = scores.mean()
+        print(f"{name:20s}: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

-def tune_hyperparameters(pipeline, X_train, y_train, param_grid, cv=5):
-    """
-    Perform hyperparameter tuning using GridSearchCV.
+    # Select best model based on CV
+    best_model_name = max(cv_results, key=cv_results.get)
+    best_model = models[best_model_name]

-    Args:
-        pipeline: Pipeline to tune
-        X_train, y_train: Training data
-        param_grid: Dictionary of parameters to search
-        cv: Number of cross-validation folds
+    print(f"\nBest model: {best_model_name}")
+
+    # Hyperparameter tuning for best model
+    if best_model_name == 'Random Forest':
+        param_grid = {
+            'classifier__n_estimators': [100, 200],
+            'classifier__max_depth': [10, 20, None],
+            'classifier__min_samples_split': [2, 5]
+        }
+    elif best_model_name == 'Gradient Boosting':
+        param_grid = {
+            'classifier__n_estimators': [100, 200],
+            'classifier__learning_rate': [0.01, 0.1],
+            'classifier__max_depth': [3, 5]
+        }
+    else:  # Logistic Regression
+        param_grid = {
+            'classifier__C': [0.1, 1.0, 10.0],
+            'classifier__penalty': ['l2']
+        }
+
+    print("\n" + "="*60)
+    print("Hyperparameter Tuning")
+    print("="*60)

-    Returns:
-        GridSearchCV object with best model
-    """
    grid_search = GridSearchCV(
-        pipeline,
-        param_grid,
-        cv=cv,
-        scoring='f1_weighted',
-        n_jobs=-1,
-        verbose=1
+        best_model, param_grid, cv=5, scoring='accuracy',
+        n_jobs=-1, verbose=0
    )

    grid_search.fit(X_train, y_train)

    print(f"Best parameters: {grid_search.best_params_}")
-    print(f"Best CV score: {grid_search.best_score_:.3f}")
+    print(f"Best CV score: {grid_search.best_score_:.4f}")

-    return grid_search
+    # Evaluate on test set
+    tuned_model = grid_search.best_estimator_
+    y_pred = tuned_model.predict(X_test)
+    y_pred_proba = tuned_model.predict_proba(X_test)

+    print("\n" + "="*60)
+    print("Test Set Evaluation")
+    print("="*60)

-def main():
-    """
-    Example usage of the classification pipeline.
-    """
-    # Load your data here
-    # X, y = load_data()
+    # Calculate metrics
+    accuracy = accuracy_score(y_test, y_pred)
+    precision = precision_score(y_test, y_pred, average='weighted')
+    recall = recall_score(y_test, y_pred, average='weighted')
+    f1 = f1_score(y_test, y_pred, average='weighted')

-    # Example with synthetic data
-    from sklearn.datasets import make_classification
-    X, y = make_classification(
-        n_samples=1000,
-        n_features=20,
-        n_informative=15,
-        n_redundant=5,
-        random_state=42
-    )
+    print(f"Accuracy:  {accuracy:.4f}")
+    print(f"Precision: {precision:.4f}")
+    print(f"Recall:    {recall:.4f}")
+    print(f"F1-Score:  {f1:.4f}")

-    # Convert to DataFrame for demonstration
-    feature_names = [f'feature_{i}' for i in range(X.shape[1])]
-    X = pd.DataFrame(X, columns=feature_names)
+    # ROC AUC (if binary classification)
+    if len(np.unique(y)) == 2:
+        roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])
+        print(f"ROC AUC:   {roc_auc:.4f}")

-    # Split features into numeric and categorical (all numeric in this example)
-    numeric_features = feature_names
-    categorical_features = []
-
-    # Split data (use stratify for imbalanced classes)
-    X_train, X_test, y_train, y_test = train_test_split(
-        X, y, test_size=0.2, random_state=42, stratify=y
-    )
-
-    # Create preprocessing pipeline
-    preprocessor = create_preprocessing_pipeline(numeric_features, categorical_features)
-
-    # Create full pipeline
-    pipeline = create_full_pipeline(preprocessor)
-
-    # Train model
-    print("Training model...")
-    pipeline.fit(X_train, y_train)
-
-    # Evaluate model
-    print("\nEvaluating model...")
-    results = evaluate_model(pipeline, X_train, y_train, X_test, y_test)
-
-    print(f"CV Accuracy: {results['cv_mean']:.3f} (+/- {results['cv_std']:.3f})")
-    print(f"Test Accuracy: {results['test_score']:.3f}")
-    if results['auc']:
-        print(f"ROC-AUC: {results['auc']:.3f}")
-    print("\nClassification Report:")
-    print(results['classification_report'])
-
-    # Hyperparameter tuning (optional)
-    print("\nTuning hyperparameters...")
-    param_grid = {
-        'classifier__n_estimators': [100, 200],
-        'classifier__max_depth': [10, 20, None],
-        'classifier__min_samples_split': [2, 5]
-    }
-
-    grid_search = tune_hyperparameters(pipeline, X_train, y_train, param_grid)
-
-    # Evaluate best model
-    print("\nEvaluating tuned model...")
-    best_pipeline = grid_search.best_estimator_
-    y_pred = best_pipeline.predict(X_test)
+    print("\n" + "="*60)
+    print("Classification Report")
+    print("="*60)
    print(classification_report(y_test, y_pred))

-    # Save model
-    print("\nSaving model...")
-    joblib.dump(best_pipeline, 'best_model.pkl')
-    print("Model saved as 'best_model.pkl'")
+    print("\n" + "="*60)
+    print("Confusion Matrix")
+    print("="*60)
+    print(confusion_matrix(y_test, y_pred))
+
+    # Feature importance (if available)
+    if hasattr(tuned_model.named_steps['classifier'], 'feature_importances_'):
+        print("\n" + "="*60)
+        print("Top 10 Most Important Features")
+        print("="*60)
+
+        feature_names = tuned_model.named_steps['preprocessor'].get_feature_names_out()
+        importances = tuned_model.named_steps['classifier'].feature_importances_
+
+        feature_importance_df = pd.DataFrame({
+            'feature': feature_names,
+            'importance': importances
+        }).sort_values('importance', ascending=False).head(10)
+
+        print(feature_importance_df.to_string(index=False))
+
+    return {
+        'model': tuned_model,
+        'y_test': y_test,
+        'y_pred': y_pred,
+        'y_pred_proba': y_pred_proba,
+        'metrics': {
+            'accuracy': accuracy,
+            'precision': precision,
+            'recall': recall,
+            'f1': f1
+        }
+    }


+# Example usage
 if __name__ == "__main__":
-    main()
+    # Load example dataset
+    from sklearn.datasets import load_breast_cancer
+
+    # Load data
+    data = load_breast_cancer()
+    X = pd.DataFrame(data.data, columns=data.feature_names)
+    y = data.target
+
+    # For demonstration, treat all features as numeric
+    numeric_features = X.columns.tolist()
+    categorical_features = []
+
+    print("="*60)
+    print("Classification Pipeline Example")
+    print("Dataset: Breast Cancer Wisconsin")
+    print("="*60)
+
+    # Run complete pipeline
+    results = train_and_evaluate_model(
+        X, y, numeric_features, categorical_features,
+        test_size=0.2, random_state=42
+    )
+
+    print("\n" + "="*60)
+    print("Pipeline Complete!")
+    print("="*60)
--- a/scientific-packages/scikit-learn/scripts/clustering_analysis.py
+++ b/scientific-packages/scikit-learn/scripts/clustering_analysis.py
@@ -1,291 +1,386 @@
-#!/usr/bin/env python3
 """
-Clustering analysis script with multiple algorithms and evaluation.
-Demonstrates k-means, DBSCAN, and hierarchical clustering with visualization.
+Clustering analysis example with multiple algorithms, evaluation, and visualization.
 """

 import numpy as np
 import pandas as pd
-from sklearn.preprocessing import StandardScaler
-from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
-from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
-from sklearn.decomposition import PCA
 import matplotlib.pyplot as plt
-import seaborn as sns
+from sklearn.preprocessing import StandardScaler
+from sklearn.decomposition import PCA
+from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
+from sklearn.mixture import GaussianMixture
+from sklearn.metrics import (
+    silhouette_score, calinski_harabasz_score, davies_bouldin_score
+)
+import warnings
+warnings.filterwarnings('ignore')


-def scale_data(X):
+def preprocess_for_clustering(X, scale=True, pca_components=None):
    """
-    Scale features using StandardScaler.
-    ALWAYS scale data before clustering!
+    Preprocess data for clustering.

-    Args:
-        X: Feature matrix
+    Parameters:
+    -----------
+    X : array-like
+        Feature matrix
+    scale : bool
+        Whether to standardize features
+    pca_components : int or None
+        Number of PCA components (None to skip PCA)

    Returns:
-        Scaled feature matrix and fitted scaler
+    --------
+    array
+        Preprocessed data
    """
-    scaler = StandardScaler()
-    X_scaled = scaler.fit_transform(X)
-    return X_scaled, scaler
+    X_processed = X.copy()
+
+    if scale:
+        scaler = StandardScaler()
+        X_processed = scaler.fit_transform(X_processed)
+
+    if pca_components is not None:
+        pca = PCA(n_components=pca_components)
+        X_processed = pca.fit_transform(X_processed)
+        print(f"PCA: Explained variance ratio = {pca.explained_variance_ratio_.sum():.3f}")
+
+    return X_processed


-def find_optimal_k(X_scaled, k_range=range(2, 11)):
+def find_optimal_k_kmeans(X, k_range=range(2, 11)):
    """
-    Find optimal number of clusters using elbow method and silhouette score.
+    Find optimal K for K-Means using elbow method and silhouette score.

-    Args:
-        X_scaled: Scaled feature matrix
-        k_range: Range of k values to try
+    Parameters:
+    -----------
+    X : array-like
+        Feature matrix (should be scaled)
+    k_range : range
+        Range of K values to test

    Returns:
-        Dictionary with inertias and silhouette scores
+    --------
+    dict
+        Dictionary with inertia and silhouette scores for each K
    """
    inertias = []
    silhouette_scores = []

    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
-        labels = kmeans.fit_predict(X_scaled)
+        labels = kmeans.fit_predict(X)
+
        inertias.append(kmeans.inertia_)
-        silhouette_scores.append(silhouette_score(X_scaled, labels))
+        silhouette_scores.append(silhouette_score(X, labels))
+
+    # Plot results
+    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
+
+    # Elbow plot
+    ax1.plot(k_range, inertias, 'bo-')
+    ax1.set_xlabel('Number of clusters (K)')
+    ax1.set_ylabel('Inertia')
+    ax1.set_title('Elbow Method')
+    ax1.grid(True)
+
+    # Silhouette plot
+    ax2.plot(k_range, silhouette_scores, 'ro-')
+    ax2.set_xlabel('Number of clusters (K)')
+    ax2.set_ylabel('Silhouette Score')
+    ax2.set_title('Silhouette Analysis')
+    ax2.grid(True)
+
+    plt.tight_layout()
+    plt.savefig('clustering_optimization.png', dpi=300, bbox_inches='tight')
+    print("Saved: clustering_optimization.png")
+    plt.close()
+
+    # Find best K based on silhouette score
+    best_k = k_range[np.argmax(silhouette_scores)]
+    print(f"\nRecommended K based on silhouette score: {best_k}")

    return {
        'k_values': list(k_range),
        'inertias': inertias,
-        'silhouette_scores': silhouette_scores
+        'silhouette_scores': silhouette_scores,
+        'best_k': best_k
    }


-def plot_elbow_silhouette(results):
+def compare_clustering_algorithms(X, n_clusters=3):
    """
-    Plot elbow method and silhouette scores.
+    Compare different clustering algorithms.

-    Args:
-        results: Dictionary from find_optimal_k
-    """
-    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
-
-    # Elbow plot
-    ax1.plot(results['k_values'], results['inertias'], 'bo-')
-    ax1.set_xlabel('Number of clusters (k)')
-    ax1.set_ylabel('Inertia')
-    ax1.set_title('Elbow Method')
-    ax1.grid(True, alpha=0.3)
-
-    # Silhouette plot
-    ax2.plot(results['k_values'], results['silhouette_scores'], 'ro-')
-    ax2.set_xlabel('Number of clusters (k)')
-    ax2.set_ylabel('Silhouette Score')
-    ax2.set_title('Silhouette Score vs k')
-    ax2.grid(True, alpha=0.3)
-
-    plt.tight_layout()
-    plt.savefig('elbow_silhouette.png', dpi=300, bbox_inches='tight')
-    print("Saved elbow and silhouette plots to 'elbow_silhouette.png'")
-    plt.close()
-
-
-def evaluate_clustering(X_scaled, labels, algorithm_name):
-    """
-    Evaluate clustering using multiple metrics.
-
-    Args:
-        X_scaled: Scaled feature matrix
-        labels: Cluster labels
-        algorithm_name: Name of clustering algorithm
+    Parameters:
+    -----------
+    X : array-like
+        Feature matrix (should be scaled)
+    n_clusters : int
+        Number of clusters

    Returns:
-        Dictionary with evaluation metrics
+    --------
+    dict
+        Dictionary with results for each algorithm
    """
-    # Filter out noise points for DBSCAN (-1 labels)
-    mask = labels != -1
-    X_filtered = X_scaled[mask]
-    labels_filtered = labels[mask]
+    print("="*60)
+    print(f"Comparing Clustering Algorithms (n_clusters={n_clusters})")
+    print("="*60)

-    n_clusters = len(set(labels_filtered))
-    n_noise = list(labels).count(-1)
-
-    results = {
-        'algorithm': algorithm_name,
-        'n_clusters': n_clusters,
-        'n_noise': n_noise
+    algorithms = {
+        'K-Means': KMeans(n_clusters=n_clusters, random_state=42, n_init=10),
+        'Agglomerative': AgglomerativeClustering(n_clusters=n_clusters, linkage='ward'),
+        'Gaussian Mixture': GaussianMixture(n_components=n_clusters, random_state=42)
    }

-    # Calculate metrics if we have valid clusters
-    if n_clusters > 1:
-        results['silhouette'] = silhouette_score(X_filtered, labels_filtered)
-        results['davies_bouldin'] = davies_bouldin_score(X_filtered, labels_filtered)
-        results['calinski_harabasz'] = calinski_harabasz_score(X_filtered, labels_filtered)
+    # DBSCAN doesn't require n_clusters
+    # We'll add it separately
+    dbscan = DBSCAN(eps=0.5, min_samples=5)
+    dbscan_labels = dbscan.fit_predict(X)
+
+    results = {}
+
+    for name, algorithm in algorithms.items():
+        labels = algorithm.fit_predict(X)
+
+        # Calculate metrics
+        silhouette = silhouette_score(X, labels)
+        calinski = calinski_harabasz_score(X, labels)
+        davies = davies_bouldin_score(X, labels)
+
+        results[name] = {
+            'labels': labels,
+            'n_clusters': n_clusters,
+            'silhouette': silhouette,
+            'calinski_harabasz': calinski,
+            'davies_bouldin': davies
+        }
+
+        print(f"\n{name}:")
+        print(f"  Silhouette Score:       {silhouette:.4f} (higher is better)")
+        print(f"  Calinski-Harabasz:      {calinski:.4f} (higher is better)")
+        print(f"  Davies-Bouldin:         {davies:.4f} (lower is better)")
+
+    # DBSCAN results
+    n_clusters_dbscan = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
+    n_noise = list(dbscan_labels).count(-1)
+
+    if n_clusters_dbscan > 1:
+        # Only calculate metrics if we have multiple clusters
+        mask = dbscan_labels != -1  # Exclude noise
+        if mask.sum() > 0:
+            silhouette = silhouette_score(X[mask], dbscan_labels[mask])
+            calinski = calinski_harabasz_score(X[mask], dbscan_labels[mask])
+            davies = davies_bouldin_score(X[mask], dbscan_labels[mask])
+
+            results['DBSCAN'] = {
+                'labels': dbscan_labels,
+                'n_clusters': n_clusters_dbscan,
+                'n_noise': n_noise,
+                'silhouette': silhouette,
+                'calinski_harabasz': calinski,
+                'davies_bouldin': davies
+            }
+
+            print(f"\nDBSCAN:")
+            print(f"  Clusters found:         {n_clusters_dbscan}")
+            print(f"  Noise points:           {n_noise}")
+            print(f"  Silhouette Score:       {silhouette:.4f} (higher is better)")
+            print(f"  Calinski-Harabasz:      {calinski:.4f} (higher is better)")
+            print(f"  Davies-Bouldin:         {davies:.4f} (lower is better)")
    else:
-        results['silhouette'] = None
-        results['davies_bouldin'] = None
-        results['calinski_harabasz'] = None
+        print(f"\nDBSCAN:")
+        print(f"  Clusters found:         {n_clusters_dbscan}")
+        print(f"  Noise points:           {n_noise}")
+        print("  Note: Insufficient clusters for metric calculation")

    return results


-def perform_kmeans(X_scaled, n_clusters=3):
+def visualize_clusters(X, results, true_labels=None):
    """
-    Perform k-means clustering.
+    Visualize clustering results using PCA for 2D projection.

-    Args:
-        X_scaled: Scaled feature matrix
-        n_clusters: Number of clusters
-
-    Returns:
-        Fitted KMeans model and labels
+    Parameters:
+    -----------
+    X : array-like
+        Feature matrix
+    results : dict
+        Dictionary with clustering results
+    true_labels : array-like or None
+        True labels (if available) for comparison
    """
-    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
-    labels = kmeans.fit_predict(X_scaled)
-    return kmeans, labels
+    # Reduce to 2D using PCA
+    pca = PCA(n_components=2)
+    X_2d = pca.fit_transform(X)

+    # Determine number of subplots
+    n_plots = len(results)
+    if true_labels is not None:
+        n_plots += 1

-def perform_dbscan(X_scaled, eps=0.5, min_samples=5):
-    """
-    Perform DBSCAN clustering.
+    n_cols = min(3, n_plots)
+    n_rows = (n_plots + n_cols - 1) // n_cols

-    Args:
-        X_scaled: Scaled feature matrix
-        eps: Maximum distance between neighbors
-        min_samples: Minimum points to form dense region
+    fig, axes = plt.subplots(n_rows, n_cols, figsize=(5*n_cols, 4*n_rows))
+    if n_plots == 1:
+        axes = np.array([axes])
+    axes = axes.flatten()

-    Returns:
-        Fitted DBSCAN model and labels
-    """
-    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
-    labels = dbscan.fit_predict(X_scaled)
-    return dbscan, labels
+    plot_idx = 0

+    # Plot true labels if available
+    if true_labels is not None:
+        ax = axes[plot_idx]
+        scatter = ax.scatter(X_2d[:, 0], X_2d[:, 1], c=true_labels, cmap='viridis', alpha=0.6)
+        ax.set_title('True Labels')
+        ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})')
+        ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})')
+        plt.colorbar(scatter, ax=ax)
+        plot_idx += 1

-def perform_hierarchical(X_scaled, n_clusters=3, linkage='ward'):
-    """
-    Perform hierarchical clustering.
+    # Plot clustering results
+    for name, result in results.items():
+        ax = axes[plot_idx]
+        labels = result['labels']

-    Args:
-        X_scaled: Scaled feature matrix
-        n_clusters: Number of clusters
-        linkage: Linkage criterion ('ward', 'complete', 'average', 'single')
+        scatter = ax.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis', alpha=0.6)

-    Returns:
-        Fitted AgglomerativeClustering model and labels
-    """
-    hierarchical = AgglomerativeClustering(n_clusters=n_clusters, linkage=linkage)
-    labels = hierarchical.fit_predict(X_scaled)
-    return hierarchical, labels
+        # Highlight noise points for DBSCAN
+        if name == 'DBSCAN' and -1 in labels:
+            noise_mask = labels == -1
+            ax.scatter(X_2d[noise_mask, 0], X_2d[noise_mask, 1],
+                      c='red', marker='x', s=100, label='Noise', alpha=0.8)
+            ax.legend()

+        title = f"{name} (K={result['n_clusters']})"
+        if 'silhouette' in result:
+            title += f"\nSilhouette: {result['silhouette']:.3f}"
+        ax.set_title(title)
+        ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})')
+        ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})')
+        plt.colorbar(scatter, ax=ax)

-def visualize_clusters_2d(X_scaled, labels, algorithm_name, method='pca'):
-    """
-    Visualize clusters in 2D using PCA or t-SNE.
+        plot_idx += 1

-    Args:
-        X_scaled: Scaled feature matrix
-        labels: Cluster labels
-        algorithm_name: Name of algorithm for title
-        method: 'pca' or 'tsne'
-    """
-    # Reduce to 2D
-    if method == 'pca':
-        pca = PCA(n_components=2, random_state=42)
-        X_2d = pca.fit_transform(X_scaled)
-        variance = pca.explained_variance_ratio_
-        xlabel = f'PC1 ({variance[0]:.1%} variance)'
-        ylabel = f'PC2 ({variance[1]:.1%} variance)'
-    else:
-        from sklearn.manifold import TSNE
-        # Use PCA first to speed up t-SNE
-        pca = PCA(n_components=min(50, X_scaled.shape[1]), random_state=42)
-        X_pca = pca.fit_transform(X_scaled)
-        tsne = TSNE(n_components=2, random_state=42, perplexity=30)
-        X_2d = tsne.fit_transform(X_pca)
-        xlabel = 't-SNE 1'
-        ylabel = 't-SNE 2'
+    # Hide unused subplots
+    for idx in range(plot_idx, len(axes)):
+        axes[idx].axis('off')

-    # Plot
-    plt.figure(figsize=(10, 8))
-    scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis', alpha=0.6, s=50)
-    plt.colorbar(scatter, label='Cluster')
-    plt.xlabel(xlabel)
-    plt.ylabel(ylabel)
-    plt.title(f'{algorithm_name} Clustering ({method.upper()})')
-    plt.grid(True, alpha=0.3)
-
-    filename = f'{algorithm_name.lower().replace(" ", "_")}_{method}.png'
-    plt.savefig(filename, dpi=300, bbox_inches='tight')
-    print(f"Saved visualization to '{filename}'")
+    plt.tight_layout()
+    plt.savefig('clustering_results.png', dpi=300, bbox_inches='tight')
+    print("\nSaved: clustering_results.png")
    plt.close()


-def main():
+def complete_clustering_analysis(X, true_labels=None, scale=True,
+                                 find_k=True, k_range=range(2, 11), n_clusters=3):
    """
-    Example clustering analysis workflow.
+    Complete clustering analysis workflow.
+
+    Parameters:
+    -----------
+    X : array-like
+        Feature matrix
+    true_labels : array-like or None
+        True labels (for comparison only, not used in clustering)
+    scale : bool
+        Whether to scale features
+    find_k : bool
+        Whether to search for optimal K
+    k_range : range
+        Range of K values to test
+    n_clusters : int
+        Number of clusters to use in comparison
+
+    Returns:
+    --------
+    dict
+        Dictionary with all analysis results
    """
-    # Load your data here
-    # X = load_data()
+    print("="*60)
+    print("Clustering Analysis")
+    print("="*60)
+    print(f"Data shape: {X.shape}")

-    # Example with synthetic data
-    from sklearn.datasets import make_blobs
-    X, y_true = make_blobs(
-        n_samples=500,
-        n_features=10,
-        centers=4,
-        cluster_std=1.0,
-        random_state=42
-    )
+    # Preprocess data
+    X_processed = preprocess_for_clustering(X, scale=scale)

-    print(f"Dataset shape: {X.shape}")
+    # Find optimal K if requested
+    optimization_results = None
+    if find_k:
+        print("\n" + "="*60)
+        print("Finding Optimal Number of Clusters")
+        print("="*60)
+        optimization_results = find_optimal_k_kmeans(X_processed, k_range=k_range)

-    # Scale data (ALWAYS scale for clustering!)
-    print("\nScaling data...")
-    X_scaled, scaler = scale_data(X)
+        # Use recommended K
+        if optimization_results:
+            n_clusters = optimization_results['best_k']

-    # Find optimal k
-    print("\nFinding optimal number of clusters...")
-    results = find_optimal_k(X_scaled)
-    plot_elbow_silhouette(results)
+    # Compare clustering algorithms
+    comparison_results = compare_clustering_algorithms(X_processed, n_clusters=n_clusters)

-    # Based on elbow/silhouette, choose optimal k
-    optimal_k = 4  # Adjust based on plots
-
-    # Perform k-means
-    print(f"\nPerforming k-means with k={optimal_k}...")
-    kmeans, kmeans_labels = perform_kmeans(X_scaled, n_clusters=optimal_k)
-    kmeans_results = evaluate_clustering(X_scaled, kmeans_labels, 'K-Means')
-
-    # Perform DBSCAN
-    print("\nPerforming DBSCAN...")
-    dbscan, dbscan_labels = perform_dbscan(X_scaled, eps=0.5, min_samples=5)
-    dbscan_results = evaluate_clustering(X_scaled, dbscan_labels, 'DBSCAN')
-
-    # Perform hierarchical clustering
-    print("\nPerforming hierarchical clustering...")
-    hierarchical, hier_labels = perform_hierarchical(X_scaled, n_clusters=optimal_k)
-    hier_results = evaluate_clustering(X_scaled, hier_labels, 'Hierarchical')
-
-    # Print results
+    # Visualize results
    print("\n" + "="*60)
-    print("CLUSTERING RESULTS")
+    print("Visualizing Results")
+    print("="*60)
+    visualize_clusters(X_processed, comparison_results, true_labels=true_labels)
+
+    return {
+        'X_processed': X_processed,
+        'optimization': optimization_results,
+        'comparison': comparison_results
+    }
+
+
+# Example usage
+if __name__ == "__main__":
+    from sklearn.datasets import load_iris, make_blobs
+
+    print("="*60)
+    print("Example 1: Iris Dataset")
    print("="*60)

-    for results in [kmeans_results, dbscan_results, hier_results]:
-        print(f"\n{results['algorithm']}:")
-        print(f"  Clusters: {results['n_clusters']}")
-        if results['n_noise'] > 0:
-            print(f"  Noise points: {results['n_noise']}")
-        if results['silhouette']:
-            print(f"  Silhouette Score: {results['silhouette']:.3f}")
-            print(f"  Davies-Bouldin Index: {results['davies_bouldin']:.3f} (lower is better)")
-            print(f"  Calinski-Harabasz Index: {results['calinski_harabasz']:.1f} (higher is better)")
+    # Load Iris dataset
+    iris = load_iris()
+    X_iris = iris.data
+    y_iris = iris.target

-    # Visualize clusters
-    print("\nCreating visualizations...")
-    visualize_clusters_2d(X_scaled, kmeans_labels, 'K-Means', method='pca')
-    visualize_clusters_2d(X_scaled, dbscan_labels, 'DBSCAN', method='pca')
-    visualize_clusters_2d(X_scaled, hier_labels, 'Hierarchical', method='pca')
+    results_iris = complete_clustering_analysis(
+        X_iris,
+        true_labels=y_iris,
+        scale=True,
+        find_k=True,
+        k_range=range(2, 8),
+        n_clusters=3
+    )

-    print("\nClustering analysis complete!")
+    print("\n" + "="*60)
+    print("Example 2: Synthetic Dataset with Noise")
+    print("="*60)

+    # Create synthetic dataset
+    X_synth, y_synth = make_blobs(
+        n_samples=500, n_features=2, centers=4,
+        cluster_std=0.5, random_state=42
+    )

-if __name__ == "__main__":
-    main()
+    # Add noise points
+    noise = np.random.randn(50, 2) * 3
+    X_synth = np.vstack([X_synth, noise])
+    y_synth_with_noise = np.concatenate([y_synth, np.full(50, -1)])
+
+    results_synth = complete_clustering_analysis(
+        X_synth,
+        true_labels=y_synth_with_noise,
+        scale=True,
+        find_k=True,
+        k_range=range(2, 8),
+        n_clusters=4
+    )
+
+    print("\n" + "="*60)
+    print("Analysis Complete!")
+    print("="*60)