Add more scientific skills

2026-03-28 07:33:45 +08:00 · 2025-10-19 14:12:02 -07:00
parent 78d5ac2b56
commit 660c8574d0
210 changed files with 88957 additions and 1 deletions
--- a/scientific-packages/scikit-learn/references/unsupervised_learning.md
+++ b/scientific-packages/scikit-learn/references/unsupervised_learning.md
@@ -0,0 +1,728 @@
+# Unsupervised Learning in scikit-learn
+
+## Overview
+Unsupervised learning discovers patterns in data without labeled targets. Main tasks include clustering (grouping similar samples), dimensionality reduction (reducing feature count), and anomaly detection (finding outliers).
+
+## Clustering Algorithms
+
+### K-Means
+
+Groups data into k clusters by minimizing within-cluster variance.
+
+**Algorithm**:
+1. Initialize k centroids (k-means++ initialization recommended)
+2. Assign each point to nearest centroid
+3. Update centroids to mean of assigned points
+4. Repeat until convergence
+
+```python
+from sklearn.cluster import KMeans
+
+kmeans = KMeans(
+    n_clusters=3,
+    init='k-means++',  # Smart initialization
+    n_init=10,         # Number of times to run with different seeds
+    max_iter=300,
+    random_state=42
+)
+labels = kmeans.fit_predict(X)
+centroids = kmeans.cluster_centers_
+```
+
+**Use cases**:
+- Customer segmentation
+- Image compression
+- Data preprocessing (clustering as features)
+
+**Strengths**:
+- Fast and scalable
+- Simple to understand
+- Works well with spherical clusters
+
+**Limitations**:
+- Assumes spherical clusters of similar size
+- Sensitive to initialization (mitigated by k-means++)
+- Must specify k beforehand
+- Sensitive to outliers
+
+**Choosing k**: Use elbow method, silhouette score, or domain knowledge
+
+**Variants**:
+- **MiniBatchKMeans**: Faster for large datasets, uses mini-batches
+- **KMeans with n_init='auto'**: Adaptive number of initializations
+
+### DBSCAN
+
+Density-Based Spatial Clustering of Applications with Noise. Identifies clusters as dense regions separated by sparse areas.
+
+```python
+from sklearn.cluster import DBSCAN
+
+dbscan = DBSCAN(
+    eps=0.5,           # Maximum distance between neighbors
+    min_samples=5,     # Minimum points to form dense region
+    metric='euclidean'
+)
+labels = dbscan.fit_predict(X)
+# -1 indicates noise/outliers
+```
+
+**Use cases**:
+- Arbitrary cluster shapes
+- Outlier detection
+- When cluster count is unknown
+- Geographic/spatial data
+
+**Strengths**:
+- Discovers arbitrary-shaped clusters
+- Automatically detects outliers
+- Doesn't require specifying number of clusters
+- Robust to outliers
+
+**Limitations**:
+- Struggles with varying densities
+- Sensitive to eps and min_samples parameters
+- Not deterministic (border points may vary)
+
+**Parameter tuning**:
+- `eps`: Plot k-distance graph, look for elbow
+- `min_samples`: Rule of thumb: 2 * dimensions
+
+### HDBSCAN
+
+Hierarchical DBSCAN that handles variable cluster densities.
+
+```python
+from sklearn.cluster import HDBSCAN
+
+hdbscan = HDBSCAN(
+    min_cluster_size=5,
+    min_samples=None,  # Defaults to min_cluster_size
+    metric='euclidean'
+)
+labels = hdbscan.fit_predict(X)
+```
+
+**Advantages over DBSCAN**:
+- Handles variable density clusters
+- More robust parameter selection
+- Provides cluster membership probabilities
+- Hierarchical structure
+
+**Use cases**: When DBSCAN struggles with varying densities
+
+### Hierarchical Clustering
+
+Builds nested cluster hierarchies using agglomerative (bottom-up) approach.
+
+```python
+from sklearn.cluster import AgglomerativeClustering
+
+agg_clust = AgglomerativeClustering(
+    n_clusters=3,
+    linkage='ward',  # 'ward', 'complete', 'average', 'single'
+    metric='euclidean'
+)
+labels = agg_clust.fit_predict(X)
+
+# Visualize with dendrogram
+from scipy.cluster.hierarchy import dendrogram, linkage as scipy_linkage
+import matplotlib.pyplot as plt
+
+linkage_matrix = scipy_linkage(X, method='ward')
+dendrogram(linkage_matrix)
+plt.show()
+```
+
+**Linkage methods**:
+- `ward`: Minimizes variance (only with Euclidean) - **most common**
+- `complete`: Maximum distance between clusters
+- `average`: Average distance between clusters
+- `single`: Minimum distance between clusters
+
+**Use cases**:
+- When hierarchical structure is meaningful
+- Taxonomy/phylogenetic trees
+- When visualization is important (dendrograms)
+
+**Strengths**:
+- No need to specify k initially (cut dendrogram at desired level)
+- Produces hierarchy of clusters
+- Deterministic
+
+**Limitations**:
+- Computationally expensive (O(n²) to O(n³))
+- Not suitable for large datasets
+- Cannot undo previous merges
+
+### Spectral Clustering
+
+Performs dimensionality reduction using affinity matrix before clustering.
+
+```python
+from sklearn.cluster import SpectralClustering
+
+spectral = SpectralClustering(
+    n_clusters=3,
+    affinity='rbf',  # 'rbf', 'nearest_neighbors', 'precomputed'
+    gamma=1.0,
+    n_neighbors=10,
+    random_state=42
+)
+labels = spectral.fit_predict(X)
+```
+
+**Use cases**:
+- Non-convex clusters
+- Image segmentation
+- Graph clustering
+- When similarity matrix is available
+
+**Strengths**:
+- Handles non-convex clusters
+- Works with similarity matrices
+- Often better than k-means for complex shapes
+
+**Limitations**:
+- Computationally expensive
+- Requires specifying number of clusters
+- Memory intensive
+
+### Mean Shift
+
+Discovers clusters through iterative centroid updates based on density.
+
+```python
+from sklearn.cluster import MeanShift, estimate_bandwidth
+
+# Estimate bandwidth
+bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)
+
+mean_shift = MeanShift(bandwidth=bandwidth)
+labels = mean_shift.fit_predict(X)
+cluster_centers = mean_shift.cluster_centers_
+```
+
+**Use cases**:
+- When cluster count is unknown
+- Computer vision applications
+- Object tracking
+
+**Strengths**:
+- Automatically determines number of clusters
+- Handles arbitrary shapes
+- No assumptions about cluster shape
+
+**Limitations**:
+- Computationally expensive
+- Very sensitive to bandwidth parameter
+- Doesn't scale well
+
+### Affinity Propagation
+
+Uses message-passing between samples to identify exemplars.
+
+```python
+from sklearn.cluster import AffinityPropagation
+
+affinity_prop = AffinityPropagation(
+    damping=0.5,       # Damping factor (0.5-1.0)
+    preference=None,   # Self-preference (controls number of clusters)
+    random_state=42
+)
+labels = affinity_prop.fit_predict(X)
+exemplars = affinity_prop.cluster_centers_indices_
+```
+
+**Use cases**:
+- When number of clusters is unknown
+- When exemplars (representative samples) are needed
+
+**Strengths**:
+- Automatically determines number of clusters
+- Identifies exemplar samples
+- No initialization required
+
+**Limitations**:
+- Very slow: O(n²t) where t is iterations
+- Not suitable for large datasets
+- Memory intensive
+
+### Gaussian Mixture Models (GMM)
+
+Probabilistic model assuming data comes from mixture of Gaussian distributions.
+
+```python
+from sklearn.mixture import GaussianMixture
+
+gmm = GaussianMixture(
+    n_components=3,
+    covariance_type='full',  # 'full', 'tied', 'diag', 'spherical'
+    random_state=42
+)
+labels = gmm.fit_predict(X)
+probabilities = gmm.predict_proba(X)  # Soft clustering
+```
+
+**Covariance types**:
+- `full`: Each component has its own covariance matrix
+- `tied`: All components share same covariance
+- `diag`: Diagonal covariance (independent features)
+- `spherical`: Spherical covariance (isotropic)
+
+**Use cases**:
+- When soft clustering is needed (probabilities)
+- When clusters have different shapes/sizes
+- Generative modeling
+- Density estimation
+
+**Strengths**:
+- Provides probabilities (soft clustering)
+- Can handle elliptical clusters
+- Generative model (can sample new data)
+- Model selection with BIC/AIC
+
+**Limitations**:
+- Assumes Gaussian distributions
+- Sensitive to initialization
+- Can converge to local optima
+
+**Model selection**:
+```python
+from sklearn.mixture import GaussianMixture
+import numpy as np
+
+n_components_range = range(2, 10)
+bic_scores = []
+
+for n in n_components_range:
+    gmm = GaussianMixture(n_components=n, random_state=42)
+    gmm.fit(X)
+    bic_scores.append(gmm.bic(X))
+
+optimal_n = n_components_range[np.argmin(bic_scores)]
+```
+
+### BIRCH
+
+Builds Clustering Feature Tree for memory-efficient processing of large datasets.
+
+```python
+from sklearn.cluster import Birch
+
+birch = Birch(
+    n_clusters=3,
+    threshold=0.5,
+    branching_factor=50
+)
+labels = birch.fit_predict(X)
+```
+
+**Use cases**:
+- Very large datasets
+- Streaming data
+- Memory constraints
+
+**Strengths**:
+- Memory efficient
+- Single pass over data
+- Incremental learning
+
+## Dimensionality Reduction
+
+### Principal Component Analysis (PCA)
+
+Finds orthogonal components that explain maximum variance.
+
+```python
+from sklearn.decomposition import PCA
+import matplotlib.pyplot as plt
+
+# Specify number of components
+pca = PCA(n_components=2, random_state=42)
+X_transformed = pca.fit_transform(X)
+
+print("Explained variance ratio:", pca.explained_variance_ratio_)
+print("Total variance explained:", pca.explained_variance_ratio_.sum())
+
+# Or specify variance to retain
+pca = PCA(n_components=0.95)  # Keep 95% of variance
+X_transformed = pca.fit_transform(X)
+print(f"Components needed: {pca.n_components_}")
+
+# Visualize explained variance
+plt.plot(np.cumsum(pca.explained_variance_ratio_))
+plt.xlabel('Number of components')
+plt.ylabel('Cumulative explained variance')
+plt.show()
+```
+
+**Use cases**:
+- Visualization (reduce to 2-3 dimensions)
+- Remove multicollinearity
+- Noise reduction
+- Speed up training
+- Feature extraction
+
+**Strengths**:
+- Fast and efficient
+- Reduces multicollinearity
+- Works well for linear relationships
+- Interpretable components
+
+**Limitations**:
+- Only linear transformations
+- Sensitive to scaling (always standardize first!)
+- Components may be hard to interpret
+
+**Variants**:
+- **IncrementalPCA**: For datasets that don't fit in memory
+- **KernelPCA**: Non-linear dimensionality reduction
+- **SparsePCA**: Sparse loadings for interpretability
+
+### t-SNE
+
+t-Distributed Stochastic Neighbor Embedding for visualization.
+
+```python
+from sklearn.manifold import TSNE
+
+tsne = TSNE(
+    n_components=2,
+    perplexity=30,      # Balance local vs global structure (5-50)
+    learning_rate='auto',
+    n_iter=1000,
+    random_state=42
+)
+X_embedded = tsne.fit_transform(X)
+
+# Visualize
+plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y)
+plt.show()
+```
+
+**Use cases**:
+- Visualization only (do not use for preprocessing!)
+- Exploring high-dimensional data
+- Finding clusters visually
+
+**Important notes**:
+- **Only for visualization**, not for preprocessing
+- Each run produces different results (use random_state for reproducibility)
+- Slow for large datasets
+- Cannot transform new data (no transform() method)
+
+**Parameter tuning**:
+- `perplexity`: 5-50, larger for larger datasets
+- Lower perplexity = focus on local structure
+- Higher perplexity = focus on global structure
+
+### UMAP
+
+Uniform Manifold Approximation and Projection (requires umap-learn package).
+
+**Advantages over t-SNE**:
+- Preserves global structure better
+- Faster
+- Can transform new data
+- Can be used for preprocessing (not just visualization)
+
+### Truncated SVD (LSA)
+
+Similar to PCA but works with sparse matrices (e.g., TF-IDF).
+
+```python
+from sklearn.decomposition import TruncatedSVD
+
+svd = TruncatedSVD(n_components=100, random_state=42)
+X_reduced = svd.fit_transform(X_sparse)
+```
+
+**Use cases**:
+- Text data (after TF-IDF)
+- Sparse matrices
+- Latent Semantic Analysis (LSA)
+
+### Non-negative Matrix Factorization (NMF)
+
+Factorizes data into non-negative components.
+
+```python
+from sklearn.decomposition import NMF
+
+nmf = NMF(n_components=10, init='nndsvd', random_state=42)
+W = nmf.fit_transform(X)  # Document-topic matrix
+H = nmf.components_        # Topic-word matrix
+```
+
+**Use cases**:
+- Topic modeling
+- Audio source separation
+- Image processing
+- When non-negativity is important (e.g., counts)
+
+**Strengths**:
+- Interpretable components (additive, non-negative)
+- Sparse representations
+
+### Independent Component Analysis (ICA)
+
+Separates multivariate signal into independent components.
+
+```python
+from sklearn.decomposition import FastICA
+
+ica = FastICA(n_components=10, random_state=42)
+X_independent = ica.fit_transform(X)
+```
+
+**Use cases**:
+- Blind source separation
+- Signal processing
+- Feature extraction when independence is expected
+
+### Factor Analysis
+
+Models observed variables as linear combinations of latent factors plus noise.
+
+```python
+from sklearn.decomposition import FactorAnalysis
+
+fa = FactorAnalysis(n_components=5, random_state=42)
+X_factors = fa.fit_transform(X)
+```
+
+**Use cases**:
+- When noise is heteroscedastic
+- Latent variable modeling
+- Psychology/social science research
+
+**Difference from PCA**: Models noise explicitly, assumes features have independent noise
+
+## Anomaly Detection
+
+### One-Class SVM
+
+Learns boundary around normal data.
+
+```python
+from sklearn.svm import OneClassSVM
+
+oc_svm = OneClassSVM(
+    nu=0.1,           # Proportion of outliers expected
+    kernel='rbf',
+    gamma='auto'
+)
+oc_svm.fit(X_train)
+predictions = oc_svm.predict(X_test)  # 1 for inliers, -1 for outliers
+```
+
+**Use cases**:
+- Novelty detection
+- When only normal data is available for training
+
+### Isolation Forest
+
+Isolates outliers using random forests.
+
+```python
+from sklearn.ensemble import IsolationForest
+
+iso_forest = IsolationForest(
+    contamination=0.1,  # Expected proportion of outliers
+    random_state=42
+)
+predictions = iso_forest.fit_predict(X)  # 1 for inliers, -1 for outliers
+scores = iso_forest.score_samples(X)     # Anomaly scores
+```
+
+**Use cases**:
+- General anomaly detection
+- Works well with high-dimensional data
+- Fast and scalable
+
+**Strengths**:
+- Fast
+- Effective in high dimensions
+- Low memory requirements
+
+### Local Outlier Factor (LOF)
+
+Detects outliers based on local density deviation.
+
+```python
+from sklearn.neighbors import LocalOutlierFactor
+
+lof = LocalOutlierFactor(
+    n_neighbors=20,
+    contamination=0.1
+)
+predictions = lof.fit_predict(X)  # 1 for inliers, -1 for outliers
+scores = lof.negative_outlier_factor_  # Anomaly scores (negative)
+```
+
+**Use cases**:
+- Finding local outliers
+- When global methods fail
+
+## Clustering Evaluation
+
+### With Ground Truth Labels
+
+When true labels are available (for validation):
+
+**Adjusted Rand Index (ARI)**:
+```python
+from sklearn.metrics import adjusted_rand_score
+ari = adjusted_rand_score(y_true, y_pred)
+# Range: [-1, 1], 1 = perfect, 0 = random
+```
+
+**Normalized Mutual Information (NMI)**:
+```python
+from sklearn.metrics import normalized_mutual_info_score
+nmi = normalized_mutual_info_score(y_true, y_pred)
+# Range: [0, 1], 1 = perfect
+```
+
+**V-Measure**:
+```python
+from sklearn.metrics import v_measure_score
+v = v_measure_score(y_true, y_pred)
+# Range: [0, 1], harmonic mean of homogeneity and completeness
+```
+
+### Without Ground Truth Labels
+
+When true labels are unavailable (unsupervised evaluation):
+
+**Silhouette Score**:
+Measures how similar objects are to their own cluster vs other clusters.
+
+```python
+from sklearn.metrics import silhouette_score, silhouette_samples
+import matplotlib.pyplot as plt
+
+score = silhouette_score(X, labels)
+# Range: [-1, 1], higher is better
+# >0.7: Strong structure
+# 0.5-0.7: Reasonable structure
+# 0.25-0.5: Weak structure
+# <0.25: No substantial structure
+
+# Per-sample scores for detailed analysis
+sample_scores = silhouette_samples(X, labels)
+
+# Visualize silhouette plot
+for i in range(n_clusters):
+    cluster_scores = sample_scores[labels == i]
+    cluster_scores.sort()
+    plt.barh(range(len(cluster_scores)), cluster_scores)
+plt.axvline(x=score, color='red', linestyle='--')
+plt.show()
+```
+
+**Davies-Bouldin Index**:
+```python
+from sklearn.metrics import davies_bouldin_score
+db = davies_bouldin_score(X, labels)
+# Lower is better, 0 = perfect
+```
+
+**Calinski-Harabasz Index** (Variance Ratio Criterion):
+```python
+from sklearn.metrics import calinski_harabasz_score
+ch = calinski_harabasz_score(X, labels)
+# Higher is better
+```
+
+**Inertia** (K-Means specific):
+```python
+inertia = kmeans.inertia_
+# Sum of squared distances to nearest cluster center
+# Use for elbow method
+```
+
+### Elbow Method (K-Means)
+
+```python
+from sklearn.cluster import KMeans
+import matplotlib.pyplot as plt
+
+inertias = []
+K_range = range(2, 11)
+
+for k in K_range:
+    kmeans = KMeans(n_clusters=k, random_state=42)
+    kmeans.fit(X)
+    inertias.append(kmeans.inertia_)
+
+plt.plot(K_range, inertias, 'bo-')
+plt.xlabel('Number of clusters (k)')
+plt.ylabel('Inertia')
+plt.title('Elbow Method')
+plt.show()
+# Look for "elbow" where inertia starts decreasing more slowly
+```
+
+## Best Practices
+
+### Clustering Algorithm Selection
+
+**Use K-Means when**:
+- Clusters are spherical and similar size
+- Speed is important
+- Data is not too high-dimensional
+
+**Use DBSCAN when**:
+- Arbitrary cluster shapes
+- Number of clusters unknown
+- Outlier detection needed
+
+**Use Hierarchical when**:
+- Hierarchy is meaningful
+- Small to medium datasets
+- Visualization is important
+
+**Use GMM when**:
+- Soft clustering needed
+- Clusters have different shapes/sizes
+- Probabilistic interpretation needed
+
+**Use Spectral Clustering when**:
+- Non-convex clusters
+- Have similarity matrix
+- Moderate dataset size
+
+### Preprocessing for Clustering
+
+1. **Always scale features**: Use StandardScaler or MinMaxScaler
+2. **Handle outliers**: Remove or use robust algorithms (DBSCAN, HDBSCAN)
+3. **Reduce dimensionality if needed**: PCA for speed, careful with interpretation
+4. **Check for categorical variables**: Encode appropriately or use specialized algorithms
+
+### Dimensionality Reduction Guidelines
+
+**For preprocessing/feature extraction**:
+- PCA (linear relationships)
+- TruncatedSVD (sparse data)
+- NMF (non-negative data)
+
+**For visualization only**:
+- t-SNE (preserves local structure)
+- UMAP (preserves both local and global structure)
+
+**Always**:
+- Standardize features before PCA
+- Use appropriate n_components (elbow plot, explained variance)
+- Don't use t-SNE for anything except visualization
+
+### Common Pitfalls
+
+1. **Not scaling data**: Most algorithms sensitive to scale
+2. **Using t-SNE for preprocessing**: Only for visualization!
+3. **Overfitting cluster count**: Too many clusters = overfitting noise
+4. **Ignoring outliers**: Can severely affect centroid-based methods
+5. **Wrong metric**: Euclidean assumes all features equally important
+6. **Not validating results**: Always check with multiple metrics and domain knowledge
+7. **PCA without standardization**: Components dominated by high-variance features