3.5 KiB
Statistical Tests Guide
Interpretation guidelines for common EDA statistical tests.
Normality Tests
Shapiro-Wilk
Use: Small to medium samples (n < 5000)
H0: Data is normal | H1: Data is not normal
Interpretation: p > 0.05 → likely normal | p ≤ 0.05 → not normal
Note: Very sensitive to sample size; small deviations may be significant in large samples
Anderson-Darling
Use: More powerful than Shapiro-Wilk, emphasizes tails
Interpretation: Test statistic > critical value → reject normality
Kolmogorov-Smirnov
Use: Large samples or testing against non-normal distributions
Interpretation: p > 0.05 → matches reference | p ≤ 0.05 → differs from reference
Distribution Characteristics
Skewness
Measures asymmetry:
- ≈ 0: Symmetric
- > 0: Right-skewed (tail right)
- < 0: Left-skewed (tail left)
Magnitude: |s| < 0.5 (symmetric) | 0.5-1 (moderate) | ≥ 1 (high)
Action: High skew → consider transformation (log, sqrt, Box-Cox); use median over mean
Kurtosis
Measures tailedness (excess kurtosis, normal = 0):
- ≈ 0: Normal tails
- > 0: Heavy tails, more outliers
- < 0: Light tails, fewer outliers
Magnitude: |k| < 0.5 (normal) | 0.5-1 (moderate) | ≥ 1 (very different)
Action: High kurtosis → investigate outliers carefully
Correlation
Pearson
Measures: Linear relationship (-1 to +1)
Strength: |r| < 0.3 (weak) | 0.3-0.5 (moderate) | 0.5-0.7 (strong) | ≥ 0.7 (very strong)
Assumptions: Linear, continuous, normal, no outliers, homoscedastic
Use: Expected linear relationship, assumptions met
Spearman
Measures: Monotonic relationship (-1 to +1), rank-based
Advantages: Robust to outliers, no linearity assumption, works with ordinal, no normality required
Use: Outliers present, non-linear monotonic relationship, ordinal data, non-normal
Outlier Detection
IQR Method
Bounds: Q1 - 1.5×IQR to Q3 + 1.5×IQR
Characteristics: Simple, robust, works with skewed data
Typical Rates: < 5% (normal) | 5-10% (moderate) | > 10% (high, investigate)
Z-Score Method
Definition: |z| > 3 where z = (x - μ) / σ
Use: Normal data, n > 30
Avoid: Small samples, skewed data, many outliers (contaminates mean/SD)
Hypothesis Testing
Significance Levels: α = 0.05 (standard) | 0.01 (conservative) | 0.10 (liberal)
p-value Interpretation: ≤ 0.001 () | ≤ 0.01 () | ≤ 0.05 () | ≤ 0.10 (weak) | > 0.10 (none)
Key Considerations:
- Statistical ≠ practical significance
- Multiple testing → use correction (Bonferroni, FDR)
- Large samples detect trivial effects
- Always report effect sizes with p-values
Transformations
Right-skewed: Log, sqrt, Box-Cox
Left-skewed: Square, cube, exponential
Heavy tails: Robust scaling, winsorization, log
Non-constant variance: Log, Box-Cox
Common Methods:
- Log: log(x+1) for positive skew, multiplicative relationships
- Sqrt: Count data, moderate skew
- Box-Cox: Auto-finds optimal (requires positive values)
- Standardization: (x-μ)/σ for scaling to unit variance
- Min-Max: (x-min)/(max-min) for [0,1] scaling
Practical Guidelines
Sample Size: n < 30 (non-parametric, cautious) | 30-100 (parametric OK) | ≥ 100 (robust) | ≥ 1000 (may detect trivial effects)
Missing Data: < 5% (simple methods) | 5-10% (imputation) | > 10% (investigate patterns, advanced methods)
Reporting: Include test statistic, p-value, CI, effect size, n, assumption checks