8.0 KiB
Statistical Tests Guide for EDA
This guide provides interpretation guidelines for statistical tests commonly used in exploratory data analysis.
Normality Tests
Shapiro-Wilk Test
Purpose: Test if a sample comes from a normally distributed population
When to use: Best for small to medium sample sizes (n < 5000)
Interpretation:
- Null Hypothesis (H0): The data follows a normal distribution
- Alternative Hypothesis (H1): The data does not follow a normal distribution
- p-value > 0.05: Fail to reject H0 → Data is likely normally distributed
- p-value ≤ 0.05: Reject H0 → Data is not normally distributed
Notes:
- Very sensitive to sample size
- Small deviations from normality may be detected as significant in large samples
- Consider practical significance alongside statistical significance
Anderson-Darling Test
Purpose: Test if a sample comes from a specific distribution (typically normal)
When to use: More powerful than Shapiro-Wilk for detecting departures from normality
Interpretation:
- Compares test statistic against critical values at different significance levels
- If test statistic > critical value at given significance level, reject normality
- More weight given to tails of distribution than other tests
Kolmogorov-Smirnov Test
Purpose: Test if a sample comes from a reference distribution
When to use: When you have a large sample or want to test against distributions other than normal
Interpretation:
- p-value > 0.05: Sample distribution matches reference distribution
- p-value ≤ 0.05: Sample distribution differs from reference distribution
Distribution Characteristics
Skewness
Purpose: Measure asymmetry of the distribution
Interpretation:
- Skewness ≈ 0: Symmetric distribution
- Skewness > 0: Right-skewed (tail extends to right, most values on left)
- Skewness < 0: Left-skewed (tail extends to left, most values on right)
Magnitude interpretation:
- |Skewness| < 0.5: Approximately symmetric
- 0.5 ≤ |Skewness| < 1: Moderately skewed
- |Skewness| ≥ 1: Highly skewed
Implications:
- Highly skewed data may require transformation (log, sqrt, Box-Cox)
- Mean is pulled toward tail; median more robust for skewed data
- Many statistical tests assume symmetry/normality
Kurtosis
Purpose: Measure tailedness and peak of distribution
Interpretation (Excess Kurtosis, where normal distribution = 0):
- Kurtosis ≈ 0: Normal tail behavior (mesokurtic)
- Kurtosis > 0: Heavy tails, sharp peak (leptokurtic)
- More outliers than normal distribution
- Higher probability of extreme values
- Kurtosis < 0: Light tails, flat peak (platykurtic)
- Fewer outliers than normal distribution
- More uniform distribution
Magnitude interpretation:
- |Kurtosis| < 0.5: Normal-like tails
- 0.5 ≤ |Kurtosis| < 1: Moderately different tails
- |Kurtosis| ≥ 1: Very different tail behavior from normal
Implications:
- High kurtosis → Be cautious with outliers
- Low kurtosis → Distribution lacks distinct peak
Correlation Tests
Pearson Correlation
Purpose: Measure linear relationship between two continuous variables
Range: -1 to +1
Interpretation:
- r = +1: Perfect positive linear relationship
- r = 0: No linear relationship
- r = -1: Perfect negative linear relationship
Strength guidelines:
- |r| < 0.3: Weak correlation
- 0.3 ≤ |r| < 0.5: Moderate correlation
- 0.5 ≤ |r| < 0.7: Strong correlation
- |r| ≥ 0.7: Very strong correlation
Assumptions:
- Linear relationship between variables
- Both variables continuous and normally distributed
- No significant outliers
- Homoscedasticity (constant variance)
When to use: When relationship is expected to be linear and data meets assumptions
Spearman Correlation
Purpose: Measure monotonic relationship between two variables (rank-based)
Range: -1 to +1
Interpretation: Same as Pearson, but measures monotonic (not just linear) relationships
Advantages over Pearson:
- Robust to outliers (uses ranks)
- Doesn't assume linear relationship
- Works with ordinal data
- Doesn't require normality assumption
When to use:
- Data has outliers
- Relationship is monotonic but not linear
- Data is ordinal
- Distribution is non-normal
Outlier Detection Methods
IQR Method (Interquartile Range)
Definition:
- Lower bound: Q1 - 1.5 × IQR
- Upper bound: Q3 + 1.5 × IQR
- Values outside these bounds are outliers
Characteristics:
- Simple and interpretable
- Robust to extreme values
- Works well for skewed distributions
- Conservative approach (Tukey's fences)
Interpretation:
- < 5% outliers: Typical for most datasets
- 5-10% outliers: Moderate, investigate causes
- > 10% outliers: High rate, may indicate data quality issues or interesting phenomena
Z-Score Method
Definition: Outliers are data points with |z-score| > 3
Formula: z = (x - μ) / σ
Characteristics:
- Assumes normal distribution
- Sensitive to extreme values
- Standard threshold is |z| > 3 (99.7% of data within ±3σ)
When to use:
- Data is approximately normally distributed
- Large sample sizes (n > 30)
When NOT to use:
- Small samples
- Heavily skewed data
- Data with many outliers (contaminates mean and SD)
Hypothesis Testing Guidelines
Significance Levels
- α = 0.05: Standard significance level (5% chance of Type I error)
- α = 0.01: More conservative (1% chance of Type I error)
- α = 0.10: More liberal (10% chance of Type I error)
p-value Interpretation
- p ≤ 0.001: Very strong evidence against H0 (***)
- 0.001 < p ≤ 0.01: Strong evidence against H0 (**)
- 0.01 < p ≤ 0.05: Moderate evidence against H0 (*)
- 0.05 < p ≤ 0.10: Weak evidence against H0
- p > 0.10: Little to no evidence against H0
Important Considerations
- Statistical vs Practical Significance: A small p-value doesn't always mean the effect is important
- Multiple Testing: When performing many tests, use correction methods (Bonferroni, FDR)
- Sample Size: Large samples can detect trivial effects as significant
- Effect Size: Always report and interpret effect sizes alongside p-values
Data Transformation Strategies
When to Transform
- Right-skewed data: Log, square root, or Box-Cox transformation
- Left-skewed data: Square, cube, or exponential transformation
- Heavy tails/outliers: Robust scaling, winsorization, or log transformation
- Non-constant variance: Log or Box-Cox transformation
Common Transformations
-
Log transformation: log(x) or log(x + 1)
- Best for: Positive skewed data, multiplicative relationships
- Cannot use with zero or negative values
-
Square root transformation: √x
- Best for: Count data, moderate positive skew
- Less aggressive than log
-
Box-Cox transformation: (x^λ - 1) / λ
- Best for: Automatically finds optimal transformation
- Requires positive values
-
Standardization: (x - μ) / σ
- Best for: Scaling features to same range
- Centers data at 0 with unit variance
-
Min-Max scaling: (x - min) / (max - min)
- Best for: Scaling to [0, 1] range
- Preserves zero values
Practical Guidelines
Sample Size Considerations
- n < 30: Use non-parametric tests, be cautious with assumptions
- 30 ≤ n < 100: Moderate sample, parametric tests usually acceptable
- n ≥ 100: Large sample, parametric tests robust to violations
- n ≥ 1000: Very large sample, may detect trivial effects as significant
Dealing with Missing Data
- < 5% missing: Usually not a problem, simple methods OK
- 5-10% missing: Use appropriate imputation methods
- > 10% missing: Investigate patterns, consider advanced imputation or modeling missingness
Reporting Results
Always include:
- Test statistic value
- p-value
- Confidence interval (when applicable)
- Effect size
- Sample size
- Assumptions checked and violations noted