Improve the EDA skill

This commit is contained in:
Timothy Kassis
2025-11-04 17:25:06 -08:00
parent 1225ddecf1
commit ffad3d81b0
4 changed files with 362 additions and 988 deletions

View File

@@ -1,275 +1,202 @@
--- ---
name: exploratory-data-analysis name: exploratory-data-analysis
description: "EDA toolkit. Analyze CSV/Excel/JSON/Parquet files, statistical summaries, distributions, correlations, outliers, missing data, visualizations, markdown reports, for data profiling and insights." description: "Analyze datasets to discover patterns, anomalies, and relationships. Use when exploring data files, generating statistical summaries, checking data quality, or creating visualizations. Supports CSV, Excel, JSON, Parquet, and more."
--- ---
# Exploratory Data Analysis # Exploratory Data Analysis
## Overview Discover patterns, anomalies, and relationships in tabular data through statistical analysis and visualization.
EDA is a process for discovering patterns, anomalies, and relationships in data. Analyze CSV/Excel/JSON/Parquet files to generate statistical summaries, distributions, correlations, outliers, and visualizations. All outputs are markdown-formatted for integration into workflows. **Supported formats**: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
## When to Use This Skill ## Standard Workflow
This skill should be used when:
- User provides a data file and requests analysis or exploration
- User asks to "explore this dataset", "analyze this data", or "what's in this file?"
- User needs statistical summaries, distributions, or correlations
- User requests data visualizations or insights
- User wants to understand data quality issues or patterns
- User mentions EDA, exploratory analysis, or data profiling
**Supported file formats**: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
## Quick Start Workflow
1. **Receive data file** from user
2. **Run comprehensive analysis** using `scripts/eda_analyzer.py`
3. **Generate visualizations** using `scripts/visualizer.py`
4. **Create markdown report** using insights and the `assets/report_template.md` template
5. **Present findings** to user with key insights highlighted
## Core Capabilities
### 1. Comprehensive Data Analysis
Execute full statistical analysis using the `eda_analyzer.py` script:
1. Run statistical analysis:
```bash ```bash
python scripts/eda_analyzer.py <data_file_path> -o <output_directory> python scripts/eda_analyzer.py <data_file> -o <output_dir>
``` ```
**What it provides**: 2. Generate visualizations:
- Auto-detection and loading of file formats
- Basic dataset information (shape, types, memory usage)
- Missing data analysis (patterns, percentages)
- Summary statistics for numeric and categorical variables
- Outlier detection using IQR and Z-score methods
- Distribution analysis with normality tests (Shapiro-Wilk, Anderson-Darling)
- Correlation analysis (Pearson and Spearman)
- Data quality assessment (completeness, duplicates, issues)
- Automated insight generation
**Output**: JSON file containing all analysis results at `<output_directory>/eda_analysis.json`
### 2. Comprehensive Visualizations
Generate complete visualization suite using the `visualizer.py` script:
```bash ```bash
python scripts/visualizer.py <data_file_path> -o <output_directory> python scripts/visualizer.py <data_file> -o <output_dir>
``` ```
**Generated visualizations**: 3. Read analysis results from `<output_dir>/eda_analysis.json`
- **Missing data patterns**: Heatmap and bar chart showing missing data
- **Distribution plots**: Histograms with KDE overlays for all numeric variables
- **Box plots with violin plots**: Outlier detection visualizations
- **Correlation heatmap**: Both Pearson and Spearman correlation matrices
- **Scatter matrix**: Pairwise relationships between numeric variables
- **Categorical analysis**: Bar charts for top categories
- **Time series plots**: Temporal trends with trend lines (if datetime columns exist)
**Output**: High-quality PNG files saved to `<output_directory>/eda_visualizations/` 4. Create report using `assets/report_template.md` structure
All visualizations are production-ready with: 5. Present findings with key insights and visualizations
- 300 DPI resolution
- Clear titles and labels
- Statistical annotations
- Professional styling using seaborn
### 3. Automated Insight Generation ## Analysis Capabilities
The analyzer automatically generates actionable insights including: ### Statistical Analysis
- **Data scale insights**: Dataset size considerations for processing Run `scripts/eda_analyzer.py` to generate comprehensive analysis:
- **Missing data alerts**: Warnings when missing data exceeds thresholds
- **Correlation discoveries**: Strong relationships identified for feature engineering
- **Outlier warnings**: Variables with high outlier rates flagged
- **Distribution assessments**: Skewness issues requiring transformations
- **Duplicate alerts**: Duplicate row detection
- **Imbalance warnings**: Categorical variable imbalance detection
Access insights from the analysis results JSON under the `"insights"` key. ```bash
python scripts/eda_analyzer.py sales_data.csv -o ./output
```
### 4. Statistical Interpretation Produces `output/eda_analysis.json` containing:
- Dataset shape, types, memory usage
- Missing data patterns and percentages
- Summary statistics (numeric and categorical)
- Outlier detection (IQR and Z-score methods)
- Distribution analysis with normality tests
- Correlation matrices (Pearson and Spearman)
- Data quality metrics (completeness, duplicates)
- Automated insights
For detailed interpretation of statistical tests and measures, reference: ### Visualizations
**`references/statistical_tests_guide.md`** - Comprehensive guide covering: Run `scripts/visualizer.py` to generate plots:
```bash
python scripts/visualizer.py sales_data.csv -o ./output
```
Creates high-resolution (300 DPI) PNG files in `output/eda_visualizations/`:
- Missing data heatmaps and bar charts
- Distribution plots (histograms with KDE)
- Box plots and violin plots for outliers
- Correlation heatmaps
- Scatter matrices for numeric relationships
- Categorical bar charts
- Time series plots (if datetime columns detected)
### Automated Insights
Access generated insights from the `"insights"` key in the analysis JSON:
- Dataset size considerations
- Missing data warnings (when exceeding thresholds)
- Strong correlations for feature engineering
- High outlier rate flags
- Skewness requiring transformations
- Duplicate detection
- Categorical imbalance warnings
## Reference Materials
### Statistical Interpretation
See `references/statistical_tests_guide.md` for detailed guidance on:
- Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov) - Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
- Distribution characteristics (skewness, kurtosis) - Distribution characteristics (skewness, kurtosis)
- Correlation tests (Pearson, Spearman) - Correlation methods (Pearson, Spearman)
- Outlier detection methods (IQR, Z-score) - Outlier detection (IQR, Z-score)
- Hypothesis testing guidelines - Hypothesis testing and data transformations
- Data transformation strategies
Load this reference when needing to interpret specific statistical tests or explain results to users. Use when interpreting statistical results or explaining findings.
### 5. Best Practices Guidance ### Methodology
For methodological guidance, reference: See `references/eda_best_practices.md` for comprehensive guidance on:
- 6-step EDA process framework
- Univariate, bivariate, multivariate analysis approaches
- Visualization and statistical analysis guidelines
- Common pitfalls and domain-specific considerations
- Communication strategies for different audiences
**`references/eda_best_practices.md`** - Detailed best practices including: Use when planning analysis or handling specific scenarios.
- EDA process framework (6-step methodology)
- Univariate, bivariate, and multivariate analysis approaches
- Visualization guidelines
- Statistical analysis guidelines
- Common pitfalls to avoid
- Domain-specific considerations
- Communication tips for technical and non-technical audiences
Load this reference when planning analysis approach or needing guidance on specific EDA scenarios. ## Report Template
## Creating Analysis Reports Use `assets/report_template.md` to structure findings. Template includes:
Use the provided template to structure comprehensive EDA reports:
**`assets/report_template.md`** - Professional report template with sections for:
- Executive summary - Executive summary
- Dataset overview - Dataset overview
- Data quality assessment - Data quality assessment
- Univariate, bivariate, and multivariate analysis - Univariate, bivariate, and multivariate analysis
- Outlier analysis - Outlier analysis
- Key insights and findings - Key insights and recommendations
- Recommendations
- Limitations and appendices - Limitations and appendices
**To use the template**: Fill sections with analysis JSON results and embed visualizations using markdown image syntax.
1. Copy the template content
2. Fill in sections with analysis results from JSON output
3. Embed visualization images using markdown syntax
4. Populate insights and recommendations
5. Save as markdown for user consumption
## Typical Workflow Example ## Example: Complete Analysis
When user provides a data file: User request: "Explore this sales_data.csv file"
``` ```bash
User: "Can you explore this sales_data.csv file and tell me what you find?" # 1. Run analysis
python scripts/eda_analyzer.py sales_data.csv -o ./output
1. Run analysis: # 2. Generate visualizations
python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output python scripts/visualizer.py sales_data.csv -o ./output
2. Generate visualizations:
python scripts/visualizer.py sales_data.csv -o ./analysis_output
3. Read analysis results:
Read ./analysis_output/eda_analysis.json
4. Create markdown report using template:
- Copy assets/report_template.md structure
- Fill in sections with analysis results
- Reference visualizations from ./analysis_output/eda_visualizations/
- Include automated insights from JSON
5. Present to user:
- Show key insights prominently
- Highlight data quality issues
- Provide visualizations inline
- Make actionable recommendations
- Save complete report as .md file
``` ```
## Advanced Analysis Scenarios ```python
# 3. Read results
import json
with open('./output/eda_analysis.json') as f:
results = json.load(f)
### Large Datasets (>1M rows) # 4. Build report from assets/report_template.md
- Run analysis on sampled data first for quick exploration # - Fill sections with results
- Note sample size in report # - Embed images: ![Missing Data](./output/eda_visualizations/missing_data.png)
- Recommend distributed computing for full analysis # - Include insights from results['insights']
# - Add recommendations
```
### High-Dimensional Data (>50 columns) ## Special Cases
- Focus on most important variables first
- Consider PCA or feature selection
- Generate correlation analysis to identify variable groups
- Reference `eda_best_practices.md` section on high-dimensional data
### Time Series Data ### Dataset Size Strategy
- Ensure datetime columns are properly detected
- Time series visualizations will be automatically generated
- Consider temporal patterns, trends, and seasonality
- Reference `eda_best_practices.md` section on time series
### Imbalanced Data **If < 100 rows**: Note sample size limitations, use non-parametric methods
- Categorical analysis will flag imbalances
- Report class distributions prominently
- Recommend stratified sampling if needed
### Small Sample Sizes (<100 rows) **If 100-1M rows**: Standard workflow applies
- Non-parametric methods automatically used where appropriate
- Be conservative in statistical conclusions
- Note sample size limitations in report
## Output Best Practices **If > 1M rows**: Sample first for quick exploration, note sample size in report, recommend distributed computing for full analysis
**Always output as markdown**: ### Data Characteristics
- Structure findings using markdown headers, tables, and lists
- Embed visualizations using `![Description](path/to/image.png)` syntax
- Use tables for statistical summaries
- Include code blocks for any suggested transformations
- Highlight key insights with bold or bullet points
**Ensure reports are actionable**: **High-dimensional (>50 columns)**: Focus on key variables first, use correlation analysis to identify groups, consider PCA or feature selection. See `references/eda_best_practices.md` for guidance.
- Provide clear recommendations based on findings
- Flag data quality issues that need attention
- Suggest next steps for modeling or further analysis
- Identify feature engineering opportunities
**Make insights accessible**: **Time series**: Datetime columns auto-detected, temporal visualizations generated automatically. Consider trends, seasonality, patterns.
- Explain statistical concepts in plain language
- Use reference guides to provide detailed interpretations **Imbalanced**: Categorical analysis flags imbalances automatically. Report distributions prominently, recommend stratified sampling if needed.
- Include both technical details and executive summary
## Output Guidelines
**Format findings as markdown**:
- Use headers, tables, and lists for structure
- Embed visualizations: `![Description](path/to/image.png)`
- Include code blocks for suggested transformations
- Highlight key insights
**Make reports actionable**:
- Provide clear recommendations
- Flag data quality issues requiring attention
- Suggest next steps (modeling, feature engineering, further analysis)
- Tailor communication to user's technical level - Tailor communication to user's technical level
## Handling Edge Cases ## Error Handling
**Unsupported file formats**: **Unsupported formats**: Request conversion to supported format (CSV, Excel, JSON, Parquet)
- Request user to convert to supported format
- Suggest using pandas-compatible formats
**Files too large to load**: **Files too large**: Recommend sampling or chunked processing
- Recommend sampling approach
- Suggest chunked processing
- Consider alternative tools for big data
**Corrupted or malformed data**: **Corrupted data**: Report specific errors, suggest cleaning steps, attempt partial analysis
- Report specific errors encountered
- Suggest data cleaning steps
- Try to salvage partial analysis if possible
**All missing data in columns**: **Empty columns**: Flag in data quality section, recommend removal or investigation
- Flag completely empty columns
- Recommend removal or investigation
- Document in data quality section
## Resources Summary ## Resources
### scripts/ **Scripts** (handle all formats automatically):
- **`eda_analyzer.py`**: Main analysis engine - comprehensive statistical analysis - `scripts/eda_analyzer.py` - Statistical analysis engine
- **`visualizer.py`**: Visualization generator - creates all chart types - `scripts/visualizer.py` - Visualization generator
Both scripts are fully executable and handle multiple file formats automatically. **References** (load as needed):
- `references/statistical_tests_guide.md` - Test interpretation and methodology
- `references/eda_best_practices.md` - EDA process and best practices
### references/ **Template**:
- **`statistical_tests_guide.md`**: Statistical test interpretation and methodology - `assets/report_template.md` - Professional report structure
- **`eda_best_practices.md`**: Comprehensive EDA methodology and best practices
Load these references as needed to inform analysis approach and interpretation. ## Key Points
### assets/ - Run both scripts for complete analysis
- **`report_template.md`**: Professional markdown report template - Structure reports using the template
- Provide actionable insights, not just statistics
Use this template structure for creating consistent, comprehensive EDA reports. - Use reference guides for detailed interpretations
- Document data quality issues and limitations
## Key Reminders - Make clear recommendations for next steps
1. **Always generate markdown output** for textual results
2. **Run both scripts** (analyzer and visualizer) for complete analysis
3. **Use the template** to structure comprehensive reports
4. **Include visualizations** by referencing generated PNG files
5. **Provide actionable insights** - don't just present statistics
6. **Interpret findings** using reference guides
7. **Document limitations** and data quality issues
8. **Make recommendations** for next steps
This skill transforms raw data into actionable insights through systematic exploration, advanced statistics, rich visualizations, and clear communication.

View File

@@ -1,14 +1,10 @@
# Exploratory Data Analysis Report # EDA Report: [Dataset Name]
**Dataset**: [Dataset Name] **Date**: [Date] | **Analyst**: [Name]
**Analysis Date**: [Date]
**Analyst**: [Name]
---
## Executive Summary ## Executive Summary
[2-3 paragraph summary of key findings, major insights, and recommendations] [Concise summary of key findings and recommendations]
**Key Findings**: **Key Findings**:
- [Finding 1] - [Finding 1]
@@ -23,366 +19,197 @@
## 1. Dataset Overview ## 1. Dataset Overview
### 1.1 Data Source **Source**: [Source name] | **Format**: [CSV/Excel/JSON/etc.] | **Period**: [Date range]
- **Source**: [Source name and location]
- **Collection Period**: [Date range]
- **Last Updated**: [Date]
- **Format**: [CSV, Excel, JSON, etc.]
### 1.2 Data Structure **Structure**: [Rows] observations × [Columns] variables | **Memory**: [Size] MB
- **Observations (Rows)**: [Number]
- **Variables (Columns)**: [Number]
- **Memory Usage**: [Size in MB]
### 1.3 Variable Types **Variable Types**:
- **Numeric Variables** ([Count]): [List column names] - Numeric ([Count]): [List names]
- **Categorical Variables** ([Count]): [List column names] - Categorical ([Count]): [List names]
- **Datetime Variables** ([Count]): [List column names] - Datetime ([Count]): [List names]
- **Boolean Variables** ([Count]): [List column names] - Boolean ([Count]): [List names]
--- ---
## 2. Data Quality Assessment ## 2. Data Quality
### 2.1 Completeness **Completeness**: [Percentage]% | **Duplicates**: [Count] ([%]%)
**Overall Data Completeness**: [Percentage]% **Missing Data**:
| Column | Missing % | Assessment |
|--------|-----------|------------|
| [Column 1] | [%] | [High/Medium/Low] |
| [Column 2] | [%] | [High/Medium/Low] |
**Missing Data Summary**: ![Missing Data](path/to/missing_data.png)
| Column | Missing Count | Missing % | Assessment |
|--------|--------------|-----------|------------|
| [Column 1] | [Count] | [%] | [High/Medium/Low] |
| [Column 2] | [Count] | [%] | [High/Medium/Low] |
**Missing Data Pattern**: [Description of patterns, if any] **Quality Issues**:
- [Issue 1]
**Visualization**: ![Missing Data](path/to/missing_data.png) - [Issue 2]
### 2.2 Duplicates
- **Duplicate Rows**: [Count] ([Percentage]%)
- **Action Required**: [Yes/No - describe if needed]
### 2.3 Data Quality Issues
[List any identified issues]
- [ ] Issue 1: [Description]
- [ ] Issue 2: [Description]
- [ ] Issue 3: [Description]
--- ---
## 3. Univariate Analysis ## 3. Univariate Analysis
### 3.1 Numeric Variables ### Numeric: [Variable Name]
[For each key numeric variable:] **Stats**: Mean: [Value] | Median: [Value] | Std: [Value] | Range: [[Min]-[Max]]
#### [Variable Name] **Distribution**: Skewness: [Value] | Kurtosis: [Value] | Normality: [Yes/No]
**Summary Statistics**: **Outliers**: IQR: [Count] ([%]%) | Z-score: [Count] ([%]%)
- **Mean**: [Value]
- **Median**: [Value]
- **Std Dev**: [Value]
- **Min**: [Value]
- **Max**: [Value]
- **Range**: [Value]
- **IQR**: [Value]
**Distribution Characteristics**: ![Distribution](path/to/distribution.png)
- **Skewness**: [Value] - [Interpretation]
- **Kurtosis**: [Value] - [Interpretation]
- **Normality**: [Normal/Not Normal based on tests]
**Outliers**: **Insights**: [Key observations]
- **IQR Method**: [Count] outliers ([Percentage]%)
- **Z-Score Method**: [Count] outliers ([Percentage]%)
**Visualization**: ![Distribution of [Variable]](path/to/distribution.png) ### Categorical: [Variable Name]
**Insights**: **Stats**: [Count] unique values | Most common: [Value] ([%]%) | Balance: [Balanced/Imbalanced]
- [Key insight 1]
- [Key insight 2]
--- | Category | Count | % |
|----------|-------|---|
### 3.2 Categorical Variables
[For each key categorical variable:]
#### [Variable Name]
**Summary**:
- **Unique Values**: [Count]
- **Most Common**: [Value] ([Percentage]%)
- **Least Common**: [Value] ([Percentage]%)
- **Balance**: [Balanced/Imbalanced]
**Top Categories**:
| Category | Count | Percentage |
|----------|-------|------------|
| [Cat 1] | [Count] | [%] | | [Cat 1] | [Count] | [%] |
| [Cat 2] | [Count] | [%] | | [Cat 2] | [Count] | [%] |
| [Cat 3] | [Count] | [%] |
**Visualization**: ![Distribution of [Variable]](path/to/categorical.png) ![Distribution](path/to/categorical.png)
**Insights**: **Insights**: [Key observations]
- [Key insight 1]
- [Key insight 2]
--- ### Temporal: [Variable Name]
### 3.3 Temporal Variables **Range**: [Start] to [End] ([Duration]) | **Trend**: [Increasing/Decreasing/Stable] | **Seasonality**: [Yes/No]
[If datetime columns exist:] ![Time Series](path/to/timeseries.png)
#### [Variable Name] **Insights**: [Key observations]
**Time Range**: [Start Date] to [End Date]
**Duration**: [Time span]
**Temporal Coverage**: [Complete/Gaps identified]
**Temporal Patterns**:
- **Trend**: [Increasing/Decreasing/Stable]
- **Seasonality**: [Yes/No - describe if present]
- **Gaps**: [List any gaps in timeline]
**Visualization**: ![Time Series of [Variable]](path/to/timeseries.png)
**Insights**:
- [Key insight 1]
- [Key insight 2]
--- ---
## 4. Bivariate Analysis ## 4. Bivariate Analysis
### 4.1 Correlation Analysis **Correlation Summary**: [Count] strong positive | [Count] strong negative | [Count] weak/none
**Overall Correlation Structure**:
- **Strong Positive Correlations**: [Count]
- **Strong Negative Correlations**: [Count]
- **Weak/No Correlations**: [Count]
**Correlation Matrix**:
![Correlation Heatmap](path/to/correlation_heatmap.png) ![Correlation Heatmap](path/to/correlation_heatmap.png)
**Notable Correlations**: **Notable Correlations**:
| Variable 1 | Variable 2 | Pearson r | Spearman ρ | Strength | Interpretation | | Var 1 | Var 2 | Pearson | Spearman | Strength |
|-----------|-----------|-----------|------------|----------|----------------| |-------|-------|---------|----------|----------|
| [Var 1] | [Var 2] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] | | [Var 1] | [Var 2] | [Value] | [Value] | [Strong/Moderate/Weak] |
| [Var 1] | [Var 3] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] |
**Insights**: **Insights**: [Multicollinearity issues, feature engineering opportunities]
- [Key insight about correlations]
- [Potential multicollinearity issues]
- [Feature engineering opportunities]
--- ### Key Relationship: [Var 1] vs [Var 2]
### 4.2 Key Relationships **Type**: [Linear/Non-linear/None] | **r**: [Value] | **p-value**: [Value]
[For important variable pairs:] ![Scatter Plot](path/to/scatter.png)
#### [Variable 1] vs [Variable 2] **Insights**: [Description and implications]
**Relationship Type**: [Linear/Non-linear/None]
**Correlation**: [Value]
**Statistical Test**: [Test name, p-value]
**Visualization**: ![Scatter Plot](path/to/scatter.png)
**Insights**:
- [Description of relationship]
- [Implications]
--- ---
## 5. Multivariate Analysis ## 5. Multivariate Analysis
### 5.1 Scatter Matrix
![Scatter Matrix](path/to/scatter_matrix.png) ![Scatter Matrix](path/to/scatter_matrix.png)
**Observations**: **Patterns**: [Key observations]
- [Pattern 1]
- [Pattern 2]
- [Pattern 3]
### 5.2 Clustering Patterns **Clustering** (if performed): [Method] | [Count] clusters identified
[If clustering analysis performed:]
**Method**: [Method used]
**Number of Clusters**: [Count]
**Cluster Characteristics**:
- **Cluster 1**: [Description]
- **Cluster 2**: [Description]
**Visualization**: [Link to visualization]
--- ---
## 6. Outlier Analysis ## 6. Outliers
### 6.1 Outlier Summary **Overall Rate**: [%]%
**Overall Outlier Rate**: [Percentage]% | Variable | Outlier % | Method | Action |
|----------|-----------|--------|--------|
| [Var 1] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
| [Var 2] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
**Variables with High Outlier Rates**: ![Box Plots](path/to/boxplots.png)
| Variable | Outlier Count | Outlier % | Method | Action |
|----------|--------------|-----------|--------|--------|
| [Var 1] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
| [Var 2] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
**Visualization**: ![Box Plots](path/to/boxplots.png) **Investigation**: [Description of significant outliers, causes, validity]
### 6.2 Outlier Investigation
[For significant outliers:]
#### [Variable Name]
**Outlier Characteristics**:
- [Description of outliers]
- [Potential causes]
- [Validity assessment]
**Recommendation**: [Keep/Remove/Transform/Investigate further]
--- ---
## 7. Key Insights and Findings ## 7. Key Insights
### 7.1 Data Quality Insights **Data Quality**:
- [Insight with implication]
- [Insight with implication]
1. **[Insight 1]**: [Description and implication] **Statistical Patterns**:
2. **[Insight 2]**: [Description and implication] - [Insight with implication]
3. **[Insight 3]**: [Description and implication] - [Insight with implication]
### 7.2 Statistical Insights **Domain/Research Insights**:
- [Insight with implication]
- [Insight with implication]
1. **[Insight 1]**: [Description and implication] **Unexpected Findings**:
2. **[Insight 2]**: [Description and implication] - [Finding and significance]
3. **[Insight 3]**: [Description and implication]
### 7.3 Business/Research Insights
1. **[Insight 1]**: [Description and implication]
2. **[Insight 2]**: [Description and implication]
3. **[Insight 3]**: [Description and implication]
### 7.4 Unexpected Findings
1. **[Finding 1]**: [Description and significance]
2. **[Finding 2]**: [Description and significance]
--- ---
## 8. Recommendations ## 8. Recommendations
### 8.1 Data Quality Actions **Data Quality Actions**:
- [ ] [Action - priority]
- [ ] [Action - priority]
- [ ] **[Action 1]**: [Description and priority] **Next Steps**:
- [ ] **[Action 2]**: [Description and priority] - [Step with rationale]
- [ ] **[Action 3]**: [Description and priority] - [Step with rationale]
### 8.2 Analysis Next Steps **Feature Engineering**:
- [Opportunity]
- [Opportunity]
1. **[Step 1]**: [Description and rationale] **Modeling Considerations**:
2. **[Step 2]**: [Description and rationale] - [Consideration]
3. **[Step 3]**: [Description and rationale] - [Consideration]
### 8.3 Feature Engineering Opportunities
- **[Opportunity 1]**: [Description]
- **[Opportunity 2]**: [Description]
- **[Opportunity 3]**: [Description]
### 8.4 Modeling Considerations
- **[Consideration 1]**: [Description]
- **[Consideration 2]**: [Description]
- **[Consideration 3]**: [Description]
--- ---
## 9. Limitations and Caveats ## 9. Limitations
### 9.1 Data Limitations **Data**: [Key limitations]
- [Limitation 1] **Analysis**: [Key limitations]
- [Limitation 2]
- [Limitation 3]
### 9.2 Analysis Limitations **Assumptions**: [Key assumptions made]
- [Limitation 1]
- [Limitation 2]
- [Limitation 3]
### 9.3 Assumptions Made
- [Assumption 1]
- [Assumption 2]
- [Assumption 3]
--- ---
## 10. Appendices ## Appendices
### Appendix A: Technical Details ### A: Technical Details
**Software Environment**: **Environment**: Python with pandas, numpy, scipy, matplotlib, seaborn
- Python: [Version]
- Key Libraries: pandas ([Version]), numpy ([Version]), scipy ([Version]), matplotlib ([Version])
**Analysis Scripts**: [Link to repository or location] **Scripts**: [Repository/location]
### Appendix B: Variable Dictionary ### B: Variable Dictionary
| Variable Name | Type | Description | Unit | Valid Range | Missing % | | Variable | Type | Description | Unit | Range | Missing % |
|--------------|------|-------------|------|-------------|-----------| |----------|------|-------------|------|-------|-----------|
| [Var 1] | [Type] | [Description] | [Unit] | [Range] | [%] | | [Var 1] | [Type] | [Description] | [Unit] | [Range] | [%] |
| [Var 2] | [Type] | [Description] | [Unit] | [Range] | [%] |
### Appendix C: Statistical Test Results ### C: Statistical Tests
[Detailed statistical test outputs] **Normality**:
**Normality Tests**:
| Variable | Test | Statistic | p-value | Result | | Variable | Test | Statistic | p-value | Result |
|----------|------|-----------|---------|--------| |----------|------|-----------|---------|--------|
| [Var 1] | Shapiro-Wilk | [Value] | [Value] | [Normal/Non-normal] | | [Var 1] | Shapiro-Wilk | [Value] | [Value] | [Normal/Non-normal] |
**Correlation Tests**: **Correlations**:
| Var 1 | Var 2 | Coefficient | p-value | Significance | | Var 1 | Var 2 | r | p-value | Significant |
|-------|-------|-------------|---------|--------------| |-------|-------|---|---------|-------------|
| [Var 1] | [Var 2] | [Value] | [Value] | [Yes/No] | | [Var 1] | [Var 2] | [Value] | [Value] | [Yes/No] |
### Appendix D: Full Visualization Gallery ### D: Visualizations
[Links to all generated visualizations] 1. [Description](path/to/viz1.png)
2. [Description](path/to/viz2.png)
1. [Visualization 1 description](path/to/viz1.png)
2. [Visualization 2 description](path/to/viz2.png)
3. [Visualization 3 description](path/to/viz3.png)
---
## Contact Information
**Analyst**: [Name]
**Email**: [Email]
**Date**: [Date]
**Version**: [Version number]
---
**Document History**:
| Version | Date | Changes | Author |
|---------|------|---------|--------|
| 1.0 | [Date] | Initial analysis | [Name] |

View File

@@ -1,379 +1,125 @@
# Exploratory Data Analysis Best Practices # EDA Best Practices
This guide provides best practices and methodologies for conducting thorough exploratory data analysis. Methodologies for conducting thorough exploratory data analysis.
## EDA Process Framework ## 6-Step EDA Framework
### 1. Initial Data Understanding ### 1. Initial Understanding
**Objectives**: **Questions**:
- Understand data structure and format
- Identify data types and schema
- Get familiar with domain context
**Key Questions**:
- What does each column represent? - What does each column represent?
- What is the unit of observation? - What is the unit of observation and time period?
- What is the time period covered?
- What is the data collection methodology? - What is the data collection methodology?
- Are there any known data quality issues? - Are there known quality issues?
**Actions**: **Actions**: Load data, inspect structure, review types, document context
- Load and inspect first/last rows
- Check data dimensions (rows × columns)
- Review column names and types
- Document data source and context
### 2. Data Quality Assessment ### 2. Quality Assessment
**Objectives**: **Check**: Missing data patterns, duplicates, outliers, consistency, accuracy
- Identify data quality issues
- Assess data completeness and reliability
- Document data limitations
**Key Checks**:
- **Missing data**: Patterns, extent, randomness
- **Duplicates**: Exact and near-duplicates
- **Outliers**: Valid extremes vs. data errors
- **Consistency**: Cross-field validation
- **Accuracy**: Domain knowledge validation
**Red Flags**: **Red Flags**:
- High missing data rate (>20%) - Missing data >20%
- Unexpected duplicates - Unexpected duplicates
- Constant or near-constant columns - Constant columns
- Impossible values (negative ages, dates in future) - Impossible values (negative ages, future dates)
- High cardinality in ID-like columns
- Suspicious patterns (too many round numbers) - Suspicious patterns (too many round numbers)
### 3. Univariate Analysis ### 3. Univariate Analysis
**Objectives**: **Numeric**: Central tendency, dispersion, shape (skewness, kurtosis), distribution plots, outliers
- Understand individual variable distributions
- Identify anomalies and patterns
- Determine variable characteristics
**For Numeric Variables**: **Categorical**: Frequency distributions, unique counts, balance, bar charts
- Central tendency (mean, median, mode)
- Dispersion (range, variance, std, IQR)
- Shape (skewness, kurtosis)
- Distribution visualization (histogram, KDE, box plot)
- Outlier detection
**For Categorical Variables**: **Temporal**: Time range, gaps, trends, seasonality, time series plots
- Frequency distributions
- Unique value counts
- Most/least common categories
- Category balance/imbalance
- Bar charts and count plots
**For Temporal Variables**:
- Time range coverage
- Gaps in timeline
- Temporal patterns (trends, seasonality)
- Time series plots
### 4. Bivariate Analysis ### 4. Bivariate Analysis
**Objectives**: **Numeric vs Numeric**: Scatter plots, correlations (Pearson, Spearman), detect non-linearity
- Understand relationships between variables
- Identify correlations and dependencies
- Find potential predictors
**Numeric vs Numeric**: **Numeric vs Categorical**: Group statistics, box plots by category, t-test/ANOVA
- Scatter plots
- Correlation coefficients (Pearson, Spearman)
- Line of best fit
- Detect non-linear relationships
**Numeric vs Categorical**: **Categorical vs Categorical**: Cross-tabs, stacked bars, chi-square, Cramér's V
- Group statistics (mean, median by category)
- Box plots by category
- Distribution plots by category
- Statistical tests (t-test, ANOVA)
**Categorical vs Categorical**:
- Cross-tabulation / contingency tables
- Stacked bar charts
- Chi-square tests
- Cramér's V for association strength
### 5. Multivariate Analysis ### 5. Multivariate Analysis
**Objectives**: **Techniques**: Correlation matrices, pair plots, parallel coordinates, PCA, clustering
- Understand complex interactions
- Identify patterns across multiple variables
- Explore dimensionality
**Techniques**: **Questions**: Groups of correlated features? Reduce dimensionality? Natural clusters? Conditional patterns?
- Correlation matrices and heatmaps
- Pair plots / scatter matrices
- Parallel coordinates plots
- Principal Component Analysis (PCA)
- Clustering analysis
**Key Questions**:
- Are there groups of correlated features?
- Can we reduce dimensionality?
- Are there natural clusters?
- Do patterns change when conditioning on other variables?
### 6. Insight Generation ### 6. Insight Generation
**Objectives**: **Look for**: Unexpected patterns, strong correlations, quality issues, feature engineering opportunities, domain implications
- Synthesize findings into actionable insights
- Formulate hypotheses
- Identify next steps
**What to Look For**: ## Visualization Guidelines
- Unexpected patterns or anomalies
- Strong relationships or correlations
- Data quality issues requiring attention
- Feature engineering opportunities
- Business or research implications
## Best Practices **Chart Selection**:
- Distribution: Histogram, KDE, box/violin plots
- Relationships: Scatter, line, heatmap
- Composition: Stacked bar
- Comparison: Bar, grouped bar
### Visualization Guidelines **Best Practices**: Label axes with units, descriptive titles, purposeful color, appropriate scales, avoid clutter
1. **Choose appropriate chart types**: ## Statistical Analysis Guidelines
- Distribution: Histogram, KDE, box plot, violin plot
- Relationships: Scatter plot, line plot, heatmap
- Composition: Stacked bar, pie chart (use sparingly)
- Comparison: Bar chart, grouped bar chart
2. **Make visualizations clear and informative**: **Check Assumptions**: Normality, homoscedasticity, independence, linearity
- Always label axes with units
- Add descriptive titles
- Use color purposefully
- Include legends when needed
- Choose appropriate scales
- Avoid chart junk
3. **Use multiple views**: **Method Selection**: Parametric when assumptions met, non-parametric otherwise, report effect sizes
- Show data from different angles
- Combine complementary visualizations
- Use small multiples for faceting
### Statistical Analysis Guidelines **Context Matters**: Statistical ≠ practical significance, domain knowledge trumps statistics, correlation ≠ causation
1. **Check assumptions**: ## Documentation Guidelines
- Test for normality before parametric tests
- Check for homoscedasticity
- Verify independence of observations
- Assess linearity for linear models
2. **Use appropriate methods**: **Notes**: Document assumptions, decisions, issues, findings
- Parametric tests when assumptions met
- Non-parametric alternatives when violated
- Robust methods for outlier-prone data
- Effect sizes alongside p-values
3. **Consider context**: **Reproducibility**: Use scripts, version control, document sources, set random seeds
- Statistical significance ≠ practical significance
- Domain knowledge trumps statistical patterns
- Correlation ≠ causation
- Sample size affects what you can detect
### Documentation Guidelines **Reporting**: Clear summaries, supporting visualizations, highlighted insights, actionable recommendations
1. **Keep detailed notes**: ## Common Pitfalls
- Document assumptions and decisions
- Record data issues discovered
- Note interesting findings
- Track questions that arise
2. **Create reproducible analysis**: 1. **Confirmation Bias**: Seek disconfirming evidence, use blind analysis
- Use scripts, not manual Excel operations 2. **Ignoring Quality**: Address issues first, document limitations
- Version control your code 3. **Over-automation**: Manually inspect subsets, verify results
- Document data sources and versions 4. **Neglecting Outliers**: Investigate before removing - may be informative
- Include random seeds for reproducibility 5. **Multiple Testing**: Use correction (Bonferroni, FDR) or note exploratory nature
6. **Association ≠ Causation**: Use careful language, acknowledge alternatives
3. **Summarize findings**: 7. **Cherry-picking**: Report complete analysis, including negative results
- Write clear summaries 8. **Ignoring Sample Size**: Report effect sizes, CIs, and sample sizes
- Use visualizations to support points
- Highlight key insights
- Provide recommendations
## Common Pitfalls to Avoid
### 1. Confirmation Bias
- **Problem**: Looking only for evidence supporting preconceptions
- **Solution**: Actively seek disconfirming evidence, use blind analysis
### 2. Ignoring Data Quality
- **Problem**: Proceeding with analysis despite known data issues
- **Solution**: Address quality issues first, document limitations
### 3. Over-reliance on Automation
- **Problem**: Running analyses without understanding or verifying results
- **Solution**: Manually inspect subsets, verify automated findings
### 4. Neglecting Outliers
- **Problem**: Removing outliers without investigation
- **Solution**: Always investigate outliers - they may contain important information
### 5. Multiple Testing Without Correction
- **Problem**: Running many tests increases false positive rate
- **Solution**: Use correction methods (Bonferroni, FDR) or be explicit about exploratory nature
### 6. Mistaking Association for Causation
- **Problem**: Inferring causation from correlation
- **Solution**: Use careful language, acknowledge alternative explanations
### 7. Cherry-picking Results
- **Problem**: Reporting only interesting/significant findings
- **Solution**: Report complete analysis, including negative results
### 8. Ignoring Sample Size
- **Problem**: Not considering how sample size affects conclusions
- **Solution**: Report effect sizes, confidence intervals, and sample sizes
## Domain-Specific Considerations ## Domain-Specific Considerations
### Time Series Data **Time Series**: Check stationarity, identify trends/seasonality, autocorrelation, temporal splits
- Check for stationarity
- Identify trends and seasonality
- Look for autocorrelation
- Handle missing time points
- Consider temporal splits for validation
### High-Dimensional Data **High-Dimensional**: Dimensionality reduction, feature importance, regularization, domain-guided selection
- Start with dimensionality reduction
- Focus on feature importance
- Be cautious of curse of dimensionality
- Use regularization in modeling
- Consider domain knowledge for feature selection
### Imbalanced Data **Imbalanced**: Report distributions, appropriate metrics, resampling, stratified CV
- Report class distributions
- Use appropriate metrics (not just accuracy)
- Consider resampling techniques
- Stratify sampling and cross-validation
- Be aware of biases in learning
### Small Sample Sizes **Small Samples**: Non-parametric methods, conservative conclusions, CIs, Bayesian approaches
- Use non-parametric methods
- Be conservative with conclusions
- Report confidence intervals
- Consider Bayesian approaches
- Acknowledge limitations
### Big Data **Big Data**: Intelligent sampling, efficient structures, parallel computing, scalability
- Sample intelligently for exploration
- Use efficient data structures
- Leverage parallel/distributed computing
- Be aware computational complexity
- Consider scalability in methods
## Iterative Process ## Iterative Process
EDA is not linear - iterate and refine: EDA is iterative: Explore → Questions → Focused Analysis → Insights → New Questions → Deeper Investigation → Synthesis
1. **Initial exploration** → Identify questions **Done When**: Understand structure/quality, characterized variables, identified relationships, documented limitations, answered questions, have actionable insights
2. **Focused analysis** → Answer specific questions
3. **New insights** → Generate new questions
4. **Deeper investigation** → Refine understanding
5. **Synthesis** → Integrate findings
### When to Stop **Deliverables**: Data understanding, quality issue list, relationship insights, hypotheses, feature ideas, recommendations
You've done enough EDA when: ## Communication
- ✅ You understand the data structure and quality
- ✅ You've characterized key variables
- ✅ You've identified important relationships
- ✅ You've documented limitations
- ✅ You can answer your research questions
- ✅ You have actionable insights
### Moving Forward **Technical Audiences**: Methodological details, statistical tests, assumptions, reproducible code
After EDA, you should have: **Non-Technical Audiences**: Focus on insights, clear visualizations, avoid jargon, concrete recommendations
- Clear understanding of data
- List of quality issues and how to handle them
- Insights about relationships and patterns
- Hypotheses to test
- Ideas for feature engineering
- Recommendations for next steps
## Communication Tips **Report Structure**: Executive summary → Data overview → Analysis → Insights → Recommendations → Appendix
### For Technical Audiences ## Checklists
- Include methodological details
- Show statistical test results
- Discuss assumptions and limitations
- Provide reproducible code
- Reference relevant literature
### For Non-Technical Audiences **Before**: Understand context, define objectives, identify audience, set up environment
- Focus on insights, not methods
- Use clear visualizations
- Avoid jargon
- Provide context and implications
- Make recommendations concrete
### Report Structure **During**: Inspect structure, assess quality, analyze distributions, explore relationships, document continuously
1. **Executive Summary**: Key findings and recommendations
2. **Data Overview**: Source, structure, limitations
3. **Analysis**: Findings organized by theme
4. **Insights**: Patterns, anomalies, implications
5. **Recommendations**: Next steps and actions
6. **Appendix**: Technical details, full statistics
## Useful Checklists **After**: Verify findings, check alternatives, document limitations, prepare visualizations, ensure reproducibility
### Before Starting
- [ ] Understand business/research context
- [ ] Define analysis objectives
- [ ] Identify stakeholders and audience
- [ ] Secure necessary permissions
- [ ] Set up reproducible environment
### During Analysis
- [ ] Load and inspect data structure
- [ ] Assess data quality
- [ ] Analyze univariate distributions
- [ ] Explore bivariate relationships
- [ ] Investigate multivariate patterns
- [ ] Generate and validate insights
- [ ] Document findings continuously
### Before Concluding
- [ ] Verify all findings
- [ ] Check for alternative explanations
- [ ] Document limitations
- [ ] Prepare clear visualizations
- [ ] Write actionable recommendations
- [ ] Review with domain experts
- [ ] Ensure reproducibility
## Tools and Libraries
### Python Ecosystem
- **pandas**: Data manipulation
- **numpy**: Numerical operations
- **matplotlib/seaborn**: Visualization
- **scipy**: Statistical tests
- **scikit-learn**: ML preprocessing
- **plotly**: Interactive visualizations
### Best Tool Practices
- Use appropriate tool for task
- Leverage vectorization
- Chain operations efficiently
- Handle missing data properly
- Validate results independently
- Document custom functions
## Further Resources
- **Books**:
- "Exploratory Data Analysis" by John Tukey
- "The Art of Statistics" by David Spiegelhalter
- **Guidelines**:
- ASA Statistical Significance Statement
- FAIR data principles
- **Communities**:
- Cross Validated (Stack Exchange)
- /r/datascience
- Local data science meetups

View File

@@ -1,252 +1,126 @@
# Statistical Tests Guide for EDA # Statistical Tests Guide
This guide provides interpretation guidelines for statistical tests commonly used in exploratory data analysis. Interpretation guidelines for common EDA statistical tests.
## Normality Tests ## Normality Tests
### Shapiro-Wilk Test ### Shapiro-Wilk
**Purpose**: Test if a sample comes from a normally distributed population **Use**: Small to medium samples (n < 5000)
**When to use**: Best for small to medium sample sizes (n < 5000) **H0**: Data is normal | **H1**: Data is not normal
**Interpretation**: **Interpretation**: p > 0.05 → likely normal | p ≤ 0.05 → not normal
- **Null Hypothesis (H0)**: The data follows a normal distribution
- **Alternative Hypothesis (H1)**: The data does not follow a normal distribution
- **p-value > 0.05**: Fail to reject H0 → Data is likely normally distributed
- **p-value ≤ 0.05**: Reject H0 → Data is not normally distributed
**Notes**: **Note**: Very sensitive to sample size; small deviations may be significant in large samples
- Very sensitive to sample size
- Small deviations from normality may be detected as significant in large samples
- Consider practical significance alongside statistical significance
### Anderson-Darling Test ### Anderson-Darling
**Purpose**: Test if a sample comes from a specific distribution (typically normal) **Use**: More powerful than Shapiro-Wilk, emphasizes tails
**When to use**: More powerful than Shapiro-Wilk for detecting departures from normality **Interpretation**: Test statistic > critical value → reject normality
**Interpretation**: ### Kolmogorov-Smirnov
- Compares test statistic against critical values at different significance levels
- If test statistic > critical value at given significance level, reject normality
- More weight given to tails of distribution than other tests
### Kolmogorov-Smirnov Test **Use**: Large samples or testing against non-normal distributions
**Purpose**: Test if a sample comes from a reference distribution **Interpretation**: p > 0.05 → matches reference | p ≤ 0.05 → differs from reference
**When to use**: When you have a large sample or want to test against distributions other than normal
**Interpretation**:
- **p-value > 0.05**: Sample distribution matches reference distribution
- **p-value ≤ 0.05**: Sample distribution differs from reference distribution
## Distribution Characteristics ## Distribution Characteristics
### Skewness ### Skewness
**Purpose**: Measure asymmetry of the distribution **Measures asymmetry**:
- ≈ 0: Symmetric
- \> 0: Right-skewed (tail right)
- < 0: Left-skewed (tail left)
**Interpretation**: **Magnitude**: |s| < 0.5 (symmetric) | 0.5-1 (moderate) | ≥ 1 (high)
- **Skewness ≈ 0**: Symmetric distribution
- **Skewness > 0**: Right-skewed (tail extends to right, most values on left)
- **Skewness < 0**: Left-skewed (tail extends to left, most values on right)
**Magnitude interpretation**: **Action**: High skew → consider transformation (log, sqrt, Box-Cox); use median over mean
- **|Skewness| < 0.5**: Approximately symmetric
- **0.5 ≤ |Skewness| < 1**: Moderately skewed
- **|Skewness| ≥ 1**: Highly skewed
**Implications**:
- Highly skewed data may require transformation (log, sqrt, Box-Cox)
- Mean is pulled toward tail; median more robust for skewed data
- Many statistical tests assume symmetry/normality
### Kurtosis ### Kurtosis
**Purpose**: Measure tailedness and peak of distribution **Measures tailedness** (excess kurtosis, normal = 0):
- ≈ 0: Normal tails
- \> 0: Heavy tails, more outliers
- < 0: Light tails, fewer outliers
**Interpretation** (Excess Kurtosis, where normal distribution = 0): **Magnitude**: |k| < 0.5 (normal) | 0.5-1 (moderate) | ≥ 1 (very different)
- **Kurtosis ≈ 0**: Normal tail behavior (mesokurtic)
- **Kurtosis > 0**: Heavy tails, sharp peak (leptokurtic)
- More outliers than normal distribution
- Higher probability of extreme values
- **Kurtosis < 0**: Light tails, flat peak (platykurtic)
- Fewer outliers than normal distribution
- More uniform distribution
**Magnitude interpretation**: **Action**: High kurtosis → investigate outliers carefully
- **|Kurtosis| < 0.5**: Normal-like tails
- **0.5 ≤ |Kurtosis| < 1**: Moderately different tails
- **|Kurtosis| ≥ 1**: Very different tail behavior from normal
**Implications**: ## Correlation
- High kurtosis → Be cautious with outliers
- Low kurtosis → Distribution lacks distinct peak
## Correlation Tests ### Pearson
### Pearson Correlation **Measures**: Linear relationship (-1 to +1)
**Purpose**: Measure linear relationship between two continuous variables **Strength**: |r| < 0.3 (weak) | 0.3-0.5 (moderate) | 0.5-0.7 (strong) | ≥ 0.7 (very strong)
**Range**: -1 to +1 **Assumptions**: Linear, continuous, normal, no outliers, homoscedastic
**Interpretation**: **Use**: Expected linear relationship, assumptions met
- **r = +1**: Perfect positive linear relationship
- **r = 0**: No linear relationship
- **r = -1**: Perfect negative linear relationship
**Strength guidelines**: ### Spearman
- **|r| < 0.3**: Weak correlation
- **0.3 ≤ |r| < 0.5**: Moderate correlation
- **0.5 ≤ |r| < 0.7**: Strong correlation
- **|r| ≥ 0.7**: Very strong correlation
**Assumptions**: **Measures**: Monotonic relationship (-1 to +1), rank-based
- Linear relationship between variables
- Both variables continuous and normally distributed
- No significant outliers
- Homoscedasticity (constant variance)
**When to use**: When relationship is expected to be linear and data meets assumptions **Advantages**: Robust to outliers, no linearity assumption, works with ordinal, no normality required
### Spearman Correlation **Use**: Outliers present, non-linear monotonic relationship, ordinal data, non-normal
**Purpose**: Measure monotonic relationship between two variables (rank-based) ## Outlier Detection
**Range**: -1 to +1 ### IQR Method
**Interpretation**: Same as Pearson, but measures monotonic (not just linear) relationships **Bounds**: Q1 - 1.5×IQR to Q3 + 1.5×IQR
**Advantages over Pearson**: **Characteristics**: Simple, robust, works with skewed data
- Robust to outliers (uses ranks)
- Doesn't assume linear relationship
- Works with ordinal data
- Doesn't require normality assumption
**When to use**: **Typical Rates**: < 5% (normal) | 5-10% (moderate) | > 10% (high, investigate)
- Data has outliers
- Relationship is monotonic but not linear
- Data is ordinal
- Distribution is non-normal
## Outlier Detection Methods
### IQR Method (Interquartile Range)
**Definition**:
- Lower bound: Q1 - 1.5 × IQR
- Upper bound: Q3 + 1.5 × IQR
- Values outside these bounds are outliers
**Characteristics**:
- Simple and interpretable
- Robust to extreme values
- Works well for skewed distributions
- Conservative approach (Tukey's fences)
**Interpretation**:
- **< 5% outliers**: Typical for most datasets
- **5-10% outliers**: Moderate, investigate causes
- **> 10% outliers**: High rate, may indicate data quality issues or interesting phenomena
### Z-Score Method ### Z-Score Method
**Definition**: Outliers are data points with |z-score| > 3 **Definition**: |z| > 3 where z = (x - μ) / σ
**Formula**: z = (x - μ) / σ **Use**: Normal data, n > 30
**Characteristics**: **Avoid**: Small samples, skewed data, many outliers (contaminates mean/SD)
- Assumes normal distribution
- Sensitive to extreme values
- Standard threshold is |z| > 3 (99.7% of data within ±3σ)
**When to use**: ## Hypothesis Testing
- Data is approximately normally distributed
- Large sample sizes (n > 30)
**When NOT to use**: **Significance Levels**: α = 0.05 (standard) | 0.01 (conservative) | 0.10 (liberal)
- Small samples
- Heavily skewed data
- Data with many outliers (contaminates mean and SD)
## Hypothesis Testing Guidelines **p-value Interpretation**: ≤ 0.001 (***) | ≤ 0.01 (**) | ≤ 0.05 (*) | ≤ 0.10 (weak) | > 0.10 (none)
### Significance Levels **Key Considerations**:
- Statistical ≠ practical significance
- Multiple testing → use correction (Bonferroni, FDR)
- Large samples detect trivial effects
- Always report effect sizes with p-values
- **α = 0.05**: Standard significance level (5% chance of Type I error) ## Transformations
- **α = 0.01**: More conservative (1% chance of Type I error)
- **α = 0.10**: More liberal (10% chance of Type I error)
### p-value Interpretation **Right-skewed**: Log, sqrt, Box-Cox
- **p ≤ 0.001**: Very strong evidence against H0 (***) **Left-skewed**: Square, cube, exponential
- **0.001 < p ≤ 0.01**: Strong evidence against H0 (**)
- **0.01 < p ≤ 0.05**: Moderate evidence against H0 (*)
- **0.05 < p ≤ 0.10**: Weak evidence against H0
- **p > 0.10**: Little to no evidence against H0
### Important Considerations **Heavy tails**: Robust scaling, winsorization, log
1. **Statistical vs Practical Significance**: A small p-value doesn't always mean the effect is important **Non-constant variance**: Log, Box-Cox
2. **Multiple Testing**: When performing many tests, use correction methods (Bonferroni, FDR)
3. **Sample Size**: Large samples can detect trivial effects as significant
4. **Effect Size**: Always report and interpret effect sizes alongside p-values
## Data Transformation Strategies **Common Methods**:
- **Log**: log(x+1) for positive skew, multiplicative relationships
### When to Transform - **Sqrt**: Count data, moderate skew
- **Box-Cox**: Auto-finds optimal (requires positive values)
- **Right-skewed data**: Log, square root, or Box-Cox transformation - **Standardization**: (x-μ)/σ for scaling to unit variance
- **Left-skewed data**: Square, cube, or exponential transformation - **Min-Max**: (x-min)/(max-min) for [0,1] scaling
- **Heavy tails/outliers**: Robust scaling, winsorization, or log transformation
- **Non-constant variance**: Log or Box-Cox transformation
### Common Transformations
1. **Log transformation**: log(x) or log(x + 1)
- Best for: Positive skewed data, multiplicative relationships
- Cannot use with zero or negative values
2. **Square root transformation**: √x
- Best for: Count data, moderate positive skew
- Less aggressive than log
3. **Box-Cox transformation**: (x^λ - 1) / λ
- Best for: Automatically finds optimal transformation
- Requires positive values
4. **Standardization**: (x - μ) / σ
- Best for: Scaling features to same range
- Centers data at 0 with unit variance
5. **Min-Max scaling**: (x - min) / (max - min)
- Best for: Scaling to [0, 1] range
- Preserves zero values
## Practical Guidelines ## Practical Guidelines
### Sample Size Considerations **Sample Size**: n < 30 (non-parametric, cautious) | 30-100 (parametric OK) | ≥ 100 (robust) | ≥ 1000 (may detect trivial effects)
- **n < 30**: Use non-parametric tests, be cautious with assumptions **Missing Data**: < 5% (simple methods) | 5-10% (imputation) | > 10% (investigate patterns, advanced methods)
- **30 ≤ n < 100**: Moderate sample, parametric tests usually acceptable
- **n ≥ 100**: Large sample, parametric tests robust to violations
- **n ≥ 1000**: Very large sample, may detect trivial effects as significant
### Dealing with Missing Data **Reporting**: Include test statistic, p-value, CI, effect size, n, assumption checks
- **< 5% missing**: Usually not a problem, simple methods OK
- **5-10% missing**: Use appropriate imputation methods
- **> 10% missing**: Investigate patterns, consider advanced imputation or modeling missingness
### Reporting Results
Always include:
1. Test statistic value
2. p-value
3. Confidence interval (when applicable)
4. Effect size
5. Sample size
6. Assumptions checked and violations noted