Improve the EDA skill

This commit is contained in:
Timothy Kassis
2025-11-04 17:25:06 -08:00
parent 1225ddecf1
commit ffad3d81b0
4 changed files with 362 additions and 988 deletions

View File

@@ -1,275 +1,202 @@
---
name: exploratory-data-analysis
description: "EDA toolkit. Analyze CSV/Excel/JSON/Parquet files, statistical summaries, distributions, correlations, outliers, missing data, visualizations, markdown reports, for data profiling and insights."
description: "Analyze datasets to discover patterns, anomalies, and relationships. Use when exploring data files, generating statistical summaries, checking data quality, or creating visualizations. Supports CSV, Excel, JSON, Parquet, and more."
---
# Exploratory Data Analysis
## Overview
Discover patterns, anomalies, and relationships in tabular data through statistical analysis and visualization.
EDA is a process for discovering patterns, anomalies, and relationships in data. Analyze CSV/Excel/JSON/Parquet files to generate statistical summaries, distributions, correlations, outliers, and visualizations. All outputs are markdown-formatted for integration into workflows.
**Supported formats**: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
## When to Use This Skill
This skill should be used when:
- User provides a data file and requests analysis or exploration
- User asks to "explore this dataset", "analyze this data", or "what's in this file?"
- User needs statistical summaries, distributions, or correlations
- User requests data visualizations or insights
- User wants to understand data quality issues or patterns
- User mentions EDA, exploratory analysis, or data profiling
**Supported file formats**: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
## Quick Start Workflow
1. **Receive data file** from user
2. **Run comprehensive analysis** using `scripts/eda_analyzer.py`
3. **Generate visualizations** using `scripts/visualizer.py`
4. **Create markdown report** using insights and the `assets/report_template.md` template
5. **Present findings** to user with key insights highlighted
## Core Capabilities
### 1. Comprehensive Data Analysis
Execute full statistical analysis using the `eda_analyzer.py` script:
## Standard Workflow
1. Run statistical analysis:
```bash
python scripts/eda_analyzer.py <data_file_path> -o <output_directory>
python scripts/eda_analyzer.py <data_file> -o <output_dir>
```
**What it provides**:
- Auto-detection and loading of file formats
- Basic dataset information (shape, types, memory usage)
- Missing data analysis (patterns, percentages)
- Summary statistics for numeric and categorical variables
- Outlier detection using IQR and Z-score methods
- Distribution analysis with normality tests (Shapiro-Wilk, Anderson-Darling)
- Correlation analysis (Pearson and Spearman)
- Data quality assessment (completeness, duplicates, issues)
- Automated insight generation
**Output**: JSON file containing all analysis results at `<output_directory>/eda_analysis.json`
### 2. Comprehensive Visualizations
Generate complete visualization suite using the `visualizer.py` script:
2. Generate visualizations:
```bash
python scripts/visualizer.py <data_file_path> -o <output_directory>
python scripts/visualizer.py <data_file> -o <output_dir>
```
**Generated visualizations**:
- **Missing data patterns**: Heatmap and bar chart showing missing data
- **Distribution plots**: Histograms with KDE overlays for all numeric variables
- **Box plots with violin plots**: Outlier detection visualizations
- **Correlation heatmap**: Both Pearson and Spearman correlation matrices
- **Scatter matrix**: Pairwise relationships between numeric variables
- **Categorical analysis**: Bar charts for top categories
- **Time series plots**: Temporal trends with trend lines (if datetime columns exist)
3. Read analysis results from `<output_dir>/eda_analysis.json`
**Output**: High-quality PNG files saved to `<output_directory>/eda_visualizations/`
4. Create report using `assets/report_template.md` structure
All visualizations are production-ready with:
- 300 DPI resolution
- Clear titles and labels
- Statistical annotations
- Professional styling using seaborn
5. Present findings with key insights and visualizations
### 3. Automated Insight Generation
## Analysis Capabilities
The analyzer automatically generates actionable insights including:
### Statistical Analysis
- **Data scale insights**: Dataset size considerations for processing
- **Missing data alerts**: Warnings when missing data exceeds thresholds
- **Correlation discoveries**: Strong relationships identified for feature engineering
- **Outlier warnings**: Variables with high outlier rates flagged
- **Distribution assessments**: Skewness issues requiring transformations
- **Duplicate alerts**: Duplicate row detection
- **Imbalance warnings**: Categorical variable imbalance detection
Run `scripts/eda_analyzer.py` to generate comprehensive analysis:
Access insights from the analysis results JSON under the `"insights"` key.
```bash
python scripts/eda_analyzer.py sales_data.csv -o ./output
```
### 4. Statistical Interpretation
Produces `output/eda_analysis.json` containing:
- Dataset shape, types, memory usage
- Missing data patterns and percentages
- Summary statistics (numeric and categorical)
- Outlier detection (IQR and Z-score methods)
- Distribution analysis with normality tests
- Correlation matrices (Pearson and Spearman)
- Data quality metrics (completeness, duplicates)
- Automated insights
For detailed interpretation of statistical tests and measures, reference:
### Visualizations
**`references/statistical_tests_guide.md`** - Comprehensive guide covering:
Run `scripts/visualizer.py` to generate plots:
```bash
python scripts/visualizer.py sales_data.csv -o ./output
```
Creates high-resolution (300 DPI) PNG files in `output/eda_visualizations/`:
- Missing data heatmaps and bar charts
- Distribution plots (histograms with KDE)
- Box plots and violin plots for outliers
- Correlation heatmaps
- Scatter matrices for numeric relationships
- Categorical bar charts
- Time series plots (if datetime columns detected)
### Automated Insights
Access generated insights from the `"insights"` key in the analysis JSON:
- Dataset size considerations
- Missing data warnings (when exceeding thresholds)
- Strong correlations for feature engineering
- High outlier rate flags
- Skewness requiring transformations
- Duplicate detection
- Categorical imbalance warnings
## Reference Materials
### Statistical Interpretation
See `references/statistical_tests_guide.md` for detailed guidance on:
- Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
- Distribution characteristics (skewness, kurtosis)
- Correlation tests (Pearson, Spearman)
- Outlier detection methods (IQR, Z-score)
- Hypothesis testing guidelines
- Data transformation strategies
- Correlation methods (Pearson, Spearman)
- Outlier detection (IQR, Z-score)
- Hypothesis testing and data transformations
Load this reference when needing to interpret specific statistical tests or explain results to users.
Use when interpreting statistical results or explaining findings.
### 5. Best Practices Guidance
### Methodology
For methodological guidance, reference:
See `references/eda_best_practices.md` for comprehensive guidance on:
- 6-step EDA process framework
- Univariate, bivariate, multivariate analysis approaches
- Visualization and statistical analysis guidelines
- Common pitfalls and domain-specific considerations
- Communication strategies for different audiences
**`references/eda_best_practices.md`** - Detailed best practices including:
- EDA process framework (6-step methodology)
- Univariate, bivariate, and multivariate analysis approaches
- Visualization guidelines
- Statistical analysis guidelines
- Common pitfalls to avoid
- Domain-specific considerations
- Communication tips for technical and non-technical audiences
Use when planning analysis or handling specific scenarios.
Load this reference when planning analysis approach or needing guidance on specific EDA scenarios.
## Report Template
## Creating Analysis Reports
Use the provided template to structure comprehensive EDA reports:
**`assets/report_template.md`** - Professional report template with sections for:
Use `assets/report_template.md` to structure findings. Template includes:
- Executive summary
- Dataset overview
- Data quality assessment
- Univariate, bivariate, and multivariate analysis
- Outlier analysis
- Key insights and findings
- Recommendations
- Key insights and recommendations
- Limitations and appendices
**To use the template**:
1. Copy the template content
2. Fill in sections with analysis results from JSON output
3. Embed visualization images using markdown syntax
4. Populate insights and recommendations
5. Save as markdown for user consumption
Fill sections with analysis JSON results and embed visualizations using markdown image syntax.
## Typical Workflow Example
## Example: Complete Analysis
When user provides a data file:
User request: "Explore this sales_data.csv file"
```
User: "Can you explore this sales_data.csv file and tell me what you find?"
```bash
# 1. Run analysis
python scripts/eda_analyzer.py sales_data.csv -o ./output
1. Run analysis:
python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output
2. Generate visualizations:
python scripts/visualizer.py sales_data.csv -o ./analysis_output
3. Read analysis results:
Read ./analysis_output/eda_analysis.json
4. Create markdown report using template:
- Copy assets/report_template.md structure
- Fill in sections with analysis results
- Reference visualizations from ./analysis_output/eda_visualizations/
- Include automated insights from JSON
5. Present to user:
- Show key insights prominently
- Highlight data quality issues
- Provide visualizations inline
- Make actionable recommendations
- Save complete report as .md file
# 2. Generate visualizations
python scripts/visualizer.py sales_data.csv -o ./output
```
## Advanced Analysis Scenarios
```python
# 3. Read results
import json
with open('./output/eda_analysis.json') as f:
results = json.load(f)
### Large Datasets (>1M rows)
- Run analysis on sampled data first for quick exploration
- Note sample size in report
- Recommend distributed computing for full analysis
# 4. Build report from assets/report_template.md
# - Fill sections with results
# - Embed images: ![Missing Data](./output/eda_visualizations/missing_data.png)
# - Include insights from results['insights']
# - Add recommendations
```
### High-Dimensional Data (>50 columns)
- Focus on most important variables first
- Consider PCA or feature selection
- Generate correlation analysis to identify variable groups
- Reference `eda_best_practices.md` section on high-dimensional data
## Special Cases
### Time Series Data
- Ensure datetime columns are properly detected
- Time series visualizations will be automatically generated
- Consider temporal patterns, trends, and seasonality
- Reference `eda_best_practices.md` section on time series
### Dataset Size Strategy
### Imbalanced Data
- Categorical analysis will flag imbalances
- Report class distributions prominently
- Recommend stratified sampling if needed
**If < 100 rows**: Note sample size limitations, use non-parametric methods
### Small Sample Sizes (<100 rows)
- Non-parametric methods automatically used where appropriate
- Be conservative in statistical conclusions
- Note sample size limitations in report
**If 100-1M rows**: Standard workflow applies
## Output Best Practices
**If > 1M rows**: Sample first for quick exploration, note sample size in report, recommend distributed computing for full analysis
**Always output as markdown**:
- Structure findings using markdown headers, tables, and lists
- Embed visualizations using `![Description](path/to/image.png)` syntax
- Use tables for statistical summaries
- Include code blocks for any suggested transformations
- Highlight key insights with bold or bullet points
### Data Characteristics
**Ensure reports are actionable**:
- Provide clear recommendations based on findings
- Flag data quality issues that need attention
- Suggest next steps for modeling or further analysis
- Identify feature engineering opportunities
**High-dimensional (>50 columns)**: Focus on key variables first, use correlation analysis to identify groups, consider PCA or feature selection. See `references/eda_best_practices.md` for guidance.
**Make insights accessible**:
- Explain statistical concepts in plain language
- Use reference guides to provide detailed interpretations
- Include both technical details and executive summary
**Time series**: Datetime columns auto-detected, temporal visualizations generated automatically. Consider trends, seasonality, patterns.
**Imbalanced**: Categorical analysis flags imbalances automatically. Report distributions prominently, recommend stratified sampling if needed.
## Output Guidelines
**Format findings as markdown**:
- Use headers, tables, and lists for structure
- Embed visualizations: `![Description](path/to/image.png)`
- Include code blocks for suggested transformations
- Highlight key insights
**Make reports actionable**:
- Provide clear recommendations
- Flag data quality issues requiring attention
- Suggest next steps (modeling, feature engineering, further analysis)
- Tailor communication to user's technical level
## Handling Edge Cases
## Error Handling
**Unsupported file formats**:
- Request user to convert to supported format
- Suggest using pandas-compatible formats
**Unsupported formats**: Request conversion to supported format (CSV, Excel, JSON, Parquet)
**Files too large to load**:
- Recommend sampling approach
- Suggest chunked processing
- Consider alternative tools for big data
**Files too large**: Recommend sampling or chunked processing
**Corrupted or malformed data**:
- Report specific errors encountered
- Suggest data cleaning steps
- Try to salvage partial analysis if possible
**Corrupted data**: Report specific errors, suggest cleaning steps, attempt partial analysis
**All missing data in columns**:
- Flag completely empty columns
- Recommend removal or investigation
- Document in data quality section
**Empty columns**: Flag in data quality section, recommend removal or investigation
## Resources Summary
## Resources
### scripts/
- **`eda_analyzer.py`**: Main analysis engine - comprehensive statistical analysis
- **`visualizer.py`**: Visualization generator - creates all chart types
**Scripts** (handle all formats automatically):
- `scripts/eda_analyzer.py` - Statistical analysis engine
- `scripts/visualizer.py` - Visualization generator
Both scripts are fully executable and handle multiple file formats automatically.
**References** (load as needed):
- `references/statistical_tests_guide.md` - Test interpretation and methodology
- `references/eda_best_practices.md` - EDA process and best practices
### references/
- **`statistical_tests_guide.md`**: Statistical test interpretation and methodology
- **`eda_best_practices.md`**: Comprehensive EDA methodology and best practices
**Template**:
- `assets/report_template.md` - Professional report structure
Load these references as needed to inform analysis approach and interpretation.
## Key Points
### assets/
- **`report_template.md`**: Professional markdown report template
Use this template structure for creating consistent, comprehensive EDA reports.
## Key Reminders
1. **Always generate markdown output** for textual results
2. **Run both scripts** (analyzer and visualizer) for complete analysis
3. **Use the template** to structure comprehensive reports
4. **Include visualizations** by referencing generated PNG files
5. **Provide actionable insights** - don't just present statistics
6. **Interpret findings** using reference guides
7. **Document limitations** and data quality issues
8. **Make recommendations** for next steps
This skill transforms raw data into actionable insights through systematic exploration, advanced statistics, rich visualizations, and clear communication.
- Run both scripts for complete analysis
- Structure reports using the template
- Provide actionable insights, not just statistics
- Use reference guides for detailed interpretations
- Document data quality issues and limitations
- Make clear recommendations for next steps

View File

@@ -1,14 +1,10 @@
# Exploratory Data Analysis Report
# EDA Report: [Dataset Name]
**Dataset**: [Dataset Name]
**Analysis Date**: [Date]
**Analyst**: [Name]
---
**Date**: [Date] | **Analyst**: [Name]
## Executive Summary
[2-3 paragraph summary of key findings, major insights, and recommendations]
[Concise summary of key findings and recommendations]
**Key Findings**:
- [Finding 1]
@@ -23,366 +19,197 @@
## 1. Dataset Overview
### 1.1 Data Source
- **Source**: [Source name and location]
- **Collection Period**: [Date range]
- **Last Updated**: [Date]
- **Format**: [CSV, Excel, JSON, etc.]
**Source**: [Source name] | **Format**: [CSV/Excel/JSON/etc.] | **Period**: [Date range]
### 1.2 Data Structure
- **Observations (Rows)**: [Number]
- **Variables (Columns)**: [Number]
- **Memory Usage**: [Size in MB]
**Structure**: [Rows] observations × [Columns] variables | **Memory**: [Size] MB
### 1.3 Variable Types
- **Numeric Variables** ([Count]): [List column names]
- **Categorical Variables** ([Count]): [List column names]
- **Datetime Variables** ([Count]): [List column names]
- **Boolean Variables** ([Count]): [List column names]
**Variable Types**:
- Numeric ([Count]): [List names]
- Categorical ([Count]): [List names]
- Datetime ([Count]): [List names]
- Boolean ([Count]): [List names]
---
## 2. Data Quality Assessment
## 2. Data Quality
### 2.1 Completeness
**Completeness**: [Percentage]% | **Duplicates**: [Count] ([%]%)
**Overall Data Completeness**: [Percentage]%
**Missing Data**:
| Column | Missing % | Assessment |
|--------|-----------|------------|
| [Column 1] | [%] | [High/Medium/Low] |
| [Column 2] | [%] | [High/Medium/Low] |
**Missing Data Summary**:
| Column | Missing Count | Missing % | Assessment |
|--------|--------------|-----------|------------|
| [Column 1] | [Count] | [%] | [High/Medium/Low] |
| [Column 2] | [Count] | [%] | [High/Medium/Low] |
![Missing Data](path/to/missing_data.png)
**Missing Data Pattern**: [Description of patterns, if any]
**Visualization**: ![Missing Data](path/to/missing_data.png)
### 2.2 Duplicates
- **Duplicate Rows**: [Count] ([Percentage]%)
- **Action Required**: [Yes/No - describe if needed]
### 2.3 Data Quality Issues
[List any identified issues]
- [ ] Issue 1: [Description]
- [ ] Issue 2: [Description]
- [ ] Issue 3: [Description]
**Quality Issues**:
- [Issue 1]
- [Issue 2]
---
## 3. Univariate Analysis
### 3.1 Numeric Variables
### Numeric: [Variable Name]
[For each key numeric variable:]
**Stats**: Mean: [Value] | Median: [Value] | Std: [Value] | Range: [[Min]-[Max]]
#### [Variable Name]
**Distribution**: Skewness: [Value] | Kurtosis: [Value] | Normality: [Yes/No]
**Summary Statistics**:
- **Mean**: [Value]
- **Median**: [Value]
- **Std Dev**: [Value]
- **Min**: [Value]
- **Max**: [Value]
- **Range**: [Value]
- **IQR**: [Value]
**Outliers**: IQR: [Count] ([%]%) | Z-score: [Count] ([%]%)
**Distribution Characteristics**:
- **Skewness**: [Value] - [Interpretation]
- **Kurtosis**: [Value] - [Interpretation]
- **Normality**: [Normal/Not Normal based on tests]
![Distribution](path/to/distribution.png)
**Outliers**:
- **IQR Method**: [Count] outliers ([Percentage]%)
- **Z-Score Method**: [Count] outliers ([Percentage]%)
**Insights**: [Key observations]
**Visualization**: ![Distribution of [Variable]](path/to/distribution.png)
### Categorical: [Variable Name]
**Insights**:
- [Key insight 1]
- [Key insight 2]
**Stats**: [Count] unique values | Most common: [Value] ([%]%) | Balance: [Balanced/Imbalanced]
---
### 3.2 Categorical Variables
[For each key categorical variable:]
#### [Variable Name]
**Summary**:
- **Unique Values**: [Count]
- **Most Common**: [Value] ([Percentage]%)
- **Least Common**: [Value] ([Percentage]%)
- **Balance**: [Balanced/Imbalanced]
**Top Categories**:
| Category | Count | Percentage |
|----------|-------|------------|
| Category | Count | % |
|----------|-------|---|
| [Cat 1] | [Count] | [%] |
| [Cat 2] | [Count] | [%] |
| [Cat 3] | [Count] | [%] |
**Visualization**: ![Distribution of [Variable]](path/to/categorical.png)
![Distribution](path/to/categorical.png)
**Insights**:
- [Key insight 1]
- [Key insight 2]
**Insights**: [Key observations]
---
### Temporal: [Variable Name]
### 3.3 Temporal Variables
**Range**: [Start] to [End] ([Duration]) | **Trend**: [Increasing/Decreasing/Stable] | **Seasonality**: [Yes/No]
[If datetime columns exist:]
![Time Series](path/to/timeseries.png)
#### [Variable Name]
**Time Range**: [Start Date] to [End Date]
**Duration**: [Time span]
**Temporal Coverage**: [Complete/Gaps identified]
**Temporal Patterns**:
- **Trend**: [Increasing/Decreasing/Stable]
- **Seasonality**: [Yes/No - describe if present]
- **Gaps**: [List any gaps in timeline]
**Visualization**: ![Time Series of [Variable]](path/to/timeseries.png)
**Insights**:
- [Key insight 1]
- [Key insight 2]
**Insights**: [Key observations]
---
## 4. Bivariate Analysis
### 4.1 Correlation Analysis
**Overall Correlation Structure**:
- **Strong Positive Correlations**: [Count]
- **Strong Negative Correlations**: [Count]
- **Weak/No Correlations**: [Count]
**Correlation Matrix**:
**Correlation Summary**: [Count] strong positive | [Count] strong negative | [Count] weak/none
![Correlation Heatmap](path/to/correlation_heatmap.png)
**Notable Correlations**:
| Variable 1 | Variable 2 | Pearson r | Spearman ρ | Strength | Interpretation |
|-----------|-----------|-----------|------------|----------|----------------|
| [Var 1] | [Var 2] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] |
| [Var 1] | [Var 3] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] |
| Var 1 | Var 2 | Pearson | Spearman | Strength |
|-------|-------|---------|----------|----------|
| [Var 1] | [Var 2] | [Value] | [Value] | [Strong/Moderate/Weak] |
**Insights**:
- [Key insight about correlations]
- [Potential multicollinearity issues]
- [Feature engineering opportunities]
**Insights**: [Multicollinearity issues, feature engineering opportunities]
---
### Key Relationship: [Var 1] vs [Var 2]
### 4.2 Key Relationships
**Type**: [Linear/Non-linear/None] | **r**: [Value] | **p-value**: [Value]
[For important variable pairs:]
![Scatter Plot](path/to/scatter.png)
#### [Variable 1] vs [Variable 2]
**Relationship Type**: [Linear/Non-linear/None]
**Correlation**: [Value]
**Statistical Test**: [Test name, p-value]
**Visualization**: ![Scatter Plot](path/to/scatter.png)
**Insights**:
- [Description of relationship]
- [Implications]
**Insights**: [Description and implications]
---
## 5. Multivariate Analysis
### 5.1 Scatter Matrix
![Scatter Matrix](path/to/scatter_matrix.png)
**Observations**:
- [Pattern 1]
- [Pattern 2]
- [Pattern 3]
**Patterns**: [Key observations]
### 5.2 Clustering Patterns
[If clustering analysis performed:]
**Method**: [Method used]
**Number of Clusters**: [Count]
**Cluster Characteristics**:
- **Cluster 1**: [Description]
- **Cluster 2**: [Description]
**Visualization**: [Link to visualization]
**Clustering** (if performed): [Method] | [Count] clusters identified
---
## 6. Outlier Analysis
## 6. Outliers
### 6.1 Outlier Summary
**Overall Rate**: [%]%
**Overall Outlier Rate**: [Percentage]%
| Variable | Outlier % | Method | Action |
|----------|-----------|--------|--------|
| [Var 1] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
| [Var 2] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
**Variables with High Outlier Rates**:
| Variable | Outlier Count | Outlier % | Method | Action |
|----------|--------------|-----------|--------|--------|
| [Var 1] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
| [Var 2] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
![Box Plots](path/to/boxplots.png)
**Visualization**: ![Box Plots](path/to/boxplots.png)
### 6.2 Outlier Investigation
[For significant outliers:]
#### [Variable Name]
**Outlier Characteristics**:
- [Description of outliers]
- [Potential causes]
- [Validity assessment]
**Recommendation**: [Keep/Remove/Transform/Investigate further]
**Investigation**: [Description of significant outliers, causes, validity]
---
## 7. Key Insights and Findings
## 7. Key Insights
### 7.1 Data Quality Insights
**Data Quality**:
- [Insight with implication]
- [Insight with implication]
1. **[Insight 1]**: [Description and implication]
2. **[Insight 2]**: [Description and implication]
3. **[Insight 3]**: [Description and implication]
**Statistical Patterns**:
- [Insight with implication]
- [Insight with implication]
### 7.2 Statistical Insights
**Domain/Research Insights**:
- [Insight with implication]
- [Insight with implication]
1. **[Insight 1]**: [Description and implication]
2. **[Insight 2]**: [Description and implication]
3. **[Insight 3]**: [Description and implication]
### 7.3 Business/Research Insights
1. **[Insight 1]**: [Description and implication]
2. **[Insight 2]**: [Description and implication]
3. **[Insight 3]**: [Description and implication]
### 7.4 Unexpected Findings
1. **[Finding 1]**: [Description and significance]
2. **[Finding 2]**: [Description and significance]
**Unexpected Findings**:
- [Finding and significance]
---
## 8. Recommendations
### 8.1 Data Quality Actions
**Data Quality Actions**:
- [ ] [Action - priority]
- [ ] [Action - priority]
- [ ] **[Action 1]**: [Description and priority]
- [ ] **[Action 2]**: [Description and priority]
- [ ] **[Action 3]**: [Description and priority]
**Next Steps**:
- [Step with rationale]
- [Step with rationale]
### 8.2 Analysis Next Steps
**Feature Engineering**:
- [Opportunity]
- [Opportunity]
1. **[Step 1]**: [Description and rationale]
2. **[Step 2]**: [Description and rationale]
3. **[Step 3]**: [Description and rationale]
### 8.3 Feature Engineering Opportunities
- **[Opportunity 1]**: [Description]
- **[Opportunity 2]**: [Description]
- **[Opportunity 3]**: [Description]
### 8.4 Modeling Considerations
- **[Consideration 1]**: [Description]
- **[Consideration 2]**: [Description]
- **[Consideration 3]**: [Description]
**Modeling Considerations**:
- [Consideration]
- [Consideration]
---
## 9. Limitations and Caveats
## 9. Limitations
### 9.1 Data Limitations
**Data**: [Key limitations]
- [Limitation 1]
- [Limitation 2]
- [Limitation 3]
**Analysis**: [Key limitations]
### 9.2 Analysis Limitations
- [Limitation 1]
- [Limitation 2]
- [Limitation 3]
### 9.3 Assumptions Made
- [Assumption 1]
- [Assumption 2]
- [Assumption 3]
**Assumptions**: [Key assumptions made]
---
## 10. Appendices
## Appendices
### Appendix A: Technical Details
### A: Technical Details
**Software Environment**:
- Python: [Version]
- Key Libraries: pandas ([Version]), numpy ([Version]), scipy ([Version]), matplotlib ([Version])
**Environment**: Python with pandas, numpy, scipy, matplotlib, seaborn
**Analysis Scripts**: [Link to repository or location]
**Scripts**: [Repository/location]
### Appendix B: Variable Dictionary
### B: Variable Dictionary
| Variable Name | Type | Description | Unit | Valid Range | Missing % |
|--------------|------|-------------|------|-------------|-----------|
| Variable | Type | Description | Unit | Range | Missing % |
|----------|------|-------------|------|-------|-----------|
| [Var 1] | [Type] | [Description] | [Unit] | [Range] | [%] |
| [Var 2] | [Type] | [Description] | [Unit] | [Range] | [%] |
### Appendix C: Statistical Test Results
### C: Statistical Tests
[Detailed statistical test outputs]
**Normality Tests**:
**Normality**:
| Variable | Test | Statistic | p-value | Result |
|----------|------|-----------|---------|--------|
| [Var 1] | Shapiro-Wilk | [Value] | [Value] | [Normal/Non-normal] |
**Correlation Tests**:
| Var 1 | Var 2 | Coefficient | p-value | Significance |
|-------|-------|-------------|---------|--------------|
**Correlations**:
| Var 1 | Var 2 | r | p-value | Significant |
|-------|-------|---|---------|-------------|
| [Var 1] | [Var 2] | [Value] | [Value] | [Yes/No] |
### Appendix D: Full Visualization Gallery
### D: Visualizations
[Links to all generated visualizations]
1. [Visualization 1 description](path/to/viz1.png)
2. [Visualization 2 description](path/to/viz2.png)
3. [Visualization 3 description](path/to/viz3.png)
---
## Contact Information
**Analyst**: [Name]
**Email**: [Email]
**Date**: [Date]
**Version**: [Version number]
---
**Document History**:
| Version | Date | Changes | Author |
|---------|------|---------|--------|
| 1.0 | [Date] | Initial analysis | [Name] |
1. [Description](path/to/viz1.png)
2. [Description](path/to/viz2.png)

View File

@@ -1,379 +1,125 @@
# Exploratory Data Analysis Best Practices
# EDA Best Practices
This guide provides best practices and methodologies for conducting thorough exploratory data analysis.
Methodologies for conducting thorough exploratory data analysis.
## EDA Process Framework
## 6-Step EDA Framework
### 1. Initial Data Understanding
### 1. Initial Understanding
**Objectives**:
- Understand data structure and format
- Identify data types and schema
- Get familiar with domain context
**Key Questions**:
**Questions**:
- What does each column represent?
- What is the unit of observation?
- What is the time period covered?
- What is the unit of observation and time period?
- What is the data collection methodology?
- Are there any known data quality issues?
- Are there known quality issues?
**Actions**:
- Load and inspect first/last rows
- Check data dimensions (rows × columns)
- Review column names and types
- Document data source and context
**Actions**: Load data, inspect structure, review types, document context
### 2. Data Quality Assessment
### 2. Quality Assessment
**Objectives**:
- Identify data quality issues
- Assess data completeness and reliability
- Document data limitations
**Key Checks**:
- **Missing data**: Patterns, extent, randomness
- **Duplicates**: Exact and near-duplicates
- **Outliers**: Valid extremes vs. data errors
- **Consistency**: Cross-field validation
- **Accuracy**: Domain knowledge validation
**Check**: Missing data patterns, duplicates, outliers, consistency, accuracy
**Red Flags**:
- High missing data rate (>20%)
- Missing data >20%
- Unexpected duplicates
- Constant or near-constant columns
- Impossible values (negative ages, dates in future)
- High cardinality in ID-like columns
- Constant columns
- Impossible values (negative ages, future dates)
- Suspicious patterns (too many round numbers)
### 3. Univariate Analysis
**Objectives**:
- Understand individual variable distributions
- Identify anomalies and patterns
- Determine variable characteristics
**Numeric**: Central tendency, dispersion, shape (skewness, kurtosis), distribution plots, outliers
**For Numeric Variables**:
- Central tendency (mean, median, mode)
- Dispersion (range, variance, std, IQR)
- Shape (skewness, kurtosis)
- Distribution visualization (histogram, KDE, box plot)
- Outlier detection
**Categorical**: Frequency distributions, unique counts, balance, bar charts
**For Categorical Variables**:
- Frequency distributions
- Unique value counts
- Most/least common categories
- Category balance/imbalance
- Bar charts and count plots
**For Temporal Variables**:
- Time range coverage
- Gaps in timeline
- Temporal patterns (trends, seasonality)
- Time series plots
**Temporal**: Time range, gaps, trends, seasonality, time series plots
### 4. Bivariate Analysis
**Objectives**:
- Understand relationships between variables
- Identify correlations and dependencies
- Find potential predictors
**Numeric vs Numeric**: Scatter plots, correlations (Pearson, Spearman), detect non-linearity
**Numeric vs Numeric**:
- Scatter plots
- Correlation coefficients (Pearson, Spearman)
- Line of best fit
- Detect non-linear relationships
**Numeric vs Categorical**: Group statistics, box plots by category, t-test/ANOVA
**Numeric vs Categorical**:
- Group statistics (mean, median by category)
- Box plots by category
- Distribution plots by category
- Statistical tests (t-test, ANOVA)
**Categorical vs Categorical**:
- Cross-tabulation / contingency tables
- Stacked bar charts
- Chi-square tests
- Cramér's V for association strength
**Categorical vs Categorical**: Cross-tabs, stacked bars, chi-square, Cramér's V
### 5. Multivariate Analysis
**Objectives**:
- Understand complex interactions
- Identify patterns across multiple variables
- Explore dimensionality
**Techniques**: Correlation matrices, pair plots, parallel coordinates, PCA, clustering
**Techniques**:
- Correlation matrices and heatmaps
- Pair plots / scatter matrices
- Parallel coordinates plots
- Principal Component Analysis (PCA)
- Clustering analysis
**Key Questions**:
- Are there groups of correlated features?
- Can we reduce dimensionality?
- Are there natural clusters?
- Do patterns change when conditioning on other variables?
**Questions**: Groups of correlated features? Reduce dimensionality? Natural clusters? Conditional patterns?
### 6. Insight Generation
**Objectives**:
- Synthesize findings into actionable insights
- Formulate hypotheses
- Identify next steps
**Look for**: Unexpected patterns, strong correlations, quality issues, feature engineering opportunities, domain implications
**What to Look For**:
- Unexpected patterns or anomalies
- Strong relationships or correlations
- Data quality issues requiring attention
- Feature engineering opportunities
- Business or research implications
## Visualization Guidelines
## Best Practices
**Chart Selection**:
- Distribution: Histogram, KDE, box/violin plots
- Relationships: Scatter, line, heatmap
- Composition: Stacked bar
- Comparison: Bar, grouped bar
### Visualization Guidelines
**Best Practices**: Label axes with units, descriptive titles, purposeful color, appropriate scales, avoid clutter
1. **Choose appropriate chart types**:
- Distribution: Histogram, KDE, box plot, violin plot
- Relationships: Scatter plot, line plot, heatmap
- Composition: Stacked bar, pie chart (use sparingly)
- Comparison: Bar chart, grouped bar chart
## Statistical Analysis Guidelines
2. **Make visualizations clear and informative**:
- Always label axes with units
- Add descriptive titles
- Use color purposefully
- Include legends when needed
- Choose appropriate scales
- Avoid chart junk
**Check Assumptions**: Normality, homoscedasticity, independence, linearity
3. **Use multiple views**:
- Show data from different angles
- Combine complementary visualizations
- Use small multiples for faceting
**Method Selection**: Parametric when assumptions met, non-parametric otherwise, report effect sizes
### Statistical Analysis Guidelines
**Context Matters**: Statistical ≠ practical significance, domain knowledge trumps statistics, correlation ≠ causation
1. **Check assumptions**:
- Test for normality before parametric tests
- Check for homoscedasticity
- Verify independence of observations
- Assess linearity for linear models
## Documentation Guidelines
2. **Use appropriate methods**:
- Parametric tests when assumptions met
- Non-parametric alternatives when violated
- Robust methods for outlier-prone data
- Effect sizes alongside p-values
**Notes**: Document assumptions, decisions, issues, findings
3. **Consider context**:
- Statistical significance ≠ practical significance
- Domain knowledge trumps statistical patterns
- Correlation ≠ causation
- Sample size affects what you can detect
**Reproducibility**: Use scripts, version control, document sources, set random seeds
### Documentation Guidelines
**Reporting**: Clear summaries, supporting visualizations, highlighted insights, actionable recommendations
1. **Keep detailed notes**:
- Document assumptions and decisions
- Record data issues discovered
- Note interesting findings
- Track questions that arise
## Common Pitfalls
2. **Create reproducible analysis**:
- Use scripts, not manual Excel operations
- Version control your code
- Document data sources and versions
- Include random seeds for reproducibility
3. **Summarize findings**:
- Write clear summaries
- Use visualizations to support points
- Highlight key insights
- Provide recommendations
## Common Pitfalls to Avoid
### 1. Confirmation Bias
- **Problem**: Looking only for evidence supporting preconceptions
- **Solution**: Actively seek disconfirming evidence, use blind analysis
### 2. Ignoring Data Quality
- **Problem**: Proceeding with analysis despite known data issues
- **Solution**: Address quality issues first, document limitations
### 3. Over-reliance on Automation
- **Problem**: Running analyses without understanding or verifying results
- **Solution**: Manually inspect subsets, verify automated findings
### 4. Neglecting Outliers
- **Problem**: Removing outliers without investigation
- **Solution**: Always investigate outliers - they may contain important information
### 5. Multiple Testing Without Correction
- **Problem**: Running many tests increases false positive rate
- **Solution**: Use correction methods (Bonferroni, FDR) or be explicit about exploratory nature
### 6. Mistaking Association for Causation
- **Problem**: Inferring causation from correlation
- **Solution**: Use careful language, acknowledge alternative explanations
### 7. Cherry-picking Results
- **Problem**: Reporting only interesting/significant findings
- **Solution**: Report complete analysis, including negative results
### 8. Ignoring Sample Size
- **Problem**: Not considering how sample size affects conclusions
- **Solution**: Report effect sizes, confidence intervals, and sample sizes
1. **Confirmation Bias**: Seek disconfirming evidence, use blind analysis
2. **Ignoring Quality**: Address issues first, document limitations
3. **Over-automation**: Manually inspect subsets, verify results
4. **Neglecting Outliers**: Investigate before removing - may be informative
5. **Multiple Testing**: Use correction (Bonferroni, FDR) or note exploratory nature
6. **Association ≠ Causation**: Use careful language, acknowledge alternatives
7. **Cherry-picking**: Report complete analysis, including negative results
8. **Ignoring Sample Size**: Report effect sizes, CIs, and sample sizes
## Domain-Specific Considerations
### Time Series Data
- Check for stationarity
- Identify trends and seasonality
- Look for autocorrelation
- Handle missing time points
- Consider temporal splits for validation
**Time Series**: Check stationarity, identify trends/seasonality, autocorrelation, temporal splits
### High-Dimensional Data
- Start with dimensionality reduction
- Focus on feature importance
- Be cautious of curse of dimensionality
- Use regularization in modeling
- Consider domain knowledge for feature selection
**High-Dimensional**: Dimensionality reduction, feature importance, regularization, domain-guided selection
### Imbalanced Data
- Report class distributions
- Use appropriate metrics (not just accuracy)
- Consider resampling techniques
- Stratify sampling and cross-validation
- Be aware of biases in learning
**Imbalanced**: Report distributions, appropriate metrics, resampling, stratified CV
### Small Sample Sizes
- Use non-parametric methods
- Be conservative with conclusions
- Report confidence intervals
- Consider Bayesian approaches
- Acknowledge limitations
**Small Samples**: Non-parametric methods, conservative conclusions, CIs, Bayesian approaches
### Big Data
- Sample intelligently for exploration
- Use efficient data structures
- Leverage parallel/distributed computing
- Be aware computational complexity
- Consider scalability in methods
**Big Data**: Intelligent sampling, efficient structures, parallel computing, scalability
## Iterative Process
EDA is not linear - iterate and refine:
EDA is iterative: Explore → Questions → Focused Analysis → Insights → New Questions → Deeper Investigation → Synthesis
1. **Initial exploration** → Identify questions
2. **Focused analysis** → Answer specific questions
3. **New insights** → Generate new questions
4. **Deeper investigation** → Refine understanding
5. **Synthesis** → Integrate findings
**Done When**: Understand structure/quality, characterized variables, identified relationships, documented limitations, answered questions, have actionable insights
### When to Stop
**Deliverables**: Data understanding, quality issue list, relationship insights, hypotheses, feature ideas, recommendations
You've done enough EDA when:
- ✅ You understand the data structure and quality
- ✅ You've characterized key variables
- ✅ You've identified important relationships
- ✅ You've documented limitations
- ✅ You can answer your research questions
- ✅ You have actionable insights
## Communication
### Moving Forward
**Technical Audiences**: Methodological details, statistical tests, assumptions, reproducible code
After EDA, you should have:
- Clear understanding of data
- List of quality issues and how to handle them
- Insights about relationships and patterns
- Hypotheses to test
- Ideas for feature engineering
- Recommendations for next steps
**Non-Technical Audiences**: Focus on insights, clear visualizations, avoid jargon, concrete recommendations
## Communication Tips
**Report Structure**: Executive summary → Data overview → Analysis → Insights → Recommendations → Appendix
### For Technical Audiences
- Include methodological details
- Show statistical test results
- Discuss assumptions and limitations
- Provide reproducible code
- Reference relevant literature
## Checklists
### For Non-Technical Audiences
- Focus on insights, not methods
- Use clear visualizations
- Avoid jargon
- Provide context and implications
- Make recommendations concrete
**Before**: Understand context, define objectives, identify audience, set up environment
### Report Structure
1. **Executive Summary**: Key findings and recommendations
2. **Data Overview**: Source, structure, limitations
3. **Analysis**: Findings organized by theme
4. **Insights**: Patterns, anomalies, implications
5. **Recommendations**: Next steps and actions
6. **Appendix**: Technical details, full statistics
**During**: Inspect structure, assess quality, analyze distributions, explore relationships, document continuously
## Useful Checklists
### Before Starting
- [ ] Understand business/research context
- [ ] Define analysis objectives
- [ ] Identify stakeholders and audience
- [ ] Secure necessary permissions
- [ ] Set up reproducible environment
### During Analysis
- [ ] Load and inspect data structure
- [ ] Assess data quality
- [ ] Analyze univariate distributions
- [ ] Explore bivariate relationships
- [ ] Investigate multivariate patterns
- [ ] Generate and validate insights
- [ ] Document findings continuously
### Before Concluding
- [ ] Verify all findings
- [ ] Check for alternative explanations
- [ ] Document limitations
- [ ] Prepare clear visualizations
- [ ] Write actionable recommendations
- [ ] Review with domain experts
- [ ] Ensure reproducibility
## Tools and Libraries
### Python Ecosystem
- **pandas**: Data manipulation
- **numpy**: Numerical operations
- **matplotlib/seaborn**: Visualization
- **scipy**: Statistical tests
- **scikit-learn**: ML preprocessing
- **plotly**: Interactive visualizations
### Best Tool Practices
- Use appropriate tool for task
- Leverage vectorization
- Chain operations efficiently
- Handle missing data properly
- Validate results independently
- Document custom functions
## Further Resources
- **Books**:
- "Exploratory Data Analysis" by John Tukey
- "The Art of Statistics" by David Spiegelhalter
- **Guidelines**:
- ASA Statistical Significance Statement
- FAIR data principles
- **Communities**:
- Cross Validated (Stack Exchange)
- /r/datascience
- Local data science meetups
**After**: Verify findings, check alternatives, document limitations, prepare visualizations, ensure reproducibility

View File

@@ -1,252 +1,126 @@
# Statistical Tests Guide for EDA
# Statistical Tests Guide
This guide provides interpretation guidelines for statistical tests commonly used in exploratory data analysis.
Interpretation guidelines for common EDA statistical tests.
## Normality Tests
### Shapiro-Wilk Test
### Shapiro-Wilk
**Purpose**: Test if a sample comes from a normally distributed population
**Use**: Small to medium samples (n < 5000)
**When to use**: Best for small to medium sample sizes (n < 5000)
**H0**: Data is normal | **H1**: Data is not normal
**Interpretation**:
- **Null Hypothesis (H0)**: The data follows a normal distribution
- **Alternative Hypothesis (H1)**: The data does not follow a normal distribution
- **p-value > 0.05**: Fail to reject H0 → Data is likely normally distributed
- **p-value ≤ 0.05**: Reject H0 → Data is not normally distributed
**Interpretation**: p > 0.05 → likely normal | p ≤ 0.05 → not normal
**Notes**:
- Very sensitive to sample size
- Small deviations from normality may be detected as significant in large samples
- Consider practical significance alongside statistical significance
**Note**: Very sensitive to sample size; small deviations may be significant in large samples
### Anderson-Darling Test
### Anderson-Darling
**Purpose**: Test if a sample comes from a specific distribution (typically normal)
**Use**: More powerful than Shapiro-Wilk, emphasizes tails
**When to use**: More powerful than Shapiro-Wilk for detecting departures from normality
**Interpretation**: Test statistic > critical value → reject normality
**Interpretation**:
- Compares test statistic against critical values at different significance levels
- If test statistic > critical value at given significance level, reject normality
- More weight given to tails of distribution than other tests
### Kolmogorov-Smirnov
### Kolmogorov-Smirnov Test
**Use**: Large samples or testing against non-normal distributions
**Purpose**: Test if a sample comes from a reference distribution
**When to use**: When you have a large sample or want to test against distributions other than normal
**Interpretation**:
- **p-value > 0.05**: Sample distribution matches reference distribution
- **p-value ≤ 0.05**: Sample distribution differs from reference distribution
**Interpretation**: p > 0.05 → matches reference | p ≤ 0.05 → differs from reference
## Distribution Characteristics
### Skewness
**Purpose**: Measure asymmetry of the distribution
**Measures asymmetry**:
- ≈ 0: Symmetric
- \> 0: Right-skewed (tail right)
- < 0: Left-skewed (tail left)
**Interpretation**:
- **Skewness ≈ 0**: Symmetric distribution
- **Skewness > 0**: Right-skewed (tail extends to right, most values on left)
- **Skewness < 0**: Left-skewed (tail extends to left, most values on right)
**Magnitude**: |s| < 0.5 (symmetric) | 0.5-1 (moderate) | ≥ 1 (high)
**Magnitude interpretation**:
- **|Skewness| < 0.5**: Approximately symmetric
- **0.5 ≤ |Skewness| < 1**: Moderately skewed
- **|Skewness| ≥ 1**: Highly skewed
**Implications**:
- Highly skewed data may require transformation (log, sqrt, Box-Cox)
- Mean is pulled toward tail; median more robust for skewed data
- Many statistical tests assume symmetry/normality
**Action**: High skew → consider transformation (log, sqrt, Box-Cox); use median over mean
### Kurtosis
**Purpose**: Measure tailedness and peak of distribution
**Measures tailedness** (excess kurtosis, normal = 0):
- ≈ 0: Normal tails
- \> 0: Heavy tails, more outliers
- < 0: Light tails, fewer outliers
**Interpretation** (Excess Kurtosis, where normal distribution = 0):
- **Kurtosis ≈ 0**: Normal tail behavior (mesokurtic)
- **Kurtosis > 0**: Heavy tails, sharp peak (leptokurtic)
- More outliers than normal distribution
- Higher probability of extreme values
- **Kurtosis < 0**: Light tails, flat peak (platykurtic)
- Fewer outliers than normal distribution
- More uniform distribution
**Magnitude**: |k| < 0.5 (normal) | 0.5-1 (moderate) | ≥ 1 (very different)
**Magnitude interpretation**:
- **|Kurtosis| < 0.5**: Normal-like tails
- **0.5 ≤ |Kurtosis| < 1**: Moderately different tails
- **|Kurtosis| ≥ 1**: Very different tail behavior from normal
**Action**: High kurtosis → investigate outliers carefully
**Implications**:
- High kurtosis → Be cautious with outliers
- Low kurtosis → Distribution lacks distinct peak
## Correlation
## Correlation Tests
### Pearson
### Pearson Correlation
**Measures**: Linear relationship (-1 to +1)
**Purpose**: Measure linear relationship between two continuous variables
**Strength**: |r| < 0.3 (weak) | 0.3-0.5 (moderate) | 0.5-0.7 (strong) | ≥ 0.7 (very strong)
**Range**: -1 to +1
**Assumptions**: Linear, continuous, normal, no outliers, homoscedastic
**Interpretation**:
- **r = +1**: Perfect positive linear relationship
- **r = 0**: No linear relationship
- **r = -1**: Perfect negative linear relationship
**Use**: Expected linear relationship, assumptions met
**Strength guidelines**:
- **|r| < 0.3**: Weak correlation
- **0.3 ≤ |r| < 0.5**: Moderate correlation
- **0.5 ≤ |r| < 0.7**: Strong correlation
- **|r| ≥ 0.7**: Very strong correlation
### Spearman
**Assumptions**:
- Linear relationship between variables
- Both variables continuous and normally distributed
- No significant outliers
- Homoscedasticity (constant variance)
**Measures**: Monotonic relationship (-1 to +1), rank-based
**When to use**: When relationship is expected to be linear and data meets assumptions
**Advantages**: Robust to outliers, no linearity assumption, works with ordinal, no normality required
### Spearman Correlation
**Use**: Outliers present, non-linear monotonic relationship, ordinal data, non-normal
**Purpose**: Measure monotonic relationship between two variables (rank-based)
## Outlier Detection
**Range**: -1 to +1
### IQR Method
**Interpretation**: Same as Pearson, but measures monotonic (not just linear) relationships
**Bounds**: Q1 - 1.5×IQR to Q3 + 1.5×IQR
**Advantages over Pearson**:
- Robust to outliers (uses ranks)
- Doesn't assume linear relationship
- Works with ordinal data
- Doesn't require normality assumption
**Characteristics**: Simple, robust, works with skewed data
**When to use**:
- Data has outliers
- Relationship is monotonic but not linear
- Data is ordinal
- Distribution is non-normal
## Outlier Detection Methods
### IQR Method (Interquartile Range)
**Definition**:
- Lower bound: Q1 - 1.5 × IQR
- Upper bound: Q3 + 1.5 × IQR
- Values outside these bounds are outliers
**Characteristics**:
- Simple and interpretable
- Robust to extreme values
- Works well for skewed distributions
- Conservative approach (Tukey's fences)
**Interpretation**:
- **< 5% outliers**: Typical for most datasets
- **5-10% outliers**: Moderate, investigate causes
- **> 10% outliers**: High rate, may indicate data quality issues or interesting phenomena
**Typical Rates**: < 5% (normal) | 5-10% (moderate) | > 10% (high, investigate)
### Z-Score Method
**Definition**: Outliers are data points with |z-score| > 3
**Definition**: |z| > 3 where z = (x - μ) / σ
**Formula**: z = (x - μ) / σ
**Use**: Normal data, n > 30
**Characteristics**:
- Assumes normal distribution
- Sensitive to extreme values
- Standard threshold is |z| > 3 (99.7% of data within ±3σ)
**Avoid**: Small samples, skewed data, many outliers (contaminates mean/SD)
**When to use**:
- Data is approximately normally distributed
- Large sample sizes (n > 30)
## Hypothesis Testing
**When NOT to use**:
- Small samples
- Heavily skewed data
- Data with many outliers (contaminates mean and SD)
**Significance Levels**: α = 0.05 (standard) | 0.01 (conservative) | 0.10 (liberal)
## Hypothesis Testing Guidelines
**p-value Interpretation**: ≤ 0.001 (***) | ≤ 0.01 (**) | ≤ 0.05 (*) | ≤ 0.10 (weak) | > 0.10 (none)
### Significance Levels
**Key Considerations**:
- Statistical ≠ practical significance
- Multiple testing → use correction (Bonferroni, FDR)
- Large samples detect trivial effects
- Always report effect sizes with p-values
- **α = 0.05**: Standard significance level (5% chance of Type I error)
- **α = 0.01**: More conservative (1% chance of Type I error)
- **α = 0.10**: More liberal (10% chance of Type I error)
## Transformations
### p-value Interpretation
**Right-skewed**: Log, sqrt, Box-Cox
- **p ≤ 0.001**: Very strong evidence against H0 (***)
- **0.001 < p ≤ 0.01**: Strong evidence against H0 (**)
- **0.01 < p ≤ 0.05**: Moderate evidence against H0 (*)
- **0.05 < p ≤ 0.10**: Weak evidence against H0
- **p > 0.10**: Little to no evidence against H0
**Left-skewed**: Square, cube, exponential
### Important Considerations
**Heavy tails**: Robust scaling, winsorization, log
1. **Statistical vs Practical Significance**: A small p-value doesn't always mean the effect is important
2. **Multiple Testing**: When performing many tests, use correction methods (Bonferroni, FDR)
3. **Sample Size**: Large samples can detect trivial effects as significant
4. **Effect Size**: Always report and interpret effect sizes alongside p-values
**Non-constant variance**: Log, Box-Cox
## Data Transformation Strategies
### When to Transform
- **Right-skewed data**: Log, square root, or Box-Cox transformation
- **Left-skewed data**: Square, cube, or exponential transformation
- **Heavy tails/outliers**: Robust scaling, winsorization, or log transformation
- **Non-constant variance**: Log or Box-Cox transformation
### Common Transformations
1. **Log transformation**: log(x) or log(x + 1)
- Best for: Positive skewed data, multiplicative relationships
- Cannot use with zero or negative values
2. **Square root transformation**: √x
- Best for: Count data, moderate positive skew
- Less aggressive than log
3. **Box-Cox transformation**: (x^λ - 1) / λ
- Best for: Automatically finds optimal transformation
- Requires positive values
4. **Standardization**: (x - μ) / σ
- Best for: Scaling features to same range
- Centers data at 0 with unit variance
5. **Min-Max scaling**: (x - min) / (max - min)
- Best for: Scaling to [0, 1] range
- Preserves zero values
**Common Methods**:
- **Log**: log(x+1) for positive skew, multiplicative relationships
- **Sqrt**: Count data, moderate skew
- **Box-Cox**: Auto-finds optimal (requires positive values)
- **Standardization**: (x-μ)/σ for scaling to unit variance
- **Min-Max**: (x-min)/(max-min) for [0,1] scaling
## Practical Guidelines
### Sample Size Considerations
**Sample Size**: n < 30 (non-parametric, cautious) | 30-100 (parametric OK) | ≥ 100 (robust) | ≥ 1000 (may detect trivial effects)
- **n < 30**: Use non-parametric tests, be cautious with assumptions
- **30 ≤ n < 100**: Moderate sample, parametric tests usually acceptable
- **n ≥ 100**: Large sample, parametric tests robust to violations
- **n ≥ 1000**: Very large sample, may detect trivial effects as significant
**Missing Data**: < 5% (simple methods) | 5-10% (imputation) | > 10% (investigate patterns, advanced methods)
### Dealing with Missing Data
- **< 5% missing**: Usually not a problem, simple methods OK
- **5-10% missing**: Use appropriate imputation methods
- **> 10% missing**: Investigate patterns, consider advanced imputation or modeling missingness
### Reporting Results
Always include:
1. Test statistic value
2. p-value
3. Confidence interval (when applicable)
4. Effect size
5. Sample size
6. Assumptions checked and violations noted
**Reporting**: Include test statistic, p-value, CI, effect size, n, assumption checks