mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
Improve the EDA skill
This commit is contained in:
@@ -1,275 +1,202 @@
|
|||||||
---
|
---
|
||||||
name: exploratory-data-analysis
|
name: exploratory-data-analysis
|
||||||
description: "EDA toolkit. Analyze CSV/Excel/JSON/Parquet files, statistical summaries, distributions, correlations, outliers, missing data, visualizations, markdown reports, for data profiling and insights."
|
description: "Analyze datasets to discover patterns, anomalies, and relationships. Use when exploring data files, generating statistical summaries, checking data quality, or creating visualizations. Supports CSV, Excel, JSON, Parquet, and more."
|
||||||
---
|
---
|
||||||
|
|
||||||
# Exploratory Data Analysis
|
# Exploratory Data Analysis
|
||||||
|
|
||||||
## Overview
|
Discover patterns, anomalies, and relationships in tabular data through statistical analysis and visualization.
|
||||||
|
|
||||||
EDA is a process for discovering patterns, anomalies, and relationships in data. Analyze CSV/Excel/JSON/Parquet files to generate statistical summaries, distributions, correlations, outliers, and visualizations. All outputs are markdown-formatted for integration into workflows.
|
**Supported formats**: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
|
||||||
|
|
||||||
## When to Use This Skill
|
## Standard Workflow
|
||||||
|
|
||||||
This skill should be used when:
|
|
||||||
- User provides a data file and requests analysis or exploration
|
|
||||||
- User asks to "explore this dataset", "analyze this data", or "what's in this file?"
|
|
||||||
- User needs statistical summaries, distributions, or correlations
|
|
||||||
- User requests data visualizations or insights
|
|
||||||
- User wants to understand data quality issues or patterns
|
|
||||||
- User mentions EDA, exploratory analysis, or data profiling
|
|
||||||
|
|
||||||
**Supported file formats**: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
|
|
||||||
|
|
||||||
## Quick Start Workflow
|
|
||||||
|
|
||||||
1. **Receive data file** from user
|
|
||||||
2. **Run comprehensive analysis** using `scripts/eda_analyzer.py`
|
|
||||||
3. **Generate visualizations** using `scripts/visualizer.py`
|
|
||||||
4. **Create markdown report** using insights and the `assets/report_template.md` template
|
|
||||||
5. **Present findings** to user with key insights highlighted
|
|
||||||
|
|
||||||
## Core Capabilities
|
|
||||||
|
|
||||||
### 1. Comprehensive Data Analysis
|
|
||||||
|
|
||||||
Execute full statistical analysis using the `eda_analyzer.py` script:
|
|
||||||
|
|
||||||
|
1. Run statistical analysis:
|
||||||
```bash
|
```bash
|
||||||
python scripts/eda_analyzer.py <data_file_path> -o <output_directory>
|
python scripts/eda_analyzer.py <data_file> -o <output_dir>
|
||||||
```
|
```
|
||||||
|
|
||||||
**What it provides**:
|
2. Generate visualizations:
|
||||||
- Auto-detection and loading of file formats
|
|
||||||
- Basic dataset information (shape, types, memory usage)
|
|
||||||
- Missing data analysis (patterns, percentages)
|
|
||||||
- Summary statistics for numeric and categorical variables
|
|
||||||
- Outlier detection using IQR and Z-score methods
|
|
||||||
- Distribution analysis with normality tests (Shapiro-Wilk, Anderson-Darling)
|
|
||||||
- Correlation analysis (Pearson and Spearman)
|
|
||||||
- Data quality assessment (completeness, duplicates, issues)
|
|
||||||
- Automated insight generation
|
|
||||||
|
|
||||||
**Output**: JSON file containing all analysis results at `<output_directory>/eda_analysis.json`
|
|
||||||
|
|
||||||
### 2. Comprehensive Visualizations
|
|
||||||
|
|
||||||
Generate complete visualization suite using the `visualizer.py` script:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python scripts/visualizer.py <data_file_path> -o <output_directory>
|
python scripts/visualizer.py <data_file> -o <output_dir>
|
||||||
```
|
```
|
||||||
|
|
||||||
**Generated visualizations**:
|
3. Read analysis results from `<output_dir>/eda_analysis.json`
|
||||||
- **Missing data patterns**: Heatmap and bar chart showing missing data
|
|
||||||
- **Distribution plots**: Histograms with KDE overlays for all numeric variables
|
|
||||||
- **Box plots with violin plots**: Outlier detection visualizations
|
|
||||||
- **Correlation heatmap**: Both Pearson and Spearman correlation matrices
|
|
||||||
- **Scatter matrix**: Pairwise relationships between numeric variables
|
|
||||||
- **Categorical analysis**: Bar charts for top categories
|
|
||||||
- **Time series plots**: Temporal trends with trend lines (if datetime columns exist)
|
|
||||||
|
|
||||||
**Output**: High-quality PNG files saved to `<output_directory>/eda_visualizations/`
|
4. Create report using `assets/report_template.md` structure
|
||||||
|
|
||||||
All visualizations are production-ready with:
|
5. Present findings with key insights and visualizations
|
||||||
- 300 DPI resolution
|
|
||||||
- Clear titles and labels
|
|
||||||
- Statistical annotations
|
|
||||||
- Professional styling using seaborn
|
|
||||||
|
|
||||||
### 3. Automated Insight Generation
|
## Analysis Capabilities
|
||||||
|
|
||||||
The analyzer automatically generates actionable insights including:
|
### Statistical Analysis
|
||||||
|
|
||||||
- **Data scale insights**: Dataset size considerations for processing
|
Run `scripts/eda_analyzer.py` to generate comprehensive analysis:
|
||||||
- **Missing data alerts**: Warnings when missing data exceeds thresholds
|
|
||||||
- **Correlation discoveries**: Strong relationships identified for feature engineering
|
|
||||||
- **Outlier warnings**: Variables with high outlier rates flagged
|
|
||||||
- **Distribution assessments**: Skewness issues requiring transformations
|
|
||||||
- **Duplicate alerts**: Duplicate row detection
|
|
||||||
- **Imbalance warnings**: Categorical variable imbalance detection
|
|
||||||
|
|
||||||
Access insights from the analysis results JSON under the `"insights"` key.
|
```bash
|
||||||
|
python scripts/eda_analyzer.py sales_data.csv -o ./output
|
||||||
|
```
|
||||||
|
|
||||||
### 4. Statistical Interpretation
|
Produces `output/eda_analysis.json` containing:
|
||||||
|
- Dataset shape, types, memory usage
|
||||||
|
- Missing data patterns and percentages
|
||||||
|
- Summary statistics (numeric and categorical)
|
||||||
|
- Outlier detection (IQR and Z-score methods)
|
||||||
|
- Distribution analysis with normality tests
|
||||||
|
- Correlation matrices (Pearson and Spearman)
|
||||||
|
- Data quality metrics (completeness, duplicates)
|
||||||
|
- Automated insights
|
||||||
|
|
||||||
For detailed interpretation of statistical tests and measures, reference:
|
### Visualizations
|
||||||
|
|
||||||
**`references/statistical_tests_guide.md`** - Comprehensive guide covering:
|
Run `scripts/visualizer.py` to generate plots:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/visualizer.py sales_data.csv -o ./output
|
||||||
|
```
|
||||||
|
|
||||||
|
Creates high-resolution (300 DPI) PNG files in `output/eda_visualizations/`:
|
||||||
|
- Missing data heatmaps and bar charts
|
||||||
|
- Distribution plots (histograms with KDE)
|
||||||
|
- Box plots and violin plots for outliers
|
||||||
|
- Correlation heatmaps
|
||||||
|
- Scatter matrices for numeric relationships
|
||||||
|
- Categorical bar charts
|
||||||
|
- Time series plots (if datetime columns detected)
|
||||||
|
|
||||||
|
### Automated Insights
|
||||||
|
|
||||||
|
Access generated insights from the `"insights"` key in the analysis JSON:
|
||||||
|
- Dataset size considerations
|
||||||
|
- Missing data warnings (when exceeding thresholds)
|
||||||
|
- Strong correlations for feature engineering
|
||||||
|
- High outlier rate flags
|
||||||
|
- Skewness requiring transformations
|
||||||
|
- Duplicate detection
|
||||||
|
- Categorical imbalance warnings
|
||||||
|
|
||||||
|
## Reference Materials
|
||||||
|
|
||||||
|
### Statistical Interpretation
|
||||||
|
|
||||||
|
See `references/statistical_tests_guide.md` for detailed guidance on:
|
||||||
- Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
|
- Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
|
||||||
- Distribution characteristics (skewness, kurtosis)
|
- Distribution characteristics (skewness, kurtosis)
|
||||||
- Correlation tests (Pearson, Spearman)
|
- Correlation methods (Pearson, Spearman)
|
||||||
- Outlier detection methods (IQR, Z-score)
|
- Outlier detection (IQR, Z-score)
|
||||||
- Hypothesis testing guidelines
|
- Hypothesis testing and data transformations
|
||||||
- Data transformation strategies
|
|
||||||
|
|
||||||
Load this reference when needing to interpret specific statistical tests or explain results to users.
|
Use when interpreting statistical results or explaining findings.
|
||||||
|
|
||||||
### 5. Best Practices Guidance
|
### Methodology
|
||||||
|
|
||||||
For methodological guidance, reference:
|
See `references/eda_best_practices.md` for comprehensive guidance on:
|
||||||
|
- 6-step EDA process framework
|
||||||
|
- Univariate, bivariate, multivariate analysis approaches
|
||||||
|
- Visualization and statistical analysis guidelines
|
||||||
|
- Common pitfalls and domain-specific considerations
|
||||||
|
- Communication strategies for different audiences
|
||||||
|
|
||||||
**`references/eda_best_practices.md`** - Detailed best practices including:
|
Use when planning analysis or handling specific scenarios.
|
||||||
- EDA process framework (6-step methodology)
|
|
||||||
- Univariate, bivariate, and multivariate analysis approaches
|
|
||||||
- Visualization guidelines
|
|
||||||
- Statistical analysis guidelines
|
|
||||||
- Common pitfalls to avoid
|
|
||||||
- Domain-specific considerations
|
|
||||||
- Communication tips for technical and non-technical audiences
|
|
||||||
|
|
||||||
Load this reference when planning analysis approach or needing guidance on specific EDA scenarios.
|
## Report Template
|
||||||
|
|
||||||
## Creating Analysis Reports
|
Use `assets/report_template.md` to structure findings. Template includes:
|
||||||
|
|
||||||
Use the provided template to structure comprehensive EDA reports:
|
|
||||||
|
|
||||||
**`assets/report_template.md`** - Professional report template with sections for:
|
|
||||||
- Executive summary
|
- Executive summary
|
||||||
- Dataset overview
|
- Dataset overview
|
||||||
- Data quality assessment
|
- Data quality assessment
|
||||||
- Univariate, bivariate, and multivariate analysis
|
- Univariate, bivariate, and multivariate analysis
|
||||||
- Outlier analysis
|
- Outlier analysis
|
||||||
- Key insights and findings
|
- Key insights and recommendations
|
||||||
- Recommendations
|
|
||||||
- Limitations and appendices
|
- Limitations and appendices
|
||||||
|
|
||||||
**To use the template**:
|
Fill sections with analysis JSON results and embed visualizations using markdown image syntax.
|
||||||
1. Copy the template content
|
|
||||||
2. Fill in sections with analysis results from JSON output
|
|
||||||
3. Embed visualization images using markdown syntax
|
|
||||||
4. Populate insights and recommendations
|
|
||||||
5. Save as markdown for user consumption
|
|
||||||
|
|
||||||
## Typical Workflow Example
|
## Example: Complete Analysis
|
||||||
|
|
||||||
When user provides a data file:
|
User request: "Explore this sales_data.csv file"
|
||||||
|
|
||||||
```
|
```bash
|
||||||
User: "Can you explore this sales_data.csv file and tell me what you find?"
|
# 1. Run analysis
|
||||||
|
python scripts/eda_analyzer.py sales_data.csv -o ./output
|
||||||
|
|
||||||
1. Run analysis:
|
# 2. Generate visualizations
|
||||||
python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output
|
python scripts/visualizer.py sales_data.csv -o ./output
|
||||||
|
|
||||||
2. Generate visualizations:
|
|
||||||
python scripts/visualizer.py sales_data.csv -o ./analysis_output
|
|
||||||
|
|
||||||
3. Read analysis results:
|
|
||||||
Read ./analysis_output/eda_analysis.json
|
|
||||||
|
|
||||||
4. Create markdown report using template:
|
|
||||||
- Copy assets/report_template.md structure
|
|
||||||
- Fill in sections with analysis results
|
|
||||||
- Reference visualizations from ./analysis_output/eda_visualizations/
|
|
||||||
- Include automated insights from JSON
|
|
||||||
|
|
||||||
5. Present to user:
|
|
||||||
- Show key insights prominently
|
|
||||||
- Highlight data quality issues
|
|
||||||
- Provide visualizations inline
|
|
||||||
- Make actionable recommendations
|
|
||||||
- Save complete report as .md file
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Advanced Analysis Scenarios
|
```python
|
||||||
|
# 3. Read results
|
||||||
|
import json
|
||||||
|
with open('./output/eda_analysis.json') as f:
|
||||||
|
results = json.load(f)
|
||||||
|
|
||||||
### Large Datasets (>1M rows)
|
# 4. Build report from assets/report_template.md
|
||||||
- Run analysis on sampled data first for quick exploration
|
# - Fill sections with results
|
||||||
- Note sample size in report
|
# - Embed images: 
|
||||||
- Recommend distributed computing for full analysis
|
# - Include insights from results['insights']
|
||||||
|
# - Add recommendations
|
||||||
|
```
|
||||||
|
|
||||||
### High-Dimensional Data (>50 columns)
|
## Special Cases
|
||||||
- Focus on most important variables first
|
|
||||||
- Consider PCA or feature selection
|
|
||||||
- Generate correlation analysis to identify variable groups
|
|
||||||
- Reference `eda_best_practices.md` section on high-dimensional data
|
|
||||||
|
|
||||||
### Time Series Data
|
### Dataset Size Strategy
|
||||||
- Ensure datetime columns are properly detected
|
|
||||||
- Time series visualizations will be automatically generated
|
|
||||||
- Consider temporal patterns, trends, and seasonality
|
|
||||||
- Reference `eda_best_practices.md` section on time series
|
|
||||||
|
|
||||||
### Imbalanced Data
|
**If < 100 rows**: Note sample size limitations, use non-parametric methods
|
||||||
- Categorical analysis will flag imbalances
|
|
||||||
- Report class distributions prominently
|
|
||||||
- Recommend stratified sampling if needed
|
|
||||||
|
|
||||||
### Small Sample Sizes (<100 rows)
|
**If 100-1M rows**: Standard workflow applies
|
||||||
- Non-parametric methods automatically used where appropriate
|
|
||||||
- Be conservative in statistical conclusions
|
|
||||||
- Note sample size limitations in report
|
|
||||||
|
|
||||||
## Output Best Practices
|
**If > 1M rows**: Sample first for quick exploration, note sample size in report, recommend distributed computing for full analysis
|
||||||
|
|
||||||
**Always output as markdown**:
|
### Data Characteristics
|
||||||
- Structure findings using markdown headers, tables, and lists
|
|
||||||
- Embed visualizations using `` syntax
|
|
||||||
- Use tables for statistical summaries
|
|
||||||
- Include code blocks for any suggested transformations
|
|
||||||
- Highlight key insights with bold or bullet points
|
|
||||||
|
|
||||||
**Ensure reports are actionable**:
|
**High-dimensional (>50 columns)**: Focus on key variables first, use correlation analysis to identify groups, consider PCA or feature selection. See `references/eda_best_practices.md` for guidance.
|
||||||
- Provide clear recommendations based on findings
|
|
||||||
- Flag data quality issues that need attention
|
|
||||||
- Suggest next steps for modeling or further analysis
|
|
||||||
- Identify feature engineering opportunities
|
|
||||||
|
|
||||||
**Make insights accessible**:
|
**Time series**: Datetime columns auto-detected, temporal visualizations generated automatically. Consider trends, seasonality, patterns.
|
||||||
- Explain statistical concepts in plain language
|
|
||||||
- Use reference guides to provide detailed interpretations
|
**Imbalanced**: Categorical analysis flags imbalances automatically. Report distributions prominently, recommend stratified sampling if needed.
|
||||||
- Include both technical details and executive summary
|
|
||||||
|
## Output Guidelines
|
||||||
|
|
||||||
|
**Format findings as markdown**:
|
||||||
|
- Use headers, tables, and lists for structure
|
||||||
|
- Embed visualizations: ``
|
||||||
|
- Include code blocks for suggested transformations
|
||||||
|
- Highlight key insights
|
||||||
|
|
||||||
|
**Make reports actionable**:
|
||||||
|
- Provide clear recommendations
|
||||||
|
- Flag data quality issues requiring attention
|
||||||
|
- Suggest next steps (modeling, feature engineering, further analysis)
|
||||||
- Tailor communication to user's technical level
|
- Tailor communication to user's technical level
|
||||||
|
|
||||||
## Handling Edge Cases
|
## Error Handling
|
||||||
|
|
||||||
**Unsupported file formats**:
|
**Unsupported formats**: Request conversion to supported format (CSV, Excel, JSON, Parquet)
|
||||||
- Request user to convert to supported format
|
|
||||||
- Suggest using pandas-compatible formats
|
|
||||||
|
|
||||||
**Files too large to load**:
|
**Files too large**: Recommend sampling or chunked processing
|
||||||
- Recommend sampling approach
|
|
||||||
- Suggest chunked processing
|
|
||||||
- Consider alternative tools for big data
|
|
||||||
|
|
||||||
**Corrupted or malformed data**:
|
**Corrupted data**: Report specific errors, suggest cleaning steps, attempt partial analysis
|
||||||
- Report specific errors encountered
|
|
||||||
- Suggest data cleaning steps
|
|
||||||
- Try to salvage partial analysis if possible
|
|
||||||
|
|
||||||
**All missing data in columns**:
|
**Empty columns**: Flag in data quality section, recommend removal or investigation
|
||||||
- Flag completely empty columns
|
|
||||||
- Recommend removal or investigation
|
|
||||||
- Document in data quality section
|
|
||||||
|
|
||||||
## Resources Summary
|
## Resources
|
||||||
|
|
||||||
### scripts/
|
**Scripts** (handle all formats automatically):
|
||||||
- **`eda_analyzer.py`**: Main analysis engine - comprehensive statistical analysis
|
- `scripts/eda_analyzer.py` - Statistical analysis engine
|
||||||
- **`visualizer.py`**: Visualization generator - creates all chart types
|
- `scripts/visualizer.py` - Visualization generator
|
||||||
|
|
||||||
Both scripts are fully executable and handle multiple file formats automatically.
|
**References** (load as needed):
|
||||||
|
- `references/statistical_tests_guide.md` - Test interpretation and methodology
|
||||||
|
- `references/eda_best_practices.md` - EDA process and best practices
|
||||||
|
|
||||||
### references/
|
**Template**:
|
||||||
- **`statistical_tests_guide.md`**: Statistical test interpretation and methodology
|
- `assets/report_template.md` - Professional report structure
|
||||||
- **`eda_best_practices.md`**: Comprehensive EDA methodology and best practices
|
|
||||||
|
|
||||||
Load these references as needed to inform analysis approach and interpretation.
|
## Key Points
|
||||||
|
|
||||||
### assets/
|
- Run both scripts for complete analysis
|
||||||
- **`report_template.md`**: Professional markdown report template
|
- Structure reports using the template
|
||||||
|
- Provide actionable insights, not just statistics
|
||||||
Use this template structure for creating consistent, comprehensive EDA reports.
|
- Use reference guides for detailed interpretations
|
||||||
|
- Document data quality issues and limitations
|
||||||
## Key Reminders
|
- Make clear recommendations for next steps
|
||||||
|
|
||||||
1. **Always generate markdown output** for textual results
|
|
||||||
2. **Run both scripts** (analyzer and visualizer) for complete analysis
|
|
||||||
3. **Use the template** to structure comprehensive reports
|
|
||||||
4. **Include visualizations** by referencing generated PNG files
|
|
||||||
5. **Provide actionable insights** - don't just present statistics
|
|
||||||
6. **Interpret findings** using reference guides
|
|
||||||
7. **Document limitations** and data quality issues
|
|
||||||
8. **Make recommendations** for next steps
|
|
||||||
|
|
||||||
This skill transforms raw data into actionable insights through systematic exploration, advanced statistics, rich visualizations, and clear communication.
|
|
||||||
|
|||||||
@@ -1,14 +1,10 @@
|
|||||||
# Exploratory Data Analysis Report
|
# EDA Report: [Dataset Name]
|
||||||
|
|
||||||
**Dataset**: [Dataset Name]
|
**Date**: [Date] | **Analyst**: [Name]
|
||||||
**Analysis Date**: [Date]
|
|
||||||
**Analyst**: [Name]
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Executive Summary
|
## Executive Summary
|
||||||
|
|
||||||
[2-3 paragraph summary of key findings, major insights, and recommendations]
|
[Concise summary of key findings and recommendations]
|
||||||
|
|
||||||
**Key Findings**:
|
**Key Findings**:
|
||||||
- [Finding 1]
|
- [Finding 1]
|
||||||
@@ -23,366 +19,197 @@
|
|||||||
|
|
||||||
## 1. Dataset Overview
|
## 1. Dataset Overview
|
||||||
|
|
||||||
### 1.1 Data Source
|
**Source**: [Source name] | **Format**: [CSV/Excel/JSON/etc.] | **Period**: [Date range]
|
||||||
- **Source**: [Source name and location]
|
|
||||||
- **Collection Period**: [Date range]
|
|
||||||
- **Last Updated**: [Date]
|
|
||||||
- **Format**: [CSV, Excel, JSON, etc.]
|
|
||||||
|
|
||||||
### 1.2 Data Structure
|
**Structure**: [Rows] observations × [Columns] variables | **Memory**: [Size] MB
|
||||||
- **Observations (Rows)**: [Number]
|
|
||||||
- **Variables (Columns)**: [Number]
|
|
||||||
- **Memory Usage**: [Size in MB]
|
|
||||||
|
|
||||||
### 1.3 Variable Types
|
**Variable Types**:
|
||||||
- **Numeric Variables** ([Count]): [List column names]
|
- Numeric ([Count]): [List names]
|
||||||
- **Categorical Variables** ([Count]): [List column names]
|
- Categorical ([Count]): [List names]
|
||||||
- **Datetime Variables** ([Count]): [List column names]
|
- Datetime ([Count]): [List names]
|
||||||
- **Boolean Variables** ([Count]): [List column names]
|
- Boolean ([Count]): [List names]
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 2. Data Quality Assessment
|
## 2. Data Quality
|
||||||
|
|
||||||
### 2.1 Completeness
|
**Completeness**: [Percentage]% | **Duplicates**: [Count] ([%]%)
|
||||||
|
|
||||||
**Overall Data Completeness**: [Percentage]%
|
**Missing Data**:
|
||||||
|
| Column | Missing % | Assessment |
|
||||||
|
|--------|-----------|------------|
|
||||||
|
| [Column 1] | [%] | [High/Medium/Low] |
|
||||||
|
| [Column 2] | [%] | [High/Medium/Low] |
|
||||||
|
|
||||||
**Missing Data Summary**:
|

|
||||||
| Column | Missing Count | Missing % | Assessment |
|
|
||||||
|--------|--------------|-----------|------------|
|
|
||||||
| [Column 1] | [Count] | [%] | [High/Medium/Low] |
|
|
||||||
| [Column 2] | [Count] | [%] | [High/Medium/Low] |
|
|
||||||
|
|
||||||
**Missing Data Pattern**: [Description of patterns, if any]
|
**Quality Issues**:
|
||||||
|
- [Issue 1]
|
||||||
**Visualization**: 
|
- [Issue 2]
|
||||||
|
|
||||||
### 2.2 Duplicates
|
|
||||||
|
|
||||||
- **Duplicate Rows**: [Count] ([Percentage]%)
|
|
||||||
- **Action Required**: [Yes/No - describe if needed]
|
|
||||||
|
|
||||||
### 2.3 Data Quality Issues
|
|
||||||
|
|
||||||
[List any identified issues]
|
|
||||||
- [ ] Issue 1: [Description]
|
|
||||||
- [ ] Issue 2: [Description]
|
|
||||||
- [ ] Issue 3: [Description]
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 3. Univariate Analysis
|
## 3. Univariate Analysis
|
||||||
|
|
||||||
### 3.1 Numeric Variables
|
### Numeric: [Variable Name]
|
||||||
|
|
||||||
[For each key numeric variable:]
|
**Stats**: Mean: [Value] | Median: [Value] | Std: [Value] | Range: [[Min]-[Max]]
|
||||||
|
|
||||||
#### [Variable Name]
|
**Distribution**: Skewness: [Value] | Kurtosis: [Value] | Normality: [Yes/No]
|
||||||
|
|
||||||
**Summary Statistics**:
|
**Outliers**: IQR: [Count] ([%]%) | Z-score: [Count] ([%]%)
|
||||||
- **Mean**: [Value]
|
|
||||||
- **Median**: [Value]
|
|
||||||
- **Std Dev**: [Value]
|
|
||||||
- **Min**: [Value]
|
|
||||||
- **Max**: [Value]
|
|
||||||
- **Range**: [Value]
|
|
||||||
- **IQR**: [Value]
|
|
||||||
|
|
||||||
**Distribution Characteristics**:
|

|
||||||
- **Skewness**: [Value] - [Interpretation]
|
|
||||||
- **Kurtosis**: [Value] - [Interpretation]
|
|
||||||
- **Normality**: [Normal/Not Normal based on tests]
|
|
||||||
|
|
||||||
**Outliers**:
|
**Insights**: [Key observations]
|
||||||
- **IQR Method**: [Count] outliers ([Percentage]%)
|
|
||||||
- **Z-Score Method**: [Count] outliers ([Percentage]%)
|
|
||||||
|
|
||||||
**Visualization**: ![Distribution of [Variable]](path/to/distribution.png)
|
### Categorical: [Variable Name]
|
||||||
|
|
||||||
**Insights**:
|
**Stats**: [Count] unique values | Most common: [Value] ([%]%) | Balance: [Balanced/Imbalanced]
|
||||||
- [Key insight 1]
|
|
||||||
- [Key insight 2]
|
|
||||||
|
|
||||||
---
|
| Category | Count | % |
|
||||||
|
|----------|-------|---|
|
||||||
### 3.2 Categorical Variables
|
|
||||||
|
|
||||||
[For each key categorical variable:]
|
|
||||||
|
|
||||||
#### [Variable Name]
|
|
||||||
|
|
||||||
**Summary**:
|
|
||||||
- **Unique Values**: [Count]
|
|
||||||
- **Most Common**: [Value] ([Percentage]%)
|
|
||||||
- **Least Common**: [Value] ([Percentage]%)
|
|
||||||
- **Balance**: [Balanced/Imbalanced]
|
|
||||||
|
|
||||||
**Top Categories**:
|
|
||||||
| Category | Count | Percentage |
|
|
||||||
|----------|-------|------------|
|
|
||||||
| [Cat 1] | [Count] | [%] |
|
| [Cat 1] | [Count] | [%] |
|
||||||
| [Cat 2] | [Count] | [%] |
|
| [Cat 2] | [Count] | [%] |
|
||||||
| [Cat 3] | [Count] | [%] |
|
|
||||||
|
|
||||||
**Visualization**: ![Distribution of [Variable]](path/to/categorical.png)
|

|
||||||
|
|
||||||
**Insights**:
|
**Insights**: [Key observations]
|
||||||
- [Key insight 1]
|
|
||||||
- [Key insight 2]
|
|
||||||
|
|
||||||
---
|
### Temporal: [Variable Name]
|
||||||
|
|
||||||
### 3.3 Temporal Variables
|
**Range**: [Start] to [End] ([Duration]) | **Trend**: [Increasing/Decreasing/Stable] | **Seasonality**: [Yes/No]
|
||||||
|
|
||||||
[If datetime columns exist:]
|

|
||||||
|
|
||||||
#### [Variable Name]
|
**Insights**: [Key observations]
|
||||||
|
|
||||||
**Time Range**: [Start Date] to [End Date]
|
|
||||||
**Duration**: [Time span]
|
|
||||||
**Temporal Coverage**: [Complete/Gaps identified]
|
|
||||||
|
|
||||||
**Temporal Patterns**:
|
|
||||||
- **Trend**: [Increasing/Decreasing/Stable]
|
|
||||||
- **Seasonality**: [Yes/No - describe if present]
|
|
||||||
- **Gaps**: [List any gaps in timeline]
|
|
||||||
|
|
||||||
**Visualization**: ![Time Series of [Variable]](path/to/timeseries.png)
|
|
||||||
|
|
||||||
**Insights**:
|
|
||||||
- [Key insight 1]
|
|
||||||
- [Key insight 2]
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 4. Bivariate Analysis
|
## 4. Bivariate Analysis
|
||||||
|
|
||||||
### 4.1 Correlation Analysis
|
**Correlation Summary**: [Count] strong positive | [Count] strong negative | [Count] weak/none
|
||||||
|
|
||||||
**Overall Correlation Structure**:
|
|
||||||
- **Strong Positive Correlations**: [Count]
|
|
||||||
- **Strong Negative Correlations**: [Count]
|
|
||||||
- **Weak/No Correlations**: [Count]
|
|
||||||
|
|
||||||
**Correlation Matrix**:
|
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
**Notable Correlations**:
|
**Notable Correlations**:
|
||||||
| Variable 1 | Variable 2 | Pearson r | Spearman ρ | Strength | Interpretation |
|
| Var 1 | Var 2 | Pearson | Spearman | Strength |
|
||||||
|-----------|-----------|-----------|------------|----------|----------------|
|
|-------|-------|---------|----------|----------|
|
||||||
| [Var 1] | [Var 2] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] |
|
| [Var 1] | [Var 2] | [Value] | [Value] | [Strong/Moderate/Weak] |
|
||||||
| [Var 1] | [Var 3] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] |
|
|
||||||
|
|
||||||
**Insights**:
|
**Insights**: [Multicollinearity issues, feature engineering opportunities]
|
||||||
- [Key insight about correlations]
|
|
||||||
- [Potential multicollinearity issues]
|
|
||||||
- [Feature engineering opportunities]
|
|
||||||
|
|
||||||
---
|
### Key Relationship: [Var 1] vs [Var 2]
|
||||||
|
|
||||||
### 4.2 Key Relationships
|
**Type**: [Linear/Non-linear/None] | **r**: [Value] | **p-value**: [Value]
|
||||||
|
|
||||||
[For important variable pairs:]
|

|
||||||
|
|
||||||
#### [Variable 1] vs [Variable 2]
|
**Insights**: [Description and implications]
|
||||||
|
|
||||||
**Relationship Type**: [Linear/Non-linear/None]
|
|
||||||
**Correlation**: [Value]
|
|
||||||
**Statistical Test**: [Test name, p-value]
|
|
||||||
|
|
||||||
**Visualization**: 
|
|
||||||
|
|
||||||
**Insights**:
|
|
||||||
- [Description of relationship]
|
|
||||||
- [Implications]
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 5. Multivariate Analysis
|
## 5. Multivariate Analysis
|
||||||
|
|
||||||
### 5.1 Scatter Matrix
|
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
**Observations**:
|
**Patterns**: [Key observations]
|
||||||
- [Pattern 1]
|
|
||||||
- [Pattern 2]
|
|
||||||
- [Pattern 3]
|
|
||||||
|
|
||||||
### 5.2 Clustering Patterns
|
**Clustering** (if performed): [Method] | [Count] clusters identified
|
||||||
|
|
||||||
[If clustering analysis performed:]
|
|
||||||
|
|
||||||
**Method**: [Method used]
|
|
||||||
**Number of Clusters**: [Count]
|
|
||||||
|
|
||||||
**Cluster Characteristics**:
|
|
||||||
- **Cluster 1**: [Description]
|
|
||||||
- **Cluster 2**: [Description]
|
|
||||||
|
|
||||||
**Visualization**: [Link to visualization]
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 6. Outlier Analysis
|
## 6. Outliers
|
||||||
|
|
||||||
### 6.1 Outlier Summary
|
**Overall Rate**: [%]%
|
||||||
|
|
||||||
**Overall Outlier Rate**: [Percentage]%
|
| Variable | Outlier % | Method | Action |
|
||||||
|
|----------|-----------|--------|--------|
|
||||||
|
| [Var 1] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
|
||||||
|
| [Var 2] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
|
||||||
|
|
||||||
**Variables with High Outlier Rates**:
|

|
||||||
| Variable | Outlier Count | Outlier % | Method | Action |
|
|
||||||
|----------|--------------|-----------|--------|--------|
|
|
||||||
| [Var 1] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
|
|
||||||
| [Var 2] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
|
|
||||||
|
|
||||||
**Visualization**: 
|
**Investigation**: [Description of significant outliers, causes, validity]
|
||||||
|
|
||||||
### 6.2 Outlier Investigation
|
|
||||||
|
|
||||||
[For significant outliers:]
|
|
||||||
|
|
||||||
#### [Variable Name]
|
|
||||||
|
|
||||||
**Outlier Characteristics**:
|
|
||||||
- [Description of outliers]
|
|
||||||
- [Potential causes]
|
|
||||||
- [Validity assessment]
|
|
||||||
|
|
||||||
**Recommendation**: [Keep/Remove/Transform/Investigate further]
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 7. Key Insights and Findings
|
## 7. Key Insights
|
||||||
|
|
||||||
### 7.1 Data Quality Insights
|
**Data Quality**:
|
||||||
|
- [Insight with implication]
|
||||||
|
- [Insight with implication]
|
||||||
|
|
||||||
1. **[Insight 1]**: [Description and implication]
|
**Statistical Patterns**:
|
||||||
2. **[Insight 2]**: [Description and implication]
|
- [Insight with implication]
|
||||||
3. **[Insight 3]**: [Description and implication]
|
- [Insight with implication]
|
||||||
|
|
||||||
### 7.2 Statistical Insights
|
**Domain/Research Insights**:
|
||||||
|
- [Insight with implication]
|
||||||
|
- [Insight with implication]
|
||||||
|
|
||||||
1. **[Insight 1]**: [Description and implication]
|
**Unexpected Findings**:
|
||||||
2. **[Insight 2]**: [Description and implication]
|
- [Finding and significance]
|
||||||
3. **[Insight 3]**: [Description and implication]
|
|
||||||
|
|
||||||
### 7.3 Business/Research Insights
|
|
||||||
|
|
||||||
1. **[Insight 1]**: [Description and implication]
|
|
||||||
2. **[Insight 2]**: [Description and implication]
|
|
||||||
3. **[Insight 3]**: [Description and implication]
|
|
||||||
|
|
||||||
### 7.4 Unexpected Findings
|
|
||||||
|
|
||||||
1. **[Finding 1]**: [Description and significance]
|
|
||||||
2. **[Finding 2]**: [Description and significance]
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 8. Recommendations
|
## 8. Recommendations
|
||||||
|
|
||||||
### 8.1 Data Quality Actions
|
**Data Quality Actions**:
|
||||||
|
- [ ] [Action - priority]
|
||||||
|
- [ ] [Action - priority]
|
||||||
|
|
||||||
- [ ] **[Action 1]**: [Description and priority]
|
**Next Steps**:
|
||||||
- [ ] **[Action 2]**: [Description and priority]
|
- [Step with rationale]
|
||||||
- [ ] **[Action 3]**: [Description and priority]
|
- [Step with rationale]
|
||||||
|
|
||||||
### 8.2 Analysis Next Steps
|
**Feature Engineering**:
|
||||||
|
- [Opportunity]
|
||||||
|
- [Opportunity]
|
||||||
|
|
||||||
1. **[Step 1]**: [Description and rationale]
|
**Modeling Considerations**:
|
||||||
2. **[Step 2]**: [Description and rationale]
|
- [Consideration]
|
||||||
3. **[Step 3]**: [Description and rationale]
|
- [Consideration]
|
||||||
|
|
||||||
### 8.3 Feature Engineering Opportunities
|
|
||||||
|
|
||||||
- **[Opportunity 1]**: [Description]
|
|
||||||
- **[Opportunity 2]**: [Description]
|
|
||||||
- **[Opportunity 3]**: [Description]
|
|
||||||
|
|
||||||
### 8.4 Modeling Considerations
|
|
||||||
|
|
||||||
- **[Consideration 1]**: [Description]
|
|
||||||
- **[Consideration 2]**: [Description]
|
|
||||||
- **[Consideration 3]**: [Description]
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 9. Limitations and Caveats
|
## 9. Limitations
|
||||||
|
|
||||||
### 9.1 Data Limitations
|
**Data**: [Key limitations]
|
||||||
|
|
||||||
- [Limitation 1]
|
**Analysis**: [Key limitations]
|
||||||
- [Limitation 2]
|
|
||||||
- [Limitation 3]
|
|
||||||
|
|
||||||
### 9.2 Analysis Limitations
|
**Assumptions**: [Key assumptions made]
|
||||||
|
|
||||||
- [Limitation 1]
|
|
||||||
- [Limitation 2]
|
|
||||||
- [Limitation 3]
|
|
||||||
|
|
||||||
### 9.3 Assumptions Made
|
|
||||||
|
|
||||||
- [Assumption 1]
|
|
||||||
- [Assumption 2]
|
|
||||||
- [Assumption 3]
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 10. Appendices
|
## Appendices
|
||||||
|
|
||||||
### Appendix A: Technical Details
|
### A: Technical Details
|
||||||
|
|
||||||
**Software Environment**:
|
**Environment**: Python with pandas, numpy, scipy, matplotlib, seaborn
|
||||||
- Python: [Version]
|
|
||||||
- Key Libraries: pandas ([Version]), numpy ([Version]), scipy ([Version]), matplotlib ([Version])
|
|
||||||
|
|
||||||
**Analysis Scripts**: [Link to repository or location]
|
**Scripts**: [Repository/location]
|
||||||
|
|
||||||
### Appendix B: Variable Dictionary
|
### B: Variable Dictionary
|
||||||
|
|
||||||
| Variable Name | Type | Description | Unit | Valid Range | Missing % |
|
| Variable | Type | Description | Unit | Range | Missing % |
|
||||||
|--------------|------|-------------|------|-------------|-----------|
|
|----------|------|-------------|------|-------|-----------|
|
||||||
| [Var 1] | [Type] | [Description] | [Unit] | [Range] | [%] |
|
| [Var 1] | [Type] | [Description] | [Unit] | [Range] | [%] |
|
||||||
| [Var 2] | [Type] | [Description] | [Unit] | [Range] | [%] |
|
|
||||||
|
|
||||||
### Appendix C: Statistical Test Results
|
### C: Statistical Tests
|
||||||
|
|
||||||
[Detailed statistical test outputs]
|
**Normality**:
|
||||||
|
|
||||||
**Normality Tests**:
|
|
||||||
| Variable | Test | Statistic | p-value | Result |
|
| Variable | Test | Statistic | p-value | Result |
|
||||||
|----------|------|-----------|---------|--------|
|
|----------|------|-----------|---------|--------|
|
||||||
| [Var 1] | Shapiro-Wilk | [Value] | [Value] | [Normal/Non-normal] |
|
| [Var 1] | Shapiro-Wilk | [Value] | [Value] | [Normal/Non-normal] |
|
||||||
|
|
||||||
**Correlation Tests**:
|
**Correlations**:
|
||||||
| Var 1 | Var 2 | Coefficient | p-value | Significance |
|
| Var 1 | Var 2 | r | p-value | Significant |
|
||||||
|-------|-------|-------------|---------|--------------|
|
|-------|-------|---|---------|-------------|
|
||||||
| [Var 1] | [Var 2] | [Value] | [Value] | [Yes/No] |
|
| [Var 1] | [Var 2] | [Value] | [Value] | [Yes/No] |
|
||||||
|
|
||||||
### Appendix D: Full Visualization Gallery
|
### D: Visualizations
|
||||||
|
|
||||||
[Links to all generated visualizations]
|
1. [Description](path/to/viz1.png)
|
||||||
|
2. [Description](path/to/viz2.png)
|
||||||
1. [Visualization 1 description](path/to/viz1.png)
|
|
||||||
2. [Visualization 2 description](path/to/viz2.png)
|
|
||||||
3. [Visualization 3 description](path/to/viz3.png)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Contact Information
|
|
||||||
|
|
||||||
**Analyst**: [Name]
|
|
||||||
**Email**: [Email]
|
|
||||||
**Date**: [Date]
|
|
||||||
**Version**: [Version number]
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Document History**:
|
|
||||||
| Version | Date | Changes | Author |
|
|
||||||
|---------|------|---------|--------|
|
|
||||||
| 1.0 | [Date] | Initial analysis | [Name] |
|
|
||||||
|
|||||||
@@ -1,379 +1,125 @@
|
|||||||
# Exploratory Data Analysis Best Practices
|
# EDA Best Practices
|
||||||
|
|
||||||
This guide provides best practices and methodologies for conducting thorough exploratory data analysis.
|
Methodologies for conducting thorough exploratory data analysis.
|
||||||
|
|
||||||
## EDA Process Framework
|
## 6-Step EDA Framework
|
||||||
|
|
||||||
### 1. Initial Data Understanding
|
### 1. Initial Understanding
|
||||||
|
|
||||||
**Objectives**:
|
**Questions**:
|
||||||
- Understand data structure and format
|
|
||||||
- Identify data types and schema
|
|
||||||
- Get familiar with domain context
|
|
||||||
|
|
||||||
**Key Questions**:
|
|
||||||
- What does each column represent?
|
- What does each column represent?
|
||||||
- What is the unit of observation?
|
- What is the unit of observation and time period?
|
||||||
- What is the time period covered?
|
|
||||||
- What is the data collection methodology?
|
- What is the data collection methodology?
|
||||||
- Are there any known data quality issues?
|
- Are there known quality issues?
|
||||||
|
|
||||||
**Actions**:
|
**Actions**: Load data, inspect structure, review types, document context
|
||||||
- Load and inspect first/last rows
|
|
||||||
- Check data dimensions (rows × columns)
|
|
||||||
- Review column names and types
|
|
||||||
- Document data source and context
|
|
||||||
|
|
||||||
### 2. Data Quality Assessment
|
### 2. Quality Assessment
|
||||||
|
|
||||||
**Objectives**:
|
**Check**: Missing data patterns, duplicates, outliers, consistency, accuracy
|
||||||
- Identify data quality issues
|
|
||||||
- Assess data completeness and reliability
|
|
||||||
- Document data limitations
|
|
||||||
|
|
||||||
**Key Checks**:
|
|
||||||
- **Missing data**: Patterns, extent, randomness
|
|
||||||
- **Duplicates**: Exact and near-duplicates
|
|
||||||
- **Outliers**: Valid extremes vs. data errors
|
|
||||||
- **Consistency**: Cross-field validation
|
|
||||||
- **Accuracy**: Domain knowledge validation
|
|
||||||
|
|
||||||
**Red Flags**:
|
**Red Flags**:
|
||||||
- High missing data rate (>20%)
|
- Missing data >20%
|
||||||
- Unexpected duplicates
|
- Unexpected duplicates
|
||||||
- Constant or near-constant columns
|
- Constant columns
|
||||||
- Impossible values (negative ages, dates in future)
|
- Impossible values (negative ages, future dates)
|
||||||
- High cardinality in ID-like columns
|
|
||||||
- Suspicious patterns (too many round numbers)
|
- Suspicious patterns (too many round numbers)
|
||||||
|
|
||||||
### 3. Univariate Analysis
|
### 3. Univariate Analysis
|
||||||
|
|
||||||
**Objectives**:
|
**Numeric**: Central tendency, dispersion, shape (skewness, kurtosis), distribution plots, outliers
|
||||||
- Understand individual variable distributions
|
|
||||||
- Identify anomalies and patterns
|
|
||||||
- Determine variable characteristics
|
|
||||||
|
|
||||||
**For Numeric Variables**:
|
**Categorical**: Frequency distributions, unique counts, balance, bar charts
|
||||||
- Central tendency (mean, median, mode)
|
|
||||||
- Dispersion (range, variance, std, IQR)
|
|
||||||
- Shape (skewness, kurtosis)
|
|
||||||
- Distribution visualization (histogram, KDE, box plot)
|
|
||||||
- Outlier detection
|
|
||||||
|
|
||||||
**For Categorical Variables**:
|
**Temporal**: Time range, gaps, trends, seasonality, time series plots
|
||||||
- Frequency distributions
|
|
||||||
- Unique value counts
|
|
||||||
- Most/least common categories
|
|
||||||
- Category balance/imbalance
|
|
||||||
- Bar charts and count plots
|
|
||||||
|
|
||||||
**For Temporal Variables**:
|
|
||||||
- Time range coverage
|
|
||||||
- Gaps in timeline
|
|
||||||
- Temporal patterns (trends, seasonality)
|
|
||||||
- Time series plots
|
|
||||||
|
|
||||||
### 4. Bivariate Analysis
|
### 4. Bivariate Analysis
|
||||||
|
|
||||||
**Objectives**:
|
**Numeric vs Numeric**: Scatter plots, correlations (Pearson, Spearman), detect non-linearity
|
||||||
- Understand relationships between variables
|
|
||||||
- Identify correlations and dependencies
|
|
||||||
- Find potential predictors
|
|
||||||
|
|
||||||
**Numeric vs Numeric**:
|
**Numeric vs Categorical**: Group statistics, box plots by category, t-test/ANOVA
|
||||||
- Scatter plots
|
|
||||||
- Correlation coefficients (Pearson, Spearman)
|
|
||||||
- Line of best fit
|
|
||||||
- Detect non-linear relationships
|
|
||||||
|
|
||||||
**Numeric vs Categorical**:
|
**Categorical vs Categorical**: Cross-tabs, stacked bars, chi-square, Cramér's V
|
||||||
- Group statistics (mean, median by category)
|
|
||||||
- Box plots by category
|
|
||||||
- Distribution plots by category
|
|
||||||
- Statistical tests (t-test, ANOVA)
|
|
||||||
|
|
||||||
**Categorical vs Categorical**:
|
|
||||||
- Cross-tabulation / contingency tables
|
|
||||||
- Stacked bar charts
|
|
||||||
- Chi-square tests
|
|
||||||
- Cramér's V for association strength
|
|
||||||
|
|
||||||
### 5. Multivariate Analysis
|
### 5. Multivariate Analysis
|
||||||
|
|
||||||
**Objectives**:
|
**Techniques**: Correlation matrices, pair plots, parallel coordinates, PCA, clustering
|
||||||
- Understand complex interactions
|
|
||||||
- Identify patterns across multiple variables
|
|
||||||
- Explore dimensionality
|
|
||||||
|
|
||||||
**Techniques**:
|
**Questions**: Groups of correlated features? Reduce dimensionality? Natural clusters? Conditional patterns?
|
||||||
- Correlation matrices and heatmaps
|
|
||||||
- Pair plots / scatter matrices
|
|
||||||
- Parallel coordinates plots
|
|
||||||
- Principal Component Analysis (PCA)
|
|
||||||
- Clustering analysis
|
|
||||||
|
|
||||||
**Key Questions**:
|
|
||||||
- Are there groups of correlated features?
|
|
||||||
- Can we reduce dimensionality?
|
|
||||||
- Are there natural clusters?
|
|
||||||
- Do patterns change when conditioning on other variables?
|
|
||||||
|
|
||||||
### 6. Insight Generation
|
### 6. Insight Generation
|
||||||
|
|
||||||
**Objectives**:
|
**Look for**: Unexpected patterns, strong correlations, quality issues, feature engineering opportunities, domain implications
|
||||||
- Synthesize findings into actionable insights
|
|
||||||
- Formulate hypotheses
|
|
||||||
- Identify next steps
|
|
||||||
|
|
||||||
**What to Look For**:
|
## Visualization Guidelines
|
||||||
- Unexpected patterns or anomalies
|
|
||||||
- Strong relationships or correlations
|
|
||||||
- Data quality issues requiring attention
|
|
||||||
- Feature engineering opportunities
|
|
||||||
- Business or research implications
|
|
||||||
|
|
||||||
## Best Practices
|
**Chart Selection**:
|
||||||
|
- Distribution: Histogram, KDE, box/violin plots
|
||||||
|
- Relationships: Scatter, line, heatmap
|
||||||
|
- Composition: Stacked bar
|
||||||
|
- Comparison: Bar, grouped bar
|
||||||
|
|
||||||
### Visualization Guidelines
|
**Best Practices**: Label axes with units, descriptive titles, purposeful color, appropriate scales, avoid clutter
|
||||||
|
|
||||||
1. **Choose appropriate chart types**:
|
## Statistical Analysis Guidelines
|
||||||
- Distribution: Histogram, KDE, box plot, violin plot
|
|
||||||
- Relationships: Scatter plot, line plot, heatmap
|
|
||||||
- Composition: Stacked bar, pie chart (use sparingly)
|
|
||||||
- Comparison: Bar chart, grouped bar chart
|
|
||||||
|
|
||||||
2. **Make visualizations clear and informative**:
|
**Check Assumptions**: Normality, homoscedasticity, independence, linearity
|
||||||
- Always label axes with units
|
|
||||||
- Add descriptive titles
|
|
||||||
- Use color purposefully
|
|
||||||
- Include legends when needed
|
|
||||||
- Choose appropriate scales
|
|
||||||
- Avoid chart junk
|
|
||||||
|
|
||||||
3. **Use multiple views**:
|
**Method Selection**: Parametric when assumptions met, non-parametric otherwise, report effect sizes
|
||||||
- Show data from different angles
|
|
||||||
- Combine complementary visualizations
|
|
||||||
- Use small multiples for faceting
|
|
||||||
|
|
||||||
### Statistical Analysis Guidelines
|
**Context Matters**: Statistical ≠ practical significance, domain knowledge trumps statistics, correlation ≠ causation
|
||||||
|
|
||||||
1. **Check assumptions**:
|
## Documentation Guidelines
|
||||||
- Test for normality before parametric tests
|
|
||||||
- Check for homoscedasticity
|
|
||||||
- Verify independence of observations
|
|
||||||
- Assess linearity for linear models
|
|
||||||
|
|
||||||
2. **Use appropriate methods**:
|
**Notes**: Document assumptions, decisions, issues, findings
|
||||||
- Parametric tests when assumptions met
|
|
||||||
- Non-parametric alternatives when violated
|
|
||||||
- Robust methods for outlier-prone data
|
|
||||||
- Effect sizes alongside p-values
|
|
||||||
|
|
||||||
3. **Consider context**:
|
**Reproducibility**: Use scripts, version control, document sources, set random seeds
|
||||||
- Statistical significance ≠ practical significance
|
|
||||||
- Domain knowledge trumps statistical patterns
|
|
||||||
- Correlation ≠ causation
|
|
||||||
- Sample size affects what you can detect
|
|
||||||
|
|
||||||
### Documentation Guidelines
|
**Reporting**: Clear summaries, supporting visualizations, highlighted insights, actionable recommendations
|
||||||
|
|
||||||
1. **Keep detailed notes**:
|
## Common Pitfalls
|
||||||
- Document assumptions and decisions
|
|
||||||
- Record data issues discovered
|
|
||||||
- Note interesting findings
|
|
||||||
- Track questions that arise
|
|
||||||
|
|
||||||
2. **Create reproducible analysis**:
|
1. **Confirmation Bias**: Seek disconfirming evidence, use blind analysis
|
||||||
- Use scripts, not manual Excel operations
|
2. **Ignoring Quality**: Address issues first, document limitations
|
||||||
- Version control your code
|
3. **Over-automation**: Manually inspect subsets, verify results
|
||||||
- Document data sources and versions
|
4. **Neglecting Outliers**: Investigate before removing - may be informative
|
||||||
- Include random seeds for reproducibility
|
5. **Multiple Testing**: Use correction (Bonferroni, FDR) or note exploratory nature
|
||||||
|
6. **Association ≠ Causation**: Use careful language, acknowledge alternatives
|
||||||
3. **Summarize findings**:
|
7. **Cherry-picking**: Report complete analysis, including negative results
|
||||||
- Write clear summaries
|
8. **Ignoring Sample Size**: Report effect sizes, CIs, and sample sizes
|
||||||
- Use visualizations to support points
|
|
||||||
- Highlight key insights
|
|
||||||
- Provide recommendations
|
|
||||||
|
|
||||||
## Common Pitfalls to Avoid
|
|
||||||
|
|
||||||
### 1. Confirmation Bias
|
|
||||||
- **Problem**: Looking only for evidence supporting preconceptions
|
|
||||||
- **Solution**: Actively seek disconfirming evidence, use blind analysis
|
|
||||||
|
|
||||||
### 2. Ignoring Data Quality
|
|
||||||
- **Problem**: Proceeding with analysis despite known data issues
|
|
||||||
- **Solution**: Address quality issues first, document limitations
|
|
||||||
|
|
||||||
### 3. Over-reliance on Automation
|
|
||||||
- **Problem**: Running analyses without understanding or verifying results
|
|
||||||
- **Solution**: Manually inspect subsets, verify automated findings
|
|
||||||
|
|
||||||
### 4. Neglecting Outliers
|
|
||||||
- **Problem**: Removing outliers without investigation
|
|
||||||
- **Solution**: Always investigate outliers - they may contain important information
|
|
||||||
|
|
||||||
### 5. Multiple Testing Without Correction
|
|
||||||
- **Problem**: Running many tests increases false positive rate
|
|
||||||
- **Solution**: Use correction methods (Bonferroni, FDR) or be explicit about exploratory nature
|
|
||||||
|
|
||||||
### 6. Mistaking Association for Causation
|
|
||||||
- **Problem**: Inferring causation from correlation
|
|
||||||
- **Solution**: Use careful language, acknowledge alternative explanations
|
|
||||||
|
|
||||||
### 7. Cherry-picking Results
|
|
||||||
- **Problem**: Reporting only interesting/significant findings
|
|
||||||
- **Solution**: Report complete analysis, including negative results
|
|
||||||
|
|
||||||
### 8. Ignoring Sample Size
|
|
||||||
- **Problem**: Not considering how sample size affects conclusions
|
|
||||||
- **Solution**: Report effect sizes, confidence intervals, and sample sizes
|
|
||||||
|
|
||||||
## Domain-Specific Considerations
|
## Domain-Specific Considerations
|
||||||
|
|
||||||
### Time Series Data
|
**Time Series**: Check stationarity, identify trends/seasonality, autocorrelation, temporal splits
|
||||||
- Check for stationarity
|
|
||||||
- Identify trends and seasonality
|
|
||||||
- Look for autocorrelation
|
|
||||||
- Handle missing time points
|
|
||||||
- Consider temporal splits for validation
|
|
||||||
|
|
||||||
### High-Dimensional Data
|
**High-Dimensional**: Dimensionality reduction, feature importance, regularization, domain-guided selection
|
||||||
- Start with dimensionality reduction
|
|
||||||
- Focus on feature importance
|
|
||||||
- Be cautious of curse of dimensionality
|
|
||||||
- Use regularization in modeling
|
|
||||||
- Consider domain knowledge for feature selection
|
|
||||||
|
|
||||||
### Imbalanced Data
|
**Imbalanced**: Report distributions, appropriate metrics, resampling, stratified CV
|
||||||
- Report class distributions
|
|
||||||
- Use appropriate metrics (not just accuracy)
|
|
||||||
- Consider resampling techniques
|
|
||||||
- Stratify sampling and cross-validation
|
|
||||||
- Be aware of biases in learning
|
|
||||||
|
|
||||||
### Small Sample Sizes
|
**Small Samples**: Non-parametric methods, conservative conclusions, CIs, Bayesian approaches
|
||||||
- Use non-parametric methods
|
|
||||||
- Be conservative with conclusions
|
|
||||||
- Report confidence intervals
|
|
||||||
- Consider Bayesian approaches
|
|
||||||
- Acknowledge limitations
|
|
||||||
|
|
||||||
### Big Data
|
**Big Data**: Intelligent sampling, efficient structures, parallel computing, scalability
|
||||||
- Sample intelligently for exploration
|
|
||||||
- Use efficient data structures
|
|
||||||
- Leverage parallel/distributed computing
|
|
||||||
- Be aware computational complexity
|
|
||||||
- Consider scalability in methods
|
|
||||||
|
|
||||||
## Iterative Process
|
## Iterative Process
|
||||||
|
|
||||||
EDA is not linear - iterate and refine:
|
EDA is iterative: Explore → Questions → Focused Analysis → Insights → New Questions → Deeper Investigation → Synthesis
|
||||||
|
|
||||||
1. **Initial exploration** → Identify questions
|
**Done When**: Understand structure/quality, characterized variables, identified relationships, documented limitations, answered questions, have actionable insights
|
||||||
2. **Focused analysis** → Answer specific questions
|
|
||||||
3. **New insights** → Generate new questions
|
|
||||||
4. **Deeper investigation** → Refine understanding
|
|
||||||
5. **Synthesis** → Integrate findings
|
|
||||||
|
|
||||||
### When to Stop
|
**Deliverables**: Data understanding, quality issue list, relationship insights, hypotheses, feature ideas, recommendations
|
||||||
|
|
||||||
You've done enough EDA when:
|
## Communication
|
||||||
- ✅ You understand the data structure and quality
|
|
||||||
- ✅ You've characterized key variables
|
|
||||||
- ✅ You've identified important relationships
|
|
||||||
- ✅ You've documented limitations
|
|
||||||
- ✅ You can answer your research questions
|
|
||||||
- ✅ You have actionable insights
|
|
||||||
|
|
||||||
### Moving Forward
|
**Technical Audiences**: Methodological details, statistical tests, assumptions, reproducible code
|
||||||
|
|
||||||
After EDA, you should have:
|
**Non-Technical Audiences**: Focus on insights, clear visualizations, avoid jargon, concrete recommendations
|
||||||
- Clear understanding of data
|
|
||||||
- List of quality issues and how to handle them
|
|
||||||
- Insights about relationships and patterns
|
|
||||||
- Hypotheses to test
|
|
||||||
- Ideas for feature engineering
|
|
||||||
- Recommendations for next steps
|
|
||||||
|
|
||||||
## Communication Tips
|
**Report Structure**: Executive summary → Data overview → Analysis → Insights → Recommendations → Appendix
|
||||||
|
|
||||||
### For Technical Audiences
|
## Checklists
|
||||||
- Include methodological details
|
|
||||||
- Show statistical test results
|
|
||||||
- Discuss assumptions and limitations
|
|
||||||
- Provide reproducible code
|
|
||||||
- Reference relevant literature
|
|
||||||
|
|
||||||
### For Non-Technical Audiences
|
**Before**: Understand context, define objectives, identify audience, set up environment
|
||||||
- Focus on insights, not methods
|
|
||||||
- Use clear visualizations
|
|
||||||
- Avoid jargon
|
|
||||||
- Provide context and implications
|
|
||||||
- Make recommendations concrete
|
|
||||||
|
|
||||||
### Report Structure
|
**During**: Inspect structure, assess quality, analyze distributions, explore relationships, document continuously
|
||||||
1. **Executive Summary**: Key findings and recommendations
|
|
||||||
2. **Data Overview**: Source, structure, limitations
|
|
||||||
3. **Analysis**: Findings organized by theme
|
|
||||||
4. **Insights**: Patterns, anomalies, implications
|
|
||||||
5. **Recommendations**: Next steps and actions
|
|
||||||
6. **Appendix**: Technical details, full statistics
|
|
||||||
|
|
||||||
## Useful Checklists
|
**After**: Verify findings, check alternatives, document limitations, prepare visualizations, ensure reproducibility
|
||||||
|
|
||||||
### Before Starting
|
|
||||||
- [ ] Understand business/research context
|
|
||||||
- [ ] Define analysis objectives
|
|
||||||
- [ ] Identify stakeholders and audience
|
|
||||||
- [ ] Secure necessary permissions
|
|
||||||
- [ ] Set up reproducible environment
|
|
||||||
|
|
||||||
### During Analysis
|
|
||||||
- [ ] Load and inspect data structure
|
|
||||||
- [ ] Assess data quality
|
|
||||||
- [ ] Analyze univariate distributions
|
|
||||||
- [ ] Explore bivariate relationships
|
|
||||||
- [ ] Investigate multivariate patterns
|
|
||||||
- [ ] Generate and validate insights
|
|
||||||
- [ ] Document findings continuously
|
|
||||||
|
|
||||||
### Before Concluding
|
|
||||||
- [ ] Verify all findings
|
|
||||||
- [ ] Check for alternative explanations
|
|
||||||
- [ ] Document limitations
|
|
||||||
- [ ] Prepare clear visualizations
|
|
||||||
- [ ] Write actionable recommendations
|
|
||||||
- [ ] Review with domain experts
|
|
||||||
- [ ] Ensure reproducibility
|
|
||||||
|
|
||||||
## Tools and Libraries
|
|
||||||
|
|
||||||
### Python Ecosystem
|
|
||||||
- **pandas**: Data manipulation
|
|
||||||
- **numpy**: Numerical operations
|
|
||||||
- **matplotlib/seaborn**: Visualization
|
|
||||||
- **scipy**: Statistical tests
|
|
||||||
- **scikit-learn**: ML preprocessing
|
|
||||||
- **plotly**: Interactive visualizations
|
|
||||||
|
|
||||||
### Best Tool Practices
|
|
||||||
- Use appropriate tool for task
|
|
||||||
- Leverage vectorization
|
|
||||||
- Chain operations efficiently
|
|
||||||
- Handle missing data properly
|
|
||||||
- Validate results independently
|
|
||||||
- Document custom functions
|
|
||||||
|
|
||||||
## Further Resources
|
|
||||||
|
|
||||||
- **Books**:
|
|
||||||
- "Exploratory Data Analysis" by John Tukey
|
|
||||||
- "The Art of Statistics" by David Spiegelhalter
|
|
||||||
- **Guidelines**:
|
|
||||||
- ASA Statistical Significance Statement
|
|
||||||
- FAIR data principles
|
|
||||||
- **Communities**:
|
|
||||||
- Cross Validated (Stack Exchange)
|
|
||||||
- /r/datascience
|
|
||||||
- Local data science meetups
|
|
||||||
|
|||||||
@@ -1,252 +1,126 @@
|
|||||||
# Statistical Tests Guide for EDA
|
# Statistical Tests Guide
|
||||||
|
|
||||||
This guide provides interpretation guidelines for statistical tests commonly used in exploratory data analysis.
|
Interpretation guidelines for common EDA statistical tests.
|
||||||
|
|
||||||
## Normality Tests
|
## Normality Tests
|
||||||
|
|
||||||
### Shapiro-Wilk Test
|
### Shapiro-Wilk
|
||||||
|
|
||||||
**Purpose**: Test if a sample comes from a normally distributed population
|
**Use**: Small to medium samples (n < 5000)
|
||||||
|
|
||||||
**When to use**: Best for small to medium sample sizes (n < 5000)
|
**H0**: Data is normal | **H1**: Data is not normal
|
||||||
|
|
||||||
**Interpretation**:
|
**Interpretation**: p > 0.05 → likely normal | p ≤ 0.05 → not normal
|
||||||
- **Null Hypothesis (H0)**: The data follows a normal distribution
|
|
||||||
- **Alternative Hypothesis (H1)**: The data does not follow a normal distribution
|
|
||||||
- **p-value > 0.05**: Fail to reject H0 → Data is likely normally distributed
|
|
||||||
- **p-value ≤ 0.05**: Reject H0 → Data is not normally distributed
|
|
||||||
|
|
||||||
**Notes**:
|
**Note**: Very sensitive to sample size; small deviations may be significant in large samples
|
||||||
- Very sensitive to sample size
|
|
||||||
- Small deviations from normality may be detected as significant in large samples
|
|
||||||
- Consider practical significance alongside statistical significance
|
|
||||||
|
|
||||||
### Anderson-Darling Test
|
### Anderson-Darling
|
||||||
|
|
||||||
**Purpose**: Test if a sample comes from a specific distribution (typically normal)
|
**Use**: More powerful than Shapiro-Wilk, emphasizes tails
|
||||||
|
|
||||||
**When to use**: More powerful than Shapiro-Wilk for detecting departures from normality
|
**Interpretation**: Test statistic > critical value → reject normality
|
||||||
|
|
||||||
**Interpretation**:
|
### Kolmogorov-Smirnov
|
||||||
- Compares test statistic against critical values at different significance levels
|
|
||||||
- If test statistic > critical value at given significance level, reject normality
|
|
||||||
- More weight given to tails of distribution than other tests
|
|
||||||
|
|
||||||
### Kolmogorov-Smirnov Test
|
**Use**: Large samples or testing against non-normal distributions
|
||||||
|
|
||||||
**Purpose**: Test if a sample comes from a reference distribution
|
**Interpretation**: p > 0.05 → matches reference | p ≤ 0.05 → differs from reference
|
||||||
|
|
||||||
**When to use**: When you have a large sample or want to test against distributions other than normal
|
|
||||||
|
|
||||||
**Interpretation**:
|
|
||||||
- **p-value > 0.05**: Sample distribution matches reference distribution
|
|
||||||
- **p-value ≤ 0.05**: Sample distribution differs from reference distribution
|
|
||||||
|
|
||||||
## Distribution Characteristics
|
## Distribution Characteristics
|
||||||
|
|
||||||
### Skewness
|
### Skewness
|
||||||
|
|
||||||
**Purpose**: Measure asymmetry of the distribution
|
**Measures asymmetry**:
|
||||||
|
- ≈ 0: Symmetric
|
||||||
|
- \> 0: Right-skewed (tail right)
|
||||||
|
- < 0: Left-skewed (tail left)
|
||||||
|
|
||||||
**Interpretation**:
|
**Magnitude**: |s| < 0.5 (symmetric) | 0.5-1 (moderate) | ≥ 1 (high)
|
||||||
- **Skewness ≈ 0**: Symmetric distribution
|
|
||||||
- **Skewness > 0**: Right-skewed (tail extends to right, most values on left)
|
|
||||||
- **Skewness < 0**: Left-skewed (tail extends to left, most values on right)
|
|
||||||
|
|
||||||
**Magnitude interpretation**:
|
**Action**: High skew → consider transformation (log, sqrt, Box-Cox); use median over mean
|
||||||
- **|Skewness| < 0.5**: Approximately symmetric
|
|
||||||
- **0.5 ≤ |Skewness| < 1**: Moderately skewed
|
|
||||||
- **|Skewness| ≥ 1**: Highly skewed
|
|
||||||
|
|
||||||
**Implications**:
|
|
||||||
- Highly skewed data may require transformation (log, sqrt, Box-Cox)
|
|
||||||
- Mean is pulled toward tail; median more robust for skewed data
|
|
||||||
- Many statistical tests assume symmetry/normality
|
|
||||||
|
|
||||||
### Kurtosis
|
### Kurtosis
|
||||||
|
|
||||||
**Purpose**: Measure tailedness and peak of distribution
|
**Measures tailedness** (excess kurtosis, normal = 0):
|
||||||
|
- ≈ 0: Normal tails
|
||||||
|
- \> 0: Heavy tails, more outliers
|
||||||
|
- < 0: Light tails, fewer outliers
|
||||||
|
|
||||||
**Interpretation** (Excess Kurtosis, where normal distribution = 0):
|
**Magnitude**: |k| < 0.5 (normal) | 0.5-1 (moderate) | ≥ 1 (very different)
|
||||||
- **Kurtosis ≈ 0**: Normal tail behavior (mesokurtic)
|
|
||||||
- **Kurtosis > 0**: Heavy tails, sharp peak (leptokurtic)
|
|
||||||
- More outliers than normal distribution
|
|
||||||
- Higher probability of extreme values
|
|
||||||
- **Kurtosis < 0**: Light tails, flat peak (platykurtic)
|
|
||||||
- Fewer outliers than normal distribution
|
|
||||||
- More uniform distribution
|
|
||||||
|
|
||||||
**Magnitude interpretation**:
|
**Action**: High kurtosis → investigate outliers carefully
|
||||||
- **|Kurtosis| < 0.5**: Normal-like tails
|
|
||||||
- **0.5 ≤ |Kurtosis| < 1**: Moderately different tails
|
|
||||||
- **|Kurtosis| ≥ 1**: Very different tail behavior from normal
|
|
||||||
|
|
||||||
**Implications**:
|
## Correlation
|
||||||
- High kurtosis → Be cautious with outliers
|
|
||||||
- Low kurtosis → Distribution lacks distinct peak
|
|
||||||
|
|
||||||
## Correlation Tests
|
### Pearson
|
||||||
|
|
||||||
### Pearson Correlation
|
**Measures**: Linear relationship (-1 to +1)
|
||||||
|
|
||||||
**Purpose**: Measure linear relationship between two continuous variables
|
**Strength**: |r| < 0.3 (weak) | 0.3-0.5 (moderate) | 0.5-0.7 (strong) | ≥ 0.7 (very strong)
|
||||||
|
|
||||||
**Range**: -1 to +1
|
**Assumptions**: Linear, continuous, normal, no outliers, homoscedastic
|
||||||
|
|
||||||
**Interpretation**:
|
**Use**: Expected linear relationship, assumptions met
|
||||||
- **r = +1**: Perfect positive linear relationship
|
|
||||||
- **r = 0**: No linear relationship
|
|
||||||
- **r = -1**: Perfect negative linear relationship
|
|
||||||
|
|
||||||
**Strength guidelines**:
|
### Spearman
|
||||||
- **|r| < 0.3**: Weak correlation
|
|
||||||
- **0.3 ≤ |r| < 0.5**: Moderate correlation
|
|
||||||
- **0.5 ≤ |r| < 0.7**: Strong correlation
|
|
||||||
- **|r| ≥ 0.7**: Very strong correlation
|
|
||||||
|
|
||||||
**Assumptions**:
|
**Measures**: Monotonic relationship (-1 to +1), rank-based
|
||||||
- Linear relationship between variables
|
|
||||||
- Both variables continuous and normally distributed
|
|
||||||
- No significant outliers
|
|
||||||
- Homoscedasticity (constant variance)
|
|
||||||
|
|
||||||
**When to use**: When relationship is expected to be linear and data meets assumptions
|
**Advantages**: Robust to outliers, no linearity assumption, works with ordinal, no normality required
|
||||||
|
|
||||||
### Spearman Correlation
|
**Use**: Outliers present, non-linear monotonic relationship, ordinal data, non-normal
|
||||||
|
|
||||||
**Purpose**: Measure monotonic relationship between two variables (rank-based)
|
## Outlier Detection
|
||||||
|
|
||||||
**Range**: -1 to +1
|
### IQR Method
|
||||||
|
|
||||||
**Interpretation**: Same as Pearson, but measures monotonic (not just linear) relationships
|
**Bounds**: Q1 - 1.5×IQR to Q3 + 1.5×IQR
|
||||||
|
|
||||||
**Advantages over Pearson**:
|
**Characteristics**: Simple, robust, works with skewed data
|
||||||
- Robust to outliers (uses ranks)
|
|
||||||
- Doesn't assume linear relationship
|
|
||||||
- Works with ordinal data
|
|
||||||
- Doesn't require normality assumption
|
|
||||||
|
|
||||||
**When to use**:
|
**Typical Rates**: < 5% (normal) | 5-10% (moderate) | > 10% (high, investigate)
|
||||||
- Data has outliers
|
|
||||||
- Relationship is monotonic but not linear
|
|
||||||
- Data is ordinal
|
|
||||||
- Distribution is non-normal
|
|
||||||
|
|
||||||
## Outlier Detection Methods
|
|
||||||
|
|
||||||
### IQR Method (Interquartile Range)
|
|
||||||
|
|
||||||
**Definition**:
|
|
||||||
- Lower bound: Q1 - 1.5 × IQR
|
|
||||||
- Upper bound: Q3 + 1.5 × IQR
|
|
||||||
- Values outside these bounds are outliers
|
|
||||||
|
|
||||||
**Characteristics**:
|
|
||||||
- Simple and interpretable
|
|
||||||
- Robust to extreme values
|
|
||||||
- Works well for skewed distributions
|
|
||||||
- Conservative approach (Tukey's fences)
|
|
||||||
|
|
||||||
**Interpretation**:
|
|
||||||
- **< 5% outliers**: Typical for most datasets
|
|
||||||
- **5-10% outliers**: Moderate, investigate causes
|
|
||||||
- **> 10% outliers**: High rate, may indicate data quality issues or interesting phenomena
|
|
||||||
|
|
||||||
### Z-Score Method
|
### Z-Score Method
|
||||||
|
|
||||||
**Definition**: Outliers are data points with |z-score| > 3
|
**Definition**: |z| > 3 where z = (x - μ) / σ
|
||||||
|
|
||||||
**Formula**: z = (x - μ) / σ
|
**Use**: Normal data, n > 30
|
||||||
|
|
||||||
**Characteristics**:
|
**Avoid**: Small samples, skewed data, many outliers (contaminates mean/SD)
|
||||||
- Assumes normal distribution
|
|
||||||
- Sensitive to extreme values
|
|
||||||
- Standard threshold is |z| > 3 (99.7% of data within ±3σ)
|
|
||||||
|
|
||||||
**When to use**:
|
## Hypothesis Testing
|
||||||
- Data is approximately normally distributed
|
|
||||||
- Large sample sizes (n > 30)
|
|
||||||
|
|
||||||
**When NOT to use**:
|
**Significance Levels**: α = 0.05 (standard) | 0.01 (conservative) | 0.10 (liberal)
|
||||||
- Small samples
|
|
||||||
- Heavily skewed data
|
|
||||||
- Data with many outliers (contaminates mean and SD)
|
|
||||||
|
|
||||||
## Hypothesis Testing Guidelines
|
**p-value Interpretation**: ≤ 0.001 (***) | ≤ 0.01 (**) | ≤ 0.05 (*) | ≤ 0.10 (weak) | > 0.10 (none)
|
||||||
|
|
||||||
### Significance Levels
|
**Key Considerations**:
|
||||||
|
- Statistical ≠ practical significance
|
||||||
|
- Multiple testing → use correction (Bonferroni, FDR)
|
||||||
|
- Large samples detect trivial effects
|
||||||
|
- Always report effect sizes with p-values
|
||||||
|
|
||||||
- **α = 0.05**: Standard significance level (5% chance of Type I error)
|
## Transformations
|
||||||
- **α = 0.01**: More conservative (1% chance of Type I error)
|
|
||||||
- **α = 0.10**: More liberal (10% chance of Type I error)
|
|
||||||
|
|
||||||
### p-value Interpretation
|
**Right-skewed**: Log, sqrt, Box-Cox
|
||||||
|
|
||||||
- **p ≤ 0.001**: Very strong evidence against H0 (***)
|
**Left-skewed**: Square, cube, exponential
|
||||||
- **0.001 < p ≤ 0.01**: Strong evidence against H0 (**)
|
|
||||||
- **0.01 < p ≤ 0.05**: Moderate evidence against H0 (*)
|
|
||||||
- **0.05 < p ≤ 0.10**: Weak evidence against H0
|
|
||||||
- **p > 0.10**: Little to no evidence against H0
|
|
||||||
|
|
||||||
### Important Considerations
|
**Heavy tails**: Robust scaling, winsorization, log
|
||||||
|
|
||||||
1. **Statistical vs Practical Significance**: A small p-value doesn't always mean the effect is important
|
**Non-constant variance**: Log, Box-Cox
|
||||||
2. **Multiple Testing**: When performing many tests, use correction methods (Bonferroni, FDR)
|
|
||||||
3. **Sample Size**: Large samples can detect trivial effects as significant
|
|
||||||
4. **Effect Size**: Always report and interpret effect sizes alongside p-values
|
|
||||||
|
|
||||||
## Data Transformation Strategies
|
**Common Methods**:
|
||||||
|
- **Log**: log(x+1) for positive skew, multiplicative relationships
|
||||||
### When to Transform
|
- **Sqrt**: Count data, moderate skew
|
||||||
|
- **Box-Cox**: Auto-finds optimal (requires positive values)
|
||||||
- **Right-skewed data**: Log, square root, or Box-Cox transformation
|
- **Standardization**: (x-μ)/σ for scaling to unit variance
|
||||||
- **Left-skewed data**: Square, cube, or exponential transformation
|
- **Min-Max**: (x-min)/(max-min) for [0,1] scaling
|
||||||
- **Heavy tails/outliers**: Robust scaling, winsorization, or log transformation
|
|
||||||
- **Non-constant variance**: Log or Box-Cox transformation
|
|
||||||
|
|
||||||
### Common Transformations
|
|
||||||
|
|
||||||
1. **Log transformation**: log(x) or log(x + 1)
|
|
||||||
- Best for: Positive skewed data, multiplicative relationships
|
|
||||||
- Cannot use with zero or negative values
|
|
||||||
|
|
||||||
2. **Square root transformation**: √x
|
|
||||||
- Best for: Count data, moderate positive skew
|
|
||||||
- Less aggressive than log
|
|
||||||
|
|
||||||
3. **Box-Cox transformation**: (x^λ - 1) / λ
|
|
||||||
- Best for: Automatically finds optimal transformation
|
|
||||||
- Requires positive values
|
|
||||||
|
|
||||||
4. **Standardization**: (x - μ) / σ
|
|
||||||
- Best for: Scaling features to same range
|
|
||||||
- Centers data at 0 with unit variance
|
|
||||||
|
|
||||||
5. **Min-Max scaling**: (x - min) / (max - min)
|
|
||||||
- Best for: Scaling to [0, 1] range
|
|
||||||
- Preserves zero values
|
|
||||||
|
|
||||||
## Practical Guidelines
|
## Practical Guidelines
|
||||||
|
|
||||||
### Sample Size Considerations
|
**Sample Size**: n < 30 (non-parametric, cautious) | 30-100 (parametric OK) | ≥ 100 (robust) | ≥ 1000 (may detect trivial effects)
|
||||||
|
|
||||||
- **n < 30**: Use non-parametric tests, be cautious with assumptions
|
**Missing Data**: < 5% (simple methods) | 5-10% (imputation) | > 10% (investigate patterns, advanced methods)
|
||||||
- **30 ≤ n < 100**: Moderate sample, parametric tests usually acceptable
|
|
||||||
- **n ≥ 100**: Large sample, parametric tests robust to violations
|
|
||||||
- **n ≥ 1000**: Very large sample, may detect trivial effects as significant
|
|
||||||
|
|
||||||
### Dealing with Missing Data
|
**Reporting**: Include test statistic, p-value, CI, effect size, n, assumption checks
|
||||||
|
|
||||||
- **< 5% missing**: Usually not a problem, simple methods OK
|
|
||||||
- **5-10% missing**: Use appropriate imputation methods
|
|
||||||
- **> 10% missing**: Investigate patterns, consider advanced imputation or modeling missingness
|
|
||||||
|
|
||||||
### Reporting Results
|
|
||||||
|
|
||||||
Always include:
|
|
||||||
1. Test statistic value
|
|
||||||
2. p-value
|
|
||||||
3. Confidence interval (when applicable)
|
|
||||||
4. Effect size
|
|
||||||
5. Sample size
|
|
||||||
6. Assumptions checked and violations noted
|
|
||||||
|
|||||||
Reference in New Issue
Block a user