diff --git a/scientific-thinking/exploratory-data-analysis/SKILL.md b/scientific-thinking/exploratory-data-analysis/SKILL.md index c26ff15..950d851 100644 --- a/scientific-thinking/exploratory-data-analysis/SKILL.md +++ b/scientific-thinking/exploratory-data-analysis/SKILL.md @@ -1,275 +1,202 @@ --- name: exploratory-data-analysis -description: "EDA toolkit. Analyze CSV/Excel/JSON/Parquet files, statistical summaries, distributions, correlations, outliers, missing data, visualizations, markdown reports, for data profiling and insights." +description: "Analyze datasets to discover patterns, anomalies, and relationships. Use when exploring data files, generating statistical summaries, checking data quality, or creating visualizations. Supports CSV, Excel, JSON, Parquet, and more." --- # Exploratory Data Analysis -## Overview +Discover patterns, anomalies, and relationships in tabular data through statistical analysis and visualization. -EDA is a process for discovering patterns, anomalies, and relationships in data. Analyze CSV/Excel/JSON/Parquet files to generate statistical summaries, distributions, correlations, outliers, and visualizations. All outputs are markdown-formatted for integration into workflows. +**Supported formats**: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle -## When to Use This Skill - -This skill should be used when: -- User provides a data file and requests analysis or exploration -- User asks to "explore this dataset", "analyze this data", or "what's in this file?" -- User needs statistical summaries, distributions, or correlations -- User requests data visualizations or insights -- User wants to understand data quality issues or patterns -- User mentions EDA, exploratory analysis, or data profiling - -**Supported file formats**: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle - -## Quick Start Workflow - -1. **Receive data file** from user -2. **Run comprehensive analysis** using `scripts/eda_analyzer.py` -3. **Generate visualizations** using `scripts/visualizer.py` -4. **Create markdown report** using insights and the `assets/report_template.md` template -5. **Present findings** to user with key insights highlighted - -## Core Capabilities - -### 1. Comprehensive Data Analysis - -Execute full statistical analysis using the `eda_analyzer.py` script: +## Standard Workflow +1. Run statistical analysis: ```bash -python scripts/eda_analyzer.py -o +python scripts/eda_analyzer.py -o ``` -**What it provides**: -- Auto-detection and loading of file formats -- Basic dataset information (shape, types, memory usage) -- Missing data analysis (patterns, percentages) -- Summary statistics for numeric and categorical variables -- Outlier detection using IQR and Z-score methods -- Distribution analysis with normality tests (Shapiro-Wilk, Anderson-Darling) -- Correlation analysis (Pearson and Spearman) -- Data quality assessment (completeness, duplicates, issues) -- Automated insight generation - -**Output**: JSON file containing all analysis results at `/eda_analysis.json` - -### 2. Comprehensive Visualizations - -Generate complete visualization suite using the `visualizer.py` script: - +2. Generate visualizations: ```bash -python scripts/visualizer.py -o +python scripts/visualizer.py -o ``` -**Generated visualizations**: -- **Missing data patterns**: Heatmap and bar chart showing missing data -- **Distribution plots**: Histograms with KDE overlays for all numeric variables -- **Box plots with violin plots**: Outlier detection visualizations -- **Correlation heatmap**: Both Pearson and Spearman correlation matrices -- **Scatter matrix**: Pairwise relationships between numeric variables -- **Categorical analysis**: Bar charts for top categories -- **Time series plots**: Temporal trends with trend lines (if datetime columns exist) +3. Read analysis results from `/eda_analysis.json` -**Output**: High-quality PNG files saved to `/eda_visualizations/` +4. Create report using `assets/report_template.md` structure -All visualizations are production-ready with: -- 300 DPI resolution -- Clear titles and labels -- Statistical annotations -- Professional styling using seaborn +5. Present findings with key insights and visualizations -### 3. Automated Insight Generation +## Analysis Capabilities -The analyzer automatically generates actionable insights including: +### Statistical Analysis -- **Data scale insights**: Dataset size considerations for processing -- **Missing data alerts**: Warnings when missing data exceeds thresholds -- **Correlation discoveries**: Strong relationships identified for feature engineering -- **Outlier warnings**: Variables with high outlier rates flagged -- **Distribution assessments**: Skewness issues requiring transformations -- **Duplicate alerts**: Duplicate row detection -- **Imbalance warnings**: Categorical variable imbalance detection +Run `scripts/eda_analyzer.py` to generate comprehensive analysis: -Access insights from the analysis results JSON under the `"insights"` key. +```bash +python scripts/eda_analyzer.py sales_data.csv -o ./output +``` -### 4. Statistical Interpretation +Produces `output/eda_analysis.json` containing: +- Dataset shape, types, memory usage +- Missing data patterns and percentages +- Summary statistics (numeric and categorical) +- Outlier detection (IQR and Z-score methods) +- Distribution analysis with normality tests +- Correlation matrices (Pearson and Spearman) +- Data quality metrics (completeness, duplicates) +- Automated insights -For detailed interpretation of statistical tests and measures, reference: +### Visualizations -**`references/statistical_tests_guide.md`** - Comprehensive guide covering: +Run `scripts/visualizer.py` to generate plots: + +```bash +python scripts/visualizer.py sales_data.csv -o ./output +``` + +Creates high-resolution (300 DPI) PNG files in `output/eda_visualizations/`: +- Missing data heatmaps and bar charts +- Distribution plots (histograms with KDE) +- Box plots and violin plots for outliers +- Correlation heatmaps +- Scatter matrices for numeric relationships +- Categorical bar charts +- Time series plots (if datetime columns detected) + +### Automated Insights + +Access generated insights from the `"insights"` key in the analysis JSON: +- Dataset size considerations +- Missing data warnings (when exceeding thresholds) +- Strong correlations for feature engineering +- High outlier rate flags +- Skewness requiring transformations +- Duplicate detection +- Categorical imbalance warnings + +## Reference Materials + +### Statistical Interpretation + +See `references/statistical_tests_guide.md` for detailed guidance on: - Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov) - Distribution characteristics (skewness, kurtosis) -- Correlation tests (Pearson, Spearman) -- Outlier detection methods (IQR, Z-score) -- Hypothesis testing guidelines -- Data transformation strategies +- Correlation methods (Pearson, Spearman) +- Outlier detection (IQR, Z-score) +- Hypothesis testing and data transformations -Load this reference when needing to interpret specific statistical tests or explain results to users. +Use when interpreting statistical results or explaining findings. -### 5. Best Practices Guidance +### Methodology -For methodological guidance, reference: +See `references/eda_best_practices.md` for comprehensive guidance on: +- 6-step EDA process framework +- Univariate, bivariate, multivariate analysis approaches +- Visualization and statistical analysis guidelines +- Common pitfalls and domain-specific considerations +- Communication strategies for different audiences -**`references/eda_best_practices.md`** - Detailed best practices including: -- EDA process framework (6-step methodology) -- Univariate, bivariate, and multivariate analysis approaches -- Visualization guidelines -- Statistical analysis guidelines -- Common pitfalls to avoid -- Domain-specific considerations -- Communication tips for technical and non-technical audiences +Use when planning analysis or handling specific scenarios. -Load this reference when planning analysis approach or needing guidance on specific EDA scenarios. +## Report Template -## Creating Analysis Reports - -Use the provided template to structure comprehensive EDA reports: - -**`assets/report_template.md`** - Professional report template with sections for: +Use `assets/report_template.md` to structure findings. Template includes: - Executive summary - Dataset overview - Data quality assessment - Univariate, bivariate, and multivariate analysis - Outlier analysis -- Key insights and findings -- Recommendations +- Key insights and recommendations - Limitations and appendices -**To use the template**: -1. Copy the template content -2. Fill in sections with analysis results from JSON output -3. Embed visualization images using markdown syntax -4. Populate insights and recommendations -5. Save as markdown for user consumption +Fill sections with analysis JSON results and embed visualizations using markdown image syntax. -## Typical Workflow Example +## Example: Complete Analysis -When user provides a data file: +User request: "Explore this sales_data.csv file" -``` -User: "Can you explore this sales_data.csv file and tell me what you find?" +```bash +# 1. Run analysis +python scripts/eda_analyzer.py sales_data.csv -o ./output -1. Run analysis: - python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output - -2. Generate visualizations: - python scripts/visualizer.py sales_data.csv -o ./analysis_output - -3. Read analysis results: - Read ./analysis_output/eda_analysis.json - -4. Create markdown report using template: - - Copy assets/report_template.md structure - - Fill in sections with analysis results - - Reference visualizations from ./analysis_output/eda_visualizations/ - - Include automated insights from JSON - -5. Present to user: - - Show key insights prominently - - Highlight data quality issues - - Provide visualizations inline - - Make actionable recommendations - - Save complete report as .md file +# 2. Generate visualizations +python scripts/visualizer.py sales_data.csv -o ./output ``` -## Advanced Analysis Scenarios +```python +# 3. Read results +import json +with open('./output/eda_analysis.json') as f: + results = json.load(f) -### Large Datasets (>1M rows) -- Run analysis on sampled data first for quick exploration -- Note sample size in report -- Recommend distributed computing for full analysis +# 4. Build report from assets/report_template.md +# - Fill sections with results +# - Embed images: ![Missing Data](./output/eda_visualizations/missing_data.png) +# - Include insights from results['insights'] +# - Add recommendations +``` -### High-Dimensional Data (>50 columns) -- Focus on most important variables first -- Consider PCA or feature selection -- Generate correlation analysis to identify variable groups -- Reference `eda_best_practices.md` section on high-dimensional data +## Special Cases -### Time Series Data -- Ensure datetime columns are properly detected -- Time series visualizations will be automatically generated -- Consider temporal patterns, trends, and seasonality -- Reference `eda_best_practices.md` section on time series +### Dataset Size Strategy -### Imbalanced Data -- Categorical analysis will flag imbalances -- Report class distributions prominently -- Recommend stratified sampling if needed +**If < 100 rows**: Note sample size limitations, use non-parametric methods -### Small Sample Sizes (<100 rows) -- Non-parametric methods automatically used where appropriate -- Be conservative in statistical conclusions -- Note sample size limitations in report +**If 100-1M rows**: Standard workflow applies -## Output Best Practices +**If > 1M rows**: Sample first for quick exploration, note sample size in report, recommend distributed computing for full analysis -**Always output as markdown**: -- Structure findings using markdown headers, tables, and lists -- Embed visualizations using `![Description](path/to/image.png)` syntax -- Use tables for statistical summaries -- Include code blocks for any suggested transformations -- Highlight key insights with bold or bullet points +### Data Characteristics -**Ensure reports are actionable**: -- Provide clear recommendations based on findings -- Flag data quality issues that need attention -- Suggest next steps for modeling or further analysis -- Identify feature engineering opportunities +**High-dimensional (>50 columns)**: Focus on key variables first, use correlation analysis to identify groups, consider PCA or feature selection. See `references/eda_best_practices.md` for guidance. -**Make insights accessible**: -- Explain statistical concepts in plain language -- Use reference guides to provide detailed interpretations -- Include both technical details and executive summary +**Time series**: Datetime columns auto-detected, temporal visualizations generated automatically. Consider trends, seasonality, patterns. + +**Imbalanced**: Categorical analysis flags imbalances automatically. Report distributions prominently, recommend stratified sampling if needed. + +## Output Guidelines + +**Format findings as markdown**: +- Use headers, tables, and lists for structure +- Embed visualizations: `![Description](path/to/image.png)` +- Include code blocks for suggested transformations +- Highlight key insights + +**Make reports actionable**: +- Provide clear recommendations +- Flag data quality issues requiring attention +- Suggest next steps (modeling, feature engineering, further analysis) - Tailor communication to user's technical level -## Handling Edge Cases +## Error Handling -**Unsupported file formats**: -- Request user to convert to supported format -- Suggest using pandas-compatible formats +**Unsupported formats**: Request conversion to supported format (CSV, Excel, JSON, Parquet) -**Files too large to load**: -- Recommend sampling approach -- Suggest chunked processing -- Consider alternative tools for big data +**Files too large**: Recommend sampling or chunked processing -**Corrupted or malformed data**: -- Report specific errors encountered -- Suggest data cleaning steps -- Try to salvage partial analysis if possible +**Corrupted data**: Report specific errors, suggest cleaning steps, attempt partial analysis -**All missing data in columns**: -- Flag completely empty columns -- Recommend removal or investigation -- Document in data quality section +**Empty columns**: Flag in data quality section, recommend removal or investigation -## Resources Summary +## Resources -### scripts/ -- **`eda_analyzer.py`**: Main analysis engine - comprehensive statistical analysis -- **`visualizer.py`**: Visualization generator - creates all chart types +**Scripts** (handle all formats automatically): +- `scripts/eda_analyzer.py` - Statistical analysis engine +- `scripts/visualizer.py` - Visualization generator -Both scripts are fully executable and handle multiple file formats automatically. +**References** (load as needed): +- `references/statistical_tests_guide.md` - Test interpretation and methodology +- `references/eda_best_practices.md` - EDA process and best practices -### references/ -- **`statistical_tests_guide.md`**: Statistical test interpretation and methodology -- **`eda_best_practices.md`**: Comprehensive EDA methodology and best practices +**Template**: +- `assets/report_template.md` - Professional report structure -Load these references as needed to inform analysis approach and interpretation. +## Key Points -### assets/ -- **`report_template.md`**: Professional markdown report template - -Use this template structure for creating consistent, comprehensive EDA reports. - -## Key Reminders - -1. **Always generate markdown output** for textual results -2. **Run both scripts** (analyzer and visualizer) for complete analysis -3. **Use the template** to structure comprehensive reports -4. **Include visualizations** by referencing generated PNG files -5. **Provide actionable insights** - don't just present statistics -6. **Interpret findings** using reference guides -7. **Document limitations** and data quality issues -8. **Make recommendations** for next steps - -This skill transforms raw data into actionable insights through systematic exploration, advanced statistics, rich visualizations, and clear communication. +- Run both scripts for complete analysis +- Structure reports using the template +- Provide actionable insights, not just statistics +- Use reference guides for detailed interpretations +- Document data quality issues and limitations +- Make clear recommendations for next steps diff --git a/scientific-thinking/exploratory-data-analysis/assets/report_template.md b/scientific-thinking/exploratory-data-analysis/assets/report_template.md index 865316a..32fb0a9 100644 --- a/scientific-thinking/exploratory-data-analysis/assets/report_template.md +++ b/scientific-thinking/exploratory-data-analysis/assets/report_template.md @@ -1,14 +1,10 @@ -# Exploratory Data Analysis Report +# EDA Report: [Dataset Name] -**Dataset**: [Dataset Name] -**Analysis Date**: [Date] -**Analyst**: [Name] - ---- +**Date**: [Date] | **Analyst**: [Name] ## Executive Summary -[2-3 paragraph summary of key findings, major insights, and recommendations] +[Concise summary of key findings and recommendations] **Key Findings**: - [Finding 1] @@ -23,366 +19,197 @@ ## 1. Dataset Overview -### 1.1 Data Source -- **Source**: [Source name and location] -- **Collection Period**: [Date range] -- **Last Updated**: [Date] -- **Format**: [CSV, Excel, JSON, etc.] +**Source**: [Source name] | **Format**: [CSV/Excel/JSON/etc.] | **Period**: [Date range] -### 1.2 Data Structure -- **Observations (Rows)**: [Number] -- **Variables (Columns)**: [Number] -- **Memory Usage**: [Size in MB] +**Structure**: [Rows] observations × [Columns] variables | **Memory**: [Size] MB -### 1.3 Variable Types -- **Numeric Variables** ([Count]): [List column names] -- **Categorical Variables** ([Count]): [List column names] -- **Datetime Variables** ([Count]): [List column names] -- **Boolean Variables** ([Count]): [List column names] +**Variable Types**: +- Numeric ([Count]): [List names] +- Categorical ([Count]): [List names] +- Datetime ([Count]): [List names] +- Boolean ([Count]): [List names] --- -## 2. Data Quality Assessment +## 2. Data Quality -### 2.1 Completeness +**Completeness**: [Percentage]% | **Duplicates**: [Count] ([%]%) -**Overall Data Completeness**: [Percentage]% +**Missing Data**: +| Column | Missing % | Assessment | +|--------|-----------|------------| +| [Column 1] | [%] | [High/Medium/Low] | +| [Column 2] | [%] | [High/Medium/Low] | -**Missing Data Summary**: -| Column | Missing Count | Missing % | Assessment | -|--------|--------------|-----------|------------| -| [Column 1] | [Count] | [%] | [High/Medium/Low] | -| [Column 2] | [Count] | [%] | [High/Medium/Low] | +![Missing Data](path/to/missing_data.png) -**Missing Data Pattern**: [Description of patterns, if any] - -**Visualization**: ![Missing Data](path/to/missing_data.png) - -### 2.2 Duplicates - -- **Duplicate Rows**: [Count] ([Percentage]%) -- **Action Required**: [Yes/No - describe if needed] - -### 2.3 Data Quality Issues - -[List any identified issues] -- [ ] Issue 1: [Description] -- [ ] Issue 2: [Description] -- [ ] Issue 3: [Description] +**Quality Issues**: +- [Issue 1] +- [Issue 2] --- ## 3. Univariate Analysis -### 3.1 Numeric Variables +### Numeric: [Variable Name] -[For each key numeric variable:] +**Stats**: Mean: [Value] | Median: [Value] | Std: [Value] | Range: [[Min]-[Max]] -#### [Variable Name] +**Distribution**: Skewness: [Value] | Kurtosis: [Value] | Normality: [Yes/No] -**Summary Statistics**: -- **Mean**: [Value] -- **Median**: [Value] -- **Std Dev**: [Value] -- **Min**: [Value] -- **Max**: [Value] -- **Range**: [Value] -- **IQR**: [Value] +**Outliers**: IQR: [Count] ([%]%) | Z-score: [Count] ([%]%) -**Distribution Characteristics**: -- **Skewness**: [Value] - [Interpretation] -- **Kurtosis**: [Value] - [Interpretation] -- **Normality**: [Normal/Not Normal based on tests] +![Distribution](path/to/distribution.png) -**Outliers**: -- **IQR Method**: [Count] outliers ([Percentage]%) -- **Z-Score Method**: [Count] outliers ([Percentage]%) +**Insights**: [Key observations] -**Visualization**: ![Distribution of [Variable]](path/to/distribution.png) +### Categorical: [Variable Name] -**Insights**: -- [Key insight 1] -- [Key insight 2] +**Stats**: [Count] unique values | Most common: [Value] ([%]%) | Balance: [Balanced/Imbalanced] ---- - -### 3.2 Categorical Variables - -[For each key categorical variable:] - -#### [Variable Name] - -**Summary**: -- **Unique Values**: [Count] -- **Most Common**: [Value] ([Percentage]%) -- **Least Common**: [Value] ([Percentage]%) -- **Balance**: [Balanced/Imbalanced] - -**Top Categories**: -| Category | Count | Percentage | -|----------|-------|------------| +| Category | Count | % | +|----------|-------|---| | [Cat 1] | [Count] | [%] | | [Cat 2] | [Count] | [%] | -| [Cat 3] | [Count] | [%] | -**Visualization**: ![Distribution of [Variable]](path/to/categorical.png) +![Distribution](path/to/categorical.png) -**Insights**: -- [Key insight 1] -- [Key insight 2] +**Insights**: [Key observations] ---- +### Temporal: [Variable Name] -### 3.3 Temporal Variables +**Range**: [Start] to [End] ([Duration]) | **Trend**: [Increasing/Decreasing/Stable] | **Seasonality**: [Yes/No] -[If datetime columns exist:] +![Time Series](path/to/timeseries.png) -#### [Variable Name] - -**Time Range**: [Start Date] to [End Date] -**Duration**: [Time span] -**Temporal Coverage**: [Complete/Gaps identified] - -**Temporal Patterns**: -- **Trend**: [Increasing/Decreasing/Stable] -- **Seasonality**: [Yes/No - describe if present] -- **Gaps**: [List any gaps in timeline] - -**Visualization**: ![Time Series of [Variable]](path/to/timeseries.png) - -**Insights**: -- [Key insight 1] -- [Key insight 2] +**Insights**: [Key observations] --- ## 4. Bivariate Analysis -### 4.1 Correlation Analysis - -**Overall Correlation Structure**: -- **Strong Positive Correlations**: [Count] -- **Strong Negative Correlations**: [Count] -- **Weak/No Correlations**: [Count] - -**Correlation Matrix**: +**Correlation Summary**: [Count] strong positive | [Count] strong negative | [Count] weak/none ![Correlation Heatmap](path/to/correlation_heatmap.png) **Notable Correlations**: -| Variable 1 | Variable 2 | Pearson r | Spearman ρ | Strength | Interpretation | -|-----------|-----------|-----------|------------|----------|----------------| -| [Var 1] | [Var 2] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] | -| [Var 1] | [Var 3] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] | +| Var 1 | Var 2 | Pearson | Spearman | Strength | +|-------|-------|---------|----------|----------| +| [Var 1] | [Var 2] | [Value] | [Value] | [Strong/Moderate/Weak] | -**Insights**: -- [Key insight about correlations] -- [Potential multicollinearity issues] -- [Feature engineering opportunities] +**Insights**: [Multicollinearity issues, feature engineering opportunities] ---- +### Key Relationship: [Var 1] vs [Var 2] -### 4.2 Key Relationships +**Type**: [Linear/Non-linear/None] | **r**: [Value] | **p-value**: [Value] -[For important variable pairs:] +![Scatter Plot](path/to/scatter.png) -#### [Variable 1] vs [Variable 2] - -**Relationship Type**: [Linear/Non-linear/None] -**Correlation**: [Value] -**Statistical Test**: [Test name, p-value] - -**Visualization**: ![Scatter Plot](path/to/scatter.png) - -**Insights**: -- [Description of relationship] -- [Implications] +**Insights**: [Description and implications] --- ## 5. Multivariate Analysis -### 5.1 Scatter Matrix - ![Scatter Matrix](path/to/scatter_matrix.png) -**Observations**: -- [Pattern 1] -- [Pattern 2] -- [Pattern 3] +**Patterns**: [Key observations] -### 5.2 Clustering Patterns - -[If clustering analysis performed:] - -**Method**: [Method used] -**Number of Clusters**: [Count] - -**Cluster Characteristics**: -- **Cluster 1**: [Description] -- **Cluster 2**: [Description] - -**Visualization**: [Link to visualization] +**Clustering** (if performed): [Method] | [Count] clusters identified --- -## 6. Outlier Analysis +## 6. Outliers -### 6.1 Outlier Summary +**Overall Rate**: [%]% -**Overall Outlier Rate**: [Percentage]% +| Variable | Outlier % | Method | Action | +|----------|-----------|--------|--------| +| [Var 1] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] | +| [Var 2] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] | -**Variables with High Outlier Rates**: -| Variable | Outlier Count | Outlier % | Method | Action | -|----------|--------------|-----------|--------|--------| -| [Var 1] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] | -| [Var 2] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] | +![Box Plots](path/to/boxplots.png) -**Visualization**: ![Box Plots](path/to/boxplots.png) - -### 6.2 Outlier Investigation - -[For significant outliers:] - -#### [Variable Name] - -**Outlier Characteristics**: -- [Description of outliers] -- [Potential causes] -- [Validity assessment] - -**Recommendation**: [Keep/Remove/Transform/Investigate further] +**Investigation**: [Description of significant outliers, causes, validity] --- -## 7. Key Insights and Findings +## 7. Key Insights -### 7.1 Data Quality Insights +**Data Quality**: +- [Insight with implication] +- [Insight with implication] -1. **[Insight 1]**: [Description and implication] -2. **[Insight 2]**: [Description and implication] -3. **[Insight 3]**: [Description and implication] +**Statistical Patterns**: +- [Insight with implication] +- [Insight with implication] -### 7.2 Statistical Insights +**Domain/Research Insights**: +- [Insight with implication] +- [Insight with implication] -1. **[Insight 1]**: [Description and implication] -2. **[Insight 2]**: [Description and implication] -3. **[Insight 3]**: [Description and implication] - -### 7.3 Business/Research Insights - -1. **[Insight 1]**: [Description and implication] -2. **[Insight 2]**: [Description and implication] -3. **[Insight 3]**: [Description and implication] - -### 7.4 Unexpected Findings - -1. **[Finding 1]**: [Description and significance] -2. **[Finding 2]**: [Description and significance] +**Unexpected Findings**: +- [Finding and significance] --- ## 8. Recommendations -### 8.1 Data Quality Actions +**Data Quality Actions**: +- [ ] [Action - priority] +- [ ] [Action - priority] -- [ ] **[Action 1]**: [Description and priority] -- [ ] **[Action 2]**: [Description and priority] -- [ ] **[Action 3]**: [Description and priority] +**Next Steps**: +- [Step with rationale] +- [Step with rationale] -### 8.2 Analysis Next Steps +**Feature Engineering**: +- [Opportunity] +- [Opportunity] -1. **[Step 1]**: [Description and rationale] -2. **[Step 2]**: [Description and rationale] -3. **[Step 3]**: [Description and rationale] - -### 8.3 Feature Engineering Opportunities - -- **[Opportunity 1]**: [Description] -- **[Opportunity 2]**: [Description] -- **[Opportunity 3]**: [Description] - -### 8.4 Modeling Considerations - -- **[Consideration 1]**: [Description] -- **[Consideration 2]**: [Description] -- **[Consideration 3]**: [Description] +**Modeling Considerations**: +- [Consideration] +- [Consideration] --- -## 9. Limitations and Caveats +## 9. Limitations -### 9.1 Data Limitations +**Data**: [Key limitations] -- [Limitation 1] -- [Limitation 2] -- [Limitation 3] +**Analysis**: [Key limitations] -### 9.2 Analysis Limitations - -- [Limitation 1] -- [Limitation 2] -- [Limitation 3] - -### 9.3 Assumptions Made - -- [Assumption 1] -- [Assumption 2] -- [Assumption 3] +**Assumptions**: [Key assumptions made] --- -## 10. Appendices +## Appendices -### Appendix A: Technical Details +### A: Technical Details -**Software Environment**: -- Python: [Version] -- Key Libraries: pandas ([Version]), numpy ([Version]), scipy ([Version]), matplotlib ([Version]) +**Environment**: Python with pandas, numpy, scipy, matplotlib, seaborn -**Analysis Scripts**: [Link to repository or location] +**Scripts**: [Repository/location] -### Appendix B: Variable Dictionary +### B: Variable Dictionary -| Variable Name | Type | Description | Unit | Valid Range | Missing % | -|--------------|------|-------------|------|-------------|-----------| +| Variable | Type | Description | Unit | Range | Missing % | +|----------|------|-------------|------|-------|-----------| | [Var 1] | [Type] | [Description] | [Unit] | [Range] | [%] | -| [Var 2] | [Type] | [Description] | [Unit] | [Range] | [%] | -### Appendix C: Statistical Test Results +### C: Statistical Tests -[Detailed statistical test outputs] - -**Normality Tests**: +**Normality**: | Variable | Test | Statistic | p-value | Result | |----------|------|-----------|---------|--------| | [Var 1] | Shapiro-Wilk | [Value] | [Value] | [Normal/Non-normal] | -**Correlation Tests**: -| Var 1 | Var 2 | Coefficient | p-value | Significance | -|-------|-------|-------------|---------|--------------| +**Correlations**: +| Var 1 | Var 2 | r | p-value | Significant | +|-------|-------|---|---------|-------------| | [Var 1] | [Var 2] | [Value] | [Value] | [Yes/No] | -### Appendix D: Full Visualization Gallery +### D: Visualizations -[Links to all generated visualizations] - -1. [Visualization 1 description](path/to/viz1.png) -2. [Visualization 2 description](path/to/viz2.png) -3. [Visualization 3 description](path/to/viz3.png) - ---- - -## Contact Information - -**Analyst**: [Name] -**Email**: [Email] -**Date**: [Date] -**Version**: [Version number] - ---- - -**Document History**: -| Version | Date | Changes | Author | -|---------|------|---------|--------| -| 1.0 | [Date] | Initial analysis | [Name] | +1. [Description](path/to/viz1.png) +2. [Description](path/to/viz2.png) diff --git a/scientific-thinking/exploratory-data-analysis/references/eda_best_practices.md b/scientific-thinking/exploratory-data-analysis/references/eda_best_practices.md index 1699073..529e50d 100644 --- a/scientific-thinking/exploratory-data-analysis/references/eda_best_practices.md +++ b/scientific-thinking/exploratory-data-analysis/references/eda_best_practices.md @@ -1,379 +1,125 @@ -# Exploratory Data Analysis Best Practices +# EDA Best Practices -This guide provides best practices and methodologies for conducting thorough exploratory data analysis. +Methodologies for conducting thorough exploratory data analysis. -## EDA Process Framework +## 6-Step EDA Framework -### 1. Initial Data Understanding +### 1. Initial Understanding -**Objectives**: -- Understand data structure and format -- Identify data types and schema -- Get familiar with domain context - -**Key Questions**: +**Questions**: - What does each column represent? -- What is the unit of observation? -- What is the time period covered? +- What is the unit of observation and time period? - What is the data collection methodology? -- Are there any known data quality issues? +- Are there known quality issues? -**Actions**: -- Load and inspect first/last rows -- Check data dimensions (rows × columns) -- Review column names and types -- Document data source and context +**Actions**: Load data, inspect structure, review types, document context -### 2. Data Quality Assessment +### 2. Quality Assessment -**Objectives**: -- Identify data quality issues -- Assess data completeness and reliability -- Document data limitations - -**Key Checks**: -- **Missing data**: Patterns, extent, randomness -- **Duplicates**: Exact and near-duplicates -- **Outliers**: Valid extremes vs. data errors -- **Consistency**: Cross-field validation -- **Accuracy**: Domain knowledge validation +**Check**: Missing data patterns, duplicates, outliers, consistency, accuracy **Red Flags**: -- High missing data rate (>20%) +- Missing data >20% - Unexpected duplicates -- Constant or near-constant columns -- Impossible values (negative ages, dates in future) -- High cardinality in ID-like columns +- Constant columns +- Impossible values (negative ages, future dates) - Suspicious patterns (too many round numbers) ### 3. Univariate Analysis -**Objectives**: -- Understand individual variable distributions -- Identify anomalies and patterns -- Determine variable characteristics +**Numeric**: Central tendency, dispersion, shape (skewness, kurtosis), distribution plots, outliers -**For Numeric Variables**: -- Central tendency (mean, median, mode) -- Dispersion (range, variance, std, IQR) -- Shape (skewness, kurtosis) -- Distribution visualization (histogram, KDE, box plot) -- Outlier detection +**Categorical**: Frequency distributions, unique counts, balance, bar charts -**For Categorical Variables**: -- Frequency distributions -- Unique value counts -- Most/least common categories -- Category balance/imbalance -- Bar charts and count plots - -**For Temporal Variables**: -- Time range coverage -- Gaps in timeline -- Temporal patterns (trends, seasonality) -- Time series plots +**Temporal**: Time range, gaps, trends, seasonality, time series plots ### 4. Bivariate Analysis -**Objectives**: -- Understand relationships between variables -- Identify correlations and dependencies -- Find potential predictors +**Numeric vs Numeric**: Scatter plots, correlations (Pearson, Spearman), detect non-linearity -**Numeric vs Numeric**: -- Scatter plots -- Correlation coefficients (Pearson, Spearman) -- Line of best fit -- Detect non-linear relationships +**Numeric vs Categorical**: Group statistics, box plots by category, t-test/ANOVA -**Numeric vs Categorical**: -- Group statistics (mean, median by category) -- Box plots by category -- Distribution plots by category -- Statistical tests (t-test, ANOVA) - -**Categorical vs Categorical**: -- Cross-tabulation / contingency tables -- Stacked bar charts -- Chi-square tests -- Cramér's V for association strength +**Categorical vs Categorical**: Cross-tabs, stacked bars, chi-square, Cramér's V ### 5. Multivariate Analysis -**Objectives**: -- Understand complex interactions -- Identify patterns across multiple variables -- Explore dimensionality +**Techniques**: Correlation matrices, pair plots, parallel coordinates, PCA, clustering -**Techniques**: -- Correlation matrices and heatmaps -- Pair plots / scatter matrices -- Parallel coordinates plots -- Principal Component Analysis (PCA) -- Clustering analysis - -**Key Questions**: -- Are there groups of correlated features? -- Can we reduce dimensionality? -- Are there natural clusters? -- Do patterns change when conditioning on other variables? +**Questions**: Groups of correlated features? Reduce dimensionality? Natural clusters? Conditional patterns? ### 6. Insight Generation -**Objectives**: -- Synthesize findings into actionable insights -- Formulate hypotheses -- Identify next steps +**Look for**: Unexpected patterns, strong correlations, quality issues, feature engineering opportunities, domain implications -**What to Look For**: -- Unexpected patterns or anomalies -- Strong relationships or correlations -- Data quality issues requiring attention -- Feature engineering opportunities -- Business or research implications +## Visualization Guidelines -## Best Practices +**Chart Selection**: +- Distribution: Histogram, KDE, box/violin plots +- Relationships: Scatter, line, heatmap +- Composition: Stacked bar +- Comparison: Bar, grouped bar -### Visualization Guidelines +**Best Practices**: Label axes with units, descriptive titles, purposeful color, appropriate scales, avoid clutter -1. **Choose appropriate chart types**: - - Distribution: Histogram, KDE, box plot, violin plot - - Relationships: Scatter plot, line plot, heatmap - - Composition: Stacked bar, pie chart (use sparingly) - - Comparison: Bar chart, grouped bar chart +## Statistical Analysis Guidelines -2. **Make visualizations clear and informative**: - - Always label axes with units - - Add descriptive titles - - Use color purposefully - - Include legends when needed - - Choose appropriate scales - - Avoid chart junk +**Check Assumptions**: Normality, homoscedasticity, independence, linearity -3. **Use multiple views**: - - Show data from different angles - - Combine complementary visualizations - - Use small multiples for faceting +**Method Selection**: Parametric when assumptions met, non-parametric otherwise, report effect sizes -### Statistical Analysis Guidelines +**Context Matters**: Statistical ≠ practical significance, domain knowledge trumps statistics, correlation ≠ causation -1. **Check assumptions**: - - Test for normality before parametric tests - - Check for homoscedasticity - - Verify independence of observations - - Assess linearity for linear models +## Documentation Guidelines -2. **Use appropriate methods**: - - Parametric tests when assumptions met - - Non-parametric alternatives when violated - - Robust methods for outlier-prone data - - Effect sizes alongside p-values +**Notes**: Document assumptions, decisions, issues, findings -3. **Consider context**: - - Statistical significance ≠ practical significance - - Domain knowledge trumps statistical patterns - - Correlation ≠ causation - - Sample size affects what you can detect +**Reproducibility**: Use scripts, version control, document sources, set random seeds -### Documentation Guidelines +**Reporting**: Clear summaries, supporting visualizations, highlighted insights, actionable recommendations -1. **Keep detailed notes**: - - Document assumptions and decisions - - Record data issues discovered - - Note interesting findings - - Track questions that arise +## Common Pitfalls -2. **Create reproducible analysis**: - - Use scripts, not manual Excel operations - - Version control your code - - Document data sources and versions - - Include random seeds for reproducibility - -3. **Summarize findings**: - - Write clear summaries - - Use visualizations to support points - - Highlight key insights - - Provide recommendations - -## Common Pitfalls to Avoid - -### 1. Confirmation Bias -- **Problem**: Looking only for evidence supporting preconceptions -- **Solution**: Actively seek disconfirming evidence, use blind analysis - -### 2. Ignoring Data Quality -- **Problem**: Proceeding with analysis despite known data issues -- **Solution**: Address quality issues first, document limitations - -### 3. Over-reliance on Automation -- **Problem**: Running analyses without understanding or verifying results -- **Solution**: Manually inspect subsets, verify automated findings - -### 4. Neglecting Outliers -- **Problem**: Removing outliers without investigation -- **Solution**: Always investigate outliers - they may contain important information - -### 5. Multiple Testing Without Correction -- **Problem**: Running many tests increases false positive rate -- **Solution**: Use correction methods (Bonferroni, FDR) or be explicit about exploratory nature - -### 6. Mistaking Association for Causation -- **Problem**: Inferring causation from correlation -- **Solution**: Use careful language, acknowledge alternative explanations - -### 7. Cherry-picking Results -- **Problem**: Reporting only interesting/significant findings -- **Solution**: Report complete analysis, including negative results - -### 8. Ignoring Sample Size -- **Problem**: Not considering how sample size affects conclusions -- **Solution**: Report effect sizes, confidence intervals, and sample sizes +1. **Confirmation Bias**: Seek disconfirming evidence, use blind analysis +2. **Ignoring Quality**: Address issues first, document limitations +3. **Over-automation**: Manually inspect subsets, verify results +4. **Neglecting Outliers**: Investigate before removing - may be informative +5. **Multiple Testing**: Use correction (Bonferroni, FDR) or note exploratory nature +6. **Association ≠ Causation**: Use careful language, acknowledge alternatives +7. **Cherry-picking**: Report complete analysis, including negative results +8. **Ignoring Sample Size**: Report effect sizes, CIs, and sample sizes ## Domain-Specific Considerations -### Time Series Data -- Check for stationarity -- Identify trends and seasonality -- Look for autocorrelation -- Handle missing time points -- Consider temporal splits for validation +**Time Series**: Check stationarity, identify trends/seasonality, autocorrelation, temporal splits -### High-Dimensional Data -- Start with dimensionality reduction -- Focus on feature importance -- Be cautious of curse of dimensionality -- Use regularization in modeling -- Consider domain knowledge for feature selection +**High-Dimensional**: Dimensionality reduction, feature importance, regularization, domain-guided selection -### Imbalanced Data -- Report class distributions -- Use appropriate metrics (not just accuracy) -- Consider resampling techniques -- Stratify sampling and cross-validation -- Be aware of biases in learning +**Imbalanced**: Report distributions, appropriate metrics, resampling, stratified CV -### Small Sample Sizes -- Use non-parametric methods -- Be conservative with conclusions -- Report confidence intervals -- Consider Bayesian approaches -- Acknowledge limitations +**Small Samples**: Non-parametric methods, conservative conclusions, CIs, Bayesian approaches -### Big Data -- Sample intelligently for exploration -- Use efficient data structures -- Leverage parallel/distributed computing -- Be aware computational complexity -- Consider scalability in methods +**Big Data**: Intelligent sampling, efficient structures, parallel computing, scalability ## Iterative Process -EDA is not linear - iterate and refine: +EDA is iterative: Explore → Questions → Focused Analysis → Insights → New Questions → Deeper Investigation → Synthesis -1. **Initial exploration** → Identify questions -2. **Focused analysis** → Answer specific questions -3. **New insights** → Generate new questions -4. **Deeper investigation** → Refine understanding -5. **Synthesis** → Integrate findings +**Done When**: Understand structure/quality, characterized variables, identified relationships, documented limitations, answered questions, have actionable insights -### When to Stop +**Deliverables**: Data understanding, quality issue list, relationship insights, hypotheses, feature ideas, recommendations -You've done enough EDA when: -- ✅ You understand the data structure and quality -- ✅ You've characterized key variables -- ✅ You've identified important relationships -- ✅ You've documented limitations -- ✅ You can answer your research questions -- ✅ You have actionable insights +## Communication -### Moving Forward +**Technical Audiences**: Methodological details, statistical tests, assumptions, reproducible code -After EDA, you should have: -- Clear understanding of data -- List of quality issues and how to handle them -- Insights about relationships and patterns -- Hypotheses to test -- Ideas for feature engineering -- Recommendations for next steps +**Non-Technical Audiences**: Focus on insights, clear visualizations, avoid jargon, concrete recommendations -## Communication Tips +**Report Structure**: Executive summary → Data overview → Analysis → Insights → Recommendations → Appendix -### For Technical Audiences -- Include methodological details -- Show statistical test results -- Discuss assumptions and limitations -- Provide reproducible code -- Reference relevant literature +## Checklists -### For Non-Technical Audiences -- Focus on insights, not methods -- Use clear visualizations -- Avoid jargon -- Provide context and implications -- Make recommendations concrete +**Before**: Understand context, define objectives, identify audience, set up environment -### Report Structure -1. **Executive Summary**: Key findings and recommendations -2. **Data Overview**: Source, structure, limitations -3. **Analysis**: Findings organized by theme -4. **Insights**: Patterns, anomalies, implications -5. **Recommendations**: Next steps and actions -6. **Appendix**: Technical details, full statistics +**During**: Inspect structure, assess quality, analyze distributions, explore relationships, document continuously -## Useful Checklists - -### Before Starting -- [ ] Understand business/research context -- [ ] Define analysis objectives -- [ ] Identify stakeholders and audience -- [ ] Secure necessary permissions -- [ ] Set up reproducible environment - -### During Analysis -- [ ] Load and inspect data structure -- [ ] Assess data quality -- [ ] Analyze univariate distributions -- [ ] Explore bivariate relationships -- [ ] Investigate multivariate patterns -- [ ] Generate and validate insights -- [ ] Document findings continuously - -### Before Concluding -- [ ] Verify all findings -- [ ] Check for alternative explanations -- [ ] Document limitations -- [ ] Prepare clear visualizations -- [ ] Write actionable recommendations -- [ ] Review with domain experts -- [ ] Ensure reproducibility - -## Tools and Libraries - -### Python Ecosystem -- **pandas**: Data manipulation -- **numpy**: Numerical operations -- **matplotlib/seaborn**: Visualization -- **scipy**: Statistical tests -- **scikit-learn**: ML preprocessing -- **plotly**: Interactive visualizations - -### Best Tool Practices -- Use appropriate tool for task -- Leverage vectorization -- Chain operations efficiently -- Handle missing data properly -- Validate results independently -- Document custom functions - -## Further Resources - -- **Books**: - - "Exploratory Data Analysis" by John Tukey - - "The Art of Statistics" by David Spiegelhalter -- **Guidelines**: - - ASA Statistical Significance Statement - - FAIR data principles -- **Communities**: - - Cross Validated (Stack Exchange) - - /r/datascience - - Local data science meetups +**After**: Verify findings, check alternatives, document limitations, prepare visualizations, ensure reproducibility diff --git a/scientific-thinking/exploratory-data-analysis/references/statistical_tests_guide.md b/scientific-thinking/exploratory-data-analysis/references/statistical_tests_guide.md index 9aa0992..557c465 100644 --- a/scientific-thinking/exploratory-data-analysis/references/statistical_tests_guide.md +++ b/scientific-thinking/exploratory-data-analysis/references/statistical_tests_guide.md @@ -1,252 +1,126 @@ -# Statistical Tests Guide for EDA +# Statistical Tests Guide -This guide provides interpretation guidelines for statistical tests commonly used in exploratory data analysis. +Interpretation guidelines for common EDA statistical tests. ## Normality Tests -### Shapiro-Wilk Test +### Shapiro-Wilk -**Purpose**: Test if a sample comes from a normally distributed population +**Use**: Small to medium samples (n < 5000) -**When to use**: Best for small to medium sample sizes (n < 5000) +**H0**: Data is normal | **H1**: Data is not normal -**Interpretation**: -- **Null Hypothesis (H0)**: The data follows a normal distribution -- **Alternative Hypothesis (H1)**: The data does not follow a normal distribution -- **p-value > 0.05**: Fail to reject H0 → Data is likely normally distributed -- **p-value ≤ 0.05**: Reject H0 → Data is not normally distributed +**Interpretation**: p > 0.05 → likely normal | p ≤ 0.05 → not normal -**Notes**: -- Very sensitive to sample size -- Small deviations from normality may be detected as significant in large samples -- Consider practical significance alongside statistical significance +**Note**: Very sensitive to sample size; small deviations may be significant in large samples -### Anderson-Darling Test +### Anderson-Darling -**Purpose**: Test if a sample comes from a specific distribution (typically normal) +**Use**: More powerful than Shapiro-Wilk, emphasizes tails -**When to use**: More powerful than Shapiro-Wilk for detecting departures from normality +**Interpretation**: Test statistic > critical value → reject normality -**Interpretation**: -- Compares test statistic against critical values at different significance levels -- If test statistic > critical value at given significance level, reject normality -- More weight given to tails of distribution than other tests +### Kolmogorov-Smirnov -### Kolmogorov-Smirnov Test +**Use**: Large samples or testing against non-normal distributions -**Purpose**: Test if a sample comes from a reference distribution - -**When to use**: When you have a large sample or want to test against distributions other than normal - -**Interpretation**: -- **p-value > 0.05**: Sample distribution matches reference distribution -- **p-value ≤ 0.05**: Sample distribution differs from reference distribution +**Interpretation**: p > 0.05 → matches reference | p ≤ 0.05 → differs from reference ## Distribution Characteristics ### Skewness -**Purpose**: Measure asymmetry of the distribution +**Measures asymmetry**: +- ≈ 0: Symmetric +- \> 0: Right-skewed (tail right) +- < 0: Left-skewed (tail left) -**Interpretation**: -- **Skewness ≈ 0**: Symmetric distribution -- **Skewness > 0**: Right-skewed (tail extends to right, most values on left) -- **Skewness < 0**: Left-skewed (tail extends to left, most values on right) +**Magnitude**: |s| < 0.5 (symmetric) | 0.5-1 (moderate) | ≥ 1 (high) -**Magnitude interpretation**: -- **|Skewness| < 0.5**: Approximately symmetric -- **0.5 ≤ |Skewness| < 1**: Moderately skewed -- **|Skewness| ≥ 1**: Highly skewed - -**Implications**: -- Highly skewed data may require transformation (log, sqrt, Box-Cox) -- Mean is pulled toward tail; median more robust for skewed data -- Many statistical tests assume symmetry/normality +**Action**: High skew → consider transformation (log, sqrt, Box-Cox); use median over mean ### Kurtosis -**Purpose**: Measure tailedness and peak of distribution +**Measures tailedness** (excess kurtosis, normal = 0): +- ≈ 0: Normal tails +- \> 0: Heavy tails, more outliers +- < 0: Light tails, fewer outliers -**Interpretation** (Excess Kurtosis, where normal distribution = 0): -- **Kurtosis ≈ 0**: Normal tail behavior (mesokurtic) -- **Kurtosis > 0**: Heavy tails, sharp peak (leptokurtic) - - More outliers than normal distribution - - Higher probability of extreme values -- **Kurtosis < 0**: Light tails, flat peak (platykurtic) - - Fewer outliers than normal distribution - - More uniform distribution +**Magnitude**: |k| < 0.5 (normal) | 0.5-1 (moderate) | ≥ 1 (very different) -**Magnitude interpretation**: -- **|Kurtosis| < 0.5**: Normal-like tails -- **0.5 ≤ |Kurtosis| < 1**: Moderately different tails -- **|Kurtosis| ≥ 1**: Very different tail behavior from normal +**Action**: High kurtosis → investigate outliers carefully -**Implications**: -- High kurtosis → Be cautious with outliers -- Low kurtosis → Distribution lacks distinct peak +## Correlation -## Correlation Tests +### Pearson -### Pearson Correlation +**Measures**: Linear relationship (-1 to +1) -**Purpose**: Measure linear relationship between two continuous variables +**Strength**: |r| < 0.3 (weak) | 0.3-0.5 (moderate) | 0.5-0.7 (strong) | ≥ 0.7 (very strong) -**Range**: -1 to +1 +**Assumptions**: Linear, continuous, normal, no outliers, homoscedastic -**Interpretation**: -- **r = +1**: Perfect positive linear relationship -- **r = 0**: No linear relationship -- **r = -1**: Perfect negative linear relationship +**Use**: Expected linear relationship, assumptions met -**Strength guidelines**: -- **|r| < 0.3**: Weak correlation -- **0.3 ≤ |r| < 0.5**: Moderate correlation -- **0.5 ≤ |r| < 0.7**: Strong correlation -- **|r| ≥ 0.7**: Very strong correlation +### Spearman -**Assumptions**: -- Linear relationship between variables -- Both variables continuous and normally distributed -- No significant outliers -- Homoscedasticity (constant variance) +**Measures**: Monotonic relationship (-1 to +1), rank-based -**When to use**: When relationship is expected to be linear and data meets assumptions +**Advantages**: Robust to outliers, no linearity assumption, works with ordinal, no normality required -### Spearman Correlation +**Use**: Outliers present, non-linear monotonic relationship, ordinal data, non-normal -**Purpose**: Measure monotonic relationship between two variables (rank-based) +## Outlier Detection -**Range**: -1 to +1 +### IQR Method -**Interpretation**: Same as Pearson, but measures monotonic (not just linear) relationships +**Bounds**: Q1 - 1.5×IQR to Q3 + 1.5×IQR -**Advantages over Pearson**: -- Robust to outliers (uses ranks) -- Doesn't assume linear relationship -- Works with ordinal data -- Doesn't require normality assumption +**Characteristics**: Simple, robust, works with skewed data -**When to use**: -- Data has outliers -- Relationship is monotonic but not linear -- Data is ordinal -- Distribution is non-normal - -## Outlier Detection Methods - -### IQR Method (Interquartile Range) - -**Definition**: -- Lower bound: Q1 - 1.5 × IQR -- Upper bound: Q3 + 1.5 × IQR -- Values outside these bounds are outliers - -**Characteristics**: -- Simple and interpretable -- Robust to extreme values -- Works well for skewed distributions -- Conservative approach (Tukey's fences) - -**Interpretation**: -- **< 5% outliers**: Typical for most datasets -- **5-10% outliers**: Moderate, investigate causes -- **> 10% outliers**: High rate, may indicate data quality issues or interesting phenomena +**Typical Rates**: < 5% (normal) | 5-10% (moderate) | > 10% (high, investigate) ### Z-Score Method -**Definition**: Outliers are data points with |z-score| > 3 +**Definition**: |z| > 3 where z = (x - μ) / σ -**Formula**: z = (x - μ) / σ +**Use**: Normal data, n > 30 -**Characteristics**: -- Assumes normal distribution -- Sensitive to extreme values -- Standard threshold is |z| > 3 (99.7% of data within ±3σ) +**Avoid**: Small samples, skewed data, many outliers (contaminates mean/SD) -**When to use**: -- Data is approximately normally distributed -- Large sample sizes (n > 30) +## Hypothesis Testing -**When NOT to use**: -- Small samples -- Heavily skewed data -- Data with many outliers (contaminates mean and SD) +**Significance Levels**: α = 0.05 (standard) | 0.01 (conservative) | 0.10 (liberal) -## Hypothesis Testing Guidelines +**p-value Interpretation**: ≤ 0.001 (***) | ≤ 0.01 (**) | ≤ 0.05 (*) | ≤ 0.10 (weak) | > 0.10 (none) -### Significance Levels +**Key Considerations**: +- Statistical ≠ practical significance +- Multiple testing → use correction (Bonferroni, FDR) +- Large samples detect trivial effects +- Always report effect sizes with p-values -- **α = 0.05**: Standard significance level (5% chance of Type I error) -- **α = 0.01**: More conservative (1% chance of Type I error) -- **α = 0.10**: More liberal (10% chance of Type I error) +## Transformations -### p-value Interpretation +**Right-skewed**: Log, sqrt, Box-Cox -- **p ≤ 0.001**: Very strong evidence against H0 (***) -- **0.001 < p ≤ 0.01**: Strong evidence against H0 (**) -- **0.01 < p ≤ 0.05**: Moderate evidence against H0 (*) -- **0.05 < p ≤ 0.10**: Weak evidence against H0 -- **p > 0.10**: Little to no evidence against H0 +**Left-skewed**: Square, cube, exponential -### Important Considerations +**Heavy tails**: Robust scaling, winsorization, log -1. **Statistical vs Practical Significance**: A small p-value doesn't always mean the effect is important -2. **Multiple Testing**: When performing many tests, use correction methods (Bonferroni, FDR) -3. **Sample Size**: Large samples can detect trivial effects as significant -4. **Effect Size**: Always report and interpret effect sizes alongside p-values +**Non-constant variance**: Log, Box-Cox -## Data Transformation Strategies - -### When to Transform - -- **Right-skewed data**: Log, square root, or Box-Cox transformation -- **Left-skewed data**: Square, cube, or exponential transformation -- **Heavy tails/outliers**: Robust scaling, winsorization, or log transformation -- **Non-constant variance**: Log or Box-Cox transformation - -### Common Transformations - -1. **Log transformation**: log(x) or log(x + 1) - - Best for: Positive skewed data, multiplicative relationships - - Cannot use with zero or negative values - -2. **Square root transformation**: √x - - Best for: Count data, moderate positive skew - - Less aggressive than log - -3. **Box-Cox transformation**: (x^λ - 1) / λ - - Best for: Automatically finds optimal transformation - - Requires positive values - -4. **Standardization**: (x - μ) / σ - - Best for: Scaling features to same range - - Centers data at 0 with unit variance - -5. **Min-Max scaling**: (x - min) / (max - min) - - Best for: Scaling to [0, 1] range - - Preserves zero values +**Common Methods**: +- **Log**: log(x+1) for positive skew, multiplicative relationships +- **Sqrt**: Count data, moderate skew +- **Box-Cox**: Auto-finds optimal (requires positive values) +- **Standardization**: (x-μ)/σ for scaling to unit variance +- **Min-Max**: (x-min)/(max-min) for [0,1] scaling ## Practical Guidelines -### Sample Size Considerations +**Sample Size**: n < 30 (non-parametric, cautious) | 30-100 (parametric OK) | ≥ 100 (robust) | ≥ 1000 (may detect trivial effects) -- **n < 30**: Use non-parametric tests, be cautious with assumptions -- **30 ≤ n < 100**: Moderate sample, parametric tests usually acceptable -- **n ≥ 100**: Large sample, parametric tests robust to violations -- **n ≥ 1000**: Very large sample, may detect trivial effects as significant +**Missing Data**: < 5% (simple methods) | 5-10% (imputation) | > 10% (investigate patterns, advanced methods) -### Dealing with Missing Data - -- **< 5% missing**: Usually not a problem, simple methods OK -- **5-10% missing**: Use appropriate imputation methods -- **> 10% missing**: Investigate patterns, consider advanced imputation or modeling missingness - -### Reporting Results - -Always include: -1. Test statistic value -2. p-value -3. Confidence interval (when applicable) -4. Effect size -5. Sample size -6. Assumptions checked and violations noted +**Reporting**: Include test statistic, p-value, CI, effect size, n, assumption checks