Improve the EDA skill

2026-01-26 16:58:56 +08:00 · 2025-11-04 17:25:06 -08:00
parent 1225ddecf1
commit ffad3d81b0
4 changed files with 362 additions and 988 deletions
--- a/scientific-thinking/exploratory-data-analysis/SKILL.md
+++ b/scientific-thinking/exploratory-data-analysis/SKILL.md
@@ -1,275 +1,202 @@
 ---
 name: exploratory-data-analysis
-description: "EDA toolkit. Analyze CSV/Excel/JSON/Parquet files, statistical summaries, distributions, correlations, outliers, missing data, visualizations, markdown reports, for data profiling and insights."
+description: "Analyze datasets to discover patterns, anomalies, and relationships. Use when exploring data files, generating statistical summaries, checking data quality, or creating visualizations. Supports CSV, Excel, JSON, Parquet, and more."
 ---

 # Exploratory Data Analysis

-## Overview
+Discover patterns, anomalies, and relationships in tabular data through statistical analysis and visualization.

-EDA is a process for discovering patterns, anomalies, and relationships in data. Analyze CSV/Excel/JSON/Parquet files to generate statistical summaries, distributions, correlations, outliers, and visualizations. All outputs are markdown-formatted for integration into workflows.
+**Supported formats**: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle

-## When to Use This Skill
-
-This skill should be used when:
- User provides a data file and requests analysis or exploration
- User asks to "explore this dataset", "analyze this data", or "what's in this file?"
- User needs statistical summaries, distributions, or correlations
- User requests data visualizations or insights
- User wants to understand data quality issues or patterns
- User mentions EDA, exploratory analysis, or data profiling
-
-**Supported file formats**: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
-
-## Quick Start Workflow
-
-1. **Receive data file** from user
-2. **Run comprehensive analysis** using `scripts/eda_analyzer.py`
-3. **Generate visualizations** using `scripts/visualizer.py`
-4. **Create markdown report** using insights and the `assets/report_template.md` template
-5. **Present findings** to user with key insights highlighted
-
-## Core Capabilities
-
-### 1. Comprehensive Data Analysis
-
-Execute full statistical analysis using the `eda_analyzer.py` script:
+## Standard Workflow

+1. Run statistical analysis:
 ```bash
-python scripts/eda_analyzer.py <data_file_path> -o <output_directory>
+python scripts/eda_analyzer.py <data_file> -o <output_dir>
 ```

-**What it provides**:
- Auto-detection and loading of file formats
- Basic dataset information (shape, types, memory usage)
- Missing data analysis (patterns, percentages)
- Summary statistics for numeric and categorical variables
- Outlier detection using IQR and Z-score methods
- Distribution analysis with normality tests (Shapiro-Wilk, Anderson-Darling)
- Correlation analysis (Pearson and Spearman)
- Data quality assessment (completeness, duplicates, issues)
- Automated insight generation
-
-**Output**: JSON file containing all analysis results at `<output_directory>/eda_analysis.json`
-
-### 2. Comprehensive Visualizations
-
-Generate complete visualization suite using the `visualizer.py` script:
-
+2. Generate visualizations:
 ```bash
-python scripts/visualizer.py <data_file_path> -o <output_directory>
+python scripts/visualizer.py <data_file> -o <output_dir>
 ```

-**Generated visualizations**:
- **Missing data patterns**: Heatmap and bar chart showing missing data
- **Distribution plots**: Histograms with KDE overlays for all numeric variables
- **Box plots with violin plots**: Outlier detection visualizations
- **Correlation heatmap**: Both Pearson and Spearman correlation matrices
- **Scatter matrix**: Pairwise relationships between numeric variables
- **Categorical analysis**: Bar charts for top categories
- **Time series plots**: Temporal trends with trend lines (if datetime columns exist)
+3. Read analysis results from `<output_dir>/eda_analysis.json`

-**Output**: High-quality PNG files saved to `<output_directory>/eda_visualizations/`
+4. Create report using `assets/report_template.md` structure

-All visualizations are production-ready with:
- 300 DPI resolution
- Clear titles and labels
- Statistical annotations
- Professional styling using seaborn
+5. Present findings with key insights and visualizations

-### 3. Automated Insight Generation
+## Analysis Capabilities

-The analyzer automatically generates actionable insights including:
+### Statistical Analysis

- **Data scale insights**: Dataset size considerations for processing
- **Missing data alerts**: Warnings when missing data exceeds thresholds
- **Correlation discoveries**: Strong relationships identified for feature engineering
- **Outlier warnings**: Variables with high outlier rates flagged
- **Distribution assessments**: Skewness issues requiring transformations
- **Duplicate alerts**: Duplicate row detection
- **Imbalance warnings**: Categorical variable imbalance detection
+Run `scripts/eda_analyzer.py` to generate comprehensive analysis:

-Access insights from the analysis results JSON under the `"insights"` key.
+```bash
+python scripts/eda_analyzer.py sales_data.csv -o ./output
+```

-### 4. Statistical Interpretation
+Produces `output/eda_analysis.json` containing:
+- Dataset shape, types, memory usage
+- Missing data patterns and percentages
+- Summary statistics (numeric and categorical)
+- Outlier detection (IQR and Z-score methods)
+- Distribution analysis with normality tests
+- Correlation matrices (Pearson and Spearman)
+- Data quality metrics (completeness, duplicates)
+- Automated insights

-For detailed interpretation of statistical tests and measures, reference:
+### Visualizations

-**`references/statistical_tests_guide.md`** - Comprehensive guide covering:
+Run `scripts/visualizer.py` to generate plots:
+
+```bash
+python scripts/visualizer.py sales_data.csv -o ./output
+```
+
+Creates high-resolution (300 DPI) PNG files in `output/eda_visualizations/`:
+- Missing data heatmaps and bar charts
+- Distribution plots (histograms with KDE)
+- Box plots and violin plots for outliers
+- Correlation heatmaps
+- Scatter matrices for numeric relationships
+- Categorical bar charts
+- Time series plots (if datetime columns detected)
+
+### Automated Insights
+
+Access generated insights from the `"insights"` key in the analysis JSON:
+- Dataset size considerations
+- Missing data warnings (when exceeding thresholds)
+- Strong correlations for feature engineering
+- High outlier rate flags
+- Skewness requiring transformations
+- Duplicate detection
+- Categorical imbalance warnings
+
+## Reference Materials
+
+### Statistical Interpretation
+
+See `references/statistical_tests_guide.md` for detailed guidance on:
 - Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
 - Distribution characteristics (skewness, kurtosis)
- Correlation tests (Pearson, Spearman)
- Outlier detection methods (IQR, Z-score)
- Hypothesis testing guidelines
- Data transformation strategies
+- Correlation methods (Pearson, Spearman)
+- Outlier detection (IQR, Z-score)
+- Hypothesis testing and data transformations

-Load this reference when needing to interpret specific statistical tests or explain results to users.
+Use when interpreting statistical results or explaining findings.

-### 5. Best Practices Guidance
+### Methodology

-For methodological guidance, reference:
+See `references/eda_best_practices.md` for comprehensive guidance on:
+- 6-step EDA process framework
+- Univariate, bivariate, multivariate analysis approaches
+- Visualization and statistical analysis guidelines
+- Common pitfalls and domain-specific considerations
+- Communication strategies for different audiences

-**`references/eda_best_practices.md`** - Detailed best practices including:
- EDA process framework (6-step methodology)
- Univariate, bivariate, and multivariate analysis approaches
- Visualization guidelines
- Statistical analysis guidelines
- Common pitfalls to avoid
- Domain-specific considerations
- Communication tips for technical and non-technical audiences
+Use when planning analysis or handling specific scenarios.

-Load this reference when planning analysis approach or needing guidance on specific EDA scenarios.
+## Report Template

-## Creating Analysis Reports
-
-Use the provided template to structure comprehensive EDA reports:
-
-**`assets/report_template.md`** - Professional report template with sections for:
+Use `assets/report_template.md` to structure findings. Template includes:
 - Executive summary
 - Dataset overview
 - Data quality assessment
 - Univariate, bivariate, and multivariate analysis
 - Outlier analysis
- Key insights and findings
- Recommendations
+- Key insights and recommendations
 - Limitations and appendices

-**To use the template**:
-1. Copy the template content
-2. Fill in sections with analysis results from JSON output
-3. Embed visualization images using markdown syntax
-4. Populate insights and recommendations
-5. Save as markdown for user consumption
+Fill sections with analysis JSON results and embed visualizations using markdown image syntax.

-## Typical Workflow Example
+## Example: Complete Analysis

-When user provides a data file:
+User request: "Explore this sales_data.csv file"

-```
-User: "Can you explore this sales_data.csv file and tell me what you find?"
+```bash
+# 1. Run analysis
+python scripts/eda_analyzer.py sales_data.csv -o ./output

-1. Run analysis:
-   python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output
-
-2. Generate visualizations:
-   python scripts/visualizer.py sales_data.csv -o ./analysis_output
-
-3. Read analysis results:
-   Read ./analysis_output/eda_analysis.json
-
-4. Create markdown report using template:
-   - Copy assets/report_template.md structure
-   - Fill in sections with analysis results
-   - Reference visualizations from ./analysis_output/eda_visualizations/
-   - Include automated insights from JSON
-
-5. Present to user:
-   - Show key insights prominently
-   - Highlight data quality issues
-   - Provide visualizations inline
-   - Make actionable recommendations
-   - Save complete report as .md file
+# 2. Generate visualizations
+python scripts/visualizer.py sales_data.csv -o ./output
 ```

-## Advanced Analysis Scenarios
+```python
+# 3. Read results
+import json
+with open('./output/eda_analysis.json') as f:
+    results = json.load(f)

-### Large Datasets (>1M rows)
- Run analysis on sampled data first for quick exploration
- Note sample size in report
- Recommend distributed computing for full analysis
+# 4. Build report from assets/report_template.md
+# - Fill sections with results
+# - Embed images: ![Missing Data](./output/eda_visualizations/missing_data.png)
+# - Include insights from results['insights']
+# - Add recommendations
+```

-### High-Dimensional Data (>50 columns)
- Focus on most important variables first
- Consider PCA or feature selection
- Generate correlation analysis to identify variable groups
- Reference `eda_best_practices.md` section on high-dimensional data
+## Special Cases

-### Time Series Data
- Ensure datetime columns are properly detected
- Time series visualizations will be automatically generated
- Consider temporal patterns, trends, and seasonality
- Reference `eda_best_practices.md` section on time series
+### Dataset Size Strategy

-### Imbalanced Data
- Categorical analysis will flag imbalances
- Report class distributions prominently
- Recommend stratified sampling if needed
+**If < 100 rows**: Note sample size limitations, use non-parametric methods

-### Small Sample Sizes (<100 rows)
- Non-parametric methods automatically used where appropriate
- Be conservative in statistical conclusions
- Note sample size limitations in report
+**If 100-1M rows**: Standard workflow applies

-## Output Best Practices
+**If > 1M rows**: Sample first for quick exploration, note sample size in report, recommend distributed computing for full analysis

-**Always output as markdown**:
- Structure findings using markdown headers, tables, and lists
- Embed visualizations using `![Description](path/to/image.png)` syntax
- Use tables for statistical summaries
- Include code blocks for any suggested transformations
- Highlight key insights with bold or bullet points
+### Data Characteristics

-**Ensure reports are actionable**:
- Provide clear recommendations based on findings
- Flag data quality issues that need attention
- Suggest next steps for modeling or further analysis
- Identify feature engineering opportunities
+**High-dimensional (>50 columns)**: Focus on key variables first, use correlation analysis to identify groups, consider PCA or feature selection. See `references/eda_best_practices.md` for guidance.

-**Make insights accessible**:
- Explain statistical concepts in plain language
- Use reference guides to provide detailed interpretations
- Include both technical details and executive summary
+**Time series**: Datetime columns auto-detected, temporal visualizations generated automatically. Consider trends, seasonality, patterns.
+
+**Imbalanced**: Categorical analysis flags imbalances automatically. Report distributions prominently, recommend stratified sampling if needed.
+
+## Output Guidelines
+
+**Format findings as markdown**:
+- Use headers, tables, and lists for structure
+- Embed visualizations: `![Description](path/to/image.png)`
+- Include code blocks for suggested transformations
+- Highlight key insights
+
+**Make reports actionable**:
+- Provide clear recommendations
+- Flag data quality issues requiring attention
+- Suggest next steps (modeling, feature engineering, further analysis)
 - Tailor communication to user's technical level

-## Handling Edge Cases
+## Error Handling

-**Unsupported file formats**:
- Request user to convert to supported format
- Suggest using pandas-compatible formats
+**Unsupported formats**: Request conversion to supported format (CSV, Excel, JSON, Parquet)

-**Files too large to load**:
- Recommend sampling approach
- Suggest chunked processing
- Consider alternative tools for big data
+**Files too large**: Recommend sampling or chunked processing

-**Corrupted or malformed data**:
- Report specific errors encountered
- Suggest data cleaning steps
- Try to salvage partial analysis if possible
+**Corrupted data**: Report specific errors, suggest cleaning steps, attempt partial analysis

-**All missing data in columns**:
- Flag completely empty columns
- Recommend removal or investigation
- Document in data quality section
+**Empty columns**: Flag in data quality section, recommend removal or investigation

-## Resources Summary
+## Resources

-### scripts/
- **`eda_analyzer.py`**: Main analysis engine - comprehensive statistical analysis
- **`visualizer.py`**: Visualization generator - creates all chart types
+**Scripts** (handle all formats automatically):
+- `scripts/eda_analyzer.py` - Statistical analysis engine
+- `scripts/visualizer.py` - Visualization generator

-Both scripts are fully executable and handle multiple file formats automatically.
+**References** (load as needed):
+- `references/statistical_tests_guide.md` - Test interpretation and methodology
+- `references/eda_best_practices.md` - EDA process and best practices

-### references/
- **`statistical_tests_guide.md`**: Statistical test interpretation and methodology
- **`eda_best_practices.md`**: Comprehensive EDA methodology and best practices
+**Template**:
+- `assets/report_template.md` - Professional report structure

-Load these references as needed to inform analysis approach and interpretation.
+## Key Points

-### assets/
- **`report_template.md`**: Professional markdown report template
-
-Use this template structure for creating consistent, comprehensive EDA reports.
-
-## Key Reminders
-
-1. **Always generate markdown output** for textual results
-2. **Run both scripts** (analyzer and visualizer) for complete analysis
-3. **Use the template** to structure comprehensive reports
-4. **Include visualizations** by referencing generated PNG files
-5. **Provide actionable insights** - don't just present statistics
-6. **Interpret findings** using reference guides
-7. **Document limitations** and data quality issues
-8. **Make recommendations** for next steps
-
-This skill transforms raw data into actionable insights through systematic exploration, advanced statistics, rich visualizations, and clear communication.
+- Run both scripts for complete analysis
+- Structure reports using the template
+- Provide actionable insights, not just statistics
+- Use reference guides for detailed interpretations
+- Document data quality issues and limitations
+- Make clear recommendations for next steps
--- a/scientific-thinking/exploratory-data-analysis/assets/report_template.md
+++ b/scientific-thinking/exploratory-data-analysis/assets/report_template.md
@@ -1,14 +1,10 @@
-# Exploratory Data Analysis Report
+# EDA Report: [Dataset Name]

-**Dataset**: [Dataset Name]
-**Analysis Date**: [Date]
-**Analyst**: [Name]
-
---
+**Date**: [Date] | **Analyst**: [Name]

 ## Executive Summary

-[2-3 paragraph summary of key findings, major insights, and recommendations]
+[Concise summary of key findings and recommendations]

 **Key Findings**:
 - [Finding 1]
@@ -23,366 +19,197 @@

 ## 1. Dataset Overview

-### 1.1 Data Source
- **Source**: [Source name and location]
- **Collection Period**: [Date range]
- **Last Updated**: [Date]
- **Format**: [CSV, Excel, JSON, etc.]
+**Source**: [Source name] | **Format**: [CSV/Excel/JSON/etc.] | **Period**: [Date range]

-### 1.2 Data Structure
- **Observations (Rows)**: [Number]
- **Variables (Columns)**: [Number]
- **Memory Usage**: [Size in MB]
+**Structure**: [Rows] observations × [Columns] variables | **Memory**: [Size] MB

-### 1.3 Variable Types
- **Numeric Variables** ([Count]): [List column names]
- **Categorical Variables** ([Count]): [List column names]
- **Datetime Variables** ([Count]): [List column names]
- **Boolean Variables** ([Count]): [List column names]
+**Variable Types**:
+- Numeric ([Count]): [List names]
+- Categorical ([Count]): [List names]
+- Datetime ([Count]): [List names]
+- Boolean ([Count]): [List names]

 ---

-## 2. Data Quality Assessment
+## 2. Data Quality

-### 2.1 Completeness
+**Completeness**: [Percentage]% | **Duplicates**: [Count] ([%]%)

-**Overall Data Completeness**: [Percentage]%
+**Missing Data**:
+| Column | Missing % | Assessment |
+|--------|-----------|------------|
+| [Column 1] | [%] | [High/Medium/Low] |
+| [Column 2] | [%] | [High/Medium/Low] |

-**Missing Data Summary**:
-| Column | Missing Count | Missing % | Assessment |
-|--------|--------------|-----------|------------|
-| [Column 1] | [Count] | [%] | [High/Medium/Low] |
-| [Column 2] | [Count] | [%] | [High/Medium/Low] |
+![Missing Data](path/to/missing_data.png)

-**Missing Data Pattern**: [Description of patterns, if any]
-
-**Visualization**: ![Missing Data](path/to/missing_data.png)
-
-### 2.2 Duplicates
-
- **Duplicate Rows**: [Count] ([Percentage]%)
- **Action Required**: [Yes/No - describe if needed]
-
-### 2.3 Data Quality Issues
-
-[List any identified issues]
- [ ] Issue 1: [Description]
- [ ] Issue 2: [Description]
- [ ] Issue 3: [Description]
+**Quality Issues**:
+- [Issue 1]
+- [Issue 2]

 ---

 ## 3. Univariate Analysis

-### 3.1 Numeric Variables
+### Numeric: [Variable Name]

-[For each key numeric variable:]
+**Stats**: Mean: [Value] | Median: [Value] | Std: [Value] | Range: [[Min]-[Max]]

-#### [Variable Name]
+**Distribution**: Skewness: [Value] | Kurtosis: [Value] | Normality: [Yes/No]

-**Summary Statistics**:
- **Mean**: [Value]
- **Median**: [Value]
- **Std Dev**: [Value]
- **Min**: [Value]
- **Max**: [Value]
- **Range**: [Value]
- **IQR**: [Value]
+**Outliers**: IQR: [Count] ([%]%) | Z-score: [Count] ([%]%)

-**Distribution Characteristics**:
- **Skewness**: [Value] - [Interpretation]
- **Kurtosis**: [Value] - [Interpretation]
- **Normality**: [Normal/Not Normal based on tests]
+![Distribution](path/to/distribution.png)

-**Outliers**:
- **IQR Method**: [Count] outliers ([Percentage]%)
- **Z-Score Method**: [Count] outliers ([Percentage]%)
+**Insights**: [Key observations]

-**Visualization**: ![Distribution of [Variable]](path/to/distribution.png)
+### Categorical: [Variable Name]

-**Insights**:
- [Key insight 1]
- [Key insight 2]
+**Stats**: [Count] unique values | Most common: [Value] ([%]%) | Balance: [Balanced/Imbalanced]

---
-
-### 3.2 Categorical Variables
-
-[For each key categorical variable:]
-
-#### [Variable Name]
-
-**Summary**:
- **Unique Values**: [Count]
- **Most Common**: [Value] ([Percentage]%)
- **Least Common**: [Value] ([Percentage]%)
- **Balance**: [Balanced/Imbalanced]
-
-**Top Categories**:
-| Category | Count | Percentage |
-|----------|-------|------------|
+| Category | Count | % |
+|----------|-------|---|
 | [Cat 1] | [Count] | [%] |
 | [Cat 2] | [Count] | [%] |
-| [Cat 3] | [Count] | [%] |

-**Visualization**: ![Distribution of [Variable]](path/to/categorical.png)
+![Distribution](path/to/categorical.png)

-**Insights**:
- [Key insight 1]
- [Key insight 2]
+**Insights**: [Key observations]

---
+### Temporal: [Variable Name]

-### 3.3 Temporal Variables
+**Range**: [Start] to [End] ([Duration]) | **Trend**: [Increasing/Decreasing/Stable] | **Seasonality**: [Yes/No]

-[If datetime columns exist:]
+![Time Series](path/to/timeseries.png)

-#### [Variable Name]
-
-**Time Range**: [Start Date] to [End Date]
-**Duration**: [Time span]
-**Temporal Coverage**: [Complete/Gaps identified]
-
-**Temporal Patterns**:
- **Trend**: [Increasing/Decreasing/Stable]
- **Seasonality**: [Yes/No - describe if present]
- **Gaps**: [List any gaps in timeline]
-
-**Visualization**: ![Time Series of [Variable]](path/to/timeseries.png)
-
-**Insights**:
- [Key insight 1]
- [Key insight 2]
+**Insights**: [Key observations]

 ---

 ## 4. Bivariate Analysis

-### 4.1 Correlation Analysis
-
-**Overall Correlation Structure**:
- **Strong Positive Correlations**: [Count]
- **Strong Negative Correlations**: [Count]
- **Weak/No Correlations**: [Count]
-
-**Correlation Matrix**:
+**Correlation Summary**: [Count] strong positive | [Count] strong negative | [Count] weak/none

 ![Correlation Heatmap](path/to/correlation_heatmap.png)

 **Notable Correlations**:
-| Variable 1 | Variable 2 | Pearson r | Spearman ρ | Strength | Interpretation |
-|-----------|-----------|-----------|------------|----------|----------------|
-| [Var 1] | [Var 2] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] |
-| [Var 1] | [Var 3] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] |
+| Var 1 | Var 2 | Pearson | Spearman | Strength |
+|-------|-------|---------|----------|----------|
+| [Var 1] | [Var 2] | [Value] | [Value] | [Strong/Moderate/Weak] |

-**Insights**:
- [Key insight about correlations]
- [Potential multicollinearity issues]
- [Feature engineering opportunities]
+**Insights**: [Multicollinearity issues, feature engineering opportunities]

---
+### Key Relationship: [Var 1] vs [Var 2]

-### 4.2 Key Relationships
+**Type**: [Linear/Non-linear/None] | **r**: [Value] | **p-value**: [Value]

-[For important variable pairs:]
+![Scatter Plot](path/to/scatter.png)

-#### [Variable 1] vs [Variable 2]
-
-**Relationship Type**: [Linear/Non-linear/None]
-**Correlation**: [Value]
-**Statistical Test**: [Test name, p-value]
-
-**Visualization**: ![Scatter Plot](path/to/scatter.png)
-
-**Insights**:
- [Description of relationship]
- [Implications]
+**Insights**: [Description and implications]

 ---

 ## 5. Multivariate Analysis

-### 5.1 Scatter Matrix
-
 ![Scatter Matrix](path/to/scatter_matrix.png)

-**Observations**:
- [Pattern 1]
- [Pattern 2]
- [Pattern 3]
+**Patterns**: [Key observations]

-### 5.2 Clustering Patterns
-
-[If clustering analysis performed:]
-
-**Method**: [Method used]
-**Number of Clusters**: [Count]
-
-**Cluster Characteristics**:
- **Cluster 1**: [Description]
- **Cluster 2**: [Description]
-
-**Visualization**: [Link to visualization]
+**Clustering** (if performed): [Method] | [Count] clusters identified

 ---

-## 6. Outlier Analysis
+## 6. Outliers

-### 6.1 Outlier Summary
+**Overall Rate**: [%]%

-**Overall Outlier Rate**: [Percentage]%
+| Variable | Outlier % | Method | Action |
+|----------|-----------|--------|--------|
+| [Var 1] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
+| [Var 2] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |

-**Variables with High Outlier Rates**:
-| Variable | Outlier Count | Outlier % | Method | Action |
-|----------|--------------|-----------|--------|--------|
-| [Var 1] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
-| [Var 2] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
+![Box Plots](path/to/boxplots.png)

-**Visualization**: ![Box Plots](path/to/boxplots.png)
-
-### 6.2 Outlier Investigation
-
-[For significant outliers:]
-
-#### [Variable Name]
-
-**Outlier Characteristics**:
- [Description of outliers]
- [Potential causes]
- [Validity assessment]
-
-**Recommendation**: [Keep/Remove/Transform/Investigate further]
+**Investigation**: [Description of significant outliers, causes, validity]

 ---

-## 7. Key Insights and Findings
+## 7. Key Insights

-### 7.1 Data Quality Insights
+**Data Quality**:
+- [Insight with implication]
+- [Insight with implication]

-1. **[Insight 1]**: [Description and implication]
-2. **[Insight 2]**: [Description and implication]
-3. **[Insight 3]**: [Description and implication]
+**Statistical Patterns**:
+- [Insight with implication]
+- [Insight with implication]

-### 7.2 Statistical Insights
+**Domain/Research Insights**:
+- [Insight with implication]
+- [Insight with implication]

-1. **[Insight 1]**: [Description and implication]
-2. **[Insight 2]**: [Description and implication]
-3. **[Insight 3]**: [Description and implication]
-
-### 7.3 Business/Research Insights
-
-1. **[Insight 1]**: [Description and implication]
-2. **[Insight 2]**: [Description and implication]
-3. **[Insight 3]**: [Description and implication]
-
-### 7.4 Unexpected Findings
-
-1. **[Finding 1]**: [Description and significance]
-2. **[Finding 2]**: [Description and significance]
+**Unexpected Findings**:
+- [Finding and significance]

 ---

 ## 8. Recommendations

-### 8.1 Data Quality Actions
+**Data Quality Actions**:
+- [ ] [Action - priority]
+- [ ] [Action - priority]

- [ ] **[Action 1]**: [Description and priority]
- [ ] **[Action 2]**: [Description and priority]
- [ ] **[Action 3]**: [Description and priority]
+**Next Steps**:
+- [Step with rationale]
+- [Step with rationale]

-### 8.2 Analysis Next Steps
+**Feature Engineering**:
+- [Opportunity]
+- [Opportunity]

-1. **[Step 1]**: [Description and rationale]
-2. **[Step 2]**: [Description and rationale]
-3. **[Step 3]**: [Description and rationale]
-
-### 8.3 Feature Engineering Opportunities
-
- **[Opportunity 1]**: [Description]
- **[Opportunity 2]**: [Description]
- **[Opportunity 3]**: [Description]
-
-### 8.4 Modeling Considerations
-
- **[Consideration 1]**: [Description]
- **[Consideration 2]**: [Description]
- **[Consideration 3]**: [Description]
+**Modeling Considerations**:
+- [Consideration]
+- [Consideration]

 ---

-## 9. Limitations and Caveats
+## 9. Limitations

-### 9.1 Data Limitations
+**Data**: [Key limitations]

- [Limitation 1]
- [Limitation 2]
- [Limitation 3]
+**Analysis**: [Key limitations]

-### 9.2 Analysis Limitations
-
- [Limitation 1]
- [Limitation 2]
- [Limitation 3]
-
-### 9.3 Assumptions Made
-
- [Assumption 1]
- [Assumption 2]
- [Assumption 3]
+**Assumptions**: [Key assumptions made]

 ---

-## 10. Appendices
+## Appendices

-### Appendix A: Technical Details
+### A: Technical Details

-**Software Environment**:
- Python: [Version]
- Key Libraries: pandas ([Version]), numpy ([Version]), scipy ([Version]), matplotlib ([Version])
+**Environment**: Python with pandas, numpy, scipy, matplotlib, seaborn

-**Analysis Scripts**: [Link to repository or location]
+**Scripts**: [Repository/location]

-### Appendix B: Variable Dictionary
+### B: Variable Dictionary

-| Variable Name | Type | Description | Unit | Valid Range | Missing % |
-|--------------|------|-------------|------|-------------|-----------|
+| Variable | Type | Description | Unit | Range | Missing % |
+|----------|------|-------------|------|-------|-----------|
 | [Var 1] | [Type] | [Description] | [Unit] | [Range] | [%] |
-| [Var 2] | [Type] | [Description] | [Unit] | [Range] | [%] |

-### Appendix C: Statistical Test Results
+### C: Statistical Tests

-[Detailed statistical test outputs]
-
-**Normality Tests**:
+**Normality**:
 | Variable | Test | Statistic | p-value | Result |
 |----------|------|-----------|---------|--------|
 | [Var 1] | Shapiro-Wilk | [Value] | [Value] | [Normal/Non-normal] |

-**Correlation Tests**:
-| Var 1 | Var 2 | Coefficient | p-value | Significance |
-|-------|-------|-------------|---------|--------------|
+**Correlations**:
+| Var 1 | Var 2 | r | p-value | Significant |
+|-------|-------|---|---------|-------------|
 | [Var 1] | [Var 2] | [Value] | [Value] | [Yes/No] |

-### Appendix D: Full Visualization Gallery
+### D: Visualizations

-[Links to all generated visualizations]
-
-1. [Visualization 1 description](path/to/viz1.png)
-2. [Visualization 2 description](path/to/viz2.png)
-3. [Visualization 3 description](path/to/viz3.png)
-
---
-
-## Contact Information
-
-**Analyst**: [Name]
-**Email**: [Email]
-**Date**: [Date]
-**Version**: [Version number]
-
---
-
-**Document History**:
-| Version | Date | Changes | Author |
-|---------|------|---------|--------|
-| 1.0 | [Date] | Initial analysis | [Name] |
+1. [Description](path/to/viz1.png)
+2. [Description](path/to/viz2.png)
--- a/scientific-thinking/exploratory-data-analysis/references/eda_best_practices.md
+++ b/scientific-thinking/exploratory-data-analysis/references/eda_best_practices.md
@@ -1,379 +1,125 @@
-# Exploratory Data Analysis Best Practices
+# EDA Best Practices

-This guide provides best practices and methodologies for conducting thorough exploratory data analysis.
+Methodologies for conducting thorough exploratory data analysis.

-## EDA Process Framework
+## 6-Step EDA Framework

-### 1. Initial Data Understanding
+### 1. Initial Understanding

-**Objectives**:
- Understand data structure and format
- Identify data types and schema
- Get familiar with domain context
-
-**Key Questions**:
+**Questions**:
 - What does each column represent?
- What is the unit of observation?
- What is the time period covered?
+- What is the unit of observation and time period?
 - What is the data collection methodology?
- Are there any known data quality issues?
+- Are there known quality issues?

-**Actions**:
- Load and inspect first/last rows
- Check data dimensions (rows × columns)
- Review column names and types
- Document data source and context
+**Actions**: Load data, inspect structure, review types, document context

-### 2. Data Quality Assessment
+### 2. Quality Assessment

-**Objectives**:
- Identify data quality issues
- Assess data completeness and reliability
- Document data limitations
-
-**Key Checks**:
- **Missing data**: Patterns, extent, randomness
- **Duplicates**: Exact and near-duplicates
- **Outliers**: Valid extremes vs. data errors
- **Consistency**: Cross-field validation
- **Accuracy**: Domain knowledge validation
+**Check**: Missing data patterns, duplicates, outliers, consistency, accuracy

 **Red Flags**:
- High missing data rate (>20%)
+- Missing data >20%
 - Unexpected duplicates
- Constant or near-constant columns
- Impossible values (negative ages, dates in future)
- High cardinality in ID-like columns
+- Constant columns
+- Impossible values (negative ages, future dates)
 - Suspicious patterns (too many round numbers)

 ### 3. Univariate Analysis

-**Objectives**:
- Understand individual variable distributions
- Identify anomalies and patterns
- Determine variable characteristics
+**Numeric**: Central tendency, dispersion, shape (skewness, kurtosis), distribution plots, outliers

-**For Numeric Variables**:
- Central tendency (mean, median, mode)
- Dispersion (range, variance, std, IQR)
- Shape (skewness, kurtosis)
- Distribution visualization (histogram, KDE, box plot)
- Outlier detection
+**Categorical**: Frequency distributions, unique counts, balance, bar charts

-**For Categorical Variables**:
- Frequency distributions
- Unique value counts
- Most/least common categories
- Category balance/imbalance
- Bar charts and count plots
-
-**For Temporal Variables**:
- Time range coverage
- Gaps in timeline
- Temporal patterns (trends, seasonality)
- Time series plots
+**Temporal**: Time range, gaps, trends, seasonality, time series plots

 ### 4. Bivariate Analysis

-**Objectives**:
- Understand relationships between variables
- Identify correlations and dependencies
- Find potential predictors
+**Numeric vs Numeric**: Scatter plots, correlations (Pearson, Spearman), detect non-linearity

-**Numeric vs Numeric**:
- Scatter plots
- Correlation coefficients (Pearson, Spearman)
- Line of best fit
- Detect non-linear relationships
+**Numeric vs Categorical**: Group statistics, box plots by category, t-test/ANOVA

-**Numeric vs Categorical**:
- Group statistics (mean, median by category)
- Box plots by category
- Distribution plots by category
- Statistical tests (t-test, ANOVA)
-
-**Categorical vs Categorical**:
- Cross-tabulation / contingency tables
- Stacked bar charts
- Chi-square tests
- Cramér's V for association strength
+**Categorical vs Categorical**: Cross-tabs, stacked bars, chi-square, Cramér's V

 ### 5. Multivariate Analysis

-**Objectives**:
- Understand complex interactions
- Identify patterns across multiple variables
- Explore dimensionality
+**Techniques**: Correlation matrices, pair plots, parallel coordinates, PCA, clustering

-**Techniques**:
- Correlation matrices and heatmaps
- Pair plots / scatter matrices
- Parallel coordinates plots
- Principal Component Analysis (PCA)
- Clustering analysis
-
-**Key Questions**:
- Are there groups of correlated features?
- Can we reduce dimensionality?
- Are there natural clusters?
- Do patterns change when conditioning on other variables?
+**Questions**: Groups of correlated features? Reduce dimensionality? Natural clusters? Conditional patterns?

 ### 6. Insight Generation

-**Objectives**:
- Synthesize findings into actionable insights
- Formulate hypotheses
- Identify next steps
+**Look for**: Unexpected patterns, strong correlations, quality issues, feature engineering opportunities, domain implications

-**What to Look For**:
- Unexpected patterns or anomalies
- Strong relationships or correlations
- Data quality issues requiring attention
- Feature engineering opportunities
- Business or research implications
+## Visualization Guidelines

-## Best Practices
+**Chart Selection**:
+- Distribution: Histogram, KDE, box/violin plots
+- Relationships: Scatter, line, heatmap
+- Composition: Stacked bar
+- Comparison: Bar, grouped bar

-### Visualization Guidelines
+**Best Practices**: Label axes with units, descriptive titles, purposeful color, appropriate scales, avoid clutter

-1. **Choose appropriate chart types**:
-   - Distribution: Histogram, KDE, box plot, violin plot
-   - Relationships: Scatter plot, line plot, heatmap
-   - Composition: Stacked bar, pie chart (use sparingly)
-   - Comparison: Bar chart, grouped bar chart
+## Statistical Analysis Guidelines

-2. **Make visualizations clear and informative**:
-   - Always label axes with units
-   - Add descriptive titles
-   - Use color purposefully
-   - Include legends when needed
-   - Choose appropriate scales
-   - Avoid chart junk
+**Check Assumptions**: Normality, homoscedasticity, independence, linearity

-3. **Use multiple views**:
-   - Show data from different angles
-   - Combine complementary visualizations
-   - Use small multiples for faceting
+**Method Selection**: Parametric when assumptions met, non-parametric otherwise, report effect sizes

-### Statistical Analysis Guidelines
+**Context Matters**: Statistical ≠ practical significance, domain knowledge trumps statistics, correlation ≠ causation

-1. **Check assumptions**:
-   - Test for normality before parametric tests
-   - Check for homoscedasticity
-   - Verify independence of observations
-   - Assess linearity for linear models
+## Documentation Guidelines

-2. **Use appropriate methods**:
-   - Parametric tests when assumptions met
-   - Non-parametric alternatives when violated
-   - Robust methods for outlier-prone data
-   - Effect sizes alongside p-values
+**Notes**: Document assumptions, decisions, issues, findings

-3. **Consider context**:
-   - Statistical significance ≠ practical significance
-   - Domain knowledge trumps statistical patterns
-   - Correlation ≠ causation
-   - Sample size affects what you can detect
+**Reproducibility**: Use scripts, version control, document sources, set random seeds

-### Documentation Guidelines
+**Reporting**: Clear summaries, supporting visualizations, highlighted insights, actionable recommendations

-1. **Keep detailed notes**:
-   - Document assumptions and decisions
-   - Record data issues discovered
-   - Note interesting findings
-   - Track questions that arise
+## Common Pitfalls

-2. **Create reproducible analysis**:
-   - Use scripts, not manual Excel operations
-   - Version control your code
-   - Document data sources and versions
-   - Include random seeds for reproducibility
-
-3. **Summarize findings**:
-   - Write clear summaries
-   - Use visualizations to support points
-   - Highlight key insights
-   - Provide recommendations
-
-## Common Pitfalls to Avoid
-
-### 1. Confirmation Bias
- **Problem**: Looking only for evidence supporting preconceptions
- **Solution**: Actively seek disconfirming evidence, use blind analysis
-
-### 2. Ignoring Data Quality
- **Problem**: Proceeding with analysis despite known data issues
- **Solution**: Address quality issues first, document limitations
-
-### 3. Over-reliance on Automation
- **Problem**: Running analyses without understanding or verifying results
- **Solution**: Manually inspect subsets, verify automated findings
-
-### 4. Neglecting Outliers
- **Problem**: Removing outliers without investigation
- **Solution**: Always investigate outliers - they may contain important information
-
-### 5. Multiple Testing Without Correction
- **Problem**: Running many tests increases false positive rate
- **Solution**: Use correction methods (Bonferroni, FDR) or be explicit about exploratory nature
-
-### 6. Mistaking Association for Causation
- **Problem**: Inferring causation from correlation
- **Solution**: Use careful language, acknowledge alternative explanations
-
-### 7. Cherry-picking Results
- **Problem**: Reporting only interesting/significant findings
- **Solution**: Report complete analysis, including negative results
-
-### 8. Ignoring Sample Size
- **Problem**: Not considering how sample size affects conclusions
- **Solution**: Report effect sizes, confidence intervals, and sample sizes
+1. **Confirmation Bias**: Seek disconfirming evidence, use blind analysis
+2. **Ignoring Quality**: Address issues first, document limitations
+3. **Over-automation**: Manually inspect subsets, verify results
+4. **Neglecting Outliers**: Investigate before removing - may be informative
+5. **Multiple Testing**: Use correction (Bonferroni, FDR) or note exploratory nature
+6. **Association ≠ Causation**: Use careful language, acknowledge alternatives
+7. **Cherry-picking**: Report complete analysis, including negative results
+8. **Ignoring Sample Size**: Report effect sizes, CIs, and sample sizes

 ## Domain-Specific Considerations

-### Time Series Data
- Check for stationarity
- Identify trends and seasonality
- Look for autocorrelation
- Handle missing time points
- Consider temporal splits for validation
+**Time Series**: Check stationarity, identify trends/seasonality, autocorrelation, temporal splits

-### High-Dimensional Data
- Start with dimensionality reduction
- Focus on feature importance
- Be cautious of curse of dimensionality
- Use regularization in modeling
- Consider domain knowledge for feature selection
+**High-Dimensional**: Dimensionality reduction, feature importance, regularization, domain-guided selection

-### Imbalanced Data
- Report class distributions
- Use appropriate metrics (not just accuracy)
- Consider resampling techniques
- Stratify sampling and cross-validation
- Be aware of biases in learning
+**Imbalanced**: Report distributions, appropriate metrics, resampling, stratified CV

-### Small Sample Sizes
- Use non-parametric methods
- Be conservative with conclusions
- Report confidence intervals
- Consider Bayesian approaches
- Acknowledge limitations
+**Small Samples**: Non-parametric methods, conservative conclusions, CIs, Bayesian approaches

-### Big Data
- Sample intelligently for exploration
- Use efficient data structures
- Leverage parallel/distributed computing
- Be aware computational complexity
- Consider scalability in methods
+**Big Data**: Intelligent sampling, efficient structures, parallel computing, scalability

 ## Iterative Process

-EDA is not linear - iterate and refine:
+EDA is iterative: Explore → Questions → Focused Analysis → Insights → New Questions → Deeper Investigation → Synthesis

-1. **Initial exploration** → Identify questions
-2. **Focused analysis** → Answer specific questions
-3. **New insights** → Generate new questions
-4. **Deeper investigation** → Refine understanding
-5. **Synthesis** → Integrate findings
+**Done When**: Understand structure/quality, characterized variables, identified relationships, documented limitations, answered questions, have actionable insights

-### When to Stop
+**Deliverables**: Data understanding, quality issue list, relationship insights, hypotheses, feature ideas, recommendations

-You've done enough EDA when:
- ✅ You understand the data structure and quality
- ✅ You've characterized key variables
- ✅ You've identified important relationships
- ✅ You've documented limitations
- ✅ You can answer your research questions
- ✅ You have actionable insights
+## Communication

-### Moving Forward
+**Technical Audiences**: Methodological details, statistical tests, assumptions, reproducible code

-After EDA, you should have:
- Clear understanding of data
- List of quality issues and how to handle them
- Insights about relationships and patterns
- Hypotheses to test
- Ideas for feature engineering
- Recommendations for next steps
+**Non-Technical Audiences**: Focus on insights, clear visualizations, avoid jargon, concrete recommendations

-## Communication Tips
+**Report Structure**: Executive summary → Data overview → Analysis → Insights → Recommendations → Appendix

-### For Technical Audiences
- Include methodological details
- Show statistical test results
- Discuss assumptions and limitations
- Provide reproducible code
- Reference relevant literature
+## Checklists

-### For Non-Technical Audiences
- Focus on insights, not methods
- Use clear visualizations
- Avoid jargon
- Provide context and implications
- Make recommendations concrete
+**Before**: Understand context, define objectives, identify audience, set up environment

-### Report Structure
-1. **Executive Summary**: Key findings and recommendations
-2. **Data Overview**: Source, structure, limitations
-3. **Analysis**: Findings organized by theme
-4. **Insights**: Patterns, anomalies, implications
-5. **Recommendations**: Next steps and actions
-6. **Appendix**: Technical details, full statistics
+**During**: Inspect structure, assess quality, analyze distributions, explore relationships, document continuously

-## Useful Checklists
-
-### Before Starting
- [ ] Understand business/research context
- [ ] Define analysis objectives
- [ ] Identify stakeholders and audience
- [ ] Secure necessary permissions
- [ ] Set up reproducible environment
-
-### During Analysis
- [ ] Load and inspect data structure
- [ ] Assess data quality
- [ ] Analyze univariate distributions
- [ ] Explore bivariate relationships
- [ ] Investigate multivariate patterns
- [ ] Generate and validate insights
- [ ] Document findings continuously
-
-### Before Concluding
- [ ] Verify all findings
- [ ] Check for alternative explanations
- [ ] Document limitations
- [ ] Prepare clear visualizations
- [ ] Write actionable recommendations
- [ ] Review with domain experts
- [ ] Ensure reproducibility
-
-## Tools and Libraries
-
-### Python Ecosystem
- **pandas**: Data manipulation
- **numpy**: Numerical operations
- **matplotlib/seaborn**: Visualization
- **scipy**: Statistical tests
- **scikit-learn**: ML preprocessing
- **plotly**: Interactive visualizations
-
-### Best Tool Practices
- Use appropriate tool for task
- Leverage vectorization
- Chain operations efficiently
- Handle missing data properly
- Validate results independently
- Document custom functions
-
-## Further Resources
-
- **Books**:
-  - "Exploratory Data Analysis" by John Tukey
-  - "The Art of Statistics" by David Spiegelhalter
- **Guidelines**:
-  - ASA Statistical Significance Statement
-  - FAIR data principles
- **Communities**:
-  - Cross Validated (Stack Exchange)
-  - /r/datascience
-  - Local data science meetups
+**After**: Verify findings, check alternatives, document limitations, prepare visualizations, ensure reproducibility
--- a/scientific-thinking/exploratory-data-analysis/references/statistical_tests_guide.md
+++ b/scientific-thinking/exploratory-data-analysis/references/statistical_tests_guide.md
@@ -1,252 +1,126 @@
-# Statistical Tests Guide for EDA
+# Statistical Tests Guide

-This guide provides interpretation guidelines for statistical tests commonly used in exploratory data analysis.
+Interpretation guidelines for common EDA statistical tests.

 ## Normality Tests

-### Shapiro-Wilk Test
+### Shapiro-Wilk

-**Purpose**: Test if a sample comes from a normally distributed population
+**Use**: Small to medium samples (n < 5000)

-**When to use**: Best for small to medium sample sizes (n < 5000)
+**H0**: Data is normal | **H1**: Data is not normal

-**Interpretation**:
- **Null Hypothesis (H0)**: The data follows a normal distribution
- **Alternative Hypothesis (H1)**: The data does not follow a normal distribution
- **p-value > 0.05**: Fail to reject H0 → Data is likely normally distributed
- **p-value ≤ 0.05**: Reject H0 → Data is not normally distributed
+**Interpretation**: p > 0.05 → likely normal | p ≤ 0.05 → not normal

-**Notes**:
- Very sensitive to sample size
- Small deviations from normality may be detected as significant in large samples
- Consider practical significance alongside statistical significance
+**Note**: Very sensitive to sample size; small deviations may be significant in large samples

-### Anderson-Darling Test
+### Anderson-Darling

-**Purpose**: Test if a sample comes from a specific distribution (typically normal)
+**Use**: More powerful than Shapiro-Wilk, emphasizes tails

-**When to use**: More powerful than Shapiro-Wilk for detecting departures from normality
+**Interpretation**: Test statistic > critical value → reject normality

-**Interpretation**:
- Compares test statistic against critical values at different significance levels
- If test statistic > critical value at given significance level, reject normality
- More weight given to tails of distribution than other tests
+### Kolmogorov-Smirnov

-### Kolmogorov-Smirnov Test
+**Use**: Large samples or testing against non-normal distributions

-**Purpose**: Test if a sample comes from a reference distribution
-
-**When to use**: When you have a large sample or want to test against distributions other than normal
-
-**Interpretation**:
- **p-value > 0.05**: Sample distribution matches reference distribution
- **p-value ≤ 0.05**: Sample distribution differs from reference distribution
+**Interpretation**: p > 0.05 → matches reference | p ≤ 0.05 → differs from reference

 ## Distribution Characteristics

 ### Skewness

-**Purpose**: Measure asymmetry of the distribution
+**Measures asymmetry**:
+- ≈ 0: Symmetric
+- \> 0: Right-skewed (tail right)
+- < 0: Left-skewed (tail left)

-**Interpretation**:
- **Skewness ≈ 0**: Symmetric distribution
- **Skewness > 0**: Right-skewed (tail extends to right, most values on left)
- **Skewness < 0**: Left-skewed (tail extends to left, most values on right)
+**Magnitude**: |s| < 0.5 (symmetric) | 0.5-1 (moderate) | ≥ 1 (high)

-**Magnitude interpretation**:
- **|Skewness| < 0.5**: Approximately symmetric
- **0.5 ≤ |Skewness| < 1**: Moderately skewed
- **|Skewness| ≥ 1**: Highly skewed
-
-**Implications**:
- Highly skewed data may require transformation (log, sqrt, Box-Cox)
- Mean is pulled toward tail; median more robust for skewed data
- Many statistical tests assume symmetry/normality
+**Action**: High skew → consider transformation (log, sqrt, Box-Cox); use median over mean

 ### Kurtosis

-**Purpose**: Measure tailedness and peak of distribution
+**Measures tailedness** (excess kurtosis, normal = 0):
+- ≈ 0: Normal tails
+- \> 0: Heavy tails, more outliers
+- < 0: Light tails, fewer outliers

-**Interpretation** (Excess Kurtosis, where normal distribution = 0):
- **Kurtosis ≈ 0**: Normal tail behavior (mesokurtic)
- **Kurtosis > 0**: Heavy tails, sharp peak (leptokurtic)
-  - More outliers than normal distribution
-  - Higher probability of extreme values
- **Kurtosis < 0**: Light tails, flat peak (platykurtic)
-  - Fewer outliers than normal distribution
-  - More uniform distribution
+**Magnitude**: |k| < 0.5 (normal) | 0.5-1 (moderate) | ≥ 1 (very different)

-**Magnitude interpretation**:
- **|Kurtosis| < 0.5**: Normal-like tails
- **0.5 ≤ |Kurtosis| < 1**: Moderately different tails
- **|Kurtosis| ≥ 1**: Very different tail behavior from normal
+**Action**: High kurtosis → investigate outliers carefully

-**Implications**:
- High kurtosis → Be cautious with outliers
- Low kurtosis → Distribution lacks distinct peak
+## Correlation

-## Correlation Tests
+### Pearson

-### Pearson Correlation
+**Measures**: Linear relationship (-1 to +1)

-**Purpose**: Measure linear relationship between two continuous variables
+**Strength**: |r| < 0.3 (weak) | 0.3-0.5 (moderate) | 0.5-0.7 (strong) | ≥ 0.7 (very strong)

-**Range**: -1 to +1
+**Assumptions**: Linear, continuous, normal, no outliers, homoscedastic

-**Interpretation**:
- **r = +1**: Perfect positive linear relationship
- **r = 0**: No linear relationship
- **r = -1**: Perfect negative linear relationship
+**Use**: Expected linear relationship, assumptions met

-**Strength guidelines**:
- **|r| < 0.3**: Weak correlation
- **0.3 ≤ |r| < 0.5**: Moderate correlation
- **0.5 ≤ |r| < 0.7**: Strong correlation
- **|r| ≥ 0.7**: Very strong correlation
+### Spearman

-**Assumptions**:
- Linear relationship between variables
- Both variables continuous and normally distributed
- No significant outliers
- Homoscedasticity (constant variance)
+**Measures**: Monotonic relationship (-1 to +1), rank-based

-**When to use**: When relationship is expected to be linear and data meets assumptions
+**Advantages**: Robust to outliers, no linearity assumption, works with ordinal, no normality required

-### Spearman Correlation
+**Use**: Outliers present, non-linear monotonic relationship, ordinal data, non-normal

-**Purpose**: Measure monotonic relationship between two variables (rank-based)
+## Outlier Detection

-**Range**: -1 to +1
+### IQR Method

-**Interpretation**: Same as Pearson, but measures monotonic (not just linear) relationships
+**Bounds**: Q1 - 1.5×IQR to Q3 + 1.5×IQR

-**Advantages over Pearson**:
- Robust to outliers (uses ranks)
- Doesn't assume linear relationship
- Works with ordinal data
- Doesn't require normality assumption
+**Characteristics**: Simple, robust, works with skewed data

-**When to use**:
- Data has outliers
- Relationship is monotonic but not linear
- Data is ordinal
- Distribution is non-normal
-
-## Outlier Detection Methods
-
-### IQR Method (Interquartile Range)
-
-**Definition**:
- Lower bound: Q1 - 1.5 × IQR
- Upper bound: Q3 + 1.5 × IQR
- Values outside these bounds are outliers
-
-**Characteristics**:
- Simple and interpretable
- Robust to extreme values
- Works well for skewed distributions
- Conservative approach (Tukey's fences)
-
-**Interpretation**:
- **< 5% outliers**: Typical for most datasets
- **5-10% outliers**: Moderate, investigate causes
- **> 10% outliers**: High rate, may indicate data quality issues or interesting phenomena
+**Typical Rates**: < 5% (normal) | 5-10% (moderate) | > 10% (high, investigate)

 ### Z-Score Method

-**Definition**: Outliers are data points with |z-score| > 3
+**Definition**: |z| > 3 where z = (x - μ) / σ

-**Formula**: z = (x - μ) / σ
+**Use**: Normal data, n > 30

-**Characteristics**:
- Assumes normal distribution
- Sensitive to extreme values
- Standard threshold is |z| > 3 (99.7% of data within ±3σ)
+**Avoid**: Small samples, skewed data, many outliers (contaminates mean/SD)

-**When to use**:
- Data is approximately normally distributed
- Large sample sizes (n > 30)
+## Hypothesis Testing

-**When NOT to use**:
- Small samples
- Heavily skewed data
- Data with many outliers (contaminates mean and SD)
+**Significance Levels**: α = 0.05 (standard) | 0.01 (conservative) | 0.10 (liberal)

-## Hypothesis Testing Guidelines
+**p-value Interpretation**: ≤ 0.001 (***) | ≤ 0.01 (**) | ≤ 0.05 (*) | ≤ 0.10 (weak) | > 0.10 (none)

-### Significance Levels
+**Key Considerations**:
+- Statistical ≠ practical significance
+- Multiple testing → use correction (Bonferroni, FDR)
+- Large samples detect trivial effects
+- Always report effect sizes with p-values

- **α = 0.05**: Standard significance level (5% chance of Type I error)
- **α = 0.01**: More conservative (1% chance of Type I error)
- **α = 0.10**: More liberal (10% chance of Type I error)
+## Transformations

-### p-value Interpretation
+**Right-skewed**: Log, sqrt, Box-Cox

- **p ≤ 0.001**: Very strong evidence against H0 (***)
- **0.001 < p ≤ 0.01**: Strong evidence against H0 (**)
- **0.01 < p ≤ 0.05**: Moderate evidence against H0 (*)
- **0.05 < p ≤ 0.10**: Weak evidence against H0
- **p > 0.10**: Little to no evidence against H0
+**Left-skewed**: Square, cube, exponential

-### Important Considerations
+**Heavy tails**: Robust scaling, winsorization, log

-1. **Statistical vs Practical Significance**: A small p-value doesn't always mean the effect is important
-2. **Multiple Testing**: When performing many tests, use correction methods (Bonferroni, FDR)
-3. **Sample Size**: Large samples can detect trivial effects as significant
-4. **Effect Size**: Always report and interpret effect sizes alongside p-values
+**Non-constant variance**: Log, Box-Cox

-## Data Transformation Strategies
-
-### When to Transform
-
- **Right-skewed data**: Log, square root, or Box-Cox transformation
- **Left-skewed data**: Square, cube, or exponential transformation
- **Heavy tails/outliers**: Robust scaling, winsorization, or log transformation
- **Non-constant variance**: Log or Box-Cox transformation
-
-### Common Transformations
-
-1. **Log transformation**: log(x) or log(x + 1)
-   - Best for: Positive skewed data, multiplicative relationships
-   - Cannot use with zero or negative values
-
-2. **Square root transformation**: √x
-   - Best for: Count data, moderate positive skew
-   - Less aggressive than log
-
-3. **Box-Cox transformation**: (x^λ - 1) / λ
-   - Best for: Automatically finds optimal transformation
-   - Requires positive values
-
-4. **Standardization**: (x - μ) / σ
-   - Best for: Scaling features to same range
-   - Centers data at 0 with unit variance
-
-5. **Min-Max scaling**: (x - min) / (max - min)
-   - Best for: Scaling to [0, 1] range
-   - Preserves zero values
+**Common Methods**:
+- **Log**: log(x+1) for positive skew, multiplicative relationships
+- **Sqrt**: Count data, moderate skew
+- **Box-Cox**: Auto-finds optimal (requires positive values)
+- **Standardization**: (x-μ)/σ for scaling to unit variance
+- **Min-Max**: (x-min)/(max-min) for [0,1] scaling

 ## Practical Guidelines

-### Sample Size Considerations
+**Sample Size**: n < 30 (non-parametric, cautious) | 30-100 (parametric OK) | ≥ 100 (robust) | ≥ 1000 (may detect trivial effects)

- **n < 30**: Use non-parametric tests, be cautious with assumptions
- **30 ≤ n < 100**: Moderate sample, parametric tests usually acceptable
- **n ≥ 100**: Large sample, parametric tests robust to violations
- **n ≥ 1000**: Very large sample, may detect trivial effects as significant
+**Missing Data**: < 5% (simple methods) | 5-10% (imputation) | > 10% (investigate patterns, advanced methods)

-### Dealing with Missing Data
-
- **< 5% missing**: Usually not a problem, simple methods OK
- **5-10% missing**: Use appropriate imputation methods
- **> 10% missing**: Investigate patterns, consider advanced imputation or modeling missingness
-
-### Reporting Results
-
-Always include:
-1. Test statistic value
-2. p-value
-3. Confidence interval (when applicable)
-4. Effect size
-5. Sample size
-6. Assumptions checked and violations noted
+**Reporting**: Include test statistic, p-value, CI, effect size, n, assumption checks