This commit is contained in:
Timothy Kassis
2025-10-19 16:53:31 -07:00
parent 3b85528b90
commit c627714209
8 changed files with 2236 additions and 1 deletions

View File

@@ -7,7 +7,7 @@
},
"metadata": {
"description": "Claude scientific skills from K-Dense Inc",
"version": "1.2.0"
"version": "1.3.0"
},
"plugins": [
{
@@ -73,6 +73,7 @@
"source": "./",
"strict": false,
"skills": [
"./scientific-thinking/exploratory-data-analysis",
"./scientific-thinking/hypothesis-generation",
"./scientific-thinking/scientific-critical-thinking",
"./scientific-thinking/statistical-analysis",

View File

@@ -68,6 +68,7 @@ A comprehensive collection of ready-to-use scientific skills for Claude, curated
### Scientific Thinking & Analysis
- **Exploratory Data Analysis** - Comprehensive EDA toolkit with automated statistics, visualizations, and insights for any tabular dataset
- **Hypothesis Generation** - Structured frameworks for generating and evaluating scientific hypotheses
- **Scientific Critical Thinking** - Tools and approaches for rigorous scientific reasoning and evaluation
- **Scientific Visualization** - Best practices and templates for creating publication-quality scientific figures

View File

@@ -0,0 +1,275 @@
---
name: exploratory-data-analysis
description: Comprehensive exploratory data analysis toolkit for data scientists. Use when users request data exploration, analysis of datasets, statistical summaries, data visualizations, or insights from data files. Handles multiple file formats (CSV, Excel, JSON, Parquet, etc.) and generates detailed markdown reports with statistics, visualizations, and automated insights. This skill should be used when analyzing any tabular data to understand patterns, distributions, correlations, outliers, and data quality issues.
---
# Exploratory Data Analysis
## Overview
Perform comprehensive exploratory data analysis on datasets of any format. This skill acts as a proficient data scientist, automatically analyzing data to generate meaningful summaries, advanced statistics, visualizations, and actionable insights. All textual outputs are generated as markdown for seamless integration into workflows.
## When to Use This Skill
Invoke this skill when:
- User provides a data file and requests analysis or exploration
- User asks to "explore this dataset", "analyze this data", or "what's in this file?"
- User needs statistical summaries, distributions, or correlations
- User requests data visualizations or insights
- User wants to understand data quality issues or patterns
- User mentions EDA, exploratory analysis, or data profiling
**Supported file formats**: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
## Quick Start Workflow
1. **Receive data file** from user
2. **Run comprehensive analysis** using `scripts/eda_analyzer.py`
3. **Generate visualizations** using `scripts/visualizer.py`
4. **Create markdown report** using insights and the `assets/report_template.md` template
5. **Present findings** to user with key insights highlighted
## Core Capabilities
### 1. Comprehensive Data Analysis
Execute full statistical analysis using the `eda_analyzer.py` script:
```bash
python scripts/eda_analyzer.py <data_file_path> -o <output_directory>
```
**What it provides**:
- Auto-detection and loading of file formats
- Basic dataset information (shape, types, memory usage)
- Missing data analysis (patterns, percentages)
- Summary statistics for numeric and categorical variables
- Outlier detection using IQR and Z-score methods
- Distribution analysis with normality tests (Shapiro-Wilk, Anderson-Darling)
- Correlation analysis (Pearson and Spearman)
- Data quality assessment (completeness, duplicates, issues)
- Automated insight generation
**Output**: JSON file containing all analysis results at `<output_directory>/eda_analysis.json`
### 2. Comprehensive Visualizations
Generate complete visualization suite using the `visualizer.py` script:
```bash
python scripts/visualizer.py <data_file_path> -o <output_directory>
```
**Generated visualizations**:
- **Missing data patterns**: Heatmap and bar chart showing missing data
- **Distribution plots**: Histograms with KDE overlays for all numeric variables
- **Box plots with violin plots**: Outlier detection visualizations
- **Correlation heatmap**: Both Pearson and Spearman correlation matrices
- **Scatter matrix**: Pairwise relationships between numeric variables
- **Categorical analysis**: Bar charts for top categories
- **Time series plots**: Temporal trends with trend lines (if datetime columns exist)
**Output**: High-quality PNG files saved to `<output_directory>/eda_visualizations/`
All visualizations are production-ready with:
- 300 DPI resolution
- Clear titles and labels
- Statistical annotations
- Professional styling using seaborn
### 3. Automated Insight Generation
The analyzer automatically generates actionable insights including:
- **Data scale insights**: Dataset size considerations for processing
- **Missing data alerts**: Warnings when missing data exceeds thresholds
- **Correlation discoveries**: Strong relationships identified for feature engineering
- **Outlier warnings**: Variables with high outlier rates flagged
- **Distribution assessments**: Skewness issues requiring transformations
- **Duplicate alerts**: Duplicate row detection
- **Imbalance warnings**: Categorical variable imbalance detection
Access insights from the analysis results JSON under the `"insights"` key.
### 4. Statistical Interpretation
For detailed interpretation of statistical tests and measures, reference:
**`references/statistical_tests_guide.md`** - Comprehensive guide covering:
- Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
- Distribution characteristics (skewness, kurtosis)
- Correlation tests (Pearson, Spearman)
- Outlier detection methods (IQR, Z-score)
- Hypothesis testing guidelines
- Data transformation strategies
Load this reference when needing to interpret specific statistical tests or explain results to users.
### 5. Best Practices Guidance
For methodological guidance, reference:
**`references/eda_best_practices.md`** - Detailed best practices including:
- EDA process framework (6-step methodology)
- Univariate, bivariate, and multivariate analysis approaches
- Visualization guidelines
- Statistical analysis guidelines
- Common pitfalls to avoid
- Domain-specific considerations
- Communication tips for technical and non-technical audiences
Load this reference when planning analysis approach or needing guidance on specific EDA scenarios.
## Creating Analysis Reports
Use the provided template to structure comprehensive EDA reports:
**`assets/report_template.md`** - Professional report template with sections for:
- Executive summary
- Dataset overview
- Data quality assessment
- Univariate, bivariate, and multivariate analysis
- Outlier analysis
- Key insights and findings
- Recommendations
- Limitations and appendices
**To use the template**:
1. Copy the template content
2. Fill in sections with analysis results from JSON output
3. Embed visualization images using markdown syntax
4. Populate insights and recommendations
5. Save as markdown for user consumption
## Typical Workflow Example
When user provides a data file:
```
User: "Can you explore this sales_data.csv file and tell me what you find?"
1. Run analysis:
python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output
2. Generate visualizations:
python scripts/visualizer.py sales_data.csv -o ./analysis_output
3. Read analysis results:
Read ./analysis_output/eda_analysis.json
4. Create markdown report using template:
- Copy assets/report_template.md structure
- Fill in sections with analysis results
- Reference visualizations from ./analysis_output/eda_visualizations/
- Include automated insights from JSON
5. Present to user:
- Show key insights prominently
- Highlight data quality issues
- Provide visualizations inline
- Make actionable recommendations
- Save complete report as .md file
```
## Advanced Analysis Scenarios
### Large Datasets (>1M rows)
- Run analysis on sampled data first for quick exploration
- Note sample size in report
- Recommend distributed computing for full analysis
### High-Dimensional Data (>50 columns)
- Focus on most important variables first
- Consider PCA or feature selection
- Generate correlation analysis to identify variable groups
- Reference `eda_best_practices.md` section on high-dimensional data
### Time Series Data
- Ensure datetime columns are properly detected
- Time series visualizations will be automatically generated
- Consider temporal patterns, trends, and seasonality
- Reference `eda_best_practices.md` section on time series
### Imbalanced Data
- Categorical analysis will flag imbalances
- Report class distributions prominently
- Recommend stratified sampling if needed
### Small Sample Sizes (<100 rows)
- Non-parametric methods automatically used where appropriate
- Be conservative in statistical conclusions
- Note sample size limitations in report
## Output Best Practices
**Always output as markdown**:
- Structure findings using markdown headers, tables, and lists
- Embed visualizations using `![Description](path/to/image.png)` syntax
- Use tables for statistical summaries
- Include code blocks for any suggested transformations
- Highlight key insights with bold or bullet points
**Ensure reports are actionable**:
- Provide clear recommendations based on findings
- Flag data quality issues that need attention
- Suggest next steps for modeling or further analysis
- Identify feature engineering opportunities
**Make insights accessible**:
- Explain statistical concepts in plain language
- Use reference guides to provide detailed interpretations
- Include both technical details and executive summary
- Tailor communication to user's technical level
## Handling Edge Cases
**Unsupported file formats**:
- Request user to convert to supported format
- Suggest using pandas-compatible formats
**Files too large to load**:
- Recommend sampling approach
- Suggest chunked processing
- Consider alternative tools for big data
**Corrupted or malformed data**:
- Report specific errors encountered
- Suggest data cleaning steps
- Try to salvage partial analysis if possible
**All missing data in columns**:
- Flag completely empty columns
- Recommend removal or investigation
- Document in data quality section
## Resources Summary
### scripts/
- **`eda_analyzer.py`**: Main analysis engine - comprehensive statistical analysis
- **`visualizer.py`**: Visualization generator - creates all chart types
Both scripts are fully executable and handle multiple file formats automatically.
### references/
- **`statistical_tests_guide.md`**: Statistical test interpretation and methodology
- **`eda_best_practices.md`**: Comprehensive EDA methodology and best practices
Load these references as needed to inform analysis approach and interpretation.
### assets/
- **`report_template.md`**: Professional markdown report template
Use this template structure for creating consistent, comprehensive EDA reports.
## Key Reminders
1. **Always generate markdown output** for textual results
2. **Run both scripts** (analyzer and visualizer) for complete analysis
3. **Use the template** to structure comprehensive reports
4. **Include visualizations** by referencing generated PNG files
5. **Provide actionable insights** - don't just present statistics
6. **Interpret findings** using reference guides
7. **Document limitations** and data quality issues
8. **Make recommendations** for next steps
This skill transforms raw data into actionable insights through systematic exploration, advanced statistics, rich visualizations, and clear communication.

View File

@@ -0,0 +1,388 @@
# Exploratory Data Analysis Report
**Dataset**: [Dataset Name]
**Analysis Date**: [Date]
**Analyst**: [Name]
---
## Executive Summary
[2-3 paragraph summary of key findings, major insights, and recommendations]
**Key Findings**:
- [Finding 1]
- [Finding 2]
- [Finding 3]
**Recommendations**:
- [Recommendation 1]
- [Recommendation 2]
---
## 1. Dataset Overview
### 1.1 Data Source
- **Source**: [Source name and location]
- **Collection Period**: [Date range]
- **Last Updated**: [Date]
- **Format**: [CSV, Excel, JSON, etc.]
### 1.2 Data Structure
- **Observations (Rows)**: [Number]
- **Variables (Columns)**: [Number]
- **Memory Usage**: [Size in MB]
### 1.3 Variable Types
- **Numeric Variables** ([Count]): [List column names]
- **Categorical Variables** ([Count]): [List column names]
- **Datetime Variables** ([Count]): [List column names]
- **Boolean Variables** ([Count]): [List column names]
---
## 2. Data Quality Assessment
### 2.1 Completeness
**Overall Data Completeness**: [Percentage]%
**Missing Data Summary**:
| Column | Missing Count | Missing % | Assessment |
|--------|--------------|-----------|------------|
| [Column 1] | [Count] | [%] | [High/Medium/Low] |
| [Column 2] | [Count] | [%] | [High/Medium/Low] |
**Missing Data Pattern**: [Description of patterns, if any]
**Visualization**: ![Missing Data](path/to/missing_data.png)
### 2.2 Duplicates
- **Duplicate Rows**: [Count] ([Percentage]%)
- **Action Required**: [Yes/No - describe if needed]
### 2.3 Data Quality Issues
[List any identified issues]
- [ ] Issue 1: [Description]
- [ ] Issue 2: [Description]
- [ ] Issue 3: [Description]
---
## 3. Univariate Analysis
### 3.1 Numeric Variables
[For each key numeric variable:]
#### [Variable Name]
**Summary Statistics**:
- **Mean**: [Value]
- **Median**: [Value]
- **Std Dev**: [Value]
- **Min**: [Value]
- **Max**: [Value]
- **Range**: [Value]
- **IQR**: [Value]
**Distribution Characteristics**:
- **Skewness**: [Value] - [Interpretation]
- **Kurtosis**: [Value] - [Interpretation]
- **Normality**: [Normal/Not Normal based on tests]
**Outliers**:
- **IQR Method**: [Count] outliers ([Percentage]%)
- **Z-Score Method**: [Count] outliers ([Percentage]%)
**Visualization**: ![Distribution of [Variable]](path/to/distribution.png)
**Insights**:
- [Key insight 1]
- [Key insight 2]
---
### 3.2 Categorical Variables
[For each key categorical variable:]
#### [Variable Name]
**Summary**:
- **Unique Values**: [Count]
- **Most Common**: [Value] ([Percentage]%)
- **Least Common**: [Value] ([Percentage]%)
- **Balance**: [Balanced/Imbalanced]
**Top Categories**:
| Category | Count | Percentage |
|----------|-------|------------|
| [Cat 1] | [Count] | [%] |
| [Cat 2] | [Count] | [%] |
| [Cat 3] | [Count] | [%] |
**Visualization**: ![Distribution of [Variable]](path/to/categorical.png)
**Insights**:
- [Key insight 1]
- [Key insight 2]
---
### 3.3 Temporal Variables
[If datetime columns exist:]
#### [Variable Name]
**Time Range**: [Start Date] to [End Date]
**Duration**: [Time span]
**Temporal Coverage**: [Complete/Gaps identified]
**Temporal Patterns**:
- **Trend**: [Increasing/Decreasing/Stable]
- **Seasonality**: [Yes/No - describe if present]
- **Gaps**: [List any gaps in timeline]
**Visualization**: ![Time Series of [Variable]](path/to/timeseries.png)
**Insights**:
- [Key insight 1]
- [Key insight 2]
---
## 4. Bivariate Analysis
### 4.1 Correlation Analysis
**Overall Correlation Structure**:
- **Strong Positive Correlations**: [Count]
- **Strong Negative Correlations**: [Count]
- **Weak/No Correlations**: [Count]
**Correlation Matrix**:
![Correlation Heatmap](path/to/correlation_heatmap.png)
**Notable Correlations**:
| Variable 1 | Variable 2 | Pearson r | Spearman ρ | Strength | Interpretation |
|-----------|-----------|-----------|------------|----------|----------------|
| [Var 1] | [Var 2] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] |
| [Var 1] | [Var 3] | [Value] | [Value] | [Strong/Moderate/Weak] | [Interpretation] |
**Insights**:
- [Key insight about correlations]
- [Potential multicollinearity issues]
- [Feature engineering opportunities]
---
### 4.2 Key Relationships
[For important variable pairs:]
#### [Variable 1] vs [Variable 2]
**Relationship Type**: [Linear/Non-linear/None]
**Correlation**: [Value]
**Statistical Test**: [Test name, p-value]
**Visualization**: ![Scatter Plot](path/to/scatter.png)
**Insights**:
- [Description of relationship]
- [Implications]
---
## 5. Multivariate Analysis
### 5.1 Scatter Matrix
![Scatter Matrix](path/to/scatter_matrix.png)
**Observations**:
- [Pattern 1]
- [Pattern 2]
- [Pattern 3]
### 5.2 Clustering Patterns
[If clustering analysis performed:]
**Method**: [Method used]
**Number of Clusters**: [Count]
**Cluster Characteristics**:
- **Cluster 1**: [Description]
- **Cluster 2**: [Description]
**Visualization**: [Link to visualization]
---
## 6. Outlier Analysis
### 6.1 Outlier Summary
**Overall Outlier Rate**: [Percentage]%
**Variables with High Outlier Rates**:
| Variable | Outlier Count | Outlier % | Method | Action |
|----------|--------------|-----------|--------|--------|
| [Var 1] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
| [Var 2] | [Count] | [%] | [IQR/Z-score] | [Keep/Investigate/Remove] |
**Visualization**: ![Box Plots](path/to/boxplots.png)
### 6.2 Outlier Investigation
[For significant outliers:]
#### [Variable Name]
**Outlier Characteristics**:
- [Description of outliers]
- [Potential causes]
- [Validity assessment]
**Recommendation**: [Keep/Remove/Transform/Investigate further]
---
## 7. Key Insights and Findings
### 7.1 Data Quality Insights
1. **[Insight 1]**: [Description and implication]
2. **[Insight 2]**: [Description and implication]
3. **[Insight 3]**: [Description and implication]
### 7.2 Statistical Insights
1. **[Insight 1]**: [Description and implication]
2. **[Insight 2]**: [Description and implication]
3. **[Insight 3]**: [Description and implication]
### 7.3 Business/Research Insights
1. **[Insight 1]**: [Description and implication]
2. **[Insight 2]**: [Description and implication]
3. **[Insight 3]**: [Description and implication]
### 7.4 Unexpected Findings
1. **[Finding 1]**: [Description and significance]
2. **[Finding 2]**: [Description and significance]
---
## 8. Recommendations
### 8.1 Data Quality Actions
- [ ] **[Action 1]**: [Description and priority]
- [ ] **[Action 2]**: [Description and priority]
- [ ] **[Action 3]**: [Description and priority]
### 8.2 Analysis Next Steps
1. **[Step 1]**: [Description and rationale]
2. **[Step 2]**: [Description and rationale]
3. **[Step 3]**: [Description and rationale]
### 8.3 Feature Engineering Opportunities
- **[Opportunity 1]**: [Description]
- **[Opportunity 2]**: [Description]
- **[Opportunity 3]**: [Description]
### 8.4 Modeling Considerations
- **[Consideration 1]**: [Description]
- **[Consideration 2]**: [Description]
- **[Consideration 3]**: [Description]
---
## 9. Limitations and Caveats
### 9.1 Data Limitations
- [Limitation 1]
- [Limitation 2]
- [Limitation 3]
### 9.2 Analysis Limitations
- [Limitation 1]
- [Limitation 2]
- [Limitation 3]
### 9.3 Assumptions Made
- [Assumption 1]
- [Assumption 2]
- [Assumption 3]
---
## 10. Appendices
### Appendix A: Technical Details
**Software Environment**:
- Python: [Version]
- Key Libraries: pandas ([Version]), numpy ([Version]), scipy ([Version]), matplotlib ([Version])
**Analysis Scripts**: [Link to repository or location]
### Appendix B: Variable Dictionary
| Variable Name | Type | Description | Unit | Valid Range | Missing % |
|--------------|------|-------------|------|-------------|-----------|
| [Var 1] | [Type] | [Description] | [Unit] | [Range] | [%] |
| [Var 2] | [Type] | [Description] | [Unit] | [Range] | [%] |
### Appendix C: Statistical Test Results
[Detailed statistical test outputs]
**Normality Tests**:
| Variable | Test | Statistic | p-value | Result |
|----------|------|-----------|---------|--------|
| [Var 1] | Shapiro-Wilk | [Value] | [Value] | [Normal/Non-normal] |
**Correlation Tests**:
| Var 1 | Var 2 | Coefficient | p-value | Significance |
|-------|-------|-------------|---------|--------------|
| [Var 1] | [Var 2] | [Value] | [Value] | [Yes/No] |
### Appendix D: Full Visualization Gallery
[Links to all generated visualizations]
1. [Visualization 1 description](path/to/viz1.png)
2. [Visualization 2 description](path/to/viz2.png)
3. [Visualization 3 description](path/to/viz3.png)
---
## Contact Information
**Analyst**: [Name]
**Email**: [Email]
**Date**: [Date]
**Version**: [Version number]
---
**Document History**:
| Version | Date | Changes | Author |
|---------|------|---------|--------|
| 1.0 | [Date] | Initial analysis | [Name] |

View File

@@ -0,0 +1,379 @@
# Exploratory Data Analysis Best Practices
This guide provides best practices and methodologies for conducting thorough exploratory data analysis.
## EDA Process Framework
### 1. Initial Data Understanding
**Objectives**:
- Understand data structure and format
- Identify data types and schema
- Get familiar with domain context
**Key Questions**:
- What does each column represent?
- What is the unit of observation?
- What is the time period covered?
- What is the data collection methodology?
- Are there any known data quality issues?
**Actions**:
- Load and inspect first/last rows
- Check data dimensions (rows × columns)
- Review column names and types
- Document data source and context
### 2. Data Quality Assessment
**Objectives**:
- Identify data quality issues
- Assess data completeness and reliability
- Document data limitations
**Key Checks**:
- **Missing data**: Patterns, extent, randomness
- **Duplicates**: Exact and near-duplicates
- **Outliers**: Valid extremes vs. data errors
- **Consistency**: Cross-field validation
- **Accuracy**: Domain knowledge validation
**Red Flags**:
- High missing data rate (>20%)
- Unexpected duplicates
- Constant or near-constant columns
- Impossible values (negative ages, dates in future)
- High cardinality in ID-like columns
- Suspicious patterns (too many round numbers)
### 3. Univariate Analysis
**Objectives**:
- Understand individual variable distributions
- Identify anomalies and patterns
- Determine variable characteristics
**For Numeric Variables**:
- Central tendency (mean, median, mode)
- Dispersion (range, variance, std, IQR)
- Shape (skewness, kurtosis)
- Distribution visualization (histogram, KDE, box plot)
- Outlier detection
**For Categorical Variables**:
- Frequency distributions
- Unique value counts
- Most/least common categories
- Category balance/imbalance
- Bar charts and count plots
**For Temporal Variables**:
- Time range coverage
- Gaps in timeline
- Temporal patterns (trends, seasonality)
- Time series plots
### 4. Bivariate Analysis
**Objectives**:
- Understand relationships between variables
- Identify correlations and dependencies
- Find potential predictors
**Numeric vs Numeric**:
- Scatter plots
- Correlation coefficients (Pearson, Spearman)
- Line of best fit
- Detect non-linear relationships
**Numeric vs Categorical**:
- Group statistics (mean, median by category)
- Box plots by category
- Distribution plots by category
- Statistical tests (t-test, ANOVA)
**Categorical vs Categorical**:
- Cross-tabulation / contingency tables
- Stacked bar charts
- Chi-square tests
- Cramér's V for association strength
### 5. Multivariate Analysis
**Objectives**:
- Understand complex interactions
- Identify patterns across multiple variables
- Explore dimensionality
**Techniques**:
- Correlation matrices and heatmaps
- Pair plots / scatter matrices
- Parallel coordinates plots
- Principal Component Analysis (PCA)
- Clustering analysis
**Key Questions**:
- Are there groups of correlated features?
- Can we reduce dimensionality?
- Are there natural clusters?
- Do patterns change when conditioning on other variables?
### 6. Insight Generation
**Objectives**:
- Synthesize findings into actionable insights
- Formulate hypotheses
- Identify next steps
**What to Look For**:
- Unexpected patterns or anomalies
- Strong relationships or correlations
- Data quality issues requiring attention
- Feature engineering opportunities
- Business or research implications
## Best Practices
### Visualization Guidelines
1. **Choose appropriate chart types**:
- Distribution: Histogram, KDE, box plot, violin plot
- Relationships: Scatter plot, line plot, heatmap
- Composition: Stacked bar, pie chart (use sparingly)
- Comparison: Bar chart, grouped bar chart
2. **Make visualizations clear and informative**:
- Always label axes with units
- Add descriptive titles
- Use color purposefully
- Include legends when needed
- Choose appropriate scales
- Avoid chart junk
3. **Use multiple views**:
- Show data from different angles
- Combine complementary visualizations
- Use small multiples for faceting
### Statistical Analysis Guidelines
1. **Check assumptions**:
- Test for normality before parametric tests
- Check for homoscedasticity
- Verify independence of observations
- Assess linearity for linear models
2. **Use appropriate methods**:
- Parametric tests when assumptions met
- Non-parametric alternatives when violated
- Robust methods for outlier-prone data
- Effect sizes alongside p-values
3. **Consider context**:
- Statistical significance ≠ practical significance
- Domain knowledge trumps statistical patterns
- Correlation ≠ causation
- Sample size affects what you can detect
### Documentation Guidelines
1. **Keep detailed notes**:
- Document assumptions and decisions
- Record data issues discovered
- Note interesting findings
- Track questions that arise
2. **Create reproducible analysis**:
- Use scripts, not manual Excel operations
- Version control your code
- Document data sources and versions
- Include random seeds for reproducibility
3. **Summarize findings**:
- Write clear summaries
- Use visualizations to support points
- Highlight key insights
- Provide recommendations
## Common Pitfalls to Avoid
### 1. Confirmation Bias
- **Problem**: Looking only for evidence supporting preconceptions
- **Solution**: Actively seek disconfirming evidence, use blind analysis
### 2. Ignoring Data Quality
- **Problem**: Proceeding with analysis despite known data issues
- **Solution**: Address quality issues first, document limitations
### 3. Over-reliance on Automation
- **Problem**: Running analyses without understanding or verifying results
- **Solution**: Manually inspect subsets, verify automated findings
### 4. Neglecting Outliers
- **Problem**: Removing outliers without investigation
- **Solution**: Always investigate outliers - they may contain important information
### 5. Multiple Testing Without Correction
- **Problem**: Running many tests increases false positive rate
- **Solution**: Use correction methods (Bonferroni, FDR) or be explicit about exploratory nature
### 6. Mistaking Association for Causation
- **Problem**: Inferring causation from correlation
- **Solution**: Use careful language, acknowledge alternative explanations
### 7. Cherry-picking Results
- **Problem**: Reporting only interesting/significant findings
- **Solution**: Report complete analysis, including negative results
### 8. Ignoring Sample Size
- **Problem**: Not considering how sample size affects conclusions
- **Solution**: Report effect sizes, confidence intervals, and sample sizes
## Domain-Specific Considerations
### Time Series Data
- Check for stationarity
- Identify trends and seasonality
- Look for autocorrelation
- Handle missing time points
- Consider temporal splits for validation
### High-Dimensional Data
- Start with dimensionality reduction
- Focus on feature importance
- Be cautious of curse of dimensionality
- Use regularization in modeling
- Consider domain knowledge for feature selection
### Imbalanced Data
- Report class distributions
- Use appropriate metrics (not just accuracy)
- Consider resampling techniques
- Stratify sampling and cross-validation
- Be aware of biases in learning
### Small Sample Sizes
- Use non-parametric methods
- Be conservative with conclusions
- Report confidence intervals
- Consider Bayesian approaches
- Acknowledge limitations
### Big Data
- Sample intelligently for exploration
- Use efficient data structures
- Leverage parallel/distributed computing
- Be aware computational complexity
- Consider scalability in methods
## Iterative Process
EDA is not linear - iterate and refine:
1. **Initial exploration** → Identify questions
2. **Focused analysis** → Answer specific questions
3. **New insights** → Generate new questions
4. **Deeper investigation** → Refine understanding
5. **Synthesis** → Integrate findings
### When to Stop
You've done enough EDA when:
- ✅ You understand the data structure and quality
- ✅ You've characterized key variables
- ✅ You've identified important relationships
- ✅ You've documented limitations
- ✅ You can answer your research questions
- ✅ You have actionable insights
### Moving Forward
After EDA, you should have:
- Clear understanding of data
- List of quality issues and how to handle them
- Insights about relationships and patterns
- Hypotheses to test
- Ideas for feature engineering
- Recommendations for next steps
## Communication Tips
### For Technical Audiences
- Include methodological details
- Show statistical test results
- Discuss assumptions and limitations
- Provide reproducible code
- Reference relevant literature
### For Non-Technical Audiences
- Focus on insights, not methods
- Use clear visualizations
- Avoid jargon
- Provide context and implications
- Make recommendations concrete
### Report Structure
1. **Executive Summary**: Key findings and recommendations
2. **Data Overview**: Source, structure, limitations
3. **Analysis**: Findings organized by theme
4. **Insights**: Patterns, anomalies, implications
5. **Recommendations**: Next steps and actions
6. **Appendix**: Technical details, full statistics
## Useful Checklists
### Before Starting
- [ ] Understand business/research context
- [ ] Define analysis objectives
- [ ] Identify stakeholders and audience
- [ ] Secure necessary permissions
- [ ] Set up reproducible environment
### During Analysis
- [ ] Load and inspect data structure
- [ ] Assess data quality
- [ ] Analyze univariate distributions
- [ ] Explore bivariate relationships
- [ ] Investigate multivariate patterns
- [ ] Generate and validate insights
- [ ] Document findings continuously
### Before Concluding
- [ ] Verify all findings
- [ ] Check for alternative explanations
- [ ] Document limitations
- [ ] Prepare clear visualizations
- [ ] Write actionable recommendations
- [ ] Review with domain experts
- [ ] Ensure reproducibility
## Tools and Libraries
### Python Ecosystem
- **pandas**: Data manipulation
- **numpy**: Numerical operations
- **matplotlib/seaborn**: Visualization
- **scipy**: Statistical tests
- **scikit-learn**: ML preprocessing
- **plotly**: Interactive visualizations
### Best Tool Practices
- Use appropriate tool for task
- Leverage vectorization
- Chain operations efficiently
- Handle missing data properly
- Validate results independently
- Document custom functions
## Further Resources
- **Books**:
- "Exploratory Data Analysis" by John Tukey
- "The Art of Statistics" by David Spiegelhalter
- **Guidelines**:
- ASA Statistical Significance Statement
- FAIR data principles
- **Communities**:
- Cross Validated (Stack Exchange)
- /r/datascience
- Local data science meetups

View File

@@ -0,0 +1,252 @@
# Statistical Tests Guide for EDA
This guide provides interpretation guidelines for statistical tests commonly used in exploratory data analysis.
## Normality Tests
### Shapiro-Wilk Test
**Purpose**: Test if a sample comes from a normally distributed population
**When to use**: Best for small to medium sample sizes (n < 5000)
**Interpretation**:
- **Null Hypothesis (H0)**: The data follows a normal distribution
- **Alternative Hypothesis (H1)**: The data does not follow a normal distribution
- **p-value > 0.05**: Fail to reject H0 → Data is likely normally distributed
- **p-value ≤ 0.05**: Reject H0 → Data is not normally distributed
**Notes**:
- Very sensitive to sample size
- Small deviations from normality may be detected as significant in large samples
- Consider practical significance alongside statistical significance
### Anderson-Darling Test
**Purpose**: Test if a sample comes from a specific distribution (typically normal)
**When to use**: More powerful than Shapiro-Wilk for detecting departures from normality
**Interpretation**:
- Compares test statistic against critical values at different significance levels
- If test statistic > critical value at given significance level, reject normality
- More weight given to tails of distribution than other tests
### Kolmogorov-Smirnov Test
**Purpose**: Test if a sample comes from a reference distribution
**When to use**: When you have a large sample or want to test against distributions other than normal
**Interpretation**:
- **p-value > 0.05**: Sample distribution matches reference distribution
- **p-value ≤ 0.05**: Sample distribution differs from reference distribution
## Distribution Characteristics
### Skewness
**Purpose**: Measure asymmetry of the distribution
**Interpretation**:
- **Skewness ≈ 0**: Symmetric distribution
- **Skewness > 0**: Right-skewed (tail extends to right, most values on left)
- **Skewness < 0**: Left-skewed (tail extends to left, most values on right)
**Magnitude interpretation**:
- **|Skewness| < 0.5**: Approximately symmetric
- **0.5 ≤ |Skewness| < 1**: Moderately skewed
- **|Skewness| ≥ 1**: Highly skewed
**Implications**:
- Highly skewed data may require transformation (log, sqrt, Box-Cox)
- Mean is pulled toward tail; median more robust for skewed data
- Many statistical tests assume symmetry/normality
### Kurtosis
**Purpose**: Measure tailedness and peak of distribution
**Interpretation** (Excess Kurtosis, where normal distribution = 0):
- **Kurtosis ≈ 0**: Normal tail behavior (mesokurtic)
- **Kurtosis > 0**: Heavy tails, sharp peak (leptokurtic)
- More outliers than normal distribution
- Higher probability of extreme values
- **Kurtosis < 0**: Light tails, flat peak (platykurtic)
- Fewer outliers than normal distribution
- More uniform distribution
**Magnitude interpretation**:
- **|Kurtosis| < 0.5**: Normal-like tails
- **0.5 ≤ |Kurtosis| < 1**: Moderately different tails
- **|Kurtosis| ≥ 1**: Very different tail behavior from normal
**Implications**:
- High kurtosis → Be cautious with outliers
- Low kurtosis → Distribution lacks distinct peak
## Correlation Tests
### Pearson Correlation
**Purpose**: Measure linear relationship between two continuous variables
**Range**: -1 to +1
**Interpretation**:
- **r = +1**: Perfect positive linear relationship
- **r = 0**: No linear relationship
- **r = -1**: Perfect negative linear relationship
**Strength guidelines**:
- **|r| < 0.3**: Weak correlation
- **0.3 ≤ |r| < 0.5**: Moderate correlation
- **0.5 ≤ |r| < 0.7**: Strong correlation
- **|r| ≥ 0.7**: Very strong correlation
**Assumptions**:
- Linear relationship between variables
- Both variables continuous and normally distributed
- No significant outliers
- Homoscedasticity (constant variance)
**When to use**: When relationship is expected to be linear and data meets assumptions
### Spearman Correlation
**Purpose**: Measure monotonic relationship between two variables (rank-based)
**Range**: -1 to +1
**Interpretation**: Same as Pearson, but measures monotonic (not just linear) relationships
**Advantages over Pearson**:
- Robust to outliers (uses ranks)
- Doesn't assume linear relationship
- Works with ordinal data
- Doesn't require normality assumption
**When to use**:
- Data has outliers
- Relationship is monotonic but not linear
- Data is ordinal
- Distribution is non-normal
## Outlier Detection Methods
### IQR Method (Interquartile Range)
**Definition**:
- Lower bound: Q1 - 1.5 × IQR
- Upper bound: Q3 + 1.5 × IQR
- Values outside these bounds are outliers
**Characteristics**:
- Simple and interpretable
- Robust to extreme values
- Works well for skewed distributions
- Conservative approach (Tukey's fences)
**Interpretation**:
- **< 5% outliers**: Typical for most datasets
- **5-10% outliers**: Moderate, investigate causes
- **> 10% outliers**: High rate, may indicate data quality issues or interesting phenomena
### Z-Score Method
**Definition**: Outliers are data points with |z-score| > 3
**Formula**: z = (x - μ) / σ
**Characteristics**:
- Assumes normal distribution
- Sensitive to extreme values
- Standard threshold is |z| > 3 (99.7% of data within ±3σ)
**When to use**:
- Data is approximately normally distributed
- Large sample sizes (n > 30)
**When NOT to use**:
- Small samples
- Heavily skewed data
- Data with many outliers (contaminates mean and SD)
## Hypothesis Testing Guidelines
### Significance Levels
- **α = 0.05**: Standard significance level (5% chance of Type I error)
- **α = 0.01**: More conservative (1% chance of Type I error)
- **α = 0.10**: More liberal (10% chance of Type I error)
### p-value Interpretation
- **p ≤ 0.001**: Very strong evidence against H0 (***)
- **0.001 < p ≤ 0.01**: Strong evidence against H0 (**)
- **0.01 < p ≤ 0.05**: Moderate evidence against H0 (*)
- **0.05 < p ≤ 0.10**: Weak evidence against H0
- **p > 0.10**: Little to no evidence against H0
### Important Considerations
1. **Statistical vs Practical Significance**: A small p-value doesn't always mean the effect is important
2. **Multiple Testing**: When performing many tests, use correction methods (Bonferroni, FDR)
3. **Sample Size**: Large samples can detect trivial effects as significant
4. **Effect Size**: Always report and interpret effect sizes alongside p-values
## Data Transformation Strategies
### When to Transform
- **Right-skewed data**: Log, square root, or Box-Cox transformation
- **Left-skewed data**: Square, cube, or exponential transformation
- **Heavy tails/outliers**: Robust scaling, winsorization, or log transformation
- **Non-constant variance**: Log or Box-Cox transformation
### Common Transformations
1. **Log transformation**: log(x) or log(x + 1)
- Best for: Positive skewed data, multiplicative relationships
- Cannot use with zero or negative values
2. **Square root transformation**: √x
- Best for: Count data, moderate positive skew
- Less aggressive than log
3. **Box-Cox transformation**: (x^λ - 1) / λ
- Best for: Automatically finds optimal transformation
- Requires positive values
4. **Standardization**: (x - μ) / σ
- Best for: Scaling features to same range
- Centers data at 0 with unit variance
5. **Min-Max scaling**: (x - min) / (max - min)
- Best for: Scaling to [0, 1] range
- Preserves zero values
## Practical Guidelines
### Sample Size Considerations
- **n < 30**: Use non-parametric tests, be cautious with assumptions
- **30 ≤ n < 100**: Moderate sample, parametric tests usually acceptable
- **n ≥ 100**: Large sample, parametric tests robust to violations
- **n ≥ 1000**: Very large sample, may detect trivial effects as significant
### Dealing with Missing Data
- **< 5% missing**: Usually not a problem, simple methods OK
- **5-10% missing**: Use appropriate imputation methods
- **> 10% missing**: Investigate patterns, consider advanced imputation or modeling missingness
### Reporting Results
Always include:
1. Test statistic value
2. p-value
3. Confidence interval (when applicable)
4. Effect size
5. Sample size
6. Assumptions checked and violations noted

View File

@@ -0,0 +1,458 @@
#!/usr/bin/env python3
"""
Exploratory Data Analysis Analyzer
Comprehensive data analysis tool that handles multiple file formats and generates
detailed statistical analysis, insights, and data quality reports.
"""
import os
import sys
import json
import argparse
from pathlib import Path
from typing import Dict, List, Any, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import normaltest, shapiro, kstest, anderson
class EDAAnalyzer:
"""Main EDA analysis engine"""
def __init__(self, file_path: str, output_dir: Optional[str] = None):
self.file_path = Path(file_path)
self.output_dir = Path(output_dir) if output_dir else self.file_path.parent
self.output_dir.mkdir(parents=True, exist_ok=True)
self.df = None
self.analysis_results = {}
def load_data(self) -> pd.DataFrame:
"""Auto-detect file type and load data"""
file_ext = self.file_path.suffix.lower()
try:
if file_ext == '.csv':
self.df = pd.read_csv(self.file_path)
elif file_ext in ['.xlsx', '.xls']:
self.df = pd.read_excel(self.file_path)
elif file_ext == '.json':
self.df = pd.read_json(self.file_path)
elif file_ext == '.parquet':
self.df = pd.read_parquet(self.file_path)
elif file_ext == '.tsv':
self.df = pd.read_csv(self.file_path, sep='\t')
elif file_ext == '.feather':
self.df = pd.read_feather(self.file_path)
elif file_ext == '.h5' or file_ext == '.hdf5':
self.df = pd.read_hdf(self.file_path)
elif file_ext == '.pkl' or file_ext == '.pickle':
self.df = pd.read_pickle(self.file_path)
else:
raise ValueError(f"Unsupported file format: {file_ext}")
print(f"✅ Successfully loaded {file_ext} file with shape {self.df.shape}")
return self.df
except Exception as e:
print(f"❌ Error loading file: {str(e)}")
sys.exit(1)
def basic_info(self) -> Dict[str, Any]:
"""Generate basic dataset information"""
info = {
'rows': len(self.df),
'columns': len(self.df.columns),
'column_names': list(self.df.columns),
'dtypes': self.df.dtypes.astype(str).to_dict(),
'memory_usage_mb': self.df.memory_usage(deep=True).sum() / 1024**2
}
# Categorize columns by type
info['numeric_columns'] = list(self.df.select_dtypes(include=[np.number]).columns)
info['categorical_columns'] = list(self.df.select_dtypes(include=['object', 'category']).columns)
info['datetime_columns'] = list(self.df.select_dtypes(include=['datetime64']).columns)
info['boolean_columns'] = list(self.df.select_dtypes(include=['bool']).columns)
self.analysis_results['basic_info'] = info
return info
def missing_data_analysis(self) -> Dict[str, Any]:
"""Analyze missing data patterns"""
missing_counts = self.df.isnull().sum()
missing_pct = (missing_counts / len(self.df) * 100).round(2)
missing_info = {
'total_missing_cells': int(self.df.isnull().sum().sum()),
'missing_percentage': round(self.df.isnull().sum().sum() / (self.df.shape[0] * self.df.shape[1]) * 100, 2),
'columns_with_missing': {}
}
for col in self.df.columns:
if missing_counts[col] > 0:
missing_info['columns_with_missing'][col] = {
'count': int(missing_counts[col]),
'percentage': float(missing_pct[col])
}
self.analysis_results['missing_data'] = missing_info
return missing_info
def summary_statistics(self) -> Dict[str, Any]:
"""Generate comprehensive summary statistics"""
stats_dict = {}
# Numeric columns
if len(self.df.select_dtypes(include=[np.number]).columns) > 0:
numeric_stats = self.df.describe().to_dict()
stats_dict['numeric'] = numeric_stats
# Additional statistics
for col in self.df.select_dtypes(include=[np.number]).columns:
if col not in stats_dict:
stats_dict[col] = {}
data = self.df[col].dropna()
if len(data) > 0:
stats_dict[col].update({
'skewness': float(data.skew()),
'kurtosis': float(data.kurtosis()),
'variance': float(data.var()),
'range': float(data.max() - data.min()),
'iqr': float(data.quantile(0.75) - data.quantile(0.25)),
'cv': float(data.std() / data.mean()) if data.mean() != 0 else np.nan
})
# Categorical columns
categorical_stats = {}
for col in self.df.select_dtypes(include=['object', 'category']).columns:
categorical_stats[col] = {
'unique_values': int(self.df[col].nunique()),
'most_common': self.df[col].mode().iloc[0] if len(self.df[col].mode()) > 0 else None,
'most_common_freq': int(self.df[col].value_counts().iloc[0]) if len(self.df[col].value_counts()) > 0 else 0,
'most_common_pct': float(self.df[col].value_counts(normalize=True).iloc[0] * 100) if len(self.df[col].value_counts()) > 0 else 0
}
if categorical_stats:
stats_dict['categorical'] = categorical_stats
self.analysis_results['summary_statistics'] = stats_dict
return stats_dict
def outlier_detection(self) -> Dict[str, Any]:
"""Detect outliers using multiple methods"""
outliers = {}
for col in self.df.select_dtypes(include=[np.number]).columns:
data = self.df[col].dropna()
if len(data) == 0:
continue
outliers[col] = {}
# IQR method
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
iqr_outliers = data[(data < lower_bound) | (data > upper_bound)]
outliers[col]['iqr_method'] = {
'count': len(iqr_outliers),
'percentage': round(len(iqr_outliers) / len(data) * 100, 2),
'lower_bound': float(lower_bound),
'upper_bound': float(upper_bound)
}
# Z-score method (|z| > 3)
if len(data) > 2:
z_scores = np.abs(stats.zscore(data))
z_outliers = data[z_scores > 3]
outliers[col]['zscore_method'] = {
'count': len(z_outliers),
'percentage': round(len(z_outliers) / len(data) * 100, 2)
}
self.analysis_results['outliers'] = outliers
return outliers
def distribution_analysis(self) -> Dict[str, Any]:
"""Analyze distributions and test for normality"""
distributions = {}
for col in self.df.select_dtypes(include=[np.number]).columns:
data = self.df[col].dropna()
if len(data) < 8: # Need at least 8 samples for tests
continue
distributions[col] = {}
# Shapiro-Wilk test (best for n < 5000)
if len(data) < 5000:
try:
stat, p_value = shapiro(data)
distributions[col]['shapiro_wilk'] = {
'statistic': float(stat),
'p_value': float(p_value),
'is_normal': p_value > 0.05
}
except:
pass
# Anderson-Darling test
try:
result = anderson(data)
distributions[col]['anderson_darling'] = {
'statistic': float(result.statistic),
'critical_values': result.critical_values.tolist(),
'significance_levels': result.significance_level.tolist()
}
except:
pass
# Distribution characteristics
distributions[col]['characteristics'] = {
'skewness': float(data.skew()),
'skewness_interpretation': self._interpret_skewness(data.skew()),
'kurtosis': float(data.kurtosis()),
'kurtosis_interpretation': self._interpret_kurtosis(data.kurtosis())
}
self.analysis_results['distributions'] = distributions
return distributions
def correlation_analysis(self) -> Dict[str, Any]:
"""Analyze correlations between numeric variables"""
numeric_df = self.df.select_dtypes(include=[np.number])
if len(numeric_df.columns) < 2:
return {}
correlations = {}
# Pearson correlation
pearson_corr = numeric_df.corr(method='pearson')
correlations['pearson'] = pearson_corr.to_dict()
# Spearman correlation (rank-based, robust to outliers)
spearman_corr = numeric_df.corr(method='spearman')
correlations['spearman'] = spearman_corr.to_dict()
# Find strong correlations (|r| > 0.7)
strong_correlations = []
for i in range(len(pearson_corr.columns)):
for j in range(i + 1, len(pearson_corr.columns)):
col1 = pearson_corr.columns[i]
col2 = pearson_corr.columns[j]
corr_value = pearson_corr.iloc[i, j]
if abs(corr_value) > 0.7:
strong_correlations.append({
'variable1': col1,
'variable2': col2,
'correlation': float(corr_value),
'strength': self._interpret_correlation(corr_value)
})
correlations['strong_correlations'] = strong_correlations
self.analysis_results['correlations'] = correlations
return correlations
def data_quality_assessment(self) -> Dict[str, Any]:
"""Assess overall data quality"""
quality = {
'completeness': {
'score': round((1 - self.df.isnull().sum().sum() / (self.df.shape[0] * self.df.shape[1])) * 100, 2),
'interpretation': ''
},
'duplicates': {
'count': int(self.df.duplicated().sum()),
'percentage': round(self.df.duplicated().sum() / len(self.df) * 100, 2)
},
'issues': []
}
# Completeness interpretation
if quality['completeness']['score'] > 95:
quality['completeness']['interpretation'] = 'Excellent'
elif quality['completeness']['score'] > 90:
quality['completeness']['interpretation'] = 'Good'
elif quality['completeness']['score'] > 80:
quality['completeness']['interpretation'] = 'Fair'
else:
quality['completeness']['interpretation'] = 'Poor'
# Identify potential issues
if quality['duplicates']['count'] > 0:
quality['issues'].append(f"Found {quality['duplicates']['count']} duplicate rows")
if quality['completeness']['score'] < 90:
quality['issues'].append("Missing data exceeds 10% threshold")
# Check for constant columns
constant_cols = [col for col in self.df.columns if self.df[col].nunique() == 1]
if constant_cols:
quality['issues'].append(f"Constant columns detected: {', '.join(constant_cols)}")
quality['constant_columns'] = constant_cols
# Check for high cardinality
high_cardinality_cols = []
for col in self.df.select_dtypes(include=['object']).columns:
if self.df[col].nunique() > len(self.df) * 0.9:
high_cardinality_cols.append(col)
if high_cardinality_cols:
quality['issues'].append(f"High cardinality columns (>90% unique): {', '.join(high_cardinality_cols)}")
quality['high_cardinality_columns'] = high_cardinality_cols
self.analysis_results['data_quality'] = quality
return quality
def generate_insights(self) -> List[str]:
"""Generate automated insights from the analysis"""
insights = []
# Dataset size insights
info = self.analysis_results.get('basic_info', {})
if info.get('rows', 0) > 1000000:
insights.append(f"📊 Large dataset with {info['rows']:,} rows - consider sampling for faster iteration")
# Missing data insights
missing = self.analysis_results.get('missing_data', {})
if missing.get('missing_percentage', 0) > 20:
insights.append(f"⚠️ Significant missing data ({missing['missing_percentage']}%) - imputation or removal may be needed")
# Correlation insights
correlations = self.analysis_results.get('correlations', {})
strong_corrs = correlations.get('strong_correlations', [])
if len(strong_corrs) > 0:
insights.append(f"🔗 Found {len(strong_corrs)} strong correlations - potential for feature engineering or multicollinearity")
# Outlier insights
outliers = self.analysis_results.get('outliers', {})
high_outlier_cols = [col for col, data in outliers.items()
if data.get('iqr_method', {}).get('percentage', 0) > 5]
if high_outlier_cols:
insights.append(f"🎯 Columns with high outlier rates (>5%): {', '.join(high_outlier_cols)}")
# Distribution insights
distributions = self.analysis_results.get('distributions', {})
skewed_cols = [col for col, data in distributions.items()
if abs(data.get('characteristics', {}).get('skewness', 0)) > 1]
if skewed_cols:
insights.append(f"📈 Highly skewed distributions detected in: {', '.join(skewed_cols)} - consider transformations")
# Data quality insights
quality = self.analysis_results.get('data_quality', {})
if quality.get('duplicates', {}).get('count', 0) > 0:
insights.append(f"🔄 {quality['duplicates']['count']} duplicate rows found - consider deduplication")
# Categorical insights
stats = self.analysis_results.get('summary_statistics', {})
categorical = stats.get('categorical', {})
imbalanced_cols = [col for col, data in categorical.items()
if data.get('most_common_pct', 0) > 90]
if imbalanced_cols:
insights.append(f"⚖️ Highly imbalanced categorical variables: {', '.join(imbalanced_cols)}")
self.analysis_results['insights'] = insights
return insights
def _interpret_skewness(self, skew: float) -> str:
"""Interpret skewness value"""
if abs(skew) < 0.5:
return "Approximately symmetric"
elif skew > 0.5:
return "Right-skewed (positive skew)"
else:
return "Left-skewed (negative skew)"
def _interpret_kurtosis(self, kurt: float) -> str:
"""Interpret kurtosis value"""
if abs(kurt) < 0.5:
return "Mesokurtic (normal-like tails)"
elif kurt > 0.5:
return "Leptokurtic (heavy tails)"
else:
return "Platykurtic (light tails)"
def _interpret_correlation(self, corr: float) -> str:
"""Interpret correlation strength"""
abs_corr = abs(corr)
if abs_corr > 0.9:
return "Very strong"
elif abs_corr > 0.7:
return "Strong"
elif abs_corr > 0.5:
return "Moderate"
elif abs_corr > 0.3:
return "Weak"
else:
return "Very weak"
def run_full_analysis(self) -> Dict[str, Any]:
"""Run complete EDA analysis"""
print("🔍 Starting comprehensive EDA analysis...")
self.load_data()
print("📊 Analyzing basic information...")
self.basic_info()
print("🔎 Analyzing missing data...")
self.missing_data_analysis()
print("📈 Computing summary statistics...")
self.summary_statistics()
print("🎯 Detecting outliers...")
self.outlier_detection()
print("📉 Analyzing distributions...")
self.distribution_analysis()
print("🔗 Computing correlations...")
self.correlation_analysis()
print("✅ Assessing data quality...")
self.data_quality_assessment()
print("💡 Generating insights...")
self.generate_insights()
print("✨ Analysis complete!")
return self.analysis_results
def save_results(self, format='json') -> str:
"""Save analysis results to file"""
output_file = self.output_dir / f"eda_analysis.{format}"
if format == 'json':
with open(output_file, 'w') as f:
json.dump(self.analysis_results, f, indent=2, default=str)
print(f"💾 Results saved to: {output_file}")
return str(output_file)
def main():
parser = argparse.ArgumentParser(description='Perform comprehensive exploratory data analysis')
parser.add_argument('file_path', help='Path to data file')
parser.add_argument('-o', '--output', help='Output directory for results', default=None)
parser.add_argument('-f', '--format', choices=['json'], default='json', help='Output format')
args = parser.parse_args()
analyzer = EDAAnalyzer(args.file_path, args.output)
analyzer.run_full_analysis()
analyzer.save_results(format=args.format)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,481 @@
#!/usr/bin/env python3
"""
EDA Visualizer
Generate comprehensive visualizations for exploratory data analysis including
distribution plots, correlation heatmaps, time series, and categorical analyses.
"""
import os
import sys
import argparse
from pathlib import Path
from typing import List, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.figure import Figure
from matplotlib.gridspec import GridSpec
class EDAVisualizer:
"""Generate comprehensive EDA visualizations"""
def __init__(self, file_path: str, output_dir: Optional[str] = None):
self.file_path = Path(file_path)
self.output_dir = Path(output_dir) if output_dir else self.file_path.parent / "eda_visualizations"
self.output_dir.mkdir(parents=True, exist_ok=True)
self.df = None
# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['savefig.bbox'] = 'tight'
def load_data(self) -> pd.DataFrame:
"""Auto-detect file type and load data"""
file_ext = self.file_path.suffix.lower()
try:
if file_ext == '.csv':
self.df = pd.read_csv(self.file_path)
elif file_ext in ['.xlsx', '.xls']:
self.df = pd.read_excel(self.file_path)
elif file_ext == '.json':
self.df = pd.read_json(self.file_path)
elif file_ext == '.parquet':
self.df = pd.read_parquet(self.file_path)
elif file_ext == '.tsv':
self.df = pd.read_csv(self.file_path, sep='\t')
elif file_ext == '.feather':
self.df = pd.read_feather(self.file_path)
elif file_ext == '.h5' or file_ext == '.hdf5':
self.df = pd.read_hdf(self.file_path)
elif file_ext == '.pkl' or file_ext == '.pickle':
self.df = pd.read_pickle(self.file_path)
else:
raise ValueError(f"Unsupported file format: {file_ext}")
print(f"✅ Successfully loaded {file_ext} file with shape {self.df.shape}")
return self.df
except Exception as e:
print(f"❌ Error loading file: {str(e)}")
sys.exit(1)
def plot_missing_data(self) -> str:
"""Visualize missing data patterns"""
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
# Missing data heatmap
if self.df.isnull().sum().sum() > 0:
# Only plot columns with missing data
missing_cols = self.df.columns[self.df.isnull().any()].tolist()
if missing_cols:
sns.heatmap(self.df[missing_cols].isnull(), cbar=True, yticklabels=False,
cmap='viridis', ax=axes[0])
axes[0].set_title('Missing Data Pattern', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Columns')
# Missing data bar chart
missing_pct = (self.df[missing_cols].isnull().sum() / len(self.df) * 100).sort_values(ascending=True)
missing_pct.plot(kind='barh', ax=axes[1], color='coral')
axes[1].set_title('Missing Data Percentage by Column', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Missing %')
axes[1].set_ylabel('Columns')
for i, v in enumerate(missing_pct):
axes[1].text(v + 0.5, i, f'{v:.1f}%', va='center')
else:
axes[0].text(0.5, 0.5, 'No missing data detected', ha='center', va='center',
transform=axes[0].transAxes, fontsize=14)
axes[0].axis('off')
axes[1].axis('off')
else:
axes[0].text(0.5, 0.5, 'No missing data detected', ha='center', va='center',
transform=axes[0].transAxes, fontsize=14)
axes[0].axis('off')
axes[1].axis('off')
plt.tight_layout()
output_path = self.output_dir / "missing_data.png"
plt.savefig(output_path)
plt.close()
print(f"✅ Missing data visualization saved: {output_path}")
return str(output_path)
def plot_distributions(self) -> str:
"""Plot distributions for all numeric columns"""
numeric_cols = self.df.select_dtypes(include=[np.number]).columns.tolist()
if not numeric_cols:
print("⚠️ No numeric columns found for distribution plots")
return ""
# Limit to first 20 columns if too many
if len(numeric_cols) > 20:
print(f"⚠️ Too many numeric columns ({len(numeric_cols)}), plotting first 20")
numeric_cols = numeric_cols[:20]
n_cols = min(3, len(numeric_cols))
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(6 * n_cols, 4 * n_rows))
if n_rows == 1 and n_cols == 1:
axes = np.array([[axes]])
elif n_rows == 1 or n_cols == 1:
axes = axes.reshape(n_rows, n_cols)
for idx, col in enumerate(numeric_cols):
row = idx // n_cols
col_idx = idx % n_cols
ax = axes[row, col_idx]
data = self.df[col].dropna()
# Create histogram with KDE
ax.hist(data, bins=30, alpha=0.6, color='skyblue', edgecolor='black', density=True)
# Add KDE line
try:
data.plot(kind='kde', ax=ax, color='red', linewidth=2)
except:
pass
ax.set_title(f'{col}', fontsize=10, fontweight='bold')
ax.set_xlabel('Value')
ax.set_ylabel('Density')
# Add statistics box
stats_text = f'Mean: {data.mean():.2f}\nMedian: {data.median():.2f}\nStd: {data.std():.2f}'
ax.text(0.98, 0.98, stats_text, transform=ax.transAxes,
verticalalignment='top', horizontalalignment='right',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5),
fontsize=8)
# Hide empty subplots
for idx in range(len(numeric_cols), n_rows * n_cols):
row = idx // n_cols
col_idx = idx % n_cols
axes[row, col_idx].axis('off')
plt.suptitle('Distribution Analysis', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
output_path = self.output_dir / "distributions.png"
plt.savefig(output_path)
plt.close()
print(f"✅ Distribution plots saved: {output_path}")
return str(output_path)
def plot_boxplots(self) -> str:
"""Create box plots for numeric columns to show outliers"""
numeric_cols = self.df.select_dtypes(include=[np.number]).columns.tolist()
if not numeric_cols:
print("⚠️ No numeric columns found for box plots")
return ""
# Limit to first 20 columns if too many
if len(numeric_cols) > 20:
numeric_cols = numeric_cols[:20]
n_cols = min(3, len(numeric_cols))
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(6 * n_cols, 4 * n_rows))
if n_rows == 1 and n_cols == 1:
axes = np.array([[axes]])
elif n_rows == 1 or n_cols == 1:
axes = axes.reshape(n_rows, n_cols)
for idx, col in enumerate(numeric_cols):
row = idx // n_cols
col_idx = idx % n_cols
ax = axes[row, col_idx]
data = self.df[col].dropna()
# Box plot with violin
parts = ax.violinplot([data], positions=[0], widths=0.7, showmeans=True, showextrema=True)
ax.boxplot([data], positions=[0], widths=0.3, patch_artist=True,
boxprops=dict(facecolor='lightblue', alpha=0.7))
ax.set_title(f'{col}', fontsize=10, fontweight='bold')
ax.set_ylabel('Value')
ax.set_xticks([])
# Hide empty subplots
for idx in range(len(numeric_cols), n_rows * n_cols):
row = idx // n_cols
col_idx = idx % n_cols
axes[row, col_idx].axis('off')
plt.suptitle('Box Plots with Violin Plots (Outlier Detection)', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
output_path = self.output_dir / "boxplots.png"
plt.savefig(output_path)
plt.close()
print(f"✅ Box plots saved: {output_path}")
return str(output_path)
def plot_correlation_heatmap(self) -> str:
"""Create correlation heatmap for numeric variables"""
numeric_df = self.df.select_dtypes(include=[np.number])
if len(numeric_df.columns) < 2:
print("⚠️ Need at least 2 numeric columns for correlation heatmap")
return ""
fig, axes = plt.subplots(1, 2, figsize=(20, 8))
# Pearson correlation
corr_pearson = numeric_df.corr(method='pearson')
mask = np.triu(np.ones_like(corr_pearson, dtype=bool))
sns.heatmap(corr_pearson, mask=mask, annot=True, fmt='.2f', cmap='coolwarm',
center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8},
ax=axes[0])
axes[0].set_title('Pearson Correlation Matrix', fontsize=14, fontweight='bold')
# Spearman correlation
corr_spearman = numeric_df.corr(method='spearman')
mask = np.triu(np.ones_like(corr_spearman, dtype=bool))
sns.heatmap(corr_spearman, mask=mask, annot=True, fmt='.2f', cmap='coolwarm',
center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8},
ax=axes[1])
axes[1].set_title('Spearman Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
output_path = self.output_dir / "correlation_heatmap.png"
plt.savefig(output_path)
plt.close()
print(f"✅ Correlation heatmap saved: {output_path}")
return str(output_path)
def plot_scatter_matrix(self) -> str:
"""Create scatter plot matrix for numeric variables"""
numeric_df = self.df.select_dtypes(include=[np.number])
if len(numeric_df.columns) < 2:
print("⚠️ Need at least 2 numeric columns for scatter matrix")
return ""
# Limit to first 6 columns if too many (scatter matrix gets too large)
if len(numeric_df.columns) > 6:
print(f"⚠️ Too many columns for scatter matrix, using first 6")
numeric_df = numeric_df.iloc[:, :6]
fig = plt.figure(figsize=(15, 15))
pd.plotting.scatter_matrix(numeric_df, alpha=0.6, figsize=(15, 15),
diagonal='kde', hist_kwds={'bins': 20})
plt.suptitle('Scatter Plot Matrix', fontsize=16, fontweight='bold', y=1.00)
output_path = self.output_dir / "scatter_matrix.png"
plt.savefig(output_path)
plt.close()
print(f"✅ Scatter matrix saved: {output_path}")
return str(output_path)
def plot_categorical_analysis(self) -> str:
"""Analyze and visualize categorical variables"""
categorical_cols = self.df.select_dtypes(include=['object', 'category']).columns.tolist()
if not categorical_cols:
print("⚠️ No categorical columns found")
return ""
# Limit to first 12 columns if too many
if len(categorical_cols) > 12:
print(f"⚠️ Too many categorical columns ({len(categorical_cols)}), plotting first 12")
categorical_cols = categorical_cols[:12]
n_cols = min(3, len(categorical_cols))
n_rows = (len(categorical_cols) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(8 * n_cols, 5 * n_rows))
if n_rows == 1 and n_cols == 1:
axes = np.array([[axes]])
elif n_rows == 1 or n_cols == 1:
axes = axes.reshape(n_rows, n_cols)
for idx, col in enumerate(categorical_cols):
row = idx // n_cols
col_idx = idx % n_cols
ax = axes[row, col_idx]
# Get top 10 categories
value_counts = self.df[col].value_counts().head(10)
# Create bar chart
value_counts.plot(kind='barh', ax=ax, color='steelblue')
ax.set_title(f'{col} (Top 10)', fontsize=11, fontweight='bold')
ax.set_xlabel('Count')
ax.set_ylabel('')
# Add value labels
for i, v in enumerate(value_counts):
ax.text(v + max(value_counts) * 0.01, i, str(v), va='center')
# Hide empty subplots
for idx in range(len(categorical_cols), n_rows * n_cols):
row = idx // n_cols
col_idx = idx % n_cols
axes[row, col_idx].axis('off')
plt.suptitle('Categorical Variable Analysis', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
output_path = self.output_dir / "categorical_analysis.png"
plt.savefig(output_path)
plt.close()
print(f"✅ Categorical analysis saved: {output_path}")
return str(output_path)
def plot_time_series(self) -> str:
"""Create time series visualizations if datetime columns exist"""
datetime_cols = self.df.select_dtypes(include=['datetime64']).columns.tolist()
# Also check for columns that might be dates but stored as strings
for col in self.df.columns:
if self.df[col].dtype == 'object':
try:
pd.to_datetime(self.df[col].head(100))
datetime_cols.append(col)
except:
pass
if not datetime_cols:
print("⚠️ No datetime columns found for time series analysis")
return ""
# Take first datetime column as index
date_col = datetime_cols[0]
df_temp = self.df.copy()
if df_temp[date_col].dtype == 'object':
df_temp[date_col] = pd.to_datetime(df_temp[date_col])
df_temp = df_temp.sort_values(date_col)
# Get numeric columns
numeric_cols = df_temp.select_dtypes(include=[np.number]).columns.tolist()
if not numeric_cols:
print("⚠️ No numeric columns found for time series plots")
return ""
# Limit to first 6 numeric columns
if len(numeric_cols) > 6:
numeric_cols = numeric_cols[:6]
n_rows = len(numeric_cols)
fig, axes = plt.subplots(n_rows, 1, figsize=(14, 4 * n_rows))
if n_rows == 1:
axes = [axes]
for idx, col in enumerate(numeric_cols):
ax = axes[idx]
# Plot time series
ax.plot(df_temp[date_col], df_temp[col], linewidth=1, alpha=0.8)
ax.set_title(f'{col} over Time', fontsize=12, fontweight='bold')
ax.set_xlabel('Date')
ax.set_ylabel(col)
ax.grid(True, alpha=0.3)
# Add trend line
try:
z = np.polyfit(range(len(df_temp)), df_temp[col].fillna(df_temp[col].mean()), 1)
p = np.poly1d(z)
ax.plot(df_temp[date_col], p(range(len(df_temp))), "r--", linewidth=2, alpha=0.8, label='Trend')
ax.legend()
except:
pass
plt.suptitle('Time Series Analysis', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
output_path = self.output_dir / "time_series.png"
plt.savefig(output_path)
plt.close()
print(f"✅ Time series plots saved: {output_path}")
return str(output_path)
def generate_all_visualizations(self) -> List[str]:
"""Generate all visualizations"""
print("🎨 Starting visualization generation...")
self.load_data()
generated_files = []
print("📊 Creating missing data visualization...")
missing_plot = self.plot_missing_data()
if missing_plot:
generated_files.append(missing_plot)
print("📈 Creating distribution plots...")
dist_plot = self.plot_distributions()
if dist_plot:
generated_files.append(dist_plot)
print("📦 Creating box plots...")
box_plot = self.plot_boxplots()
if box_plot:
generated_files.append(box_plot)
print("🔥 Creating correlation heatmap...")
corr_plot = self.plot_correlation_heatmap()
if corr_plot:
generated_files.append(corr_plot)
print("🔢 Creating scatter matrix...")
scatter_plot = self.plot_scatter_matrix()
if scatter_plot:
generated_files.append(scatter_plot)
print("📊 Creating categorical analysis...")
cat_plot = self.plot_categorical_analysis()
if cat_plot:
generated_files.append(cat_plot)
print("⏱️ Creating time series plots...")
ts_plot = self.plot_time_series()
if ts_plot:
generated_files.append(ts_plot)
print(f"✨ Generated {len(generated_files)} visualizations!")
print(f"📁 Saved to: {self.output_dir}")
return generated_files
def main():
parser = argparse.ArgumentParser(description='Generate comprehensive EDA visualizations')
parser.add_argument('file_path', help='Path to data file')
parser.add_argument('-o', '--output', help='Output directory for visualizations', default=None)
args = parser.parse_args()
visualizer = EDAVisualizer(args.file_path, args.output)
visualizer.generate_all_visualizations()
if __name__ == '__main__':
main()