10 KiB
name, description
| name | description |
|---|---|
| exploratory-data-analysis | Comprehensive exploratory data analysis toolkit for data scientists. Use when users request data exploration, analysis of datasets, statistical summaries, data visualizations, or insights from data files. Handles multiple file formats (CSV, Excel, JSON, Parquet, etc.) and generates detailed markdown reports with statistics, visualizations, and automated insights. This skill should be used when analyzing any tabular data to understand patterns, distributions, correlations, outliers, and data quality issues. |
Exploratory Data Analysis
Overview
Perform comprehensive exploratory data analysis on datasets of any format. This skill acts as a proficient data scientist, automatically analyzing data to generate meaningful summaries, advanced statistics, visualizations, and actionable insights. All textual outputs are generated as markdown for seamless integration into workflows.
When to Use This Skill
Invoke this skill when:
- User provides a data file and requests analysis or exploration
- User asks to "explore this dataset", "analyze this data", or "what's in this file?"
- User needs statistical summaries, distributions, or correlations
- User requests data visualizations or insights
- User wants to understand data quality issues or patterns
- User mentions EDA, exploratory analysis, or data profiling
Supported file formats: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
Quick Start Workflow
- Receive data file from user
- Run comprehensive analysis using
scripts/eda_analyzer.py - Generate visualizations using
scripts/visualizer.py - Create markdown report using insights and the
assets/report_template.mdtemplate - Present findings to user with key insights highlighted
Core Capabilities
1. Comprehensive Data Analysis
Execute full statistical analysis using the eda_analyzer.py script:
python scripts/eda_analyzer.py <data_file_path> -o <output_directory>
What it provides:
- Auto-detection and loading of file formats
- Basic dataset information (shape, types, memory usage)
- Missing data analysis (patterns, percentages)
- Summary statistics for numeric and categorical variables
- Outlier detection using IQR and Z-score methods
- Distribution analysis with normality tests (Shapiro-Wilk, Anderson-Darling)
- Correlation analysis (Pearson and Spearman)
- Data quality assessment (completeness, duplicates, issues)
- Automated insight generation
Output: JSON file containing all analysis results at <output_directory>/eda_analysis.json
2. Comprehensive Visualizations
Generate complete visualization suite using the visualizer.py script:
python scripts/visualizer.py <data_file_path> -o <output_directory>
Generated visualizations:
- Missing data patterns: Heatmap and bar chart showing missing data
- Distribution plots: Histograms with KDE overlays for all numeric variables
- Box plots with violin plots: Outlier detection visualizations
- Correlation heatmap: Both Pearson and Spearman correlation matrices
- Scatter matrix: Pairwise relationships between numeric variables
- Categorical analysis: Bar charts for top categories
- Time series plots: Temporal trends with trend lines (if datetime columns exist)
Output: High-quality PNG files saved to <output_directory>/eda_visualizations/
All visualizations are production-ready with:
- 300 DPI resolution
- Clear titles and labels
- Statistical annotations
- Professional styling using seaborn
3. Automated Insight Generation
The analyzer automatically generates actionable insights including:
- Data scale insights: Dataset size considerations for processing
- Missing data alerts: Warnings when missing data exceeds thresholds
- Correlation discoveries: Strong relationships identified for feature engineering
- Outlier warnings: Variables with high outlier rates flagged
- Distribution assessments: Skewness issues requiring transformations
- Duplicate alerts: Duplicate row detection
- Imbalance warnings: Categorical variable imbalance detection
Access insights from the analysis results JSON under the "insights" key.
4. Statistical Interpretation
For detailed interpretation of statistical tests and measures, reference:
references/statistical_tests_guide.md - Comprehensive guide covering:
- Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
- Distribution characteristics (skewness, kurtosis)
- Correlation tests (Pearson, Spearman)
- Outlier detection methods (IQR, Z-score)
- Hypothesis testing guidelines
- Data transformation strategies
Load this reference when needing to interpret specific statistical tests or explain results to users.
5. Best Practices Guidance
For methodological guidance, reference:
references/eda_best_practices.md - Detailed best practices including:
- EDA process framework (6-step methodology)
- Univariate, bivariate, and multivariate analysis approaches
- Visualization guidelines
- Statistical analysis guidelines
- Common pitfalls to avoid
- Domain-specific considerations
- Communication tips for technical and non-technical audiences
Load this reference when planning analysis approach or needing guidance on specific EDA scenarios.
Creating Analysis Reports
Use the provided template to structure comprehensive EDA reports:
assets/report_template.md - Professional report template with sections for:
- Executive summary
- Dataset overview
- Data quality assessment
- Univariate, bivariate, and multivariate analysis
- Outlier analysis
- Key insights and findings
- Recommendations
- Limitations and appendices
To use the template:
- Copy the template content
- Fill in sections with analysis results from JSON output
- Embed visualization images using markdown syntax
- Populate insights and recommendations
- Save as markdown for user consumption
Typical Workflow Example
When user provides a data file:
User: "Can you explore this sales_data.csv file and tell me what you find?"
1. Run analysis:
python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output
2. Generate visualizations:
python scripts/visualizer.py sales_data.csv -o ./analysis_output
3. Read analysis results:
Read ./analysis_output/eda_analysis.json
4. Create markdown report using template:
- Copy assets/report_template.md structure
- Fill in sections with analysis results
- Reference visualizations from ./analysis_output/eda_visualizations/
- Include automated insights from JSON
5. Present to user:
- Show key insights prominently
- Highlight data quality issues
- Provide visualizations inline
- Make actionable recommendations
- Save complete report as .md file
Advanced Analysis Scenarios
Large Datasets (>1M rows)
- Run analysis on sampled data first for quick exploration
- Note sample size in report
- Recommend distributed computing for full analysis
High-Dimensional Data (>50 columns)
- Focus on most important variables first
- Consider PCA or feature selection
- Generate correlation analysis to identify variable groups
- Reference
eda_best_practices.mdsection on high-dimensional data
Time Series Data
- Ensure datetime columns are properly detected
- Time series visualizations will be automatically generated
- Consider temporal patterns, trends, and seasonality
- Reference
eda_best_practices.mdsection on time series
Imbalanced Data
- Categorical analysis will flag imbalances
- Report class distributions prominently
- Recommend stratified sampling if needed
Small Sample Sizes (<100 rows)
- Non-parametric methods automatically used where appropriate
- Be conservative in statistical conclusions
- Note sample size limitations in report
Output Best Practices
Always output as markdown:
- Structure findings using markdown headers, tables, and lists
- Embed visualizations using
syntax - Use tables for statistical summaries
- Include code blocks for any suggested transformations
- Highlight key insights with bold or bullet points
Ensure reports are actionable:
- Provide clear recommendations based on findings
- Flag data quality issues that need attention
- Suggest next steps for modeling or further analysis
- Identify feature engineering opportunities
Make insights accessible:
- Explain statistical concepts in plain language
- Use reference guides to provide detailed interpretations
- Include both technical details and executive summary
- Tailor communication to user's technical level
Handling Edge Cases
Unsupported file formats:
- Request user to convert to supported format
- Suggest using pandas-compatible formats
Files too large to load:
- Recommend sampling approach
- Suggest chunked processing
- Consider alternative tools for big data
Corrupted or malformed data:
- Report specific errors encountered
- Suggest data cleaning steps
- Try to salvage partial analysis if possible
All missing data in columns:
- Flag completely empty columns
- Recommend removal or investigation
- Document in data quality section
Resources Summary
scripts/
eda_analyzer.py: Main analysis engine - comprehensive statistical analysisvisualizer.py: Visualization generator - creates all chart types
Both scripts are fully executable and handle multiple file formats automatically.
references/
statistical_tests_guide.md: Statistical test interpretation and methodologyeda_best_practices.md: Comprehensive EDA methodology and best practices
Load these references as needed to inform analysis approach and interpretation.
assets/
report_template.md: Professional markdown report template
Use this template structure for creating consistent, comprehensive EDA reports.
Key Reminders
- Always generate markdown output for textual results
- Run both scripts (analyzer and visualizer) for complete analysis
- Use the template to structure comprehensive reports
- Include visualizations by referencing generated PNG files
- Provide actionable insights - don't just present statistics
- Interpret findings using reference guides
- Document limitations and data quality issues
- Make recommendations for next steps
This skill transforms raw data into actionable insights through systematic exploration, advanced statistics, rich visualizations, and clear communication.