skills/claude-scientific-skills

Fork 0

mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-01-26 16:58:56 +08:00

Files

Timothy Kassis c627714209 Add EDA

2025-10-19 16:53:31 -07:00

10 KiB

Raw Blame History

name, description

name	description
exploratory-data-analysis	Comprehensive exploratory data analysis toolkit for data scientists. Use when users request data exploration, analysis of datasets, statistical summaries, data visualizations, or insights from data files. Handles multiple file formats (CSV, Excel, JSON, Parquet, etc.) and generates detailed markdown reports with statistics, visualizations, and automated insights. This skill should be used when analyzing any tabular data to understand patterns, distributions, correlations, outliers, and data quality issues.

name

description

exploratory-data-analysis

Comprehensive exploratory data analysis toolkit for data scientists. Use when users request data exploration, analysis of datasets, statistical summaries, data visualizations, or insights from data files. Handles multiple file formats (CSV, Excel, JSON, Parquet, etc.) and generates detailed markdown reports with statistics, visualizations, and automated insights. This skill should be used when analyzing any tabular data to understand patterns, distributions, correlations, outliers, and data quality issues.

Exploratory Data Analysis

Overview

Perform comprehensive exploratory data analysis on datasets of any format. This skill acts as a proficient data scientist, automatically analyzing data to generate meaningful summaries, advanced statistics, visualizations, and actionable insights. All textual outputs are generated as markdown for seamless integration into workflows.

When to Use This Skill

Invoke this skill when:

User provides a data file and requests analysis or exploration
User asks to "explore this dataset", "analyze this data", or "what's in this file?"
User needs statistical summaries, distributions, or correlations
User requests data visualizations or insights
User wants to understand data quality issues or patterns
User mentions EDA, exploratory analysis, or data profiling

Supported file formats: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle

Quick Start Workflow

Receive data file from user
Run comprehensive analysis using scripts/eda_analyzer.py
Generate visualizations using scripts/visualizer.py
Create markdown report using insights and the assets/report_template.md template
Present findings to user with key insights highlighted

Core Capabilities

1. Comprehensive Data Analysis

Execute full statistical analysis using the eda_analyzer.py script:

python scripts/eda_analyzer.py <data_file_path> -o <output_directory>

What it provides:

Auto-detection and loading of file formats
Basic dataset information (shape, types, memory usage)
Missing data analysis (patterns, percentages)
Summary statistics for numeric and categorical variables
Outlier detection using IQR and Z-score methods
Distribution analysis with normality tests (Shapiro-Wilk, Anderson-Darling)
Correlation analysis (Pearson and Spearman)
Data quality assessment (completeness, duplicates, issues)
Automated insight generation

Output: JSON file containing all analysis results at <output_directory>/eda_analysis.json

2. Comprehensive Visualizations

Generate complete visualization suite using the visualizer.py script:

python scripts/visualizer.py <data_file_path> -o <output_directory>

Generated visualizations:

Missing data patterns: Heatmap and bar chart showing missing data
Distribution plots: Histograms with KDE overlays for all numeric variables
Box plots with violin plots: Outlier detection visualizations
Correlation heatmap: Both Pearson and Spearman correlation matrices
Scatter matrix: Pairwise relationships between numeric variables
Categorical analysis: Bar charts for top categories
Time series plots: Temporal trends with trend lines (if datetime columns exist)

Output: High-quality PNG files saved to <output_directory>/eda_visualizations/

All visualizations are production-ready with:

300 DPI resolution
Clear titles and labels
Statistical annotations
Professional styling using seaborn

3. Automated Insight Generation

The analyzer automatically generates actionable insights including:

Data scale insights: Dataset size considerations for processing
Missing data alerts: Warnings when missing data exceeds thresholds
Correlation discoveries: Strong relationships identified for feature engineering
Outlier warnings: Variables with high outlier rates flagged
Distribution assessments: Skewness issues requiring transformations
Duplicate alerts: Duplicate row detection
Imbalance warnings: Categorical variable imbalance detection

Access insights from the analysis results JSON under the "insights" key.

4. Statistical Interpretation

For detailed interpretation of statistical tests and measures, reference:

references/statistical_tests_guide.md - Comprehensive guide covering:

Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
Distribution characteristics (skewness, kurtosis)
Correlation tests (Pearson, Spearman)
Outlier detection methods (IQR, Z-score)
Hypothesis testing guidelines
Data transformation strategies

Load this reference when needing to interpret specific statistical tests or explain results to users.

5. Best Practices Guidance

For methodological guidance, reference:

references/eda_best_practices.md - Detailed best practices including:

EDA process framework (6-step methodology)
Univariate, bivariate, and multivariate analysis approaches
Visualization guidelines
Statistical analysis guidelines
Common pitfalls to avoid
Domain-specific considerations
Communication tips for technical and non-technical audiences

Load this reference when planning analysis approach or needing guidance on specific EDA scenarios.

Creating Analysis Reports

Use the provided template to structure comprehensive EDA reports:

assets/report_template.md - Professional report template with sections for:

Executive summary
Dataset overview
Data quality assessment
Univariate, bivariate, and multivariate analysis
Outlier analysis
Key insights and findings
Recommendations
Limitations and appendices

To use the template:

Copy the template content
Fill in sections with analysis results from JSON output
Embed visualization images using markdown syntax
Populate insights and recommendations
Save as markdown for user consumption

Typical Workflow Example

When user provides a data file:

User: "Can you explore this sales_data.csv file and tell me what you find?"

1. Run analysis:
   python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output

2. Generate visualizations:
   python scripts/visualizer.py sales_data.csv -o ./analysis_output

3. Read analysis results:
   Read ./analysis_output/eda_analysis.json

4. Create markdown report using template:
   - Copy assets/report_template.md structure
   - Fill in sections with analysis results
   - Reference visualizations from ./analysis_output/eda_visualizations/
   - Include automated insights from JSON

5. Present to user:
   - Show key insights prominently
   - Highlight data quality issues
   - Provide visualizations inline
   - Make actionable recommendations
   - Save complete report as .md file

Advanced Analysis Scenarios

Large Datasets (>1M rows)

Run analysis on sampled data first for quick exploration
Note sample size in report
Recommend distributed computing for full analysis

High-Dimensional Data (>50 columns)

Focus on most important variables first
Consider PCA or feature selection
Generate correlation analysis to identify variable groups
Reference eda_best_practices.md section on high-dimensional data

Time Series Data

Ensure datetime columns are properly detected
Time series visualizations will be automatically generated
Consider temporal patterns, trends, and seasonality
Reference eda_best_practices.md section on time series

Imbalanced Data

Categorical analysis will flag imbalances
Report class distributions prominently
Recommend stratified sampling if needed

Small Sample Sizes (<100 rows)

Non-parametric methods automatically used where appropriate
Be conservative in statistical conclusions
Note sample size limitations in report

Output Best Practices

Always output as markdown:

Structure findings using markdown headers, tables, and lists
Embed visualizations using ![Description](path/to/image.png) syntax
Use tables for statistical summaries
Include code blocks for any suggested transformations
Highlight key insights with bold or bullet points

Ensure reports are actionable:

Provide clear recommendations based on findings
Flag data quality issues that need attention
Suggest next steps for modeling or further analysis
Identify feature engineering opportunities

Make insights accessible:

Explain statistical concepts in plain language
Use reference guides to provide detailed interpretations
Include both technical details and executive summary
Tailor communication to user's technical level

Handling Edge Cases

Unsupported file formats:

Request user to convert to supported format
Suggest using pandas-compatible formats

Files too large to load:

Recommend sampling approach
Suggest chunked processing
Consider alternative tools for big data

Corrupted or malformed data:

Report specific errors encountered
Suggest data cleaning steps
Try to salvage partial analysis if possible

All missing data in columns:

Flag completely empty columns
Recommend removal or investigation
Document in data quality section

Resources Summary

scripts/

eda_analyzer.py: Main analysis engine - comprehensive statistical analysis
visualizer.py: Visualization generator - creates all chart types

Both scripts are fully executable and handle multiple file formats automatically.

references/

statistical_tests_guide.md: Statistical test interpretation and methodology
eda_best_practices.md: Comprehensive EDA methodology and best practices

Load these references as needed to inform analysis approach and interpretation.

assets/

report_template.md: Professional markdown report template

Use this template structure for creating consistent, comprehensive EDA reports.

Key Reminders

Always generate markdown output for textual results
Run both scripts (analyzer and visualizer) for complete analysis
Use the template to structure comprehensive reports
Include visualizations by referencing generated PNG files
Provide actionable insights - don't just present statistics
Interpret findings using reference guides
Document limitations and data quality issues
Make recommendations for next steps

This skill transforms raw data into actionable insights through systematic exploration, advanced statistics, rich visualizations, and clear communication.

10 KiB Raw Blame History