Files
claude-scientific-skills/scientific-thinking/exploratory-data-analysis/assets/report_template.md
Timothy Kassis c627714209 Add EDA
2025-10-19 16:53:31 -07:00

8.6 KiB
Raw Blame History

Exploratory Data Analysis Report

Dataset: [Dataset Name] Analysis Date: [Date] Analyst: [Name]


Executive Summary

[2-3 paragraph summary of key findings, major insights, and recommendations]

Key Findings:

  • [Finding 1]
  • [Finding 2]
  • [Finding 3]

Recommendations:

  • [Recommendation 1]
  • [Recommendation 2]

1. Dataset Overview

1.1 Data Source

  • Source: [Source name and location]
  • Collection Period: [Date range]
  • Last Updated: [Date]
  • Format: [CSV, Excel, JSON, etc.]

1.2 Data Structure

  • Observations (Rows): [Number]
  • Variables (Columns): [Number]
  • Memory Usage: [Size in MB]

1.3 Variable Types

  • Numeric Variables ([Count]): [List column names]
  • Categorical Variables ([Count]): [List column names]
  • Datetime Variables ([Count]): [List column names]
  • Boolean Variables ([Count]): [List column names]

2. Data Quality Assessment

2.1 Completeness

Overall Data Completeness: [Percentage]%

Missing Data Summary:

Column Missing Count Missing % Assessment
[Column 1] [Count] [%] [High/Medium/Low]
[Column 2] [Count] [%] [High/Medium/Low]

Missing Data Pattern: [Description of patterns, if any]

Visualization: Missing Data

2.2 Duplicates

  • Duplicate Rows: [Count] ([Percentage]%)
  • Action Required: [Yes/No - describe if needed]

2.3 Data Quality Issues

[List any identified issues]

  • Issue 1: [Description]
  • Issue 2: [Description]
  • Issue 3: [Description]

3. Univariate Analysis

3.1 Numeric Variables

[For each key numeric variable:]

[Variable Name]

Summary Statistics:

  • Mean: [Value]
  • Median: [Value]
  • Std Dev: [Value]
  • Min: [Value]
  • Max: [Value]
  • Range: [Value]
  • IQR: [Value]

Distribution Characteristics:

  • Skewness: [Value] - [Interpretation]
  • Kurtosis: [Value] - [Interpretation]
  • Normality: [Normal/Not Normal based on tests]

Outliers:

  • IQR Method: [Count] outliers ([Percentage]%)
  • Z-Score Method: [Count] outliers ([Percentage]%)

Visualization: Distribution of [Variable]

Insights:

  • [Key insight 1]
  • [Key insight 2]

3.2 Categorical Variables

[For each key categorical variable:]

[Variable Name]

Summary:

  • Unique Values: [Count]
  • Most Common: [Value] ([Percentage]%)
  • Least Common: [Value] ([Percentage]%)
  • Balance: [Balanced/Imbalanced]

Top Categories:

Category Count Percentage
[Cat 1] [Count] [%]
[Cat 2] [Count] [%]
[Cat 3] [Count] [%]

Visualization: Distribution of [Variable]

Insights:

  • [Key insight 1]
  • [Key insight 2]

3.3 Temporal Variables

[If datetime columns exist:]

[Variable Name]

Time Range: [Start Date] to [End Date] Duration: [Time span] Temporal Coverage: [Complete/Gaps identified]

Temporal Patterns:

  • Trend: [Increasing/Decreasing/Stable]
  • Seasonality: [Yes/No - describe if present]
  • Gaps: [List any gaps in timeline]

Visualization: Time Series of [Variable]

Insights:

  • [Key insight 1]
  • [Key insight 2]

4. Bivariate Analysis

4.1 Correlation Analysis

Overall Correlation Structure:

  • Strong Positive Correlations: [Count]
  • Strong Negative Correlations: [Count]
  • Weak/No Correlations: [Count]

Correlation Matrix:

Correlation Heatmap

Notable Correlations:

Variable 1 Variable 2 Pearson r Spearman ρ Strength Interpretation
[Var 1] [Var 2] [Value] [Value] [Strong/Moderate/Weak] [Interpretation]
[Var 1] [Var 3] [Value] [Value] [Strong/Moderate/Weak] [Interpretation]

Insights:

  • [Key insight about correlations]
  • [Potential multicollinearity issues]
  • [Feature engineering opportunities]

4.2 Key Relationships

[For important variable pairs:]

[Variable 1] vs [Variable 2]

Relationship Type: [Linear/Non-linear/None] Correlation: [Value] Statistical Test: [Test name, p-value]

Visualization: Scatter Plot

Insights:

  • [Description of relationship]
  • [Implications]

5. Multivariate Analysis

5.1 Scatter Matrix

Scatter Matrix

Observations:

  • [Pattern 1]
  • [Pattern 2]
  • [Pattern 3]

5.2 Clustering Patterns

[If clustering analysis performed:]

Method: [Method used] Number of Clusters: [Count]

Cluster Characteristics:

  • Cluster 1: [Description]
  • Cluster 2: [Description]

Visualization: [Link to visualization]


6. Outlier Analysis

6.1 Outlier Summary

Overall Outlier Rate: [Percentage]%

Variables with High Outlier Rates:

Variable Outlier Count Outlier % Method Action
[Var 1] [Count] [%] [IQR/Z-score] [Keep/Investigate/Remove]
[Var 2] [Count] [%] [IQR/Z-score] [Keep/Investigate/Remove]

Visualization: Box Plots

6.2 Outlier Investigation

[For significant outliers:]

[Variable Name]

Outlier Characteristics:

  • [Description of outliers]
  • [Potential causes]
  • [Validity assessment]

Recommendation: [Keep/Remove/Transform/Investigate further]


7. Key Insights and Findings

7.1 Data Quality Insights

  1. [Insight 1]: [Description and implication]
  2. [Insight 2]: [Description and implication]
  3. [Insight 3]: [Description and implication]

7.2 Statistical Insights

  1. [Insight 1]: [Description and implication]
  2. [Insight 2]: [Description and implication]
  3. [Insight 3]: [Description and implication]

7.3 Business/Research Insights

  1. [Insight 1]: [Description and implication]
  2. [Insight 2]: [Description and implication]
  3. [Insight 3]: [Description and implication]

7.4 Unexpected Findings

  1. [Finding 1]: [Description and significance]
  2. [Finding 2]: [Description and significance]

8. Recommendations

8.1 Data Quality Actions

  • [Action 1]: [Description and priority]
  • [Action 2]: [Description and priority]
  • [Action 3]: [Description and priority]

8.2 Analysis Next Steps

  1. [Step 1]: [Description and rationale]
  2. [Step 2]: [Description and rationale]
  3. [Step 3]: [Description and rationale]

8.3 Feature Engineering Opportunities

  • [Opportunity 1]: [Description]
  • [Opportunity 2]: [Description]
  • [Opportunity 3]: [Description]

8.4 Modeling Considerations

  • [Consideration 1]: [Description]
  • [Consideration 2]: [Description]
  • [Consideration 3]: [Description]

9. Limitations and Caveats

9.1 Data Limitations

  • [Limitation 1]
  • [Limitation 2]
  • [Limitation 3]

9.2 Analysis Limitations

  • [Limitation 1]
  • [Limitation 2]
  • [Limitation 3]

9.3 Assumptions Made

  • [Assumption 1]
  • [Assumption 2]
  • [Assumption 3]

10. Appendices

Appendix A: Technical Details

Software Environment:

  • Python: [Version]
  • Key Libraries: pandas ([Version]), numpy ([Version]), scipy ([Version]), matplotlib ([Version])

Analysis Scripts: [Link to repository or location]

Appendix B: Variable Dictionary

Variable Name Type Description Unit Valid Range Missing %
[Var 1] [Type] [Description] [Unit] [Range] [%]
[Var 2] [Type] [Description] [Unit] [Range] [%]

Appendix C: Statistical Test Results

[Detailed statistical test outputs]

Normality Tests:

Variable Test Statistic p-value Result
[Var 1] Shapiro-Wilk [Value] [Value] [Normal/Non-normal]

Correlation Tests:

Var 1 Var 2 Coefficient p-value Significance
[Var 1] [Var 2] [Value] [Value] [Yes/No]

[Links to all generated visualizations]

  1. Visualization 1 description
  2. Visualization 2 description
  3. Visualization 3 description

Contact Information

Analyst: [Name] Email: [Email] Date: [Date] Version: [Version number]


Document History:

Version Date Changes Author
1.0 [Date] Initial analysis [Name]