Files
claude-scientific-skills/scientific-thinking/exploratory-data-analysis/references/eda_best_practices.md
Timothy Kassis c627714209 Add EDA
2025-10-19 16:53:31 -07:00

11 KiB
Raw Blame History

Exploratory Data Analysis Best Practices

This guide provides best practices and methodologies for conducting thorough exploratory data analysis.

EDA Process Framework

1. Initial Data Understanding

Objectives:

  • Understand data structure and format
  • Identify data types and schema
  • Get familiar with domain context

Key Questions:

  • What does each column represent?
  • What is the unit of observation?
  • What is the time period covered?
  • What is the data collection methodology?
  • Are there any known data quality issues?

Actions:

  • Load and inspect first/last rows
  • Check data dimensions (rows × columns)
  • Review column names and types
  • Document data source and context

2. Data Quality Assessment

Objectives:

  • Identify data quality issues
  • Assess data completeness and reliability
  • Document data limitations

Key Checks:

  • Missing data: Patterns, extent, randomness
  • Duplicates: Exact and near-duplicates
  • Outliers: Valid extremes vs. data errors
  • Consistency: Cross-field validation
  • Accuracy: Domain knowledge validation

Red Flags:

  • High missing data rate (>20%)
  • Unexpected duplicates
  • Constant or near-constant columns
  • Impossible values (negative ages, dates in future)
  • High cardinality in ID-like columns
  • Suspicious patterns (too many round numbers)

3. Univariate Analysis

Objectives:

  • Understand individual variable distributions
  • Identify anomalies and patterns
  • Determine variable characteristics

For Numeric Variables:

  • Central tendency (mean, median, mode)
  • Dispersion (range, variance, std, IQR)
  • Shape (skewness, kurtosis)
  • Distribution visualization (histogram, KDE, box plot)
  • Outlier detection

For Categorical Variables:

  • Frequency distributions
  • Unique value counts
  • Most/least common categories
  • Category balance/imbalance
  • Bar charts and count plots

For Temporal Variables:

  • Time range coverage
  • Gaps in timeline
  • Temporal patterns (trends, seasonality)
  • Time series plots

4. Bivariate Analysis

Objectives:

  • Understand relationships between variables
  • Identify correlations and dependencies
  • Find potential predictors

Numeric vs Numeric:

  • Scatter plots
  • Correlation coefficients (Pearson, Spearman)
  • Line of best fit
  • Detect non-linear relationships

Numeric vs Categorical:

  • Group statistics (mean, median by category)
  • Box plots by category
  • Distribution plots by category
  • Statistical tests (t-test, ANOVA)

Categorical vs Categorical:

  • Cross-tabulation / contingency tables
  • Stacked bar charts
  • Chi-square tests
  • Cramér's V for association strength

5. Multivariate Analysis

Objectives:

  • Understand complex interactions
  • Identify patterns across multiple variables
  • Explore dimensionality

Techniques:

  • Correlation matrices and heatmaps
  • Pair plots / scatter matrices
  • Parallel coordinates plots
  • Principal Component Analysis (PCA)
  • Clustering analysis

Key Questions:

  • Are there groups of correlated features?
  • Can we reduce dimensionality?
  • Are there natural clusters?
  • Do patterns change when conditioning on other variables?

6. Insight Generation

Objectives:

  • Synthesize findings into actionable insights
  • Formulate hypotheses
  • Identify next steps

What to Look For:

  • Unexpected patterns or anomalies
  • Strong relationships or correlations
  • Data quality issues requiring attention
  • Feature engineering opportunities
  • Business or research implications

Best Practices

Visualization Guidelines

  1. Choose appropriate chart types:

    • Distribution: Histogram, KDE, box plot, violin plot
    • Relationships: Scatter plot, line plot, heatmap
    • Composition: Stacked bar, pie chart (use sparingly)
    • Comparison: Bar chart, grouped bar chart
  2. Make visualizations clear and informative:

    • Always label axes with units
    • Add descriptive titles
    • Use color purposefully
    • Include legends when needed
    • Choose appropriate scales
    • Avoid chart junk
  3. Use multiple views:

    • Show data from different angles
    • Combine complementary visualizations
    • Use small multiples for faceting

Statistical Analysis Guidelines

  1. Check assumptions:

    • Test for normality before parametric tests
    • Check for homoscedasticity
    • Verify independence of observations
    • Assess linearity for linear models
  2. Use appropriate methods:

    • Parametric tests when assumptions met
    • Non-parametric alternatives when violated
    • Robust methods for outlier-prone data
    • Effect sizes alongside p-values
  3. Consider context:

    • Statistical significance ≠ practical significance
    • Domain knowledge trumps statistical patterns
    • Correlation ≠ causation
    • Sample size affects what you can detect

Documentation Guidelines

  1. Keep detailed notes:

    • Document assumptions and decisions
    • Record data issues discovered
    • Note interesting findings
    • Track questions that arise
  2. Create reproducible analysis:

    • Use scripts, not manual Excel operations
    • Version control your code
    • Document data sources and versions
    • Include random seeds for reproducibility
  3. Summarize findings:

    • Write clear summaries
    • Use visualizations to support points
    • Highlight key insights
    • Provide recommendations

Common Pitfalls to Avoid

1. Confirmation Bias

  • Problem: Looking only for evidence supporting preconceptions
  • Solution: Actively seek disconfirming evidence, use blind analysis

2. Ignoring Data Quality

  • Problem: Proceeding with analysis despite known data issues
  • Solution: Address quality issues first, document limitations

3. Over-reliance on Automation

  • Problem: Running analyses without understanding or verifying results
  • Solution: Manually inspect subsets, verify automated findings

4. Neglecting Outliers

  • Problem: Removing outliers without investigation
  • Solution: Always investigate outliers - they may contain important information

5. Multiple Testing Without Correction

  • Problem: Running many tests increases false positive rate
  • Solution: Use correction methods (Bonferroni, FDR) or be explicit about exploratory nature

6. Mistaking Association for Causation

  • Problem: Inferring causation from correlation
  • Solution: Use careful language, acknowledge alternative explanations

7. Cherry-picking Results

  • Problem: Reporting only interesting/significant findings
  • Solution: Report complete analysis, including negative results

8. Ignoring Sample Size

  • Problem: Not considering how sample size affects conclusions
  • Solution: Report effect sizes, confidence intervals, and sample sizes

Domain-Specific Considerations

Time Series Data

  • Check for stationarity
  • Identify trends and seasonality
  • Look for autocorrelation
  • Handle missing time points
  • Consider temporal splits for validation

High-Dimensional Data

  • Start with dimensionality reduction
  • Focus on feature importance
  • Be cautious of curse of dimensionality
  • Use regularization in modeling
  • Consider domain knowledge for feature selection

Imbalanced Data

  • Report class distributions
  • Use appropriate metrics (not just accuracy)
  • Consider resampling techniques
  • Stratify sampling and cross-validation
  • Be aware of biases in learning

Small Sample Sizes

  • Use non-parametric methods
  • Be conservative with conclusions
  • Report confidence intervals
  • Consider Bayesian approaches
  • Acknowledge limitations

Big Data

  • Sample intelligently for exploration
  • Use efficient data structures
  • Leverage parallel/distributed computing
  • Be aware computational complexity
  • Consider scalability in methods

Iterative Process

EDA is not linear - iterate and refine:

  1. Initial exploration → Identify questions
  2. Focused analysis → Answer specific questions
  3. New insights → Generate new questions
  4. Deeper investigation → Refine understanding
  5. Synthesis → Integrate findings

When to Stop

You've done enough EDA when:

  • You understand the data structure and quality
  • You've characterized key variables
  • You've identified important relationships
  • You've documented limitations
  • You can answer your research questions
  • You have actionable insights

Moving Forward

After EDA, you should have:

  • Clear understanding of data
  • List of quality issues and how to handle them
  • Insights about relationships and patterns
  • Hypotheses to test
  • Ideas for feature engineering
  • Recommendations for next steps

Communication Tips

For Technical Audiences

  • Include methodological details
  • Show statistical test results
  • Discuss assumptions and limitations
  • Provide reproducible code
  • Reference relevant literature

For Non-Technical Audiences

  • Focus on insights, not methods
  • Use clear visualizations
  • Avoid jargon
  • Provide context and implications
  • Make recommendations concrete

Report Structure

  1. Executive Summary: Key findings and recommendations
  2. Data Overview: Source, structure, limitations
  3. Analysis: Findings organized by theme
  4. Insights: Patterns, anomalies, implications
  5. Recommendations: Next steps and actions
  6. Appendix: Technical details, full statistics

Useful Checklists

Before Starting

  • Understand business/research context
  • Define analysis objectives
  • Identify stakeholders and audience
  • Secure necessary permissions
  • Set up reproducible environment

During Analysis

  • Load and inspect data structure
  • Assess data quality
  • Analyze univariate distributions
  • Explore bivariate relationships
  • Investigate multivariate patterns
  • Generate and validate insights
  • Document findings continuously

Before Concluding

  • Verify all findings
  • Check for alternative explanations
  • Document limitations
  • Prepare clear visualizations
  • Write actionable recommendations
  • Review with domain experts
  • Ensure reproducibility

Tools and Libraries

Python Ecosystem

  • pandas: Data manipulation
  • numpy: Numerical operations
  • matplotlib/seaborn: Visualization
  • scipy: Statistical tests
  • scikit-learn: ML preprocessing
  • plotly: Interactive visualizations

Best Tool Practices

  • Use appropriate tool for task
  • Leverage vectorization
  • Chain operations efficiently
  • Handle missing data properly
  • Validate results independently
  • Document custom functions

Further Resources

  • Books:
    • "Exploratory Data Analysis" by John Tukey
    • "The Art of Statistics" by David Spiegelhalter
  • Guidelines:
    • ASA Statistical Significance Statement
    • FAIR data principles
  • Communities:
    • Cross Validated (Stack Exchange)
    • /r/datascience
    • Local data science meetups