Improve the EDA skill

This commit is contained in:
Timothy Kassis
2025-11-04 17:25:06 -08:00
parent 1225ddecf1
commit ffad3d81b0
4 changed files with 362 additions and 988 deletions

View File

@@ -1,379 +1,125 @@
# Exploratory Data Analysis Best Practices
# EDA Best Practices
This guide provides best practices and methodologies for conducting thorough exploratory data analysis.
Methodologies for conducting thorough exploratory data analysis.
## EDA Process Framework
## 6-Step EDA Framework
### 1. Initial Data Understanding
### 1. Initial Understanding
**Objectives**:
- Understand data structure and format
- Identify data types and schema
- Get familiar with domain context
**Key Questions**:
**Questions**:
- What does each column represent?
- What is the unit of observation?
- What is the time period covered?
- What is the unit of observation and time period?
- What is the data collection methodology?
- Are there any known data quality issues?
- Are there known quality issues?
**Actions**:
- Load and inspect first/last rows
- Check data dimensions (rows × columns)
- Review column names and types
- Document data source and context
**Actions**: Load data, inspect structure, review types, document context
### 2. Data Quality Assessment
### 2. Quality Assessment
**Objectives**:
- Identify data quality issues
- Assess data completeness and reliability
- Document data limitations
**Key Checks**:
- **Missing data**: Patterns, extent, randomness
- **Duplicates**: Exact and near-duplicates
- **Outliers**: Valid extremes vs. data errors
- **Consistency**: Cross-field validation
- **Accuracy**: Domain knowledge validation
**Check**: Missing data patterns, duplicates, outliers, consistency, accuracy
**Red Flags**:
- High missing data rate (>20%)
- Missing data >20%
- Unexpected duplicates
- Constant or near-constant columns
- Impossible values (negative ages, dates in future)
- High cardinality in ID-like columns
- Constant columns
- Impossible values (negative ages, future dates)
- Suspicious patterns (too many round numbers)
### 3. Univariate Analysis
**Objectives**:
- Understand individual variable distributions
- Identify anomalies and patterns
- Determine variable characteristics
**Numeric**: Central tendency, dispersion, shape (skewness, kurtosis), distribution plots, outliers
**For Numeric Variables**:
- Central tendency (mean, median, mode)
- Dispersion (range, variance, std, IQR)
- Shape (skewness, kurtosis)
- Distribution visualization (histogram, KDE, box plot)
- Outlier detection
**Categorical**: Frequency distributions, unique counts, balance, bar charts
**For Categorical Variables**:
- Frequency distributions
- Unique value counts
- Most/least common categories
- Category balance/imbalance
- Bar charts and count plots
**For Temporal Variables**:
- Time range coverage
- Gaps in timeline
- Temporal patterns (trends, seasonality)
- Time series plots
**Temporal**: Time range, gaps, trends, seasonality, time series plots
### 4. Bivariate Analysis
**Objectives**:
- Understand relationships between variables
- Identify correlations and dependencies
- Find potential predictors
**Numeric vs Numeric**: Scatter plots, correlations (Pearson, Spearman), detect non-linearity
**Numeric vs Numeric**:
- Scatter plots
- Correlation coefficients (Pearson, Spearman)
- Line of best fit
- Detect non-linear relationships
**Numeric vs Categorical**: Group statistics, box plots by category, t-test/ANOVA
**Numeric vs Categorical**:
- Group statistics (mean, median by category)
- Box plots by category
- Distribution plots by category
- Statistical tests (t-test, ANOVA)
**Categorical vs Categorical**:
- Cross-tabulation / contingency tables
- Stacked bar charts
- Chi-square tests
- Cramér's V for association strength
**Categorical vs Categorical**: Cross-tabs, stacked bars, chi-square, Cramér's V
### 5. Multivariate Analysis
**Objectives**:
- Understand complex interactions
- Identify patterns across multiple variables
- Explore dimensionality
**Techniques**: Correlation matrices, pair plots, parallel coordinates, PCA, clustering
**Techniques**:
- Correlation matrices and heatmaps
- Pair plots / scatter matrices
- Parallel coordinates plots
- Principal Component Analysis (PCA)
- Clustering analysis
**Key Questions**:
- Are there groups of correlated features?
- Can we reduce dimensionality?
- Are there natural clusters?
- Do patterns change when conditioning on other variables?
**Questions**: Groups of correlated features? Reduce dimensionality? Natural clusters? Conditional patterns?
### 6. Insight Generation
**Objectives**:
- Synthesize findings into actionable insights
- Formulate hypotheses
- Identify next steps
**Look for**: Unexpected patterns, strong correlations, quality issues, feature engineering opportunities, domain implications
**What to Look For**:
- Unexpected patterns or anomalies
- Strong relationships or correlations
- Data quality issues requiring attention
- Feature engineering opportunities
- Business or research implications
## Visualization Guidelines
## Best Practices
**Chart Selection**:
- Distribution: Histogram, KDE, box/violin plots
- Relationships: Scatter, line, heatmap
- Composition: Stacked bar
- Comparison: Bar, grouped bar
### Visualization Guidelines
**Best Practices**: Label axes with units, descriptive titles, purposeful color, appropriate scales, avoid clutter
1. **Choose appropriate chart types**:
- Distribution: Histogram, KDE, box plot, violin plot
- Relationships: Scatter plot, line plot, heatmap
- Composition: Stacked bar, pie chart (use sparingly)
- Comparison: Bar chart, grouped bar chart
## Statistical Analysis Guidelines
2. **Make visualizations clear and informative**:
- Always label axes with units
- Add descriptive titles
- Use color purposefully
- Include legends when needed
- Choose appropriate scales
- Avoid chart junk
**Check Assumptions**: Normality, homoscedasticity, independence, linearity
3. **Use multiple views**:
- Show data from different angles
- Combine complementary visualizations
- Use small multiples for faceting
**Method Selection**: Parametric when assumptions met, non-parametric otherwise, report effect sizes
### Statistical Analysis Guidelines
**Context Matters**: Statistical ≠ practical significance, domain knowledge trumps statistics, correlation ≠ causation
1. **Check assumptions**:
- Test for normality before parametric tests
- Check for homoscedasticity
- Verify independence of observations
- Assess linearity for linear models
## Documentation Guidelines
2. **Use appropriate methods**:
- Parametric tests when assumptions met
- Non-parametric alternatives when violated
- Robust methods for outlier-prone data
- Effect sizes alongside p-values
**Notes**: Document assumptions, decisions, issues, findings
3. **Consider context**:
- Statistical significance ≠ practical significance
- Domain knowledge trumps statistical patterns
- Correlation ≠ causation
- Sample size affects what you can detect
**Reproducibility**: Use scripts, version control, document sources, set random seeds
### Documentation Guidelines
**Reporting**: Clear summaries, supporting visualizations, highlighted insights, actionable recommendations
1. **Keep detailed notes**:
- Document assumptions and decisions
- Record data issues discovered
- Note interesting findings
- Track questions that arise
## Common Pitfalls
2. **Create reproducible analysis**:
- Use scripts, not manual Excel operations
- Version control your code
- Document data sources and versions
- Include random seeds for reproducibility
3. **Summarize findings**:
- Write clear summaries
- Use visualizations to support points
- Highlight key insights
- Provide recommendations
## Common Pitfalls to Avoid
### 1. Confirmation Bias
- **Problem**: Looking only for evidence supporting preconceptions
- **Solution**: Actively seek disconfirming evidence, use blind analysis
### 2. Ignoring Data Quality
- **Problem**: Proceeding with analysis despite known data issues
- **Solution**: Address quality issues first, document limitations
### 3. Over-reliance on Automation
- **Problem**: Running analyses without understanding or verifying results
- **Solution**: Manually inspect subsets, verify automated findings
### 4. Neglecting Outliers
- **Problem**: Removing outliers without investigation
- **Solution**: Always investigate outliers - they may contain important information
### 5. Multiple Testing Without Correction
- **Problem**: Running many tests increases false positive rate
- **Solution**: Use correction methods (Bonferroni, FDR) or be explicit about exploratory nature
### 6. Mistaking Association for Causation
- **Problem**: Inferring causation from correlation
- **Solution**: Use careful language, acknowledge alternative explanations
### 7. Cherry-picking Results
- **Problem**: Reporting only interesting/significant findings
- **Solution**: Report complete analysis, including negative results
### 8. Ignoring Sample Size
- **Problem**: Not considering how sample size affects conclusions
- **Solution**: Report effect sizes, confidence intervals, and sample sizes
1. **Confirmation Bias**: Seek disconfirming evidence, use blind analysis
2. **Ignoring Quality**: Address issues first, document limitations
3. **Over-automation**: Manually inspect subsets, verify results
4. **Neglecting Outliers**: Investigate before removing - may be informative
5. **Multiple Testing**: Use correction (Bonferroni, FDR) or note exploratory nature
6. **Association ≠ Causation**: Use careful language, acknowledge alternatives
7. **Cherry-picking**: Report complete analysis, including negative results
8. **Ignoring Sample Size**: Report effect sizes, CIs, and sample sizes
## Domain-Specific Considerations
### Time Series Data
- Check for stationarity
- Identify trends and seasonality
- Look for autocorrelation
- Handle missing time points
- Consider temporal splits for validation
**Time Series**: Check stationarity, identify trends/seasonality, autocorrelation, temporal splits
### High-Dimensional Data
- Start with dimensionality reduction
- Focus on feature importance
- Be cautious of curse of dimensionality
- Use regularization in modeling
- Consider domain knowledge for feature selection
**High-Dimensional**: Dimensionality reduction, feature importance, regularization, domain-guided selection
### Imbalanced Data
- Report class distributions
- Use appropriate metrics (not just accuracy)
- Consider resampling techniques
- Stratify sampling and cross-validation
- Be aware of biases in learning
**Imbalanced**: Report distributions, appropriate metrics, resampling, stratified CV
### Small Sample Sizes
- Use non-parametric methods
- Be conservative with conclusions
- Report confidence intervals
- Consider Bayesian approaches
- Acknowledge limitations
**Small Samples**: Non-parametric methods, conservative conclusions, CIs, Bayesian approaches
### Big Data
- Sample intelligently for exploration
- Use efficient data structures
- Leverage parallel/distributed computing
- Be aware computational complexity
- Consider scalability in methods
**Big Data**: Intelligent sampling, efficient structures, parallel computing, scalability
## Iterative Process
EDA is not linear - iterate and refine:
EDA is iterative: Explore → Questions → Focused Analysis → Insights → New Questions → Deeper Investigation → Synthesis
1. **Initial exploration** → Identify questions
2. **Focused analysis** → Answer specific questions
3. **New insights** → Generate new questions
4. **Deeper investigation** → Refine understanding
5. **Synthesis** → Integrate findings
**Done When**: Understand structure/quality, characterized variables, identified relationships, documented limitations, answered questions, have actionable insights
### When to Stop
**Deliverables**: Data understanding, quality issue list, relationship insights, hypotheses, feature ideas, recommendations
You've done enough EDA when:
- ✅ You understand the data structure and quality
- ✅ You've characterized key variables
- ✅ You've identified important relationships
- ✅ You've documented limitations
- ✅ You can answer your research questions
- ✅ You have actionable insights
## Communication
### Moving Forward
**Technical Audiences**: Methodological details, statistical tests, assumptions, reproducible code
After EDA, you should have:
- Clear understanding of data
- List of quality issues and how to handle them
- Insights about relationships and patterns
- Hypotheses to test
- Ideas for feature engineering
- Recommendations for next steps
**Non-Technical Audiences**: Focus on insights, clear visualizations, avoid jargon, concrete recommendations
## Communication Tips
**Report Structure**: Executive summary → Data overview → Analysis → Insights → Recommendations → Appendix
### For Technical Audiences
- Include methodological details
- Show statistical test results
- Discuss assumptions and limitations
- Provide reproducible code
- Reference relevant literature
## Checklists
### For Non-Technical Audiences
- Focus on insights, not methods
- Use clear visualizations
- Avoid jargon
- Provide context and implications
- Make recommendations concrete
**Before**: Understand context, define objectives, identify audience, set up environment
### Report Structure
1. **Executive Summary**: Key findings and recommendations
2. **Data Overview**: Source, structure, limitations
3. **Analysis**: Findings organized by theme
4. **Insights**: Patterns, anomalies, implications
5. **Recommendations**: Next steps and actions
6. **Appendix**: Technical details, full statistics
**During**: Inspect structure, assess quality, analyze distributions, explore relationships, document continuously
## Useful Checklists
### Before Starting
- [ ] Understand business/research context
- [ ] Define analysis objectives
- [ ] Identify stakeholders and audience
- [ ] Secure necessary permissions
- [ ] Set up reproducible environment
### During Analysis
- [ ] Load and inspect data structure
- [ ] Assess data quality
- [ ] Analyze univariate distributions
- [ ] Explore bivariate relationships
- [ ] Investigate multivariate patterns
- [ ] Generate and validate insights
- [ ] Document findings continuously
### Before Concluding
- [ ] Verify all findings
- [ ] Check for alternative explanations
- [ ] Document limitations
- [ ] Prepare clear visualizations
- [ ] Write actionable recommendations
- [ ] Review with domain experts
- [ ] Ensure reproducibility
## Tools and Libraries
### Python Ecosystem
- **pandas**: Data manipulation
- **numpy**: Numerical operations
- **matplotlib/seaborn**: Visualization
- **scipy**: Statistical tests
- **scikit-learn**: ML preprocessing
- **plotly**: Interactive visualizations
### Best Tool Practices
- Use appropriate tool for task
- Leverage vectorization
- Chain operations efficiently
- Handle missing data properly
- Validate results independently
- Document custom functions
## Further Resources
- **Books**:
- "Exploratory Data Analysis" by John Tukey
- "The Art of Statistics" by David Spiegelhalter
- **Guidelines**:
- ASA Statistical Significance Statement
- FAIR data principles
- **Communities**:
- Cross Validated (Stack Exchange)
- /r/datascience
- Local data science meetups
**After**: Verify findings, check alternatives, document limitations, prepare visualizations, ensure reproducibility