mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
Improve the EDA skill
This commit is contained in:
@@ -1,379 +1,125 @@
|
||||
# Exploratory Data Analysis Best Practices
|
||||
# EDA Best Practices
|
||||
|
||||
This guide provides best practices and methodologies for conducting thorough exploratory data analysis.
|
||||
Methodologies for conducting thorough exploratory data analysis.
|
||||
|
||||
## EDA Process Framework
|
||||
## 6-Step EDA Framework
|
||||
|
||||
### 1. Initial Data Understanding
|
||||
### 1. Initial Understanding
|
||||
|
||||
**Objectives**:
|
||||
- Understand data structure and format
|
||||
- Identify data types and schema
|
||||
- Get familiar with domain context
|
||||
|
||||
**Key Questions**:
|
||||
**Questions**:
|
||||
- What does each column represent?
|
||||
- What is the unit of observation?
|
||||
- What is the time period covered?
|
||||
- What is the unit of observation and time period?
|
||||
- What is the data collection methodology?
|
||||
- Are there any known data quality issues?
|
||||
- Are there known quality issues?
|
||||
|
||||
**Actions**:
|
||||
- Load and inspect first/last rows
|
||||
- Check data dimensions (rows × columns)
|
||||
- Review column names and types
|
||||
- Document data source and context
|
||||
**Actions**: Load data, inspect structure, review types, document context
|
||||
|
||||
### 2. Data Quality Assessment
|
||||
### 2. Quality Assessment
|
||||
|
||||
**Objectives**:
|
||||
- Identify data quality issues
|
||||
- Assess data completeness and reliability
|
||||
- Document data limitations
|
||||
|
||||
**Key Checks**:
|
||||
- **Missing data**: Patterns, extent, randomness
|
||||
- **Duplicates**: Exact and near-duplicates
|
||||
- **Outliers**: Valid extremes vs. data errors
|
||||
- **Consistency**: Cross-field validation
|
||||
- **Accuracy**: Domain knowledge validation
|
||||
**Check**: Missing data patterns, duplicates, outliers, consistency, accuracy
|
||||
|
||||
**Red Flags**:
|
||||
- High missing data rate (>20%)
|
||||
- Missing data >20%
|
||||
- Unexpected duplicates
|
||||
- Constant or near-constant columns
|
||||
- Impossible values (negative ages, dates in future)
|
||||
- High cardinality in ID-like columns
|
||||
- Constant columns
|
||||
- Impossible values (negative ages, future dates)
|
||||
- Suspicious patterns (too many round numbers)
|
||||
|
||||
### 3. Univariate Analysis
|
||||
|
||||
**Objectives**:
|
||||
- Understand individual variable distributions
|
||||
- Identify anomalies and patterns
|
||||
- Determine variable characteristics
|
||||
**Numeric**: Central tendency, dispersion, shape (skewness, kurtosis), distribution plots, outliers
|
||||
|
||||
**For Numeric Variables**:
|
||||
- Central tendency (mean, median, mode)
|
||||
- Dispersion (range, variance, std, IQR)
|
||||
- Shape (skewness, kurtosis)
|
||||
- Distribution visualization (histogram, KDE, box plot)
|
||||
- Outlier detection
|
||||
**Categorical**: Frequency distributions, unique counts, balance, bar charts
|
||||
|
||||
**For Categorical Variables**:
|
||||
- Frequency distributions
|
||||
- Unique value counts
|
||||
- Most/least common categories
|
||||
- Category balance/imbalance
|
||||
- Bar charts and count plots
|
||||
|
||||
**For Temporal Variables**:
|
||||
- Time range coverage
|
||||
- Gaps in timeline
|
||||
- Temporal patterns (trends, seasonality)
|
||||
- Time series plots
|
||||
**Temporal**: Time range, gaps, trends, seasonality, time series plots
|
||||
|
||||
### 4. Bivariate Analysis
|
||||
|
||||
**Objectives**:
|
||||
- Understand relationships between variables
|
||||
- Identify correlations and dependencies
|
||||
- Find potential predictors
|
||||
**Numeric vs Numeric**: Scatter plots, correlations (Pearson, Spearman), detect non-linearity
|
||||
|
||||
**Numeric vs Numeric**:
|
||||
- Scatter plots
|
||||
- Correlation coefficients (Pearson, Spearman)
|
||||
- Line of best fit
|
||||
- Detect non-linear relationships
|
||||
**Numeric vs Categorical**: Group statistics, box plots by category, t-test/ANOVA
|
||||
|
||||
**Numeric vs Categorical**:
|
||||
- Group statistics (mean, median by category)
|
||||
- Box plots by category
|
||||
- Distribution plots by category
|
||||
- Statistical tests (t-test, ANOVA)
|
||||
|
||||
**Categorical vs Categorical**:
|
||||
- Cross-tabulation / contingency tables
|
||||
- Stacked bar charts
|
||||
- Chi-square tests
|
||||
- Cramér's V for association strength
|
||||
**Categorical vs Categorical**: Cross-tabs, stacked bars, chi-square, Cramér's V
|
||||
|
||||
### 5. Multivariate Analysis
|
||||
|
||||
**Objectives**:
|
||||
- Understand complex interactions
|
||||
- Identify patterns across multiple variables
|
||||
- Explore dimensionality
|
||||
**Techniques**: Correlation matrices, pair plots, parallel coordinates, PCA, clustering
|
||||
|
||||
**Techniques**:
|
||||
- Correlation matrices and heatmaps
|
||||
- Pair plots / scatter matrices
|
||||
- Parallel coordinates plots
|
||||
- Principal Component Analysis (PCA)
|
||||
- Clustering analysis
|
||||
|
||||
**Key Questions**:
|
||||
- Are there groups of correlated features?
|
||||
- Can we reduce dimensionality?
|
||||
- Are there natural clusters?
|
||||
- Do patterns change when conditioning on other variables?
|
||||
**Questions**: Groups of correlated features? Reduce dimensionality? Natural clusters? Conditional patterns?
|
||||
|
||||
### 6. Insight Generation
|
||||
|
||||
**Objectives**:
|
||||
- Synthesize findings into actionable insights
|
||||
- Formulate hypotheses
|
||||
- Identify next steps
|
||||
**Look for**: Unexpected patterns, strong correlations, quality issues, feature engineering opportunities, domain implications
|
||||
|
||||
**What to Look For**:
|
||||
- Unexpected patterns or anomalies
|
||||
- Strong relationships or correlations
|
||||
- Data quality issues requiring attention
|
||||
- Feature engineering opportunities
|
||||
- Business or research implications
|
||||
## Visualization Guidelines
|
||||
|
||||
## Best Practices
|
||||
**Chart Selection**:
|
||||
- Distribution: Histogram, KDE, box/violin plots
|
||||
- Relationships: Scatter, line, heatmap
|
||||
- Composition: Stacked bar
|
||||
- Comparison: Bar, grouped bar
|
||||
|
||||
### Visualization Guidelines
|
||||
**Best Practices**: Label axes with units, descriptive titles, purposeful color, appropriate scales, avoid clutter
|
||||
|
||||
1. **Choose appropriate chart types**:
|
||||
- Distribution: Histogram, KDE, box plot, violin plot
|
||||
- Relationships: Scatter plot, line plot, heatmap
|
||||
- Composition: Stacked bar, pie chart (use sparingly)
|
||||
- Comparison: Bar chart, grouped bar chart
|
||||
## Statistical Analysis Guidelines
|
||||
|
||||
2. **Make visualizations clear and informative**:
|
||||
- Always label axes with units
|
||||
- Add descriptive titles
|
||||
- Use color purposefully
|
||||
- Include legends when needed
|
||||
- Choose appropriate scales
|
||||
- Avoid chart junk
|
||||
**Check Assumptions**: Normality, homoscedasticity, independence, linearity
|
||||
|
||||
3. **Use multiple views**:
|
||||
- Show data from different angles
|
||||
- Combine complementary visualizations
|
||||
- Use small multiples for faceting
|
||||
**Method Selection**: Parametric when assumptions met, non-parametric otherwise, report effect sizes
|
||||
|
||||
### Statistical Analysis Guidelines
|
||||
**Context Matters**: Statistical ≠ practical significance, domain knowledge trumps statistics, correlation ≠ causation
|
||||
|
||||
1. **Check assumptions**:
|
||||
- Test for normality before parametric tests
|
||||
- Check for homoscedasticity
|
||||
- Verify independence of observations
|
||||
- Assess linearity for linear models
|
||||
## Documentation Guidelines
|
||||
|
||||
2. **Use appropriate methods**:
|
||||
- Parametric tests when assumptions met
|
||||
- Non-parametric alternatives when violated
|
||||
- Robust methods for outlier-prone data
|
||||
- Effect sizes alongside p-values
|
||||
**Notes**: Document assumptions, decisions, issues, findings
|
||||
|
||||
3. **Consider context**:
|
||||
- Statistical significance ≠ practical significance
|
||||
- Domain knowledge trumps statistical patterns
|
||||
- Correlation ≠ causation
|
||||
- Sample size affects what you can detect
|
||||
**Reproducibility**: Use scripts, version control, document sources, set random seeds
|
||||
|
||||
### Documentation Guidelines
|
||||
**Reporting**: Clear summaries, supporting visualizations, highlighted insights, actionable recommendations
|
||||
|
||||
1. **Keep detailed notes**:
|
||||
- Document assumptions and decisions
|
||||
- Record data issues discovered
|
||||
- Note interesting findings
|
||||
- Track questions that arise
|
||||
## Common Pitfalls
|
||||
|
||||
2. **Create reproducible analysis**:
|
||||
- Use scripts, not manual Excel operations
|
||||
- Version control your code
|
||||
- Document data sources and versions
|
||||
- Include random seeds for reproducibility
|
||||
|
||||
3. **Summarize findings**:
|
||||
- Write clear summaries
|
||||
- Use visualizations to support points
|
||||
- Highlight key insights
|
||||
- Provide recommendations
|
||||
|
||||
## Common Pitfalls to Avoid
|
||||
|
||||
### 1. Confirmation Bias
|
||||
- **Problem**: Looking only for evidence supporting preconceptions
|
||||
- **Solution**: Actively seek disconfirming evidence, use blind analysis
|
||||
|
||||
### 2. Ignoring Data Quality
|
||||
- **Problem**: Proceeding with analysis despite known data issues
|
||||
- **Solution**: Address quality issues first, document limitations
|
||||
|
||||
### 3. Over-reliance on Automation
|
||||
- **Problem**: Running analyses without understanding or verifying results
|
||||
- **Solution**: Manually inspect subsets, verify automated findings
|
||||
|
||||
### 4. Neglecting Outliers
|
||||
- **Problem**: Removing outliers without investigation
|
||||
- **Solution**: Always investigate outliers - they may contain important information
|
||||
|
||||
### 5. Multiple Testing Without Correction
|
||||
- **Problem**: Running many tests increases false positive rate
|
||||
- **Solution**: Use correction methods (Bonferroni, FDR) or be explicit about exploratory nature
|
||||
|
||||
### 6. Mistaking Association for Causation
|
||||
- **Problem**: Inferring causation from correlation
|
||||
- **Solution**: Use careful language, acknowledge alternative explanations
|
||||
|
||||
### 7. Cherry-picking Results
|
||||
- **Problem**: Reporting only interesting/significant findings
|
||||
- **Solution**: Report complete analysis, including negative results
|
||||
|
||||
### 8. Ignoring Sample Size
|
||||
- **Problem**: Not considering how sample size affects conclusions
|
||||
- **Solution**: Report effect sizes, confidence intervals, and sample sizes
|
||||
1. **Confirmation Bias**: Seek disconfirming evidence, use blind analysis
|
||||
2. **Ignoring Quality**: Address issues first, document limitations
|
||||
3. **Over-automation**: Manually inspect subsets, verify results
|
||||
4. **Neglecting Outliers**: Investigate before removing - may be informative
|
||||
5. **Multiple Testing**: Use correction (Bonferroni, FDR) or note exploratory nature
|
||||
6. **Association ≠ Causation**: Use careful language, acknowledge alternatives
|
||||
7. **Cherry-picking**: Report complete analysis, including negative results
|
||||
8. **Ignoring Sample Size**: Report effect sizes, CIs, and sample sizes
|
||||
|
||||
## Domain-Specific Considerations
|
||||
|
||||
### Time Series Data
|
||||
- Check for stationarity
|
||||
- Identify trends and seasonality
|
||||
- Look for autocorrelation
|
||||
- Handle missing time points
|
||||
- Consider temporal splits for validation
|
||||
**Time Series**: Check stationarity, identify trends/seasonality, autocorrelation, temporal splits
|
||||
|
||||
### High-Dimensional Data
|
||||
- Start with dimensionality reduction
|
||||
- Focus on feature importance
|
||||
- Be cautious of curse of dimensionality
|
||||
- Use regularization in modeling
|
||||
- Consider domain knowledge for feature selection
|
||||
**High-Dimensional**: Dimensionality reduction, feature importance, regularization, domain-guided selection
|
||||
|
||||
### Imbalanced Data
|
||||
- Report class distributions
|
||||
- Use appropriate metrics (not just accuracy)
|
||||
- Consider resampling techniques
|
||||
- Stratify sampling and cross-validation
|
||||
- Be aware of biases in learning
|
||||
**Imbalanced**: Report distributions, appropriate metrics, resampling, stratified CV
|
||||
|
||||
### Small Sample Sizes
|
||||
- Use non-parametric methods
|
||||
- Be conservative with conclusions
|
||||
- Report confidence intervals
|
||||
- Consider Bayesian approaches
|
||||
- Acknowledge limitations
|
||||
**Small Samples**: Non-parametric methods, conservative conclusions, CIs, Bayesian approaches
|
||||
|
||||
### Big Data
|
||||
- Sample intelligently for exploration
|
||||
- Use efficient data structures
|
||||
- Leverage parallel/distributed computing
|
||||
- Be aware computational complexity
|
||||
- Consider scalability in methods
|
||||
**Big Data**: Intelligent sampling, efficient structures, parallel computing, scalability
|
||||
|
||||
## Iterative Process
|
||||
|
||||
EDA is not linear - iterate and refine:
|
||||
EDA is iterative: Explore → Questions → Focused Analysis → Insights → New Questions → Deeper Investigation → Synthesis
|
||||
|
||||
1. **Initial exploration** → Identify questions
|
||||
2. **Focused analysis** → Answer specific questions
|
||||
3. **New insights** → Generate new questions
|
||||
4. **Deeper investigation** → Refine understanding
|
||||
5. **Synthesis** → Integrate findings
|
||||
**Done When**: Understand structure/quality, characterized variables, identified relationships, documented limitations, answered questions, have actionable insights
|
||||
|
||||
### When to Stop
|
||||
**Deliverables**: Data understanding, quality issue list, relationship insights, hypotheses, feature ideas, recommendations
|
||||
|
||||
You've done enough EDA when:
|
||||
- ✅ You understand the data structure and quality
|
||||
- ✅ You've characterized key variables
|
||||
- ✅ You've identified important relationships
|
||||
- ✅ You've documented limitations
|
||||
- ✅ You can answer your research questions
|
||||
- ✅ You have actionable insights
|
||||
## Communication
|
||||
|
||||
### Moving Forward
|
||||
**Technical Audiences**: Methodological details, statistical tests, assumptions, reproducible code
|
||||
|
||||
After EDA, you should have:
|
||||
- Clear understanding of data
|
||||
- List of quality issues and how to handle them
|
||||
- Insights about relationships and patterns
|
||||
- Hypotheses to test
|
||||
- Ideas for feature engineering
|
||||
- Recommendations for next steps
|
||||
**Non-Technical Audiences**: Focus on insights, clear visualizations, avoid jargon, concrete recommendations
|
||||
|
||||
## Communication Tips
|
||||
**Report Structure**: Executive summary → Data overview → Analysis → Insights → Recommendations → Appendix
|
||||
|
||||
### For Technical Audiences
|
||||
- Include methodological details
|
||||
- Show statistical test results
|
||||
- Discuss assumptions and limitations
|
||||
- Provide reproducible code
|
||||
- Reference relevant literature
|
||||
## Checklists
|
||||
|
||||
### For Non-Technical Audiences
|
||||
- Focus on insights, not methods
|
||||
- Use clear visualizations
|
||||
- Avoid jargon
|
||||
- Provide context and implications
|
||||
- Make recommendations concrete
|
||||
**Before**: Understand context, define objectives, identify audience, set up environment
|
||||
|
||||
### Report Structure
|
||||
1. **Executive Summary**: Key findings and recommendations
|
||||
2. **Data Overview**: Source, structure, limitations
|
||||
3. **Analysis**: Findings organized by theme
|
||||
4. **Insights**: Patterns, anomalies, implications
|
||||
5. **Recommendations**: Next steps and actions
|
||||
6. **Appendix**: Technical details, full statistics
|
||||
**During**: Inspect structure, assess quality, analyze distributions, explore relationships, document continuously
|
||||
|
||||
## Useful Checklists
|
||||
|
||||
### Before Starting
|
||||
- [ ] Understand business/research context
|
||||
- [ ] Define analysis objectives
|
||||
- [ ] Identify stakeholders and audience
|
||||
- [ ] Secure necessary permissions
|
||||
- [ ] Set up reproducible environment
|
||||
|
||||
### During Analysis
|
||||
- [ ] Load and inspect data structure
|
||||
- [ ] Assess data quality
|
||||
- [ ] Analyze univariate distributions
|
||||
- [ ] Explore bivariate relationships
|
||||
- [ ] Investigate multivariate patterns
|
||||
- [ ] Generate and validate insights
|
||||
- [ ] Document findings continuously
|
||||
|
||||
### Before Concluding
|
||||
- [ ] Verify all findings
|
||||
- [ ] Check for alternative explanations
|
||||
- [ ] Document limitations
|
||||
- [ ] Prepare clear visualizations
|
||||
- [ ] Write actionable recommendations
|
||||
- [ ] Review with domain experts
|
||||
- [ ] Ensure reproducibility
|
||||
|
||||
## Tools and Libraries
|
||||
|
||||
### Python Ecosystem
|
||||
- **pandas**: Data manipulation
|
||||
- **numpy**: Numerical operations
|
||||
- **matplotlib/seaborn**: Visualization
|
||||
- **scipy**: Statistical tests
|
||||
- **scikit-learn**: ML preprocessing
|
||||
- **plotly**: Interactive visualizations
|
||||
|
||||
### Best Tool Practices
|
||||
- Use appropriate tool for task
|
||||
- Leverage vectorization
|
||||
- Chain operations efficiently
|
||||
- Handle missing data properly
|
||||
- Validate results independently
|
||||
- Document custom functions
|
||||
|
||||
## Further Resources
|
||||
|
||||
- **Books**:
|
||||
- "Exploratory Data Analysis" by John Tukey
|
||||
- "The Art of Statistics" by David Spiegelhalter
|
||||
- **Guidelines**:
|
||||
- ASA Statistical Significance Statement
|
||||
- FAIR data principles
|
||||
- **Communities**:
|
||||
- Cross Validated (Stack Exchange)
|
||||
- /r/datascience
|
||||
- Local data science meetups
|
||||
**After**: Verify findings, check alternatives, document limitations, prepare visualizations, ensure reproducibility
|
||||
|
||||
Reference in New Issue
Block a user