mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
Add EDA
This commit is contained in:
@@ -0,0 +1,379 @@
|
||||
# Exploratory Data Analysis Best Practices
|
||||
|
||||
This guide provides best practices and methodologies for conducting thorough exploratory data analysis.
|
||||
|
||||
## EDA Process Framework
|
||||
|
||||
### 1. Initial Data Understanding
|
||||
|
||||
**Objectives**:
|
||||
- Understand data structure and format
|
||||
- Identify data types and schema
|
||||
- Get familiar with domain context
|
||||
|
||||
**Key Questions**:
|
||||
- What does each column represent?
|
||||
- What is the unit of observation?
|
||||
- What is the time period covered?
|
||||
- What is the data collection methodology?
|
||||
- Are there any known data quality issues?
|
||||
|
||||
**Actions**:
|
||||
- Load and inspect first/last rows
|
||||
- Check data dimensions (rows × columns)
|
||||
- Review column names and types
|
||||
- Document data source and context
|
||||
|
||||
### 2. Data Quality Assessment
|
||||
|
||||
**Objectives**:
|
||||
- Identify data quality issues
|
||||
- Assess data completeness and reliability
|
||||
- Document data limitations
|
||||
|
||||
**Key Checks**:
|
||||
- **Missing data**: Patterns, extent, randomness
|
||||
- **Duplicates**: Exact and near-duplicates
|
||||
- **Outliers**: Valid extremes vs. data errors
|
||||
- **Consistency**: Cross-field validation
|
||||
- **Accuracy**: Domain knowledge validation
|
||||
|
||||
**Red Flags**:
|
||||
- High missing data rate (>20%)
|
||||
- Unexpected duplicates
|
||||
- Constant or near-constant columns
|
||||
- Impossible values (negative ages, dates in future)
|
||||
- High cardinality in ID-like columns
|
||||
- Suspicious patterns (too many round numbers)
|
||||
|
||||
### 3. Univariate Analysis
|
||||
|
||||
**Objectives**:
|
||||
- Understand individual variable distributions
|
||||
- Identify anomalies and patterns
|
||||
- Determine variable characteristics
|
||||
|
||||
**For Numeric Variables**:
|
||||
- Central tendency (mean, median, mode)
|
||||
- Dispersion (range, variance, std, IQR)
|
||||
- Shape (skewness, kurtosis)
|
||||
- Distribution visualization (histogram, KDE, box plot)
|
||||
- Outlier detection
|
||||
|
||||
**For Categorical Variables**:
|
||||
- Frequency distributions
|
||||
- Unique value counts
|
||||
- Most/least common categories
|
||||
- Category balance/imbalance
|
||||
- Bar charts and count plots
|
||||
|
||||
**For Temporal Variables**:
|
||||
- Time range coverage
|
||||
- Gaps in timeline
|
||||
- Temporal patterns (trends, seasonality)
|
||||
- Time series plots
|
||||
|
||||
### 4. Bivariate Analysis
|
||||
|
||||
**Objectives**:
|
||||
- Understand relationships between variables
|
||||
- Identify correlations and dependencies
|
||||
- Find potential predictors
|
||||
|
||||
**Numeric vs Numeric**:
|
||||
- Scatter plots
|
||||
- Correlation coefficients (Pearson, Spearman)
|
||||
- Line of best fit
|
||||
- Detect non-linear relationships
|
||||
|
||||
**Numeric vs Categorical**:
|
||||
- Group statistics (mean, median by category)
|
||||
- Box plots by category
|
||||
- Distribution plots by category
|
||||
- Statistical tests (t-test, ANOVA)
|
||||
|
||||
**Categorical vs Categorical**:
|
||||
- Cross-tabulation / contingency tables
|
||||
- Stacked bar charts
|
||||
- Chi-square tests
|
||||
- Cramér's V for association strength
|
||||
|
||||
### 5. Multivariate Analysis
|
||||
|
||||
**Objectives**:
|
||||
- Understand complex interactions
|
||||
- Identify patterns across multiple variables
|
||||
- Explore dimensionality
|
||||
|
||||
**Techniques**:
|
||||
- Correlation matrices and heatmaps
|
||||
- Pair plots / scatter matrices
|
||||
- Parallel coordinates plots
|
||||
- Principal Component Analysis (PCA)
|
||||
- Clustering analysis
|
||||
|
||||
**Key Questions**:
|
||||
- Are there groups of correlated features?
|
||||
- Can we reduce dimensionality?
|
||||
- Are there natural clusters?
|
||||
- Do patterns change when conditioning on other variables?
|
||||
|
||||
### 6. Insight Generation
|
||||
|
||||
**Objectives**:
|
||||
- Synthesize findings into actionable insights
|
||||
- Formulate hypotheses
|
||||
- Identify next steps
|
||||
|
||||
**What to Look For**:
|
||||
- Unexpected patterns or anomalies
|
||||
- Strong relationships or correlations
|
||||
- Data quality issues requiring attention
|
||||
- Feature engineering opportunities
|
||||
- Business or research implications
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Visualization Guidelines
|
||||
|
||||
1. **Choose appropriate chart types**:
|
||||
- Distribution: Histogram, KDE, box plot, violin plot
|
||||
- Relationships: Scatter plot, line plot, heatmap
|
||||
- Composition: Stacked bar, pie chart (use sparingly)
|
||||
- Comparison: Bar chart, grouped bar chart
|
||||
|
||||
2. **Make visualizations clear and informative**:
|
||||
- Always label axes with units
|
||||
- Add descriptive titles
|
||||
- Use color purposefully
|
||||
- Include legends when needed
|
||||
- Choose appropriate scales
|
||||
- Avoid chart junk
|
||||
|
||||
3. **Use multiple views**:
|
||||
- Show data from different angles
|
||||
- Combine complementary visualizations
|
||||
- Use small multiples for faceting
|
||||
|
||||
### Statistical Analysis Guidelines
|
||||
|
||||
1. **Check assumptions**:
|
||||
- Test for normality before parametric tests
|
||||
- Check for homoscedasticity
|
||||
- Verify independence of observations
|
||||
- Assess linearity for linear models
|
||||
|
||||
2. **Use appropriate methods**:
|
||||
- Parametric tests when assumptions met
|
||||
- Non-parametric alternatives when violated
|
||||
- Robust methods for outlier-prone data
|
||||
- Effect sizes alongside p-values
|
||||
|
||||
3. **Consider context**:
|
||||
- Statistical significance ≠ practical significance
|
||||
- Domain knowledge trumps statistical patterns
|
||||
- Correlation ≠ causation
|
||||
- Sample size affects what you can detect
|
||||
|
||||
### Documentation Guidelines
|
||||
|
||||
1. **Keep detailed notes**:
|
||||
- Document assumptions and decisions
|
||||
- Record data issues discovered
|
||||
- Note interesting findings
|
||||
- Track questions that arise
|
||||
|
||||
2. **Create reproducible analysis**:
|
||||
- Use scripts, not manual Excel operations
|
||||
- Version control your code
|
||||
- Document data sources and versions
|
||||
- Include random seeds for reproducibility
|
||||
|
||||
3. **Summarize findings**:
|
||||
- Write clear summaries
|
||||
- Use visualizations to support points
|
||||
- Highlight key insights
|
||||
- Provide recommendations
|
||||
|
||||
## Common Pitfalls to Avoid
|
||||
|
||||
### 1. Confirmation Bias
|
||||
- **Problem**: Looking only for evidence supporting preconceptions
|
||||
- **Solution**: Actively seek disconfirming evidence, use blind analysis
|
||||
|
||||
### 2. Ignoring Data Quality
|
||||
- **Problem**: Proceeding with analysis despite known data issues
|
||||
- **Solution**: Address quality issues first, document limitations
|
||||
|
||||
### 3. Over-reliance on Automation
|
||||
- **Problem**: Running analyses without understanding or verifying results
|
||||
- **Solution**: Manually inspect subsets, verify automated findings
|
||||
|
||||
### 4. Neglecting Outliers
|
||||
- **Problem**: Removing outliers without investigation
|
||||
- **Solution**: Always investigate outliers - they may contain important information
|
||||
|
||||
### 5. Multiple Testing Without Correction
|
||||
- **Problem**: Running many tests increases false positive rate
|
||||
- **Solution**: Use correction methods (Bonferroni, FDR) or be explicit about exploratory nature
|
||||
|
||||
### 6. Mistaking Association for Causation
|
||||
- **Problem**: Inferring causation from correlation
|
||||
- **Solution**: Use careful language, acknowledge alternative explanations
|
||||
|
||||
### 7. Cherry-picking Results
|
||||
- **Problem**: Reporting only interesting/significant findings
|
||||
- **Solution**: Report complete analysis, including negative results
|
||||
|
||||
### 8. Ignoring Sample Size
|
||||
- **Problem**: Not considering how sample size affects conclusions
|
||||
- **Solution**: Report effect sizes, confidence intervals, and sample sizes
|
||||
|
||||
## Domain-Specific Considerations
|
||||
|
||||
### Time Series Data
|
||||
- Check for stationarity
|
||||
- Identify trends and seasonality
|
||||
- Look for autocorrelation
|
||||
- Handle missing time points
|
||||
- Consider temporal splits for validation
|
||||
|
||||
### High-Dimensional Data
|
||||
- Start with dimensionality reduction
|
||||
- Focus on feature importance
|
||||
- Be cautious of curse of dimensionality
|
||||
- Use regularization in modeling
|
||||
- Consider domain knowledge for feature selection
|
||||
|
||||
### Imbalanced Data
|
||||
- Report class distributions
|
||||
- Use appropriate metrics (not just accuracy)
|
||||
- Consider resampling techniques
|
||||
- Stratify sampling and cross-validation
|
||||
- Be aware of biases in learning
|
||||
|
||||
### Small Sample Sizes
|
||||
- Use non-parametric methods
|
||||
- Be conservative with conclusions
|
||||
- Report confidence intervals
|
||||
- Consider Bayesian approaches
|
||||
- Acknowledge limitations
|
||||
|
||||
### Big Data
|
||||
- Sample intelligently for exploration
|
||||
- Use efficient data structures
|
||||
- Leverage parallel/distributed computing
|
||||
- Be aware computational complexity
|
||||
- Consider scalability in methods
|
||||
|
||||
## Iterative Process
|
||||
|
||||
EDA is not linear - iterate and refine:
|
||||
|
||||
1. **Initial exploration** → Identify questions
|
||||
2. **Focused analysis** → Answer specific questions
|
||||
3. **New insights** → Generate new questions
|
||||
4. **Deeper investigation** → Refine understanding
|
||||
5. **Synthesis** → Integrate findings
|
||||
|
||||
### When to Stop
|
||||
|
||||
You've done enough EDA when:
|
||||
- ✅ You understand the data structure and quality
|
||||
- ✅ You've characterized key variables
|
||||
- ✅ You've identified important relationships
|
||||
- ✅ You've documented limitations
|
||||
- ✅ You can answer your research questions
|
||||
- ✅ You have actionable insights
|
||||
|
||||
### Moving Forward
|
||||
|
||||
After EDA, you should have:
|
||||
- Clear understanding of data
|
||||
- List of quality issues and how to handle them
|
||||
- Insights about relationships and patterns
|
||||
- Hypotheses to test
|
||||
- Ideas for feature engineering
|
||||
- Recommendations for next steps
|
||||
|
||||
## Communication Tips
|
||||
|
||||
### For Technical Audiences
|
||||
- Include methodological details
|
||||
- Show statistical test results
|
||||
- Discuss assumptions and limitations
|
||||
- Provide reproducible code
|
||||
- Reference relevant literature
|
||||
|
||||
### For Non-Technical Audiences
|
||||
- Focus on insights, not methods
|
||||
- Use clear visualizations
|
||||
- Avoid jargon
|
||||
- Provide context and implications
|
||||
- Make recommendations concrete
|
||||
|
||||
### Report Structure
|
||||
1. **Executive Summary**: Key findings and recommendations
|
||||
2. **Data Overview**: Source, structure, limitations
|
||||
3. **Analysis**: Findings organized by theme
|
||||
4. **Insights**: Patterns, anomalies, implications
|
||||
5. **Recommendations**: Next steps and actions
|
||||
6. **Appendix**: Technical details, full statistics
|
||||
|
||||
## Useful Checklists
|
||||
|
||||
### Before Starting
|
||||
- [ ] Understand business/research context
|
||||
- [ ] Define analysis objectives
|
||||
- [ ] Identify stakeholders and audience
|
||||
- [ ] Secure necessary permissions
|
||||
- [ ] Set up reproducible environment
|
||||
|
||||
### During Analysis
|
||||
- [ ] Load and inspect data structure
|
||||
- [ ] Assess data quality
|
||||
- [ ] Analyze univariate distributions
|
||||
- [ ] Explore bivariate relationships
|
||||
- [ ] Investigate multivariate patterns
|
||||
- [ ] Generate and validate insights
|
||||
- [ ] Document findings continuously
|
||||
|
||||
### Before Concluding
|
||||
- [ ] Verify all findings
|
||||
- [ ] Check for alternative explanations
|
||||
- [ ] Document limitations
|
||||
- [ ] Prepare clear visualizations
|
||||
- [ ] Write actionable recommendations
|
||||
- [ ] Review with domain experts
|
||||
- [ ] Ensure reproducibility
|
||||
|
||||
## Tools and Libraries
|
||||
|
||||
### Python Ecosystem
|
||||
- **pandas**: Data manipulation
|
||||
- **numpy**: Numerical operations
|
||||
- **matplotlib/seaborn**: Visualization
|
||||
- **scipy**: Statistical tests
|
||||
- **scikit-learn**: ML preprocessing
|
||||
- **plotly**: Interactive visualizations
|
||||
|
||||
### Best Tool Practices
|
||||
- Use appropriate tool for task
|
||||
- Leverage vectorization
|
||||
- Chain operations efficiently
|
||||
- Handle missing data properly
|
||||
- Validate results independently
|
||||
- Document custom functions
|
||||
|
||||
## Further Resources
|
||||
|
||||
- **Books**:
|
||||
- "Exploratory Data Analysis" by John Tukey
|
||||
- "The Art of Statistics" by David Spiegelhalter
|
||||
- **Guidelines**:
|
||||
- ASA Statistical Significance Statement
|
||||
- FAIR data principles
|
||||
- **Communities**:
|
||||
- Cross Validated (Stack Exchange)
|
||||
- /r/datascience
|
||||
- Local data science meetups
|
||||
Reference in New Issue
Block a user