Add EDA

2026-01-26 16:58:56 +08:00 · 2025-10-19 16:53:31 -07:00
parent 3b85528b90
commit c627714209
8 changed files with 2236 additions and 1 deletions
--- a/scientific-thinking/exploratory-data-analysis/references/eda_best_practices.md
+++ b/scientific-thinking/exploratory-data-analysis/references/eda_best_practices.md
@@ -0,0 +1,379 @@
+# Exploratory Data Analysis Best Practices
+
+This guide provides best practices and methodologies for conducting thorough exploratory data analysis.
+
+## EDA Process Framework
+
+### 1. Initial Data Understanding
+
+**Objectives**:
+- Understand data structure and format
+- Identify data types and schema
+- Get familiar with domain context
+
+**Key Questions**:
+- What does each column represent?
+- What is the unit of observation?
+- What is the time period covered?
+- What is the data collection methodology?
+- Are there any known data quality issues?
+
+**Actions**:
+- Load and inspect first/last rows
+- Check data dimensions (rows × columns)
+- Review column names and types
+- Document data source and context
+
+### 2. Data Quality Assessment
+
+**Objectives**:
+- Identify data quality issues
+- Assess data completeness and reliability
+- Document data limitations
+
+**Key Checks**:
+- **Missing data**: Patterns, extent, randomness
+- **Duplicates**: Exact and near-duplicates
+- **Outliers**: Valid extremes vs. data errors
+- **Consistency**: Cross-field validation
+- **Accuracy**: Domain knowledge validation
+
+**Red Flags**:
+- High missing data rate (>20%)
+- Unexpected duplicates
+- Constant or near-constant columns
+- Impossible values (negative ages, dates in future)
+- High cardinality in ID-like columns
+- Suspicious patterns (too many round numbers)
+
+### 3. Univariate Analysis
+
+**Objectives**:
+- Understand individual variable distributions
+- Identify anomalies and patterns
+- Determine variable characteristics
+
+**For Numeric Variables**:
+- Central tendency (mean, median, mode)
+- Dispersion (range, variance, std, IQR)
+- Shape (skewness, kurtosis)
+- Distribution visualization (histogram, KDE, box plot)
+- Outlier detection
+
+**For Categorical Variables**:
+- Frequency distributions
+- Unique value counts
+- Most/least common categories
+- Category balance/imbalance
+- Bar charts and count plots
+
+**For Temporal Variables**:
+- Time range coverage
+- Gaps in timeline
+- Temporal patterns (trends, seasonality)
+- Time series plots
+
+### 4. Bivariate Analysis
+
+**Objectives**:
+- Understand relationships between variables
+- Identify correlations and dependencies
+- Find potential predictors
+
+**Numeric vs Numeric**:
+- Scatter plots
+- Correlation coefficients (Pearson, Spearman)
+- Line of best fit
+- Detect non-linear relationships
+
+**Numeric vs Categorical**:
+- Group statistics (mean, median by category)
+- Box plots by category
+- Distribution plots by category
+- Statistical tests (t-test, ANOVA)
+
+**Categorical vs Categorical**:
+- Cross-tabulation / contingency tables
+- Stacked bar charts
+- Chi-square tests
+- Cramér's V for association strength
+
+### 5. Multivariate Analysis
+
+**Objectives**:
+- Understand complex interactions
+- Identify patterns across multiple variables
+- Explore dimensionality
+
+**Techniques**:
+- Correlation matrices and heatmaps
+- Pair plots / scatter matrices
+- Parallel coordinates plots
+- Principal Component Analysis (PCA)
+- Clustering analysis
+
+**Key Questions**:
+- Are there groups of correlated features?
+- Can we reduce dimensionality?
+- Are there natural clusters?
+- Do patterns change when conditioning on other variables?
+
+### 6. Insight Generation
+
+**Objectives**:
+- Synthesize findings into actionable insights
+- Formulate hypotheses
+- Identify next steps
+
+**What to Look For**:
+- Unexpected patterns or anomalies
+- Strong relationships or correlations
+- Data quality issues requiring attention
+- Feature engineering opportunities
+- Business or research implications
+
+## Best Practices
+
+### Visualization Guidelines
+
+1. **Choose appropriate chart types**:
+   - Distribution: Histogram, KDE, box plot, violin plot
+   - Relationships: Scatter plot, line plot, heatmap
+   - Composition: Stacked bar, pie chart (use sparingly)
+   - Comparison: Bar chart, grouped bar chart
+
+2. **Make visualizations clear and informative**:
+   - Always label axes with units
+   - Add descriptive titles
+   - Use color purposefully
+   - Include legends when needed
+   - Choose appropriate scales
+   - Avoid chart junk
+
+3. **Use multiple views**:
+   - Show data from different angles
+   - Combine complementary visualizations
+   - Use small multiples for faceting
+
+### Statistical Analysis Guidelines
+
+1. **Check assumptions**:
+   - Test for normality before parametric tests
+   - Check for homoscedasticity
+   - Verify independence of observations
+   - Assess linearity for linear models
+
+2. **Use appropriate methods**:
+   - Parametric tests when assumptions met
+   - Non-parametric alternatives when violated
+   - Robust methods for outlier-prone data
+   - Effect sizes alongside p-values
+
+3. **Consider context**:
+   - Statistical significance ≠ practical significance
+   - Domain knowledge trumps statistical patterns
+   - Correlation ≠ causation
+   - Sample size affects what you can detect
+
+### Documentation Guidelines
+
+1. **Keep detailed notes**:
+   - Document assumptions and decisions
+   - Record data issues discovered
+   - Note interesting findings
+   - Track questions that arise
+
+2. **Create reproducible analysis**:
+   - Use scripts, not manual Excel operations
+   - Version control your code
+   - Document data sources and versions
+   - Include random seeds for reproducibility
+
+3. **Summarize findings**:
+   - Write clear summaries
+   - Use visualizations to support points
+   - Highlight key insights
+   - Provide recommendations
+
+## Common Pitfalls to Avoid
+
+### 1. Confirmation Bias
+- **Problem**: Looking only for evidence supporting preconceptions
+- **Solution**: Actively seek disconfirming evidence, use blind analysis
+
+### 2. Ignoring Data Quality
+- **Problem**: Proceeding with analysis despite known data issues
+- **Solution**: Address quality issues first, document limitations
+
+### 3. Over-reliance on Automation
+- **Problem**: Running analyses without understanding or verifying results
+- **Solution**: Manually inspect subsets, verify automated findings
+
+### 4. Neglecting Outliers
+- **Problem**: Removing outliers without investigation
+- **Solution**: Always investigate outliers - they may contain important information
+
+### 5. Multiple Testing Without Correction
+- **Problem**: Running many tests increases false positive rate
+- **Solution**: Use correction methods (Bonferroni, FDR) or be explicit about exploratory nature
+
+### 6. Mistaking Association for Causation
+- **Problem**: Inferring causation from correlation
+- **Solution**: Use careful language, acknowledge alternative explanations
+
+### 7. Cherry-picking Results
+- **Problem**: Reporting only interesting/significant findings
+- **Solution**: Report complete analysis, including negative results
+
+### 8. Ignoring Sample Size
+- **Problem**: Not considering how sample size affects conclusions
+- **Solution**: Report effect sizes, confidence intervals, and sample sizes
+
+## Domain-Specific Considerations
+
+### Time Series Data
+- Check for stationarity
+- Identify trends and seasonality
+- Look for autocorrelation
+- Handle missing time points
+- Consider temporal splits for validation
+
+### High-Dimensional Data
+- Start with dimensionality reduction
+- Focus on feature importance
+- Be cautious of curse of dimensionality
+- Use regularization in modeling
+- Consider domain knowledge for feature selection
+
+### Imbalanced Data
+- Report class distributions
+- Use appropriate metrics (not just accuracy)
+- Consider resampling techniques
+- Stratify sampling and cross-validation
+- Be aware of biases in learning
+
+### Small Sample Sizes
+- Use non-parametric methods
+- Be conservative with conclusions
+- Report confidence intervals
+- Consider Bayesian approaches
+- Acknowledge limitations
+
+### Big Data
+- Sample intelligently for exploration
+- Use efficient data structures
+- Leverage parallel/distributed computing
+- Be aware computational complexity
+- Consider scalability in methods
+
+## Iterative Process
+
+EDA is not linear - iterate and refine:
+
+1. **Initial exploration** → Identify questions
+2. **Focused analysis** → Answer specific questions
+3. **New insights** → Generate new questions
+4. **Deeper investigation** → Refine understanding
+5. **Synthesis** → Integrate findings
+
+### When to Stop
+
+You've done enough EDA when:
+- ✅ You understand the data structure and quality
+- ✅ You've characterized key variables
+- ✅ You've identified important relationships
+- ✅ You've documented limitations
+- ✅ You can answer your research questions
+- ✅ You have actionable insights
+
+### Moving Forward
+
+After EDA, you should have:
+- Clear understanding of data
+- List of quality issues and how to handle them
+- Insights about relationships and patterns
+- Hypotheses to test
+- Ideas for feature engineering
+- Recommendations for next steps
+
+## Communication Tips
+
+### For Technical Audiences
+- Include methodological details
+- Show statistical test results
+- Discuss assumptions and limitations
+- Provide reproducible code
+- Reference relevant literature
+
+### For Non-Technical Audiences
+- Focus on insights, not methods
+- Use clear visualizations
+- Avoid jargon
+- Provide context and implications
+- Make recommendations concrete
+
+### Report Structure
+1. **Executive Summary**: Key findings and recommendations
+2. **Data Overview**: Source, structure, limitations
+3. **Analysis**: Findings organized by theme
+4. **Insights**: Patterns, anomalies, implications
+5. **Recommendations**: Next steps and actions
+6. **Appendix**: Technical details, full statistics
+
+## Useful Checklists
+
+### Before Starting
+- [ ] Understand business/research context
+- [ ] Define analysis objectives
+- [ ] Identify stakeholders and audience
+- [ ] Secure necessary permissions
+- [ ] Set up reproducible environment
+
+### During Analysis
+- [ ] Load and inspect data structure
+- [ ] Assess data quality
+- [ ] Analyze univariate distributions
+- [ ] Explore bivariate relationships
+- [ ] Investigate multivariate patterns
+- [ ] Generate and validate insights
+- [ ] Document findings continuously
+
+### Before Concluding
+- [ ] Verify all findings
+- [ ] Check for alternative explanations
+- [ ] Document limitations
+- [ ] Prepare clear visualizations
+- [ ] Write actionable recommendations
+- [ ] Review with domain experts
+- [ ] Ensure reproducibility
+
+## Tools and Libraries
+
+### Python Ecosystem
+- **pandas**: Data manipulation
+- **numpy**: Numerical operations
+- **matplotlib/seaborn**: Visualization
+- **scipy**: Statistical tests
+- **scikit-learn**: ML preprocessing
+- **plotly**: Interactive visualizations
+
+### Best Tool Practices
+- Use appropriate tool for task
+- Leverage vectorization
+- Chain operations efficiently
+- Handle missing data properly
+- Validate results independently
+- Document custom functions
+
+## Further Resources
+
+- **Books**:
+  - "Exploratory Data Analysis" by John Tukey
+  - "The Art of Statistics" by David Spiegelhalter
+- **Guidelines**:
+  - ASA Statistical Significance Statement
+  - FAIR data principles
+- **Communities**:
+  - Cross Validated (Stack Exchange)
+  - /r/datascience
+  - Local data science meetups