Initial commit for venue-templates

2026-01-29 22:15:17 +08:00
commit 5eedf4f6d9
25 changed files with 9762 additions and 0 deletions
--- a/references/ml_conference_style.md
+++ b/references/ml_conference_style.md
@@ -0,0 +1,556 @@
+# ML Conference Writing Style Guide
+
+Comprehensive writing guide for NeurIPS, ICML, ICLR, CVPR, ECCV, ICCV, and other major machine learning and computer vision conferences.
+
+**Last Updated**: 2024
+
+---
+
+## Overview
+
+ML conferences prioritize **novelty**, **rigorous empirical evaluation**, and **reproducibility**. Papers are evaluated on clear contribution, strong baselines, comprehensive ablations, and honest discussion of limitations.
+
+### Key Philosophy
+
+> "Show don't tell—your experiments should demonstrate your claims, not just your prose."
+
+**Primary Goal**: Advance the state of the art with novel methods validated through rigorous experimentation.
+
+---
+
+## Audience and Tone
+
+### Target Reader
+
+- ML researchers and practitioners
+- Experts in the specific subfield
+- Familiar with recent literature
+- Expect technical depth and precision
+
+### Tone Characteristics
+
+| Characteristic | Description |
+|---------------|-------------|
+| **Technical** | Dense with methodology details |
+| **Precise** | Exact terminology, no ambiguity |
+| **Empirical** | Claims backed by experiments |
+| **Direct** | State contributions clearly |
+| **Honest** | Acknowledge limitations |
+
+### Voice
+
+- **First person plural ("we")**: "We propose..." "Our method..."
+- **Active voice**: "We introduce a novel architecture..."
+- **Confident but measured**: Strong claims require strong evidence
+
+---
+
+## Abstract
+
+### Style Requirements
+
+- **Dense and numbers-focused**
+- **150-250 words** (varies by venue)
+- **Key results upfront**: Include specific metrics
+- **Flowing paragraph** (not structured)
+
+### Abstract Structure
+
+1. **Problem** (1 sentence): What problem are you solving?
+2. **Limitation of existing work** (1 sentence): Why current methods fall short
+3. **Your approach** (1-2 sentences): What's your method?
+4. **Key results** (2-3 sentences): Specific numbers on benchmarks
+5. **Significance** (optional, 1 sentence): Why this matters
+
+### Example Abstract (NeurIPS Style)
+
+```
+Transformers have achieved remarkable success in sequence modeling but 
+suffer from quadratic computational complexity, limiting their application 
+to long sequences. We introduce FlashAttention-2, an IO-aware exact 
+attention algorithm that achieves 2x speedup over FlashAttention and up 
+to 9x speedup over standard attention on sequences up to 16K tokens. Our 
+key insight is to reduce memory reads/writes by tiling and recomputation, 
+achieving optimal IO complexity. On the Long Range Arena benchmark, 
+FlashAttention-2 enables training with 8x longer sequences while matching 
+standard attention accuracy. Combined with sequence parallelism, we train 
+GPT-style models on sequences of 64K tokens at near-linear cost. We 
+release optimized CUDA kernels achieving 80% of theoretical peak FLOPS 
+on A100 GPUs. Code is available at [anonymous URL].
+```
+
+### Abstract Don'ts
+
+❌ "We propose a novel method for X" (vague, no results)
+❌ "Our method outperforms baselines" (no specific numbers)
+❌ "This is an important problem" (self-evident claims)
+
+✅ Include specific metrics: "achieves 94.5% accuracy, 3.2% improvement"
+✅ Include scale: "on 1M samples" or "16K token sequences"
+✅ Include comparison: "2x faster than previous SOTA"
+
+---
+
+## Introduction
+
+### Structure (2-3 pages)
+
+ML introductions have a distinctive structure with **numbered contributions**.
+
+### Paragraph-by-Paragraph Guide
+
+**Paragraph 1: Problem Motivation**
+- Why is this problem important?
+- What are the applications?
+- Set up the technical challenge
+
+```
+"Large language models have demonstrated remarkable capabilities in 
+natural language understanding and generation. However, their quadratic 
+attention complexity presents a fundamental bottleneck for processing 
+long documents, multi-turn conversations, and reasoning over extended 
+contexts. As models scale to billions of parameters and context lengths 
+extend to tens of thousands of tokens, efficient attention mechanisms 
+become critical for practical deployment."
+```
+
+**Paragraph 2: Limitations of Existing Approaches**
+- What methods exist?
+- Why are they insufficient?
+- Technical analysis of limitations
+
+```
+"Prior work has addressed this through sparse attention patterns, 
+linear attention approximations, and low-rank factorizations. While 
+these methods reduce theoretical complexity, they often sacrifice 
+accuracy, require specialized hardware, or introduce approximation 
+errors that compound in deep networks. Exact attention remains 
+preferable when computational resources permit."
+```
+
+**Paragraph 3: Your Approach (High-Level)**
+- What's your key insight?
+- How does your method work conceptually?
+- Why should it succeed?
+
+```
+"We observe that the primary bottleneck in attention is not computation 
+but rather memory bandwidth—reading and writing the large N×N attention 
+matrix dominates runtime on modern GPUs. We propose FlashAttention-2, 
+which eliminates this bottleneck through a novel tiling strategy that 
+computes attention block-by-block without materializing the full matrix."
+```
+
+**Paragraph 4: Contribution List (CRITICAL)**
+
+This is **mandatory and distinctive** for ML conferences:
+
+```
+Our contributions are as follows:
+
+• We propose FlashAttention-2, an IO-aware exact attention algorithm 
+  that achieves optimal memory complexity O(N²d/M) where M is GPU 
+  SRAM size.
+
+• We provide theoretical analysis showing that our algorithm achieves 
+  2-4x fewer HBM accesses than FlashAttention on typical GPU 
+  configurations.
+
+• We demonstrate 2x speedup over FlashAttention and up to 9x over 
+  standard PyTorch attention across sequence lengths from 256 to 64K 
+  tokens.
+
+• We show that FlashAttention-2 enables training with 8x longer 
+  contexts on the same hardware, unlocking new capabilities for 
+  long-range modeling.
+
+• We release optimized CUDA kernels and PyTorch bindings at 
+  [anonymous URL].
+```
+
+### Contribution Bullet Guidelines
+
+| Good Contribution Bullets | Bad Contribution Bullets |
+|--------------------------|-------------------------|
+| Specific, quantifiable | Vague claims |
+| Self-contained | Requires reading paper to understand |
+| Distinct from each other | Overlapping bullets |
+| Emphasize novelty | State obvious facts |
+
+### Related Work Placement
+
+- **In introduction**: Brief positioning (1-2 paragraphs)
+- **Separate section**: Detailed comparison (at end or before conclusion)
+- **Appendix**: Extended discussion if space-limited
+
+---
+
+## Method
+
+### Structure (2-3 pages)
+
+```
+METHOD
+├── Problem Formulation
+├── Method Overview / Architecture
+├── Key Technical Components
+│   ├── Component 1 (with equations)
+│   ├── Component 2 (with equations)
+│   └── Component 3 (with equations)
+├── Theoretical Analysis (if applicable)
+└── Implementation Details
+```
+
+### Mathematical Notation
+
+- **Define all notation**: "Let X ∈ ℝ^{N×d} denote the input sequence..."
+- **Consistent symbols**: Same symbol means same thing throughout
+- **Number important equations**: Reference by number later
+
+### Algorithm Pseudocode
+
+Include clear pseudocode for reproducibility:
+
+```
+Algorithm 1: FlashAttention-2 Forward Pass
+─────────────────────────────────────────
+Input: Q, K, V ∈ ℝ^{N×d}, block size B_r, B_c
+Output: O ∈ ℝ^{N×d}
+
+1:  Divide Q into T_r = ⌈N/B_r⌉ blocks
+2:  Divide K, V into T_c = ⌈N/B_c⌉ blocks
+3:  Initialize O = 0, ℓ = 0, m = -∞
+4:  for i = 1 to T_r do
+5:    Load Q_i from HBM to SRAM
+6:    for j = 1 to T_c do
+7:      Load K_j, V_j from HBM to SRAM
+8:      Compute S_ij = Q_i K_j^T
+9:      Update running max and sum
+10:     Update O_i incrementally
+11:   end for
+12:   Write O_i to HBM
+13: end for
+14: return O
+```
+
+### Architecture Diagrams
+
+- **Clear, publication-quality figures**
+- **Label all components**
+- **Show data flow with arrows**
+- **Use consistent visual language**
+
+---
+
+## Experiments
+
+### Structure (2-3 pages)
+
+```
+EXPERIMENTS
+├── Experimental Setup
+│   ├── Datasets and Benchmarks
+│   ├── Baselines
+│   ├── Implementation Details
+│   └── Evaluation Metrics
+├── Main Results
+│   └── Table/Figure with primary comparisons
+├── Ablation Studies
+│   └── Component-wise analysis
+├── Analysis
+│   ├── Scaling behavior
+│   ├── Qualitative examples
+│   └── Error analysis
+└── Computational Efficiency
+```
+
+### Datasets and Benchmarks
+
+- **Use standard benchmarks**: Establish comparability
+- **Report dataset statistics**: Size, splits, preprocessing
+- **Justify non-standard choices**: If using custom data, explain why
+
+### Baselines
+
+**Critical for acceptance.** Include:
+- **Recent SOTA**: Not just old methods
+- **Fair comparisons**: Same compute budget, hyperparameter tuning
+- **Ablated versions**: Your method without key components
+- **Strong baselines**: Don't cherry-pick weak competitors
+
+### Main Results Table
+
+Clear, comprehensive formatting:
+
+```
+Table 1: Results on Long Range Arena Benchmark (accuracy %)
+──────────────────────────────────────────────────────────
+Method          | ListOps | Text  | Retrieval | Image | Path  | Avg
+──────────────────────────────────────────────────────────
+Transformer     |  36.4   | 64.3  |   57.5    | 42.4  | 71.4  | 54.4
+Performer       |  18.0   | 65.4  |   53.8    | 42.8  | 77.1  | 51.4
+Linear Attn     |  16.1   | 65.9  |   53.1    | 42.3  | 75.3  | 50.5
+FlashAttention  |  37.1   | 64.5  |   57.8    | 42.7  | 71.2  | 54.7
+FlashAttn-2     |  37.4   | 64.7  |   58.2    | 42.9  | 71.8  | 55.0
+──────────────────────────────────────────────────────────
+```
+
+### Ablation Studies (MANDATORY)
+
+Show what matters in your method:
+
+```
+Table 2: Ablation Study on FlashAttention-2 Components
+──────────────────────────────────────────────────────
+Variant                              | Speedup | Memory
+──────────────────────────────────────────────────────
+Full FlashAttention-2                |   2.0x  |  1.0x
+  - without sequence parallelism     |   1.7x  |  1.0x
+  - without recomputation            |   1.3x  |  2.4x
+  - without block tiling             |   1.0x  |  4.0x
+FlashAttention-1 (baseline)          |   1.0x  |  1.0x
+──────────────────────────────────────────────────────
+```
+
+### What Ablations Should Show
+
+- **Each component matters**: Removing it hurts performance
+- **Design choices justified**: Why this architecture/hyperparameter?
+- **Failure modes**: When does method not work?
+- **Sensitivity analysis**: Robustness to hyperparameters
+
+---
+
+## Related Work
+
+### Placement Options
+
+1. **After Introduction**: Common in CV papers
+2. **Before Conclusion**: Common in NeurIPS/ICML
+3. **Appendix**: When space is tight
+
+### Writing Style
+
+- **Organized by theme**: Not chronological
+- **Position your work**: How you differ from each line of work
+- **Fair characterization**: Don't misrepresent prior work
+- **Recent citations**: Include 2023-2024 papers
+
+### Example Structure
+
+```
+**Efficient Attention Mechanisms.** Prior work on efficient attention 
+falls into three categories: sparse patterns (Beltagy et al., 2020; 
+Zaheer et al., 2020), linear approximations (Katharopoulos et al., 2020; 
+Choromanski et al., 2021), and low-rank factorizations (Wang et al., 
+2020). Our work differs in that we focus on IO-efficient exact 
+attention rather than approximations.
+
+**Memory-Efficient Training.** Gradient checkpointing (Chen et al., 2016) 
+and activation recomputation (Korthikanti et al., 2022) reduce memory 
+by trading compute. We adopt similar ideas but apply them within the 
+attention operator itself.
+```
+
+---
+
+## Limitations Section
+
+### Why It Matters
+
+**Increasingly required** at NeurIPS, ICML, ICLR. Honest limitations:
+- Show scientific maturity
+- Guide future work
+- Prevent overselling
+
+### What to Include
+
+1. **Method limitations**: When does it fail?
+2. **Experimental limitations**: What wasn't tested?
+3. **Scope limitations**: What's out of scope?
+4. **Computational limitations**: Resource requirements
+
+### Example Limitations Section
+
+```
+**Limitations.** While FlashAttention-2 provides substantial speedups, 
+several limitations remain. First, our implementation is optimized for 
+NVIDIA GPUs and does not support AMD or other hardware. Second, the 
+speedup is most pronounced for medium to long sequences; for very short 
+sequences (<256 tokens), the overhead of our kernel launch dominates. 
+Third, we focus on dense attention; extending our approach to sparse 
+attention patterns remains future work. Finally, our theoretical 
+analysis assumes specific GPU memory hierarchy parameters that may not 
+hold for future hardware generations.
+```
+
+---
+
+## Reproducibility
+
+### Reproducibility Checklist (NeurIPS/ICML)
+
+Most ML conferences require a reproducibility checklist covering:
+
+- [ ] Code availability
+- [ ] Dataset availability
+- [ ] Hyperparameters specified
+- [ ] Random seeds reported
+- [ ] Compute requirements stated
+- [ ] Number of runs and variance reported
+- [ ] Statistical significance tests
+
+### What to Report
+
+**Hyperparameters**:
+```
+"We train with Adam (β₁=0.9, β₂=0.999, ε=1e-8) and learning rate 3e-4 
+with linear warmup over 1000 steps and cosine decay. Batch size is 256 
+across 8 A100 GPUs. We train for 100K steps (approximately 24 hours)."
+```
+
+**Random Seeds**:
+```
+"All experiments are averaged over 3 random seeds (0, 1, 2) with 
+standard deviation reported in parentheses."
+```
+
+**Compute**:
+```
+"Experiments were conducted on 8 NVIDIA A100-80GB GPUs. Total training 
+time was approximately 500 GPU-hours."
+```
+
+---
+
+## Figures
+
+### Figure Quality
+
+- **Vector graphics preferred**: PDF, SVG
+- **High resolution for rasters**: 300+ dpi
+- **Readable at publication size**: Test at actual column width
+- **Colorblind-accessible**: Use patterns in addition to color
+
+### Common Figure Types
+
+1. **Architecture diagram**: Show your method visually
+2. **Performance plots**: Learning curves, scaling behavior
+3. **Comparison tables**: Main results
+4. **Ablation figures**: Component contributions
+5. **Qualitative examples**: Input/output samples
+
+### Figure Captions
+
+Self-contained captions that explain:
+- What is shown
+- How to read the figure
+- Key takeaway
+
+---
+
+## References
+
+### Citation Style
+
+- **Numbered [1]** or **author-year (Smith et al., 2023)**
+- Check venue-specific requirements
+- Be consistent throughout
+
+### Reference Guidelines
+
+- **Cite recent work**: 2022-2024 papers expected
+- **Don't over-cite yourself**: Raises bias concerns
+- **Cite arxiv appropriately**: Use published version when available
+- **Include all relevant prior work**: Missing citations hurt review
+
+---
+
+## Venue-Specific Notes
+
+### NeurIPS
+
+- **8 pages** main + unlimited appendix/references
+- **Broader Impact** section sometimes required
+- **Reproducibility checklist** mandatory
+- OpenReview submission, public reviews
+
+### ICML
+
+- **8 pages** main + unlimited appendix/references
+- Strong emphasis on **theory + experiments**
+- Reproducibility statement encouraged
+
+### ICLR
+
+- **8 pages** main (camera-ready can exceed)
+- OpenReview with **public reviews and discussion**
+- Author response period is interactive
+- Strong emphasis on **novelty and insight**
+
+### CVPR/ICCV/ECCV
+
+- **8 pages** main including references
+- **Supplementary video** encouraged
+- Heavy emphasis on **visual results**
+- Benchmark performance critical
+
+---
+
+## Common Mistakes
+
+1. **Weak baselines**: Not comparing to recent SOTA
+2. **Missing ablations**: Not showing component contributions
+3. **Overclaiming**: "We solve X" when you partially address X
+4. **Vague contributions**: "We propose a novel method"
+5. **Poor reproducibility**: Missing hyperparameters, seeds
+6. **Wrong template**: Using last year's style file
+7. **Anonymous violations**: Revealing identity in blind review
+8. **Missing limitations**: Not acknowledging failure modes
+
+---
+
+## Rebuttal Tips
+
+ML conferences have author response periods. Tips:
+- **Address key concerns first**: Prioritize critical issues
+- **Run requested experiments**: When feasible in time
+- **Be concise**: Reviewers read many rebuttals
+- **Stay professional**: Even with unfair reviews
+- **Reference specific lines**: "As stated in L127..."
+
+---
+
+## Pre-Submission Checklist
+
+### Content
+- [ ] Clear problem motivation
+- [ ] Explicit contribution list
+- [ ] Complete method description
+- [ ] Comprehensive experiments
+- [ ] Strong baselines included
+- [ ] Ablation studies present
+- [ ] Limitations acknowledged
+
+### Technical
+- [ ] Correct venue style file (current year)
+- [ ] Anonymized (no author names, no identifiable URLs)
+- [ ] Page limit respected
+- [ ] References complete
+- [ ] Supplementary organized
+
+### Reproducibility
+- [ ] Hyperparameters listed
+- [ ] Random seeds specified
+- [ ] Compute requirements stated
+- [ ] Code/data availability noted
+- [ ] Reproducibility checklist completed
+
+---
+
+## See Also
+
+- `venue_writing_styles.md` - Master style overview
+- `conferences_formatting.md` - Technical formatting requirements
+- `reviewer_expectations.md` - What ML reviewers seek
+