557 lines
17 KiB
Markdown
557 lines
17 KiB
Markdown
# ML Conference Writing Style Guide
|
||
|
||
Comprehensive writing guide for NeurIPS, ICML, ICLR, CVPR, ECCV, ICCV, and other major machine learning and computer vision conferences.
|
||
|
||
**Last Updated**: 2024
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
ML conferences prioritize **novelty**, **rigorous empirical evaluation**, and **reproducibility**. Papers are evaluated on clear contribution, strong baselines, comprehensive ablations, and honest discussion of limitations.
|
||
|
||
### Key Philosophy
|
||
|
||
> "Show don't tell—your experiments should demonstrate your claims, not just your prose."
|
||
|
||
**Primary Goal**: Advance the state of the art with novel methods validated through rigorous experimentation.
|
||
|
||
---
|
||
|
||
## Audience and Tone
|
||
|
||
### Target Reader
|
||
|
||
- ML researchers and practitioners
|
||
- Experts in the specific subfield
|
||
- Familiar with recent literature
|
||
- Expect technical depth and precision
|
||
|
||
### Tone Characteristics
|
||
|
||
| Characteristic | Description |
|
||
|---------------|-------------|
|
||
| **Technical** | Dense with methodology details |
|
||
| **Precise** | Exact terminology, no ambiguity |
|
||
| **Empirical** | Claims backed by experiments |
|
||
| **Direct** | State contributions clearly |
|
||
| **Honest** | Acknowledge limitations |
|
||
|
||
### Voice
|
||
|
||
- **First person plural ("we")**: "We propose..." "Our method..."
|
||
- **Active voice**: "We introduce a novel architecture..."
|
||
- **Confident but measured**: Strong claims require strong evidence
|
||
|
||
---
|
||
|
||
## Abstract
|
||
|
||
### Style Requirements
|
||
|
||
- **Dense and numbers-focused**
|
||
- **150-250 words** (varies by venue)
|
||
- **Key results upfront**: Include specific metrics
|
||
- **Flowing paragraph** (not structured)
|
||
|
||
### Abstract Structure
|
||
|
||
1. **Problem** (1 sentence): What problem are you solving?
|
||
2. **Limitation of existing work** (1 sentence): Why current methods fall short
|
||
3. **Your approach** (1-2 sentences): What's your method?
|
||
4. **Key results** (2-3 sentences): Specific numbers on benchmarks
|
||
5. **Significance** (optional, 1 sentence): Why this matters
|
||
|
||
### Example Abstract (NeurIPS Style)
|
||
|
||
```
|
||
Transformers have achieved remarkable success in sequence modeling but
|
||
suffer from quadratic computational complexity, limiting their application
|
||
to long sequences. We introduce FlashAttention-2, an IO-aware exact
|
||
attention algorithm that achieves 2x speedup over FlashAttention and up
|
||
to 9x speedup over standard attention on sequences up to 16K tokens. Our
|
||
key insight is to reduce memory reads/writes by tiling and recomputation,
|
||
achieving optimal IO complexity. On the Long Range Arena benchmark,
|
||
FlashAttention-2 enables training with 8x longer sequences while matching
|
||
standard attention accuracy. Combined with sequence parallelism, we train
|
||
GPT-style models on sequences of 64K tokens at near-linear cost. We
|
||
release optimized CUDA kernels achieving 80% of theoretical peak FLOPS
|
||
on A100 GPUs. Code is available at [anonymous URL].
|
||
```
|
||
|
||
### Abstract Don'ts
|
||
|
||
❌ "We propose a novel method for X" (vague, no results)
|
||
❌ "Our method outperforms baselines" (no specific numbers)
|
||
❌ "This is an important problem" (self-evident claims)
|
||
|
||
✅ Include specific metrics: "achieves 94.5% accuracy, 3.2% improvement"
|
||
✅ Include scale: "on 1M samples" or "16K token sequences"
|
||
✅ Include comparison: "2x faster than previous SOTA"
|
||
|
||
---
|
||
|
||
## Introduction
|
||
|
||
### Structure (2-3 pages)
|
||
|
||
ML introductions have a distinctive structure with **numbered contributions**.
|
||
|
||
### Paragraph-by-Paragraph Guide
|
||
|
||
**Paragraph 1: Problem Motivation**
|
||
- Why is this problem important?
|
||
- What are the applications?
|
||
- Set up the technical challenge
|
||
|
||
```
|
||
"Large language models have demonstrated remarkable capabilities in
|
||
natural language understanding and generation. However, their quadratic
|
||
attention complexity presents a fundamental bottleneck for processing
|
||
long documents, multi-turn conversations, and reasoning over extended
|
||
contexts. As models scale to billions of parameters and context lengths
|
||
extend to tens of thousands of tokens, efficient attention mechanisms
|
||
become critical for practical deployment."
|
||
```
|
||
|
||
**Paragraph 2: Limitations of Existing Approaches**
|
||
- What methods exist?
|
||
- Why are they insufficient?
|
||
- Technical analysis of limitations
|
||
|
||
```
|
||
"Prior work has addressed this through sparse attention patterns,
|
||
linear attention approximations, and low-rank factorizations. While
|
||
these methods reduce theoretical complexity, they often sacrifice
|
||
accuracy, require specialized hardware, or introduce approximation
|
||
errors that compound in deep networks. Exact attention remains
|
||
preferable when computational resources permit."
|
||
```
|
||
|
||
**Paragraph 3: Your Approach (High-Level)**
|
||
- What's your key insight?
|
||
- How does your method work conceptually?
|
||
- Why should it succeed?
|
||
|
||
```
|
||
"We observe that the primary bottleneck in attention is not computation
|
||
but rather memory bandwidth—reading and writing the large N×N attention
|
||
matrix dominates runtime on modern GPUs. We propose FlashAttention-2,
|
||
which eliminates this bottleneck through a novel tiling strategy that
|
||
computes attention block-by-block without materializing the full matrix."
|
||
```
|
||
|
||
**Paragraph 4: Contribution List (CRITICAL)**
|
||
|
||
This is **mandatory and distinctive** for ML conferences:
|
||
|
||
```
|
||
Our contributions are as follows:
|
||
|
||
• We propose FlashAttention-2, an IO-aware exact attention algorithm
|
||
that achieves optimal memory complexity O(N²d/M) where M is GPU
|
||
SRAM size.
|
||
|
||
• We provide theoretical analysis showing that our algorithm achieves
|
||
2-4x fewer HBM accesses than FlashAttention on typical GPU
|
||
configurations.
|
||
|
||
• We demonstrate 2x speedup over FlashAttention and up to 9x over
|
||
standard PyTorch attention across sequence lengths from 256 to 64K
|
||
tokens.
|
||
|
||
• We show that FlashAttention-2 enables training with 8x longer
|
||
contexts on the same hardware, unlocking new capabilities for
|
||
long-range modeling.
|
||
|
||
• We release optimized CUDA kernels and PyTorch bindings at
|
||
[anonymous URL].
|
||
```
|
||
|
||
### Contribution Bullet Guidelines
|
||
|
||
| Good Contribution Bullets | Bad Contribution Bullets |
|
||
|--------------------------|-------------------------|
|
||
| Specific, quantifiable | Vague claims |
|
||
| Self-contained | Requires reading paper to understand |
|
||
| Distinct from each other | Overlapping bullets |
|
||
| Emphasize novelty | State obvious facts |
|
||
|
||
### Related Work Placement
|
||
|
||
- **In introduction**: Brief positioning (1-2 paragraphs)
|
||
- **Separate section**: Detailed comparison (at end or before conclusion)
|
||
- **Appendix**: Extended discussion if space-limited
|
||
|
||
---
|
||
|
||
## Method
|
||
|
||
### Structure (2-3 pages)
|
||
|
||
```
|
||
METHOD
|
||
├── Problem Formulation
|
||
├── Method Overview / Architecture
|
||
├── Key Technical Components
|
||
│ ├── Component 1 (with equations)
|
||
│ ├── Component 2 (with equations)
|
||
│ └── Component 3 (with equations)
|
||
├── Theoretical Analysis (if applicable)
|
||
└── Implementation Details
|
||
```
|
||
|
||
### Mathematical Notation
|
||
|
||
- **Define all notation**: "Let X ∈ ℝ^{N×d} denote the input sequence..."
|
||
- **Consistent symbols**: Same symbol means same thing throughout
|
||
- **Number important equations**: Reference by number later
|
||
|
||
### Algorithm Pseudocode
|
||
|
||
Include clear pseudocode for reproducibility:
|
||
|
||
```
|
||
Algorithm 1: FlashAttention-2 Forward Pass
|
||
─────────────────────────────────────────
|
||
Input: Q, K, V ∈ ℝ^{N×d}, block size B_r, B_c
|
||
Output: O ∈ ℝ^{N×d}
|
||
|
||
1: Divide Q into T_r = ⌈N/B_r⌉ blocks
|
||
2: Divide K, V into T_c = ⌈N/B_c⌉ blocks
|
||
3: Initialize O = 0, ℓ = 0, m = -∞
|
||
4: for i = 1 to T_r do
|
||
5: Load Q_i from HBM to SRAM
|
||
6: for j = 1 to T_c do
|
||
7: Load K_j, V_j from HBM to SRAM
|
||
8: Compute S_ij = Q_i K_j^T
|
||
9: Update running max and sum
|
||
10: Update O_i incrementally
|
||
11: end for
|
||
12: Write O_i to HBM
|
||
13: end for
|
||
14: return O
|
||
```
|
||
|
||
### Architecture Diagrams
|
||
|
||
- **Clear, publication-quality figures**
|
||
- **Label all components**
|
||
- **Show data flow with arrows**
|
||
- **Use consistent visual language**
|
||
|
||
---
|
||
|
||
## Experiments
|
||
|
||
### Structure (2-3 pages)
|
||
|
||
```
|
||
EXPERIMENTS
|
||
├── Experimental Setup
|
||
│ ├── Datasets and Benchmarks
|
||
│ ├── Baselines
|
||
│ ├── Implementation Details
|
||
│ └── Evaluation Metrics
|
||
├── Main Results
|
||
│ └── Table/Figure with primary comparisons
|
||
├── Ablation Studies
|
||
│ └── Component-wise analysis
|
||
├── Analysis
|
||
│ ├── Scaling behavior
|
||
│ ├── Qualitative examples
|
||
│ └── Error analysis
|
||
└── Computational Efficiency
|
||
```
|
||
|
||
### Datasets and Benchmarks
|
||
|
||
- **Use standard benchmarks**: Establish comparability
|
||
- **Report dataset statistics**: Size, splits, preprocessing
|
||
- **Justify non-standard choices**: If using custom data, explain why
|
||
|
||
### Baselines
|
||
|
||
**Critical for acceptance.** Include:
|
||
- **Recent SOTA**: Not just old methods
|
||
- **Fair comparisons**: Same compute budget, hyperparameter tuning
|
||
- **Ablated versions**: Your method without key components
|
||
- **Strong baselines**: Don't cherry-pick weak competitors
|
||
|
||
### Main Results Table
|
||
|
||
Clear, comprehensive formatting:
|
||
|
||
```
|
||
Table 1: Results on Long Range Arena Benchmark (accuracy %)
|
||
──────────────────────────────────────────────────────────
|
||
Method | ListOps | Text | Retrieval | Image | Path | Avg
|
||
──────────────────────────────────────────────────────────
|
||
Transformer | 36.4 | 64.3 | 57.5 | 42.4 | 71.4 | 54.4
|
||
Performer | 18.0 | 65.4 | 53.8 | 42.8 | 77.1 | 51.4
|
||
Linear Attn | 16.1 | 65.9 | 53.1 | 42.3 | 75.3 | 50.5
|
||
FlashAttention | 37.1 | 64.5 | 57.8 | 42.7 | 71.2 | 54.7
|
||
FlashAttn-2 | 37.4 | 64.7 | 58.2 | 42.9 | 71.8 | 55.0
|
||
──────────────────────────────────────────────────────────
|
||
```
|
||
|
||
### Ablation Studies (MANDATORY)
|
||
|
||
Show what matters in your method:
|
||
|
||
```
|
||
Table 2: Ablation Study on FlashAttention-2 Components
|
||
──────────────────────────────────────────────────────
|
||
Variant | Speedup | Memory
|
||
──────────────────────────────────────────────────────
|
||
Full FlashAttention-2 | 2.0x | 1.0x
|
||
- without sequence parallelism | 1.7x | 1.0x
|
||
- without recomputation | 1.3x | 2.4x
|
||
- without block tiling | 1.0x | 4.0x
|
||
FlashAttention-1 (baseline) | 1.0x | 1.0x
|
||
──────────────────────────────────────────────────────
|
||
```
|
||
|
||
### What Ablations Should Show
|
||
|
||
- **Each component matters**: Removing it hurts performance
|
||
- **Design choices justified**: Why this architecture/hyperparameter?
|
||
- **Failure modes**: When does method not work?
|
||
- **Sensitivity analysis**: Robustness to hyperparameters
|
||
|
||
---
|
||
|
||
## Related Work
|
||
|
||
### Placement Options
|
||
|
||
1. **After Introduction**: Common in CV papers
|
||
2. **Before Conclusion**: Common in NeurIPS/ICML
|
||
3. **Appendix**: When space is tight
|
||
|
||
### Writing Style
|
||
|
||
- **Organized by theme**: Not chronological
|
||
- **Position your work**: How you differ from each line of work
|
||
- **Fair characterization**: Don't misrepresent prior work
|
||
- **Recent citations**: Include 2023-2024 papers
|
||
|
||
### Example Structure
|
||
|
||
```
|
||
**Efficient Attention Mechanisms.** Prior work on efficient attention
|
||
falls into three categories: sparse patterns (Beltagy et al., 2020;
|
||
Zaheer et al., 2020), linear approximations (Katharopoulos et al., 2020;
|
||
Choromanski et al., 2021), and low-rank factorizations (Wang et al.,
|
||
2020). Our work differs in that we focus on IO-efficient exact
|
||
attention rather than approximations.
|
||
|
||
**Memory-Efficient Training.** Gradient checkpointing (Chen et al., 2016)
|
||
and activation recomputation (Korthikanti et al., 2022) reduce memory
|
||
by trading compute. We adopt similar ideas but apply them within the
|
||
attention operator itself.
|
||
```
|
||
|
||
---
|
||
|
||
## Limitations Section
|
||
|
||
### Why It Matters
|
||
|
||
**Increasingly required** at NeurIPS, ICML, ICLR. Honest limitations:
|
||
- Show scientific maturity
|
||
- Guide future work
|
||
- Prevent overselling
|
||
|
||
### What to Include
|
||
|
||
1. **Method limitations**: When does it fail?
|
||
2. **Experimental limitations**: What wasn't tested?
|
||
3. **Scope limitations**: What's out of scope?
|
||
4. **Computational limitations**: Resource requirements
|
||
|
||
### Example Limitations Section
|
||
|
||
```
|
||
**Limitations.** While FlashAttention-2 provides substantial speedups,
|
||
several limitations remain. First, our implementation is optimized for
|
||
NVIDIA GPUs and does not support AMD or other hardware. Second, the
|
||
speedup is most pronounced for medium to long sequences; for very short
|
||
sequences (<256 tokens), the overhead of our kernel launch dominates.
|
||
Third, we focus on dense attention; extending our approach to sparse
|
||
attention patterns remains future work. Finally, our theoretical
|
||
analysis assumes specific GPU memory hierarchy parameters that may not
|
||
hold for future hardware generations.
|
||
```
|
||
|
||
---
|
||
|
||
## Reproducibility
|
||
|
||
### Reproducibility Checklist (NeurIPS/ICML)
|
||
|
||
Most ML conferences require a reproducibility checklist covering:
|
||
|
||
- [ ] Code availability
|
||
- [ ] Dataset availability
|
||
- [ ] Hyperparameters specified
|
||
- [ ] Random seeds reported
|
||
- [ ] Compute requirements stated
|
||
- [ ] Number of runs and variance reported
|
||
- [ ] Statistical significance tests
|
||
|
||
### What to Report
|
||
|
||
**Hyperparameters**:
|
||
```
|
||
"We train with Adam (β₁=0.9, β₂=0.999, ε=1e-8) and learning rate 3e-4
|
||
with linear warmup over 1000 steps and cosine decay. Batch size is 256
|
||
across 8 A100 GPUs. We train for 100K steps (approximately 24 hours)."
|
||
```
|
||
|
||
**Random Seeds**:
|
||
```
|
||
"All experiments are averaged over 3 random seeds (0, 1, 2) with
|
||
standard deviation reported in parentheses."
|
||
```
|
||
|
||
**Compute**:
|
||
```
|
||
"Experiments were conducted on 8 NVIDIA A100-80GB GPUs. Total training
|
||
time was approximately 500 GPU-hours."
|
||
```
|
||
|
||
---
|
||
|
||
## Figures
|
||
|
||
### Figure Quality
|
||
|
||
- **Vector graphics preferred**: PDF, SVG
|
||
- **High resolution for rasters**: 300+ dpi
|
||
- **Readable at publication size**: Test at actual column width
|
||
- **Colorblind-accessible**: Use patterns in addition to color
|
||
|
||
### Common Figure Types
|
||
|
||
1. **Architecture diagram**: Show your method visually
|
||
2. **Performance plots**: Learning curves, scaling behavior
|
||
3. **Comparison tables**: Main results
|
||
4. **Ablation figures**: Component contributions
|
||
5. **Qualitative examples**: Input/output samples
|
||
|
||
### Figure Captions
|
||
|
||
Self-contained captions that explain:
|
||
- What is shown
|
||
- How to read the figure
|
||
- Key takeaway
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
### Citation Style
|
||
|
||
- **Numbered [1]** or **author-year (Smith et al., 2023)**
|
||
- Check venue-specific requirements
|
||
- Be consistent throughout
|
||
|
||
### Reference Guidelines
|
||
|
||
- **Cite recent work**: 2022-2024 papers expected
|
||
- **Don't over-cite yourself**: Raises bias concerns
|
||
- **Cite arxiv appropriately**: Use published version when available
|
||
- **Include all relevant prior work**: Missing citations hurt review
|
||
|
||
---
|
||
|
||
## Venue-Specific Notes
|
||
|
||
### NeurIPS
|
||
|
||
- **8 pages** main + unlimited appendix/references
|
||
- **Broader Impact** section sometimes required
|
||
- **Reproducibility checklist** mandatory
|
||
- OpenReview submission, public reviews
|
||
|
||
### ICML
|
||
|
||
- **8 pages** main + unlimited appendix/references
|
||
- Strong emphasis on **theory + experiments**
|
||
- Reproducibility statement encouraged
|
||
|
||
### ICLR
|
||
|
||
- **8 pages** main (camera-ready can exceed)
|
||
- OpenReview with **public reviews and discussion**
|
||
- Author response period is interactive
|
||
- Strong emphasis on **novelty and insight**
|
||
|
||
### CVPR/ICCV/ECCV
|
||
|
||
- **8 pages** main including references
|
||
- **Supplementary video** encouraged
|
||
- Heavy emphasis on **visual results**
|
||
- Benchmark performance critical
|
||
|
||
---
|
||
|
||
## Common Mistakes
|
||
|
||
1. **Weak baselines**: Not comparing to recent SOTA
|
||
2. **Missing ablations**: Not showing component contributions
|
||
3. **Overclaiming**: "We solve X" when you partially address X
|
||
4. **Vague contributions**: "We propose a novel method"
|
||
5. **Poor reproducibility**: Missing hyperparameters, seeds
|
||
6. **Wrong template**: Using last year's style file
|
||
7. **Anonymous violations**: Revealing identity in blind review
|
||
8. **Missing limitations**: Not acknowledging failure modes
|
||
|
||
---
|
||
|
||
## Rebuttal Tips
|
||
|
||
ML conferences have author response periods. Tips:
|
||
- **Address key concerns first**: Prioritize critical issues
|
||
- **Run requested experiments**: When feasible in time
|
||
- **Be concise**: Reviewers read many rebuttals
|
||
- **Stay professional**: Even with unfair reviews
|
||
- **Reference specific lines**: "As stated in L127..."
|
||
|
||
---
|
||
|
||
## Pre-Submission Checklist
|
||
|
||
### Content
|
||
- [ ] Clear problem motivation
|
||
- [ ] Explicit contribution list
|
||
- [ ] Complete method description
|
||
- [ ] Comprehensive experiments
|
||
- [ ] Strong baselines included
|
||
- [ ] Ablation studies present
|
||
- [ ] Limitations acknowledged
|
||
|
||
### Technical
|
||
- [ ] Correct venue style file (current year)
|
||
- [ ] Anonymized (no author names, no identifiable URLs)
|
||
- [ ] Page limit respected
|
||
- [ ] References complete
|
||
- [ ] Supplementary organized
|
||
|
||
### Reproducibility
|
||
- [ ] Hyperparameters listed
|
||
- [ ] Random seeds specified
|
||
- [ ] Compute requirements stated
|
||
- [ ] Code/data availability noted
|
||
- [ ] Reproducibility checklist completed
|
||
|
||
---
|
||
|
||
## See Also
|
||
|
||
- `venue_writing_styles.md` - Master style overview
|
||
- `conferences_formatting.md` - Technical formatting requirements
|
||
- `reviewer_expectations.md` - What ML reviewers seek
|
||
|