Files
claude-scientific-skills/scientific-skills/venue-templates/references/ml_conference_style.md
Vinayak Agarwal 3439a21f57 Enhance citation management and literature review guidelines
- Updated SKILL.md in citation management to include best practices for identifying seminal and high-impact papers, emphasizing citation count thresholds, venue quality tiers, and author reputation indicators.
- Expanded literature review SKILL.md to prioritize high-impact papers, detailing citation metrics, journal tiers, and author reputation assessment.
- Added comprehensive evaluation strategies for paper impact and quality in literature_search_strategies.md, including citation count significance and journal impact factor guidance.
- Improved research lookup scripts to prioritize results based on citation count, venue prestige, and author reputation, enhancing the quality of research outputs.
2026-01-05 13:01:10 -08:00

17 KiB
Raw Blame History

ML Conference Writing Style Guide

Comprehensive writing guide for NeurIPS, ICML, ICLR, CVPR, ECCV, ICCV, and other major machine learning and computer vision conferences.

Last Updated: 2024


Overview

ML conferences prioritize novelty, rigorous empirical evaluation, and reproducibility. Papers are evaluated on clear contribution, strong baselines, comprehensive ablations, and honest discussion of limitations.

Key Philosophy

"Show don't tell—your experiments should demonstrate your claims, not just your prose."

Primary Goal: Advance the state of the art with novel methods validated through rigorous experimentation.


Audience and Tone

Target Reader

  • ML researchers and practitioners
  • Experts in the specific subfield
  • Familiar with recent literature
  • Expect technical depth and precision

Tone Characteristics

Characteristic Description
Technical Dense with methodology details
Precise Exact terminology, no ambiguity
Empirical Claims backed by experiments
Direct State contributions clearly
Honest Acknowledge limitations

Voice

  • First person plural ("we"): "We propose..." "Our method..."
  • Active voice: "We introduce a novel architecture..."
  • Confident but measured: Strong claims require strong evidence

Abstract

Style Requirements

  • Dense and numbers-focused
  • 150-250 words (varies by venue)
  • Key results upfront: Include specific metrics
  • Flowing paragraph (not structured)

Abstract Structure

  1. Problem (1 sentence): What problem are you solving?
  2. Limitation of existing work (1 sentence): Why current methods fall short
  3. Your approach (1-2 sentences): What's your method?
  4. Key results (2-3 sentences): Specific numbers on benchmarks
  5. Significance (optional, 1 sentence): Why this matters

Example Abstract (NeurIPS Style)

Transformers have achieved remarkable success in sequence modeling but 
suffer from quadratic computational complexity, limiting their application 
to long sequences. We introduce FlashAttention-2, an IO-aware exact 
attention algorithm that achieves 2x speedup over FlashAttention and up 
to 9x speedup over standard attention on sequences up to 16K tokens. Our 
key insight is to reduce memory reads/writes by tiling and recomputation, 
achieving optimal IO complexity. On the Long Range Arena benchmark, 
FlashAttention-2 enables training with 8x longer sequences while matching 
standard attention accuracy. Combined with sequence parallelism, we train 
GPT-style models on sequences of 64K tokens at near-linear cost. We 
release optimized CUDA kernels achieving 80% of theoretical peak FLOPS 
on A100 GPUs. Code is available at [anonymous URL].

Abstract Don'ts

"We propose a novel method for X" (vague, no results) "Our method outperforms baselines" (no specific numbers) "This is an important problem" (self-evident claims)

Include specific metrics: "achieves 94.5% accuracy, 3.2% improvement" Include scale: "on 1M samples" or "16K token sequences" Include comparison: "2x faster than previous SOTA"


Introduction

Structure (2-3 pages)

ML introductions have a distinctive structure with numbered contributions.

Paragraph-by-Paragraph Guide

Paragraph 1: Problem Motivation

  • Why is this problem important?
  • What are the applications?
  • Set up the technical challenge
"Large language models have demonstrated remarkable capabilities in 
natural language understanding and generation. However, their quadratic 
attention complexity presents a fundamental bottleneck for processing 
long documents, multi-turn conversations, and reasoning over extended 
contexts. As models scale to billions of parameters and context lengths 
extend to tens of thousands of tokens, efficient attention mechanisms 
become critical for practical deployment."

Paragraph 2: Limitations of Existing Approaches

  • What methods exist?
  • Why are they insufficient?
  • Technical analysis of limitations
"Prior work has addressed this through sparse attention patterns, 
linear attention approximations, and low-rank factorizations. While 
these methods reduce theoretical complexity, they often sacrifice 
accuracy, require specialized hardware, or introduce approximation 
errors that compound in deep networks. Exact attention remains 
preferable when computational resources permit."

Paragraph 3: Your Approach (High-Level)

  • What's your key insight?
  • How does your method work conceptually?
  • Why should it succeed?
"We observe that the primary bottleneck in attention is not computation 
but rather memory bandwidth—reading and writing the large N×N attention 
matrix dominates runtime on modern GPUs. We propose FlashAttention-2, 
which eliminates this bottleneck through a novel tiling strategy that 
computes attention block-by-block without materializing the full matrix."

Paragraph 4: Contribution List (CRITICAL)

This is mandatory and distinctive for ML conferences:

Our contributions are as follows:

• We propose FlashAttention-2, an IO-aware exact attention algorithm 
  that achieves optimal memory complexity O(N²d/M) where M is GPU 
  SRAM size.

• We provide theoretical analysis showing that our algorithm achieves 
  2-4x fewer HBM accesses than FlashAttention on typical GPU 
  configurations.

• We demonstrate 2x speedup over FlashAttention and up to 9x over 
  standard PyTorch attention across sequence lengths from 256 to 64K 
  tokens.

• We show that FlashAttention-2 enables training with 8x longer 
  contexts on the same hardware, unlocking new capabilities for 
  long-range modeling.

• We release optimized CUDA kernels and PyTorch bindings at 
  [anonymous URL].

Contribution Bullet Guidelines

Good Contribution Bullets Bad Contribution Bullets
Specific, quantifiable Vague claims
Self-contained Requires reading paper to understand
Distinct from each other Overlapping bullets
Emphasize novelty State obvious facts
  • In introduction: Brief positioning (1-2 paragraphs)
  • Separate section: Detailed comparison (at end or before conclusion)
  • Appendix: Extended discussion if space-limited

Method

Structure (2-3 pages)

METHOD
├── Problem Formulation
├── Method Overview / Architecture
├── Key Technical Components
│   ├── Component 1 (with equations)
│   ├── Component 2 (with equations)
│   └── Component 3 (with equations)
├── Theoretical Analysis (if applicable)
└── Implementation Details

Mathematical Notation

  • Define all notation: "Let X ∈ ^{N×d} denote the input sequence..."
  • Consistent symbols: Same symbol means same thing throughout
  • Number important equations: Reference by number later

Algorithm Pseudocode

Include clear pseudocode for reproducibility:

Algorithm 1: FlashAttention-2 Forward Pass
─────────────────────────────────────────
Input: Q, K, V ∈ ^{N×d}, block size B_r, B_c
Output: O ∈ ^{N×d}

1:  Divide Q into T_r = ⌈N/B_r⌉ blocks
2:  Divide K, V into T_c = ⌈N/B_c⌉ blocks
3:  Initialize O = 0,  = 0, m = -∞
4:  for i = 1 to T_r do
5:    Load Q_i from HBM to SRAM
6:    for j = 1 to T_c do
7:      Load K_j, V_j from HBM to SRAM
8:      Compute S_ij = Q_i K_j^T
9:      Update running max and sum
10:     Update O_i incrementally
11:   end for
12:   Write O_i to HBM
13: end for
14: return O

Architecture Diagrams

  • Clear, publication-quality figures
  • Label all components
  • Show data flow with arrows
  • Use consistent visual language

Experiments

Structure (2-3 pages)

EXPERIMENTS
├── Experimental Setup
│   ├── Datasets and Benchmarks
│   ├── Baselines
│   ├── Implementation Details
│   └── Evaluation Metrics
├── Main Results
│   └── Table/Figure with primary comparisons
├── Ablation Studies
│   └── Component-wise analysis
├── Analysis
│   ├── Scaling behavior
│   ├── Qualitative examples
│   └── Error analysis
└── Computational Efficiency

Datasets and Benchmarks

  • Use standard benchmarks: Establish comparability
  • Report dataset statistics: Size, splits, preprocessing
  • Justify non-standard choices: If using custom data, explain why

Baselines

Critical for acceptance. Include:

  • Recent SOTA: Not just old methods
  • Fair comparisons: Same compute budget, hyperparameter tuning
  • Ablated versions: Your method without key components
  • Strong baselines: Don't cherry-pick weak competitors

Main Results Table

Clear, comprehensive formatting:

Table 1: Results on Long Range Arena Benchmark (accuracy %)
──────────────────────────────────────────────────────────
Method          | ListOps | Text  | Retrieval | Image | Path  | Avg
──────────────────────────────────────────────────────────
Transformer     |  36.4   | 64.3  |   57.5    | 42.4  | 71.4  | 54.4
Performer       |  18.0   | 65.4  |   53.8    | 42.8  | 77.1  | 51.4
Linear Attn     |  16.1   | 65.9  |   53.1    | 42.3  | 75.3  | 50.5
FlashAttention  |  37.1   | 64.5  |   57.8    | 42.7  | 71.2  | 54.7
FlashAttn-2     |  37.4   | 64.7  |   58.2    | 42.9  | 71.8  | 55.0
──────────────────────────────────────────────────────────

Ablation Studies (MANDATORY)

Show what matters in your method:

Table 2: Ablation Study on FlashAttention-2 Components
──────────────────────────────────────────────────────
Variant                              | Speedup | Memory
──────────────────────────────────────────────────────
Full FlashAttention-2                |   2.0x  |  1.0x
  - without sequence parallelism     |   1.7x  |  1.0x
  - without recomputation            |   1.3x  |  2.4x
  - without block tiling             |   1.0x  |  4.0x
FlashAttention-1 (baseline)          |   1.0x  |  1.0x
──────────────────────────────────────────────────────

What Ablations Should Show

  • Each component matters: Removing it hurts performance
  • Design choices justified: Why this architecture/hyperparameter?
  • Failure modes: When does method not work?
  • Sensitivity analysis: Robustness to hyperparameters

Placement Options

  1. After Introduction: Common in CV papers
  2. Before Conclusion: Common in NeurIPS/ICML
  3. Appendix: When space is tight

Writing Style

  • Organized by theme: Not chronological
  • Position your work: How you differ from each line of work
  • Fair characterization: Don't misrepresent prior work
  • Recent citations: Include 2023-2024 papers

Example Structure

**Efficient Attention Mechanisms.** Prior work on efficient attention 
falls into three categories: sparse patterns (Beltagy et al., 2020; 
Zaheer et al., 2020), linear approximations (Katharopoulos et al., 2020; 
Choromanski et al., 2021), and low-rank factorizations (Wang et al., 
2020). Our work differs in that we focus on IO-efficient exact 
attention rather than approximations.

**Memory-Efficient Training.** Gradient checkpointing (Chen et al., 2016) 
and activation recomputation (Korthikanti et al., 2022) reduce memory 
by trading compute. We adopt similar ideas but apply them within the 
attention operator itself.

Limitations Section

Why It Matters

Increasingly required at NeurIPS, ICML, ICLR. Honest limitations:

  • Show scientific maturity
  • Guide future work
  • Prevent overselling

What to Include

  1. Method limitations: When does it fail?
  2. Experimental limitations: What wasn't tested?
  3. Scope limitations: What's out of scope?
  4. Computational limitations: Resource requirements

Example Limitations Section

**Limitations.** While FlashAttention-2 provides substantial speedups, 
several limitations remain. First, our implementation is optimized for 
NVIDIA GPUs and does not support AMD or other hardware. Second, the 
speedup is most pronounced for medium to long sequences; for very short 
sequences (<256 tokens), the overhead of our kernel launch dominates. 
Third, we focus on dense attention; extending our approach to sparse 
attention patterns remains future work. Finally, our theoretical 
analysis assumes specific GPU memory hierarchy parameters that may not 
hold for future hardware generations.

Reproducibility

Reproducibility Checklist (NeurIPS/ICML)

Most ML conferences require a reproducibility checklist covering:

  • Code availability
  • Dataset availability
  • Hyperparameters specified
  • Random seeds reported
  • Compute requirements stated
  • Number of runs and variance reported
  • Statistical significance tests

What to Report

Hyperparameters:

"We train with Adam (β₁=0.9, β₂=0.999, ε=1e-8) and learning rate 3e-4 
with linear warmup over 1000 steps and cosine decay. Batch size is 256 
across 8 A100 GPUs. We train for 100K steps (approximately 24 hours)."

Random Seeds:

"All experiments are averaged over 3 random seeds (0, 1, 2) with 
standard deviation reported in parentheses."

Compute:

"Experiments were conducted on 8 NVIDIA A100-80GB GPUs. Total training 
time was approximately 500 GPU-hours."

Figures

Figure Quality

  • Vector graphics preferred: PDF, SVG
  • High resolution for rasters: 300+ dpi
  • Readable at publication size: Test at actual column width
  • Colorblind-accessible: Use patterns in addition to color

Common Figure Types

  1. Architecture diagram: Show your method visually
  2. Performance plots: Learning curves, scaling behavior
  3. Comparison tables: Main results
  4. Ablation figures: Component contributions
  5. Qualitative examples: Input/output samples

Figure Captions

Self-contained captions that explain:

  • What is shown
  • How to read the figure
  • Key takeaway

References

Citation Style

  • Numbered [1] or author-year (Smith et al., 2023)
  • Check venue-specific requirements
  • Be consistent throughout

Reference Guidelines

  • Cite recent work: 2022-2024 papers expected
  • Don't over-cite yourself: Raises bias concerns
  • Cite arxiv appropriately: Use published version when available
  • Include all relevant prior work: Missing citations hurt review

Venue-Specific Notes

NeurIPS

  • 8 pages main + unlimited appendix/references
  • Broader Impact section sometimes required
  • Reproducibility checklist mandatory
  • OpenReview submission, public reviews

ICML

  • 8 pages main + unlimited appendix/references
  • Strong emphasis on theory + experiments
  • Reproducibility statement encouraged

ICLR

  • 8 pages main (camera-ready can exceed)
  • OpenReview with public reviews and discussion
  • Author response period is interactive
  • Strong emphasis on novelty and insight

CVPR/ICCV/ECCV

  • 8 pages main including references
  • Supplementary video encouraged
  • Heavy emphasis on visual results
  • Benchmark performance critical

Common Mistakes

  1. Weak baselines: Not comparing to recent SOTA
  2. Missing ablations: Not showing component contributions
  3. Overclaiming: "We solve X" when you partially address X
  4. Vague contributions: "We propose a novel method"
  5. Poor reproducibility: Missing hyperparameters, seeds
  6. Wrong template: Using last year's style file
  7. Anonymous violations: Revealing identity in blind review
  8. Missing limitations: Not acknowledging failure modes

Rebuttal Tips

ML conferences have author response periods. Tips:

  • Address key concerns first: Prioritize critical issues
  • Run requested experiments: When feasible in time
  • Be concise: Reviewers read many rebuttals
  • Stay professional: Even with unfair reviews
  • Reference specific lines: "As stated in L127..."

Pre-Submission Checklist

Content

  • Clear problem motivation
  • Explicit contribution list
  • Complete method description
  • Comprehensive experiments
  • Strong baselines included
  • Ablation studies present
  • Limitations acknowledged

Technical

  • Correct venue style file (current year)
  • Anonymized (no author names, no identifiable URLs)
  • Page limit respected
  • References complete
  • Supplementary organized

Reproducibility

  • Hyperparameters listed
  • Random seeds specified
  • Compute requirements stated
  • Code/data availability noted
  • Reproducibility checklist completed

See Also

  • venue_writing_styles.md - Master style overview
  • conferences_formatting.md - Technical formatting requirements
  • reviewer_expectations.md - What ML reviewers seek