venue-templates/assets/examples/neurips_introduction_example.md

# NeurIPS/ICML Introduction Example

This example demonstrates the distinctive ML conference introduction structure with numbered contributions and technical precision.

---

## Full Introduction Example

**Paper Topic**: Efficient Long-Context Transformers

---

### Paragraph 1: Problem Motivation

```
Large language models (LLMs) have demonstrated remarkable capabilities in
natural language understanding, code generation, and reasoning tasks [1, 2, 3].
These capabilities scale with both model size and context length—longer
contexts enable processing of entire documents, multi-turn conversations,
and complex reasoning chains that span many steps [4, 5]. However, the
standard Transformer attention mechanism [6] has O(N²) time and memory
complexity with respect to sequence length N, creating a fundamental
bottleneck for processing long sequences. For a context window of 100K
tokens, computing full attention requires 10 billion scalar operations
and 40 GB of memory for the attention matrix alone, making training and
inference prohibitively expensive on current hardware.
```

**Key features**:
- States why this matters (LLM capabilities)
- Connects to scaling (longer contexts = better performance)
- Specific numbers (O(N²), 100K tokens, 10 billion ops, 40 GB)
- Citations to establish credibility

---

### Paragraph 2: Limitations of Existing Approaches

```
Prior work has addressed attention efficiency through three main approaches.
Sparse attention patterns [7, 8, 9] reduce complexity to O(N√N) or O(N log N)
by restricting attention to local windows, fixed stride patterns, or learned
sparse masks. Linear attention approximations [10, 11, 12] reformulate
attention using kernel feature maps that enable O(N) computation, but
sacrifice the ability to model arbitrary pairwise interactions. Low-rank
factorizations [13, 14] approximate the attention matrix as a product of
smaller matrices, achieving efficiency at the cost of expressivity. While
these methods reduce theoretical complexity, they introduce approximation
errors that compound in deep networks, often resulting in 2-5% accuracy
degradation on long-range modeling benchmarks [15]. Perhaps more importantly,
they fundamentally change the attention mechanism, making it difficult to
apply advances in standard attention (e.g., rotary positional embeddings,
grouped-query attention) to efficient variants.
```

**Key features**:
- Organized categorization of prior work
- Complexity stated for each approach
- Limitations clearly identified
- Quantified shortcomings (2-5% degradation)
- Deeper issue identified (incompatibility with advances)

---

### Paragraph 3: Your Approach (High-Level)

```
We take a different approach: rather than approximating attention, we
accelerate exact attention by optimizing memory access patterns. Our key
observation is that on modern GPUs, attention is bottlenecked by memory
bandwidth, not compute. Reading and writing the N × N attention matrix to
and from GPU high-bandwidth memory (HBM) dominates runtime, while the GPU's
tensor cores remain underutilized. We propose LongFlash, an IO-aware exact
attention algorithm that computes attention block-by-block in fast on-chip
SRAM, never materializing the full attention matrix in HBM. By carefully
orchestrating the tiling pattern and fusing the softmax computation with
matrix multiplications, LongFlash reduces HBM accesses from O(N²) to
O(N²d/M) where d is the head dimension and M is the SRAM size, achieving
asymptotically optimal IO complexity.
```

**Key features**:
- Clear differentiation from prior work ("different approach")
- Key insight stated explicitly
- Technical mechanism explained
- Complexity improvement quantified
- Method name introduced

---

### Paragraph 4: Contributions (CRITICAL)

```
Our contributions are as follows:

• We propose LongFlash, an IO-aware exact attention algorithm that achieves
  2-4× speedup over FlashAttention [16] and up to 9× over standard PyTorch
  attention on sequences from 1K to 128K tokens (Section 3).

• We provide theoretical analysis proving that LongFlash achieves optimal
  IO complexity of O(N²d/M) among all algorithms that compute exact
  attention, and analyze the regime where our algorithm provides maximum
  benefit (Section 3.3).

• We introduce sequence parallelism techniques that enable LongFlash to
  scale to sequences of 1M+ tokens across multiple GPUs with near-linear
  weak scaling efficiency (Section 4).

• We demonstrate that LongFlash enables training with 8× longer contexts
  on the same hardware: we train a 7B parameter model on 128K token
  contexts using the same memory that previously limited us to 16K tokens
  (Section 5).

• We release optimized CUDA kernels achieving 80% of theoretical peak
  FLOPS on A100 and H100 GPUs, along with PyTorch and JAX bindings, at
  [anonymous URL] (Section 6).
```

**Key features**:
- Numbered/bulleted format
- Each contribution is specific and quantified
- Section references for each claim
- Both methodological and empirical contributions
- Code release mentioned
- Self-contained bullets (each makes sense alone)

---

## Alternative Opening Paragraphs

### For a Methods Paper

```
Scalable optimization algorithms are fundamental to modern machine learning.
Stochastic gradient descent (SGD) and its variants [1, 2, 3] have enabled
training of models with billions of parameters on massive datasets. However,
these first-order methods exhibit slow convergence on ill-conditioned
problems, often requiring thousands of iterations to converge on tasks
where second-order methods would converge in tens of iterations [4, 5].
```

### For an Applications Paper

```
Drug discovery is a costly and time-consuming process, with the average new
drug requiring 10-15 years and $2.6 billion to develop [1]. Machine learning
offers the potential to accelerate this process by predicting molecular
properties, identifying promising candidates, and optimizing lead compounds
computationally [2, 3]. Recent successes in protein structure prediction [4]
and molecular generation [5] have demonstrated that deep learning can
capture complex chemical patterns, raising hopes for ML-driven drug discovery.
```

### For a Theory Paper

```
Understanding why deep neural networks generalize well despite having more
parameters than training examples remains one of the central puzzles of
modern machine learning [1, 2]. Classical statistical learning theory
predicts that such overparameterized models should overfit dramatically,
yet in practice, large networks trained with SGD achieve excellent test
accuracy [3]. This gap between theory and practice has motivated a rich
literature on implicit regularization [4], neural tangent kernels [5],
and feature learning [6], but a complete theoretical picture remains elusive.
```

---

## Contribution Bullet Templates

### For a New Method

```
• We propose [Method Name], a novel [type of method] that [key innovation]
  achieving [performance improvement] over [baseline] on [benchmark].
```

### For Theoretical Analysis

```
• We prove that [statement], providing the first [type of result] for
  [problem setting]. This resolves an open question from [prior work].
```

### For Empirical Study

```
• We conduct a comprehensive evaluation of [N] methods across [M] datasets,
  revealing that [key finding] and identifying [failure mode/best practice].
```

### For Code/Data Release

```
• We release [resource name], a [description] containing [scale/scope],
  available at [URL]. This enables [future work/reproducibility].
```

---

## Common Mistakes to Avoid

### Vague Contributions

❌ **Bad**:
```
• We propose a novel method for attention
• We show our method is better than baselines
• We provide theoretical analysis
```

✅ **Good**:
```
• We propose LongFlash, achieving 2-4× speedup over FlashAttention
• We prove LongFlash achieves optimal O(N²d/M) IO complexity
• We enable 8× longer context training on fixed hardware budget
```

### Missing Quantification

❌ **Bad**: "Our method significantly outperforms prior work"
✅ **Good**: "Our method improves accuracy by 3.2% on GLUE and 4.1% on SuperGLUE"

### Overlapping Bullets

❌ **Bad**:
```
• We propose a new attention mechanism
• We introduce LongFlash attention
• Our novel attention approach...
```
(These say the same thing three times)

### Buried Contributions

❌ **Bad**: Contribution bullets at the end of page 2
✅ **Good**: Contribution bullets clearly visible by end of page 1

---

## See Also

- `ml_conference_style.md` - Comprehensive ML conference guide
- `venue_writing_styles.md` - Style comparison across venues