# NeurIPS/ICML Introduction Example This example demonstrates the distinctive ML conference introduction structure with numbered contributions and technical precision. --- ## Full Introduction Example **Paper Topic**: Efficient Long-Context Transformers --- ### Paragraph 1: Problem Motivation ``` Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding, code generation, and reasoning tasks [1, 2, 3]. These capabilities scale with both model size and context length—longer contexts enable processing of entire documents, multi-turn conversations, and complex reasoning chains that span many steps [4, 5]. However, the standard Transformer attention mechanism [6] has O(N²) time and memory complexity with respect to sequence length N, creating a fundamental bottleneck for processing long sequences. For a context window of 100K tokens, computing full attention requires 10 billion scalar operations and 40 GB of memory for the attention matrix alone, making training and inference prohibitively expensive on current hardware. ``` **Key features**: - States why this matters (LLM capabilities) - Connects to scaling (longer contexts = better performance) - Specific numbers (O(N²), 100K tokens, 10 billion ops, 40 GB) - Citations to establish credibility --- ### Paragraph 2: Limitations of Existing Approaches ``` Prior work has addressed attention efficiency through three main approaches. Sparse attention patterns [7, 8, 9] reduce complexity to O(N√N) or O(N log N) by restricting attention to local windows, fixed stride patterns, or learned sparse masks. Linear attention approximations [10, 11, 12] reformulate attention using kernel feature maps that enable O(N) computation, but sacrifice the ability to model arbitrary pairwise interactions. Low-rank factorizations [13, 14] approximate the attention matrix as a product of smaller matrices, achieving efficiency at the cost of expressivity. While these methods reduce theoretical complexity, they introduce approximation errors that compound in deep networks, often resulting in 2-5% accuracy degradation on long-range modeling benchmarks [15]. Perhaps more importantly, they fundamentally change the attention mechanism, making it difficult to apply advances in standard attention (e.g., rotary positional embeddings, grouped-query attention) to efficient variants. ``` **Key features**: - Organized categorization of prior work - Complexity stated for each approach - Limitations clearly identified - Quantified shortcomings (2-5% degradation) - Deeper issue identified (incompatibility with advances) --- ### Paragraph 3: Your Approach (High-Level) ``` We take a different approach: rather than approximating attention, we accelerate exact attention by optimizing memory access patterns. Our key observation is that on modern GPUs, attention is bottlenecked by memory bandwidth, not compute. Reading and writing the N × N attention matrix to and from GPU high-bandwidth memory (HBM) dominates runtime, while the GPU's tensor cores remain underutilized. We propose LongFlash, an IO-aware exact attention algorithm that computes attention block-by-block in fast on-chip SRAM, never materializing the full attention matrix in HBM. By carefully orchestrating the tiling pattern and fusing the softmax computation with matrix multiplications, LongFlash reduces HBM accesses from O(N²) to O(N²d/M) where d is the head dimension and M is the SRAM size, achieving asymptotically optimal IO complexity. ``` **Key features**: - Clear differentiation from prior work ("different approach") - Key insight stated explicitly - Technical mechanism explained - Complexity improvement quantified - Method name introduced --- ### Paragraph 4: Contributions (CRITICAL) ``` Our contributions are as follows: • We propose LongFlash, an IO-aware exact attention algorithm that achieves 2-4× speedup over FlashAttention [16] and up to 9× over standard PyTorch attention on sequences from 1K to 128K tokens (Section 3). • We provide theoretical analysis proving that LongFlash achieves optimal IO complexity of O(N²d/M) among all algorithms that compute exact attention, and analyze the regime where our algorithm provides maximum benefit (Section 3.3). • We introduce sequence parallelism techniques that enable LongFlash to scale to sequences of 1M+ tokens across multiple GPUs with near-linear weak scaling efficiency (Section 4). • We demonstrate that LongFlash enables training with 8× longer contexts on the same hardware: we train a 7B parameter model on 128K token contexts using the same memory that previously limited us to 16K tokens (Section 5). • We release optimized CUDA kernels achieving 80% of theoretical peak FLOPS on A100 and H100 GPUs, along with PyTorch and JAX bindings, at [anonymous URL] (Section 6). ``` **Key features**: - Numbered/bulleted format - Each contribution is specific and quantified - Section references for each claim - Both methodological and empirical contributions - Code release mentioned - Self-contained bullets (each makes sense alone) --- ## Alternative Opening Paragraphs ### For a Methods Paper ``` Scalable optimization algorithms are fundamental to modern machine learning. Stochastic gradient descent (SGD) and its variants [1, 2, 3] have enabled training of models with billions of parameters on massive datasets. However, these first-order methods exhibit slow convergence on ill-conditioned problems, often requiring thousands of iterations to converge on tasks where second-order methods would converge in tens of iterations [4, 5]. ``` ### For an Applications Paper ``` Drug discovery is a costly and time-consuming process, with the average new drug requiring 10-15 years and $2.6 billion to develop [1]. Machine learning offers the potential to accelerate this process by predicting molecular properties, identifying promising candidates, and optimizing lead compounds computationally [2, 3]. Recent successes in protein structure prediction [4] and molecular generation [5] have demonstrated that deep learning can capture complex chemical patterns, raising hopes for ML-driven drug discovery. ``` ### For a Theory Paper ``` Understanding why deep neural networks generalize well despite having more parameters than training examples remains one of the central puzzles of modern machine learning [1, 2]. Classical statistical learning theory predicts that such overparameterized models should overfit dramatically, yet in practice, large networks trained with SGD achieve excellent test accuracy [3]. This gap between theory and practice has motivated a rich literature on implicit regularization [4], neural tangent kernels [5], and feature learning [6], but a complete theoretical picture remains elusive. ``` --- ## Contribution Bullet Templates ### For a New Method ``` • We propose [Method Name], a novel [type of method] that [key innovation] achieving [performance improvement] over [baseline] on [benchmark]. ``` ### For Theoretical Analysis ``` • We prove that [statement], providing the first [type of result] for [problem setting]. This resolves an open question from [prior work]. ``` ### For Empirical Study ``` • We conduct a comprehensive evaluation of [N] methods across [M] datasets, revealing that [key finding] and identifying [failure mode/best practice]. ``` ### For Code/Data Release ``` • We release [resource name], a [description] containing [scale/scope], available at [URL]. This enables [future work/reproducibility]. ``` --- ## Common Mistakes to Avoid ### Vague Contributions ❌ **Bad**: ``` • We propose a novel method for attention • We show our method is better than baselines • We provide theoretical analysis ``` ✅ **Good**: ``` • We propose LongFlash, achieving 2-4× speedup over FlashAttention • We prove LongFlash achieves optimal O(N²d/M) IO complexity • We enable 8× longer context training on fixed hardware budget ``` ### Missing Quantification ❌ **Bad**: "Our method significantly outperforms prior work" ✅ **Good**: "Our method improves accuracy by 3.2% on GLUE and 4.1% on SuperGLUE" ### Overlapping Bullets ❌ **Bad**: ``` • We propose a new attention mechanism • We introduce LongFlash attention • Our novel attention approach... ``` (These say the same thing three times) ### Buried Contributions ❌ **Bad**: Contribution bullets at the end of page 2 ✅ **Good**: Contribution bullets clearly visible by end of page 1 --- ## See Also - `ml_conference_style.md` - Comprehensive ML conference guide - `venue_writing_styles.md` - Style comparison across venues