Files
Vinayak Agarwal 3439a21f57 Enhance citation management and literature review guidelines
- Updated SKILL.md in citation management to include best practices for identifying seminal and high-impact papers, emphasizing citation count thresholds, venue quality tiers, and author reputation indicators.
- Expanded literature review SKILL.md to prioritize high-impact papers, detailing citation metrics, journal tiers, and author reputation assessment.
- Added comprehensive evaluation strategies for paper impact and quality in literature_search_strategies.md, including citation count significance and journal impact factor guidance.
- Improved research lookup scripts to prioritize results based on citation count, venue prestige, and author reputation, enhancing the quality of research outputs.
2026-01-05 13:01:10 -08:00

8.6 KiB
Raw Permalink Blame History

NeurIPS/ICML Introduction Example

This example demonstrates the distinctive ML conference introduction structure with numbered contributions and technical precision.


Full Introduction Example

Paper Topic: Efficient Long-Context Transformers


Paragraph 1: Problem Motivation

Large language models (LLMs) have demonstrated remarkable capabilities in 
natural language understanding, code generation, and reasoning tasks [1, 2, 3]. 
These capabilities scale with both model size and context length—longer 
contexts enable processing of entire documents, multi-turn conversations, 
and complex reasoning chains that span many steps [4, 5]. However, the 
standard Transformer attention mechanism [6] has O(N²) time and memory 
complexity with respect to sequence length N, creating a fundamental 
bottleneck for processing long sequences. For a context window of 100K 
tokens, computing full attention requires 10 billion scalar operations 
and 40 GB of memory for the attention matrix alone, making training and 
inference prohibitively expensive on current hardware.

Key features:

  • States why this matters (LLM capabilities)
  • Connects to scaling (longer contexts = better performance)
  • Specific numbers (O(N²), 100K tokens, 10 billion ops, 40 GB)
  • Citations to establish credibility

Paragraph 2: Limitations of Existing Approaches

Prior work has addressed attention efficiency through three main approaches. 
Sparse attention patterns [7, 8, 9] reduce complexity to O(N√N) or O(N log N) 
by restricting attention to local windows, fixed stride patterns, or learned 
sparse masks. Linear attention approximations [10, 11, 12] reformulate 
attention using kernel feature maps that enable O(N) computation, but 
sacrifice the ability to model arbitrary pairwise interactions. Low-rank 
factorizations [13, 14] approximate the attention matrix as a product of 
smaller matrices, achieving efficiency at the cost of expressivity. While 
these methods reduce theoretical complexity, they introduce approximation 
errors that compound in deep networks, often resulting in 2-5% accuracy 
degradation on long-range modeling benchmarks [15]. Perhaps more importantly, 
they fundamentally change the attention mechanism, making it difficult to 
apply advances in standard attention (e.g., rotary positional embeddings, 
grouped-query attention) to efficient variants.

Key features:

  • Organized categorization of prior work
  • Complexity stated for each approach
  • Limitations clearly identified
  • Quantified shortcomings (2-5% degradation)
  • Deeper issue identified (incompatibility with advances)

Paragraph 3: Your Approach (High-Level)

We take a different approach: rather than approximating attention, we 
accelerate exact attention by optimizing memory access patterns. Our key 
observation is that on modern GPUs, attention is bottlenecked by memory 
bandwidth, not compute. Reading and writing the N × N attention matrix to 
and from GPU high-bandwidth memory (HBM) dominates runtime, while the GPU's 
tensor cores remain underutilized. We propose LongFlash, an IO-aware exact 
attention algorithm that computes attention block-by-block in fast on-chip 
SRAM, never materializing the full attention matrix in HBM. By carefully 
orchestrating the tiling pattern and fusing the softmax computation with 
matrix multiplications, LongFlash reduces HBM accesses from O(N²) to 
O(N²d/M) where d is the head dimension and M is the SRAM size, achieving 
asymptotically optimal IO complexity.

Key features:

  • Clear differentiation from prior work ("different approach")
  • Key insight stated explicitly
  • Technical mechanism explained
  • Complexity improvement quantified
  • Method name introduced

Paragraph 4: Contributions (CRITICAL)

Our contributions are as follows:

• We propose LongFlash, an IO-aware exact attention algorithm that achieves 
  2-4× speedup over FlashAttention [16] and up to 9× over standard PyTorch 
  attention on sequences from 1K to 128K tokens (Section 3).

• We provide theoretical analysis proving that LongFlash achieves optimal 
  IO complexity of O(N²d/M) among all algorithms that compute exact 
  attention, and analyze the regime where our algorithm provides maximum 
  benefit (Section 3.3).

• We introduce sequence parallelism techniques that enable LongFlash to 
  scale to sequences of 1M+ tokens across multiple GPUs with near-linear 
  weak scaling efficiency (Section 4).

• We demonstrate that LongFlash enables training with 8× longer contexts 
  on the same hardware: we train a 7B parameter model on 128K token 
  contexts using the same memory that previously limited us to 16K tokens 
  (Section 5).

• We release optimized CUDA kernels achieving 80% of theoretical peak 
  FLOPS on A100 and H100 GPUs, along with PyTorch and JAX bindings, at 
  [anonymous URL] (Section 6).

Key features:

  • Numbered/bulleted format
  • Each contribution is specific and quantified
  • Section references for each claim
  • Both methodological and empirical contributions
  • Code release mentioned
  • Self-contained bullets (each makes sense alone)

Alternative Opening Paragraphs

For a Methods Paper

Scalable optimization algorithms are fundamental to modern machine learning. 
Stochastic gradient descent (SGD) and its variants [1, 2, 3] have enabled 
training of models with billions of parameters on massive datasets. However, 
these first-order methods exhibit slow convergence on ill-conditioned 
problems, often requiring thousands of iterations to converge on tasks 
where second-order methods would converge in tens of iterations [4, 5].

For an Applications Paper

Drug discovery is a costly and time-consuming process, with the average new 
drug requiring 10-15 years and $2.6 billion to develop [1]. Machine learning 
offers the potential to accelerate this process by predicting molecular 
properties, identifying promising candidates, and optimizing lead compounds 
computationally [2, 3]. Recent successes in protein structure prediction [4] 
and molecular generation [5] have demonstrated that deep learning can 
capture complex chemical patterns, raising hopes for ML-driven drug discovery.

For a Theory Paper

Understanding why deep neural networks generalize well despite having more 
parameters than training examples remains one of the central puzzles of 
modern machine learning [1, 2]. Classical statistical learning theory 
predicts that such overparameterized models should overfit dramatically, 
yet in practice, large networks trained with SGD achieve excellent test 
accuracy [3]. This gap between theory and practice has motivated a rich 
literature on implicit regularization [4], neural tangent kernels [5], 
and feature learning [6], but a complete theoretical picture remains elusive.

Contribution Bullet Templates

For a New Method

• We propose [Method Name], a novel [type of method] that [key innovation] 
  achieving [performance improvement] over [baseline] on [benchmark].

For Theoretical Analysis

• We prove that [statement], providing the first [type of result] for 
  [problem setting]. This resolves an open question from [prior work].

For Empirical Study

• We conduct a comprehensive evaluation of [N] methods across [M] datasets, 
  revealing that [key finding] and identifying [failure mode/best practice].

For Code/Data Release

• We release [resource name], a [description] containing [scale/scope], 
  available at [URL]. This enables [future work/reproducibility].

Common Mistakes to Avoid

Vague Contributions

Bad:

• We propose a novel method for attention
• We show our method is better than baselines
• We provide theoretical analysis

Good:

• We propose LongFlash, achieving 2-4× speedup over FlashAttention
• We prove LongFlash achieves optimal O(N²d/M) IO complexity
• We enable 8× longer context training on fixed hardware budget

Missing Quantification

Bad: "Our method significantly outperforms prior work" Good: "Our method improves accuracy by 3.2% on GLUE and 4.1% on SuperGLUE"

Overlapping Bullets

Bad:

• We propose a new attention mechanism
• We introduce LongFlash attention
• Our novel attention approach...

(These say the same thing three times)

Buried Contributions

Bad: Contribution bullets at the end of page 2 Good: Contribution bullets clearly visible by end of page 1


See Also

  • ml_conference_style.md - Comprehensive ML conference guide
  • venue_writing_styles.md - Style comparison across venues