mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
- Updated SKILL.md in citation management to include best practices for identifying seminal and high-impact papers, emphasizing citation count thresholds, venue quality tiers, and author reputation indicators. - Expanded literature review SKILL.md to prioritize high-impact papers, detailing citation metrics, journal tiers, and author reputation assessment. - Added comprehensive evaluation strategies for paper impact and quality in literature_search_strategies.md, including citation count significance and journal impact factor guidance. - Improved research lookup scripts to prioritize results based on citation count, venue prestige, and author reputation, enhancing the quality of research outputs.
246 lines
8.6 KiB
Markdown
246 lines
8.6 KiB
Markdown
# NeurIPS/ICML Introduction Example
|
||
|
||
This example demonstrates the distinctive ML conference introduction structure with numbered contributions and technical precision.
|
||
|
||
---
|
||
|
||
## Full Introduction Example
|
||
|
||
**Paper Topic**: Efficient Long-Context Transformers
|
||
|
||
---
|
||
|
||
### Paragraph 1: Problem Motivation
|
||
|
||
```
|
||
Large language models (LLMs) have demonstrated remarkable capabilities in
|
||
natural language understanding, code generation, and reasoning tasks [1, 2, 3].
|
||
These capabilities scale with both model size and context length—longer
|
||
contexts enable processing of entire documents, multi-turn conversations,
|
||
and complex reasoning chains that span many steps [4, 5]. However, the
|
||
standard Transformer attention mechanism [6] has O(N²) time and memory
|
||
complexity with respect to sequence length N, creating a fundamental
|
||
bottleneck for processing long sequences. For a context window of 100K
|
||
tokens, computing full attention requires 10 billion scalar operations
|
||
and 40 GB of memory for the attention matrix alone, making training and
|
||
inference prohibitively expensive on current hardware.
|
||
```
|
||
|
||
**Key features**:
|
||
- States why this matters (LLM capabilities)
|
||
- Connects to scaling (longer contexts = better performance)
|
||
- Specific numbers (O(N²), 100K tokens, 10 billion ops, 40 GB)
|
||
- Citations to establish credibility
|
||
|
||
---
|
||
|
||
### Paragraph 2: Limitations of Existing Approaches
|
||
|
||
```
|
||
Prior work has addressed attention efficiency through three main approaches.
|
||
Sparse attention patterns [7, 8, 9] reduce complexity to O(N√N) or O(N log N)
|
||
by restricting attention to local windows, fixed stride patterns, or learned
|
||
sparse masks. Linear attention approximations [10, 11, 12] reformulate
|
||
attention using kernel feature maps that enable O(N) computation, but
|
||
sacrifice the ability to model arbitrary pairwise interactions. Low-rank
|
||
factorizations [13, 14] approximate the attention matrix as a product of
|
||
smaller matrices, achieving efficiency at the cost of expressivity. While
|
||
these methods reduce theoretical complexity, they introduce approximation
|
||
errors that compound in deep networks, often resulting in 2-5% accuracy
|
||
degradation on long-range modeling benchmarks [15]. Perhaps more importantly,
|
||
they fundamentally change the attention mechanism, making it difficult to
|
||
apply advances in standard attention (e.g., rotary positional embeddings,
|
||
grouped-query attention) to efficient variants.
|
||
```
|
||
|
||
**Key features**:
|
||
- Organized categorization of prior work
|
||
- Complexity stated for each approach
|
||
- Limitations clearly identified
|
||
- Quantified shortcomings (2-5% degradation)
|
||
- Deeper issue identified (incompatibility with advances)
|
||
|
||
---
|
||
|
||
### Paragraph 3: Your Approach (High-Level)
|
||
|
||
```
|
||
We take a different approach: rather than approximating attention, we
|
||
accelerate exact attention by optimizing memory access patterns. Our key
|
||
observation is that on modern GPUs, attention is bottlenecked by memory
|
||
bandwidth, not compute. Reading and writing the N × N attention matrix to
|
||
and from GPU high-bandwidth memory (HBM) dominates runtime, while the GPU's
|
||
tensor cores remain underutilized. We propose LongFlash, an IO-aware exact
|
||
attention algorithm that computes attention block-by-block in fast on-chip
|
||
SRAM, never materializing the full attention matrix in HBM. By carefully
|
||
orchestrating the tiling pattern and fusing the softmax computation with
|
||
matrix multiplications, LongFlash reduces HBM accesses from O(N²) to
|
||
O(N²d/M) where d is the head dimension and M is the SRAM size, achieving
|
||
asymptotically optimal IO complexity.
|
||
```
|
||
|
||
**Key features**:
|
||
- Clear differentiation from prior work ("different approach")
|
||
- Key insight stated explicitly
|
||
- Technical mechanism explained
|
||
- Complexity improvement quantified
|
||
- Method name introduced
|
||
|
||
---
|
||
|
||
### Paragraph 4: Contributions (CRITICAL)
|
||
|
||
```
|
||
Our contributions are as follows:
|
||
|
||
• We propose LongFlash, an IO-aware exact attention algorithm that achieves
|
||
2-4× speedup over FlashAttention [16] and up to 9× over standard PyTorch
|
||
attention on sequences from 1K to 128K tokens (Section 3).
|
||
|
||
• We provide theoretical analysis proving that LongFlash achieves optimal
|
||
IO complexity of O(N²d/M) among all algorithms that compute exact
|
||
attention, and analyze the regime where our algorithm provides maximum
|
||
benefit (Section 3.3).
|
||
|
||
• We introduce sequence parallelism techniques that enable LongFlash to
|
||
scale to sequences of 1M+ tokens across multiple GPUs with near-linear
|
||
weak scaling efficiency (Section 4).
|
||
|
||
• We demonstrate that LongFlash enables training with 8× longer contexts
|
||
on the same hardware: we train a 7B parameter model on 128K token
|
||
contexts using the same memory that previously limited us to 16K tokens
|
||
(Section 5).
|
||
|
||
• We release optimized CUDA kernels achieving 80% of theoretical peak
|
||
FLOPS on A100 and H100 GPUs, along with PyTorch and JAX bindings, at
|
||
[anonymous URL] (Section 6).
|
||
```
|
||
|
||
**Key features**:
|
||
- Numbered/bulleted format
|
||
- Each contribution is specific and quantified
|
||
- Section references for each claim
|
||
- Both methodological and empirical contributions
|
||
- Code release mentioned
|
||
- Self-contained bullets (each makes sense alone)
|
||
|
||
---
|
||
|
||
## Alternative Opening Paragraphs
|
||
|
||
### For a Methods Paper
|
||
|
||
```
|
||
Scalable optimization algorithms are fundamental to modern machine learning.
|
||
Stochastic gradient descent (SGD) and its variants [1, 2, 3] have enabled
|
||
training of models with billions of parameters on massive datasets. However,
|
||
these first-order methods exhibit slow convergence on ill-conditioned
|
||
problems, often requiring thousands of iterations to converge on tasks
|
||
where second-order methods would converge in tens of iterations [4, 5].
|
||
```
|
||
|
||
### For an Applications Paper
|
||
|
||
```
|
||
Drug discovery is a costly and time-consuming process, with the average new
|
||
drug requiring 10-15 years and $2.6 billion to develop [1]. Machine learning
|
||
offers the potential to accelerate this process by predicting molecular
|
||
properties, identifying promising candidates, and optimizing lead compounds
|
||
computationally [2, 3]. Recent successes in protein structure prediction [4]
|
||
and molecular generation [5] have demonstrated that deep learning can
|
||
capture complex chemical patterns, raising hopes for ML-driven drug discovery.
|
||
```
|
||
|
||
### For a Theory Paper
|
||
|
||
```
|
||
Understanding why deep neural networks generalize well despite having more
|
||
parameters than training examples remains one of the central puzzles of
|
||
modern machine learning [1, 2]. Classical statistical learning theory
|
||
predicts that such overparameterized models should overfit dramatically,
|
||
yet in practice, large networks trained with SGD achieve excellent test
|
||
accuracy [3]. This gap between theory and practice has motivated a rich
|
||
literature on implicit regularization [4], neural tangent kernels [5],
|
||
and feature learning [6], but a complete theoretical picture remains elusive.
|
||
```
|
||
|
||
---
|
||
|
||
## Contribution Bullet Templates
|
||
|
||
### For a New Method
|
||
|
||
```
|
||
• We propose [Method Name], a novel [type of method] that [key innovation]
|
||
achieving [performance improvement] over [baseline] on [benchmark].
|
||
```
|
||
|
||
### For Theoretical Analysis
|
||
|
||
```
|
||
• We prove that [statement], providing the first [type of result] for
|
||
[problem setting]. This resolves an open question from [prior work].
|
||
```
|
||
|
||
### For Empirical Study
|
||
|
||
```
|
||
• We conduct a comprehensive evaluation of [N] methods across [M] datasets,
|
||
revealing that [key finding] and identifying [failure mode/best practice].
|
||
```
|
||
|
||
### For Code/Data Release
|
||
|
||
```
|
||
• We release [resource name], a [description] containing [scale/scope],
|
||
available at [URL]. This enables [future work/reproducibility].
|
||
```
|
||
|
||
---
|
||
|
||
## Common Mistakes to Avoid
|
||
|
||
### Vague Contributions
|
||
|
||
❌ **Bad**:
|
||
```
|
||
• We propose a novel method for attention
|
||
• We show our method is better than baselines
|
||
• We provide theoretical analysis
|
||
```
|
||
|
||
✅ **Good**:
|
||
```
|
||
• We propose LongFlash, achieving 2-4× speedup over FlashAttention
|
||
• We prove LongFlash achieves optimal O(N²d/M) IO complexity
|
||
• We enable 8× longer context training on fixed hardware budget
|
||
```
|
||
|
||
### Missing Quantification
|
||
|
||
❌ **Bad**: "Our method significantly outperforms prior work"
|
||
✅ **Good**: "Our method improves accuracy by 3.2% on GLUE and 4.1% on SuperGLUE"
|
||
|
||
### Overlapping Bullets
|
||
|
||
❌ **Bad**:
|
||
```
|
||
• We propose a new attention mechanism
|
||
• We introduce LongFlash attention
|
||
• Our novel attention approach...
|
||
```
|
||
(These say the same thing three times)
|
||
|
||
### Buried Contributions
|
||
|
||
❌ **Bad**: Contribution bullets at the end of page 2
|
||
✅ **Good**: Contribution bullets clearly visible by end of page 1
|
||
|
||
---
|
||
|
||
## See Also
|
||
|
||
- `ml_conference_style.md` - Comprehensive ML conference guide
|
||
- `venue_writing_styles.md` - Style comparison across venues
|
||
|