mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-01-26 16:58:56 +08:00

Files

Vinayak Agarwal 3439a21f57 Enhance citation management and literature review guidelines

- Updated SKILL.md in citation management to include best practices for identifying seminal and high-impact papers, emphasizing citation count thresholds, venue quality tiers, and author reputation indicators.
- Expanded literature review SKILL.md to prioritize high-impact papers, detailing citation metrics, journal tiers, and author reputation assessment.
- Added comprehensive evaluation strategies for paper impact and quality in literature_search_strategies.md, including citation count significance and journal impact factor guidance.
- Improved research lookup scripts to prioritize results based on citation count, venue prestige, and author reputation, enhancing the quality of research outputs.

2026-01-05 13:01:10 -08:00

8.6 KiB

Raw Permalink Blame History

NeurIPS/ICML Introduction Example

This example demonstrates the distinctive ML conference introduction structure with numbered contributions and technical precision.

Full Introduction Example

Paper Topic: Efficient Long-Context Transformers

Paragraph 1: Problem Motivation

Large language models (LLMs) have demonstrated remarkable capabilities in 
natural language understanding, code generation, and reasoning tasks [1, 2, 3]. 
These capabilities scale with both model size and context length—longer 
contexts enable processing of entire documents, multi-turn conversations, 
and complex reasoning chains that span many steps [4, 5]. However, the 
standard Transformer attention mechanism [6] has O(N²) time and memory 
complexity with respect to sequence length N, creating a fundamental 
bottleneck for processing long sequences. For a context window of 100K 
tokens, computing full attention requires 10 billion scalar operations 
and 40 GB of memory for the attention matrix alone, making training and 
inference prohibitively expensive on current hardware.

Key features:

States why this matters (LLM capabilities)
Connects to scaling (longer contexts = better performance)
Specific numbers (O(N²), 100K tokens, 10 billion ops, 40 GB)
Citations to establish credibility

Paragraph 2: Limitations of Existing Approaches

Prior work has addressed attention efficiency through three main approaches. 
Sparse attention patterns [7, 8, 9] reduce complexity to O(N√N) or O(N log N) 
by restricting attention to local windows, fixed stride patterns, or learned 
sparse masks. Linear attention approximations [10, 11, 12] reformulate 
attention using kernel feature maps that enable O(N) computation, but 
sacrifice the ability to model arbitrary pairwise interactions. Low-rank 
factorizations [13, 14] approximate the attention matrix as a product of 
smaller matrices, achieving efficiency at the cost of expressivity. While 
these methods reduce theoretical complexity, they introduce approximation 
errors that compound in deep networks, often resulting in 2-5% accuracy 
degradation on long-range modeling benchmarks [15]. Perhaps more importantly, 
they fundamentally change the attention mechanism, making it difficult to 
apply advances in standard attention (e.g., rotary positional embeddings, 
grouped-query attention) to efficient variants.

Key features:

Organized categorization of prior work
Complexity stated for each approach
Limitations clearly identified
Quantified shortcomings (2-5% degradation)
Deeper issue identified (incompatibility with advances)

Paragraph 3: Your Approach (High-Level)

We take a different approach: rather than approximating attention, we 
accelerate exact attention by optimizing memory access patterns. Our key 
observation is that on modern GPUs, attention is bottlenecked by memory 
bandwidth, not compute. Reading and writing the N × N attention matrix to 
and from GPU high-bandwidth memory (HBM) dominates runtime, while the GPU's 
tensor cores remain underutilized. We propose LongFlash, an IO-aware exact 
attention algorithm that computes attention block-by-block in fast on-chip 
SRAM, never materializing the full attention matrix in HBM. By carefully 
orchestrating the tiling pattern and fusing the softmax computation with 
matrix multiplications, LongFlash reduces HBM accesses from O(N²) to 
O(N²d/M) where d is the head dimension and M is the SRAM size, achieving 
asymptotically optimal IO complexity.

Key features:

Clear differentiation from prior work ("different approach")
Key insight stated explicitly
Technical mechanism explained
Complexity improvement quantified
Method name introduced

Paragraph 4: Contributions (CRITICAL)

Our contributions are as follows:

• We propose LongFlash, an IO-aware exact attention algorithm that achieves 
  2-4× speedup over FlashAttention [16] and up to 9× over standard PyTorch 
  attention on sequences from 1K to 128K tokens (Section 3).

• We provide theoretical analysis proving that LongFlash achieves optimal 
  IO complexity of O(N²d/M) among all algorithms that compute exact 
  attention, and analyze the regime where our algorithm provides maximum 
  benefit (Section 3.3).

• We introduce sequence parallelism techniques that enable LongFlash to 
  scale to sequences of 1M+ tokens across multiple GPUs with near-linear 
  weak scaling efficiency (Section 4).

• We demonstrate that LongFlash enables training with 8× longer contexts 
  on the same hardware: we train a 7B parameter model on 128K token 
  contexts using the same memory that previously limited us to 16K tokens 
  (Section 5).

• We release optimized CUDA kernels achieving 80% of theoretical peak 
  FLOPS on A100 and H100 GPUs, along with PyTorch and JAX bindings, at 
  [anonymous URL] (Section 6).

Key features:

Numbered/bulleted format
Each contribution is specific and quantified
Section references for each claim
Both methodological and empirical contributions
Code release mentioned
Self-contained bullets (each makes sense alone)

Alternative Opening Paragraphs

For a Methods Paper

Scalable optimization algorithms are fundamental to modern machine learning. 
Stochastic gradient descent (SGD) and its variants [1, 2, 3] have enabled 
training of models with billions of parameters on massive datasets. However, 
these first-order methods exhibit slow convergence on ill-conditioned 
problems, often requiring thousands of iterations to converge on tasks 
where second-order methods would converge in tens of iterations [4, 5].

For an Applications Paper

Drug discovery is a costly and time-consuming process, with the average new 
drug requiring 10-15 years and $2.6 billion to develop [1]. Machine learning 
offers the potential to accelerate this process by predicting molecular 
properties, identifying promising candidates, and optimizing lead compounds 
computationally [2, 3]. Recent successes in protein structure prediction [4] 
and molecular generation [5] have demonstrated that deep learning can 
capture complex chemical patterns, raising hopes for ML-driven drug discovery.

For a Theory Paper

Understanding why deep neural networks generalize well despite having more 
parameters than training examples remains one of the central puzzles of 
modern machine learning [1, 2]. Classical statistical learning theory 
predicts that such overparameterized models should overfit dramatically, 
yet in practice, large networks trained with SGD achieve excellent test 
accuracy [3]. This gap between theory and practice has motivated a rich 
literature on implicit regularization [4], neural tangent kernels [5], 
and feature learning [6], but a complete theoretical picture remains elusive.

Contribution Bullet Templates

For a New Method

• We propose [Method Name], a novel [type of method] that [key innovation] 
  achieving [performance improvement] over [baseline] on [benchmark].

For Theoretical Analysis

• We prove that [statement], providing the first [type of result] for 
  [problem setting]. This resolves an open question from [prior work].

For Empirical Study

• We conduct a comprehensive evaluation of [N] methods across [M] datasets, 
  revealing that [key finding] and identifying [failure mode/best practice].

For Code/Data Release

• We release [resource name], a [description] containing [scale/scope], 
  available at [URL]. This enables [future work/reproducibility].

Common Mistakes to Avoid

Vague Contributions

❌ Bad:

• We propose a novel method for attention
• We show our method is better than baselines
• We provide theoretical analysis

✅ Good:

• We propose LongFlash, achieving 2-4× speedup over FlashAttention
• We prove LongFlash achieves optimal O(N²d/M) IO complexity
• We enable 8× longer context training on fixed hardware budget

Missing Quantification

❌ Bad: "Our method significantly outperforms prior work" ✅ Good: "Our method improves accuracy by 3.2% on GLUE and 4.1% on SuperGLUE"

Overlapping Bullets

❌ Bad:

• We propose a new attention mechanism
• We introduce LongFlash attention
• Our novel attention approach...

(These say the same thing three times)

Buried Contributions

❌ Bad: Contribution bullets at the end of page 2 ✅ Good: Contribution bullets clearly visible by end of page 1

8.6 KiB Raw Permalink Blame History Unescape Escape

NeurIPS/ICML Introduction Example

Full Introduction Example

Paragraph 1: Problem Motivation

Paragraph 2: Limitations of Existing Approaches

Paragraph 3: Your Approach (High-Level)

Paragraph 4: Contributions (CRITICAL)

Alternative Opening Paragraphs

For a Methods Paper

For an Applications Paper

For a Theory Paper

Contribution Bullet Templates

For a New Method

For Theoretical Analysis

For Empirical Study

For Code/Data Release

Common Mistakes to Avoid

Vague Contributions

Missing Quantification

Overlapping Bullets

Buried Contributions

See Also

8.6 KiB

Raw Permalink Blame History