Enhance citation management and literature review guidelines

- Updated SKILL.md in citation management to include best practices for identifying seminal and high-impact papers, emphasizing citation count thresholds, venue quality tiers, and author reputation indicators. - Expanded literature review SKILL.md to prioritize high-impact papers, detailing citation metrics, journal tiers, and author reputation assessment. - Added comprehensive evaluation strategies for paper impact and quality in literature_search_strategies.md, including citation count significance and journal impact factor guidance. - Improved research lookup scripts to prioritize results based on citation count, venue prestige, and author reputation, enhancing the quality of research outputs.
2026-03-27 07:09:27 +08:00 · 2026-01-05 13:01:10 -08:00
parent d243a12564
commit 3439a21f57
41 changed files with 11802 additions and 61 deletions
--- a/scientific-skills/venue-templates/assets/examples/neurips_introduction_example.md
+++ b/scientific-skills/venue-templates/assets/examples/neurips_introduction_example.md
@@ -0,0 +1,245 @@
+# NeurIPS/ICML Introduction Example
+
+This example demonstrates the distinctive ML conference introduction structure with numbered contributions and technical precision.
+
+---
+
+## Full Introduction Example
+
+**Paper Topic**: Efficient Long-Context Transformers
+
+---
+
+### Paragraph 1: Problem Motivation
+
+```
+Large language models (LLMs) have demonstrated remarkable capabilities in 
+natural language understanding, code generation, and reasoning tasks [1, 2, 3]. 
+These capabilities scale with both model size and context length—longer 
+contexts enable processing of entire documents, multi-turn conversations, 
+and complex reasoning chains that span many steps [4, 5]. However, the 
+standard Transformer attention mechanism [6] has O(N²) time and memory 
+complexity with respect to sequence length N, creating a fundamental 
+bottleneck for processing long sequences. For a context window of 100K 
+tokens, computing full attention requires 10 billion scalar operations 
+and 40 GB of memory for the attention matrix alone, making training and 
+inference prohibitively expensive on current hardware.
+```
+
+**Key features**:
+- States why this matters (LLM capabilities)
+- Connects to scaling (longer contexts = better performance)
+- Specific numbers (O(N²), 100K tokens, 10 billion ops, 40 GB)
+- Citations to establish credibility
+
+---
+
+### Paragraph 2: Limitations of Existing Approaches
+
+```
+Prior work has addressed attention efficiency through three main approaches. 
+Sparse attention patterns [7, 8, 9] reduce complexity to O(N√N) or O(N log N) 
+by restricting attention to local windows, fixed stride patterns, or learned 
+sparse masks. Linear attention approximations [10, 11, 12] reformulate 
+attention using kernel feature maps that enable O(N) computation, but 
+sacrifice the ability to model arbitrary pairwise interactions. Low-rank 
+factorizations [13, 14] approximate the attention matrix as a product of 
+smaller matrices, achieving efficiency at the cost of expressivity. While 
+these methods reduce theoretical complexity, they introduce approximation 
+errors that compound in deep networks, often resulting in 2-5% accuracy 
+degradation on long-range modeling benchmarks [15]. Perhaps more importantly, 
+they fundamentally change the attention mechanism, making it difficult to 
+apply advances in standard attention (e.g., rotary positional embeddings, 
+grouped-query attention) to efficient variants.
+```
+
+**Key features**:
+- Organized categorization of prior work
+- Complexity stated for each approach
+- Limitations clearly identified
+- Quantified shortcomings (2-5% degradation)
+- Deeper issue identified (incompatibility with advances)
+
+---
+
+### Paragraph 3: Your Approach (High-Level)
+
+```
+We take a different approach: rather than approximating attention, we 
+accelerate exact attention by optimizing memory access patterns. Our key 
+observation is that on modern GPUs, attention is bottlenecked by memory 
+bandwidth, not compute. Reading and writing the N × N attention matrix to 
+and from GPU high-bandwidth memory (HBM) dominates runtime, while the GPU's 
+tensor cores remain underutilized. We propose LongFlash, an IO-aware exact 
+attention algorithm that computes attention block-by-block in fast on-chip 
+SRAM, never materializing the full attention matrix in HBM. By carefully 
+orchestrating the tiling pattern and fusing the softmax computation with 
+matrix multiplications, LongFlash reduces HBM accesses from O(N²) to 
+O(N²d/M) where d is the head dimension and M is the SRAM size, achieving 
+asymptotically optimal IO complexity.
+```
+
+**Key features**:
+- Clear differentiation from prior work ("different approach")
+- Key insight stated explicitly
+- Technical mechanism explained
+- Complexity improvement quantified
+- Method name introduced
+
+---
+
+### Paragraph 4: Contributions (CRITICAL)
+
+```
+Our contributions are as follows:
+
+• We propose LongFlash, an IO-aware exact attention algorithm that achieves 
+  2-4× speedup over FlashAttention [16] and up to 9× over standard PyTorch 
+  attention on sequences from 1K to 128K tokens (Section 3).
+
+• We provide theoretical analysis proving that LongFlash achieves optimal 
+  IO complexity of O(N²d/M) among all algorithms that compute exact 
+  attention, and analyze the regime where our algorithm provides maximum 
+  benefit (Section 3.3).
+
+• We introduce sequence parallelism techniques that enable LongFlash to 
+  scale to sequences of 1M+ tokens across multiple GPUs with near-linear 
+  weak scaling efficiency (Section 4).
+
+• We demonstrate that LongFlash enables training with 8× longer contexts 
+  on the same hardware: we train a 7B parameter model on 128K token 
+  contexts using the same memory that previously limited us to 16K tokens 
+  (Section 5).
+
+• We release optimized CUDA kernels achieving 80% of theoretical peak 
+  FLOPS on A100 and H100 GPUs, along with PyTorch and JAX bindings, at 
+  [anonymous URL] (Section 6).
+```
+
+**Key features**:
+- Numbered/bulleted format
+- Each contribution is specific and quantified
+- Section references for each claim
+- Both methodological and empirical contributions
+- Code release mentioned
+- Self-contained bullets (each makes sense alone)
+
+---
+
+## Alternative Opening Paragraphs
+
+### For a Methods Paper
+
+```
+Scalable optimization algorithms are fundamental to modern machine learning. 
+Stochastic gradient descent (SGD) and its variants [1, 2, 3] have enabled 
+training of models with billions of parameters on massive datasets. However, 
+these first-order methods exhibit slow convergence on ill-conditioned 
+problems, often requiring thousands of iterations to converge on tasks 
+where second-order methods would converge in tens of iterations [4, 5].
+```
+
+### For an Applications Paper
+
+```
+Drug discovery is a costly and time-consuming process, with the average new 
+drug requiring 10-15 years and $2.6 billion to develop [1]. Machine learning 
+offers the potential to accelerate this process by predicting molecular 
+properties, identifying promising candidates, and optimizing lead compounds 
+computationally [2, 3]. Recent successes in protein structure prediction [4] 
+and molecular generation [5] have demonstrated that deep learning can 
+capture complex chemical patterns, raising hopes for ML-driven drug discovery.
+```
+
+### For a Theory Paper
+
+```
+Understanding why deep neural networks generalize well despite having more 
+parameters than training examples remains one of the central puzzles of 
+modern machine learning [1, 2]. Classical statistical learning theory 
+predicts that such overparameterized models should overfit dramatically, 
+yet in practice, large networks trained with SGD achieve excellent test 
+accuracy [3]. This gap between theory and practice has motivated a rich 
+literature on implicit regularization [4], neural tangent kernels [5], 
+and feature learning [6], but a complete theoretical picture remains elusive.
+```
+
+---
+
+## Contribution Bullet Templates
+
+### For a New Method
+
+```
+• We propose [Method Name], a novel [type of method] that [key innovation] 
+  achieving [performance improvement] over [baseline] on [benchmark].
+```
+
+### For Theoretical Analysis
+
+```
+• We prove that [statement], providing the first [type of result] for 
+  [problem setting]. This resolves an open question from [prior work].
+```
+
+### For Empirical Study
+
+```
+• We conduct a comprehensive evaluation of [N] methods across [M] datasets, 
+  revealing that [key finding] and identifying [failure mode/best practice].
+```
+
+### For Code/Data Release
+
+```
+• We release [resource name], a [description] containing [scale/scope], 
+  available at [URL]. This enables [future work/reproducibility].
+```
+
+---
+
+## Common Mistakes to Avoid
+
+### Vague Contributions
+
+❌ **Bad**:
+```
+• We propose a novel method for attention
+• We show our method is better than baselines
+• We provide theoretical analysis
+```
+
+✅ **Good**:
+```
+• We propose LongFlash, achieving 2-4× speedup over FlashAttention
+• We prove LongFlash achieves optimal O(N²d/M) IO complexity
+• We enable 8× longer context training on fixed hardware budget
+```
+
+### Missing Quantification
+
+❌ **Bad**: "Our method significantly outperforms prior work"
+✅ **Good**: "Our method improves accuracy by 3.2% on GLUE and 4.1% on SuperGLUE"
+
+### Overlapping Bullets
+
+❌ **Bad**: 
+```
+• We propose a new attention mechanism
+• We introduce LongFlash attention
+• Our novel attention approach...
+```
+(These say the same thing three times)
+
+### Buried Contributions
+
+❌ **Bad**: Contribution bullets at the end of page 2
+✅ **Good**: Contribution bullets clearly visible by end of page 1
+
+---
+
+## See Also
+
+- `ml_conference_style.md` - Comprehensive ML conference guide
+- `venue_writing_styles.md` - Style comparison across venues
+