Add ArXiv skill

2026-03-27 07:09:27 +08:00 · 2026-03-10 08:42:20 +00:00
parent 81589710a5
commit 4f46196223
3 changed files with 1206 additions and 0 deletions
--- a/scientific-skills/arxiv-database/SKILL.md
+++ b/scientific-skills/arxiv-database/SKILL.md
@@ -0,0 +1,362 @@
+---
+name: arxiv-database
+description: Search and retrieve preprints from arXiv via the Atom API. Use this skill when searching for papers in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering, or economics by keywords, authors, arXiv IDs, date ranges, or categories.
+license: MIT
+metadata:
+    skill-author: Orchestra Research
+---
+
+# arXiv Database
+
+## Overview
+
+This skill provides Python tools for searching and retrieving preprints from arXiv.org via its public Atom API. It supports keyword search, author search, category filtering, arXiv ID lookup, and PDF download. Results are returned as structured JSON with titles, abstracts, authors, categories, and links.
+
+## When to Use This Skill
+
+Use this skill when:
+- Searching for preprints in CS, ML, AI, physics, math, statistics, q-bio, q-fin, or economics
+- Looking up specific papers by arXiv ID (e.g., `2309.10668`)
+- Tracking an author's recent preprints
+- Filtering papers by arXiv category (e.g., `cs.LG`, `cs.CL`, `stat.ML`)
+- Downloading PDFs for full-text analysis
+- Building literature review datasets for AI/ML research
+- Monitoring new submissions in a subfield
+
+Consider alternatives when:
+- Searching for biomedical literature specifically -> Use **pubmed-database** or **biorxiv-database**
+- You need citation counts or impact metrics -> Use **openalex-database**
+- You need peer-reviewed journal articles only -> Use **pubmed-database**
+
+## Core Search Capabilities
+
+### 1. Keyword Search
+
+Search for papers by keywords in titles, abstracts, or all fields.
+
+```bash
+python scripts/arxiv_search.py \
+  --keywords "sparse autoencoders" "mechanistic interpretability" \
+  --max-results 20 \
+  --output results.json
+```
+
+With category filter:
+```bash
+python scripts/arxiv_search.py \
+  --keywords "transformer" "attention mechanism" \
+  --category cs.LG \
+  --max-results 50 \
+  --output transformer_papers.json
+```
+
+Search specific fields:
+```bash
+# Title only
+python scripts/arxiv_search.py \
+  --keywords "GRPO" \
+  --search-field ti \
+  --max-results 10
+
+# Abstract only
+python scripts/arxiv_search.py \
+  --keywords "reward model" "RLHF" \
+  --search-field abs \
+  --max-results 30
+```
+
+### 2. Author Search
+
+```bash
+python scripts/arxiv_search.py \
+  --author "Anthropic" \
+  --max-results 50 \
+  --output anthropic_papers.json
+```
+
+```bash
+python scripts/arxiv_search.py \
+  --author "Ilya Sutskever" \
+  --category cs.LG \
+  --max-results 20
+```
+
+### 3. arXiv ID Lookup
+
+Retrieve metadata for specific papers:
+
+```bash
+python scripts/arxiv_search.py \
+  --ids 2309.10668 2406.04093 2310.01405 \
+  --output sae_papers.json
+```
+
+Full arXiv URLs also accepted:
+```bash
+python scripts/arxiv_search.py \
+  --ids "https://arxiv.org/abs/2309.10668"
+```
+
+### 4. Category Browsing
+
+List recent papers in a category:
+```bash
+python scripts/arxiv_search.py \
+  --category cs.AI \
+  --max-results 100 \
+  --sort-by submittedDate \
+  --output recent_cs_ai.json
+```
+
+### 5. PDF Download
+
+```bash
+python scripts/arxiv_search.py \
+  --ids 2309.10668 \
+  --download-pdf papers/
+```
+
+Batch download from search results:
+```python
+import json
+from scripts.arxiv_search import ArxivSearcher
+
+searcher = ArxivSearcher()
+
+# Search first
+results = searcher.search(query="ti:sparse autoencoder", max_results=5)
+
+# Download all
+for paper in results:
+    arxiv_id = paper["arxiv_id"]
+    searcher.download_pdf(arxiv_id, f"papers/{arxiv_id.replace('/', '_')}.pdf")
+```
+
+## arXiv Categories
+
+### Computer Science (cs.*)
+| Category | Description |
+|----------|-------------|
+| `cs.AI` | Artificial Intelligence |
+| `cs.CL` | Computation and Language (NLP) |
+| `cs.CV` | Computer Vision |
+| `cs.LG` | Machine Learning |
+| `cs.NE` | Neural and Evolutionary Computing |
+| `cs.RO` | Robotics |
+| `cs.CR` | Cryptography and Security |
+| `cs.DS` | Data Structures and Algorithms |
+| `cs.IR` | Information Retrieval |
+| `cs.SE` | Software Engineering |
+
+### Statistics & Math
+| Category | Description |
+|----------|-------------|
+| `stat.ML` | Machine Learning (Statistics) |
+| `stat.ME` | Methodology |
+| `math.OC` | Optimization and Control |
+| `math.ST` | Statistics Theory |
+
+### Other Relevant Categories
+| Category | Description |
+|----------|-------------|
+| `q-bio.BM` | Biomolecules |
+| `q-bio.GN` | Genomics |
+| `q-bio.QM` | Quantitative Methods |
+| `q-fin.ST` | Statistical Finance |
+| `eess.SP` | Signal Processing |
+| `physics.comp-ph` | Computational Physics |
+
+Full list: see [references/api_reference.md](references/api_reference.md).
+
+## Query Syntax
+
+The arXiv API uses prefix-based field searches combined with Boolean operators.
+
+**Field prefixes:**
+- `ti:` - Title
+- `au:` - Author
+- `abs:` - Abstract
+- `cat:` - Category
+- `all:` - All fields (default)
+- `co:` - Comment
+- `jr:` - Journal reference
+- `id:` - arXiv ID
+
+**Boolean operators** (must be UPPERCASE):
+```
+ti:transformer AND abs:attention
+au:bengio OR au:lecun
+cat:cs.LG ANDNOT cat:cs.CV
+```
+
+**Grouping with parentheses:**
+```
+(ti:sparse AND ti:autoencoder) AND cat:cs.LG
+au:anthropic AND (abs:interpretability OR abs:alignment)
+```
+
+**Examples:**
+```python
+from scripts.arxiv_search import ArxivSearcher
+
+searcher = ArxivSearcher()
+
+# Papers about SAEs in ML
+results = searcher.search(
+    query="ti:sparse autoencoder AND cat:cs.LG",
+    max_results=50,
+    sort_by="submittedDate"
+)
+
+# Specific author in specific field
+results = searcher.search(
+    query="au:neel nanda AND cat:cs.LG",
+    max_results=20
+)
+
+# Complex boolean query
+results = searcher.search(
+    query="(abs:RLHF OR abs:reinforcement learning from human feedback) AND cat:cs.CL",
+    max_results=100
+)
+```
+
+## Output Format
+
+All searches return structured JSON:
+
+```json
+{
+  "query": "ti:sparse autoencoder AND cat:cs.LG",
+  "result_count": 15,
+  "results": [
+    {
+      "arxiv_id": "2309.10668",
+      "title": "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning",
+      "authors": ["Trenton Bricken", "Adly Templeton", "..."],
+      "abstract": "Full abstract text...",
+      "categories": ["cs.LG", "cs.AI"],
+      "primary_category": "cs.LG",
+      "published": "2023-09-19T17:58:00Z",
+      "updated": "2023-10-04T14:22:00Z",
+      "doi": "10.48550/arXiv.2309.10668",
+      "pdf_url": "http://arxiv.org/pdf/2309.10668v1",
+      "abs_url": "http://arxiv.org/abs/2309.10668v1",
+      "comment": "42 pages, 30 figures",
+      "journal_ref": ""
+    }
+  ]
+}
+```
+
+## Common Usage Patterns
+
+### Literature Review Workflow
+
+```python
+from scripts.arxiv_search import ArxivSearcher
+import json
+
+searcher = ArxivSearcher()
+
+# 1. Broad search
+results = searcher.search(
+    query="abs:mechanistic interpretability AND cat:cs.LG",
+    max_results=200,
+    sort_by="submittedDate"
+)
+
+# 2. Save results
+with open("interp_papers.json", "w") as f:
+    json.dump({"result_count": len(results), "results": results}, f, indent=2)
+
+# 3. Filter and analyze
+import pandas as pd
+df = pd.DataFrame(results)
+print(f"Total papers: {len(df)}")
+print(f"Date range: {df['published'].min()} to {df['published'].max()}")
+print(f"\nTop categories:")
+print(df["primary_category"].value_counts().head(10))
+```
+
+### Track a Research Group
+
+```python
+searcher = ArxivSearcher()
+
+groups = {
+    "anthropic": "au:anthropic AND (cat:cs.LG OR cat:cs.CL)",
+    "openai": "au:openai AND cat:cs.CL",
+    "deepmind": "au:deepmind AND cat:cs.LG",
+}
+
+for name, query in groups.items():
+    results = searcher.search(query=query, max_results=50, sort_by="submittedDate")
+    print(f"{name}: {len(results)} recent papers")
+```
+
+### Monitor New Submissions
+
+```python
+searcher = ArxivSearcher()
+
+# Most recent ML papers
+results = searcher.search(
+    query="cat:cs.LG",
+    max_results=50,
+    sort_by="submittedDate",
+    sort_order="descending"
+)
+
+for paper in results[:10]:
+    print(f"[{paper['published'][:10]}] {paper['title']}")
+    print(f"  {paper['abs_url']}\n")
+```
+
+## Python API
+
+```python
+from scripts.arxiv_search import ArxivSearcher
+
+searcher = ArxivSearcher(verbose=True)
+
+# Free-form query (uses arXiv query syntax)
+results = searcher.search(query="...", max_results=50)
+
+# Lookup by ID
+papers = searcher.get_by_ids(["2309.10668", "2406.04093"])
+
+# Download PDF
+searcher.download_pdf("2309.10668", "paper.pdf")
+
+# Build query from components
+query = ArxivSearcher.build_query(
+    title="sparse autoencoder",
+    author="anthropic",
+    category="cs.LG"
+)
+results = searcher.search(query=query, max_results=20)
+```
+
+## Best Practices
+
+1. **Respect rate limits**: The API requests 3-second delays between calls. The script handles this automatically.
+2. **Use category filters**: Dramatically reduces noise. `cs.LG` is where most ML papers live.
+3. **Cache results**: Save to JSON to avoid re-fetching.
+4. **Use `sort_by=submittedDate`** for recent papers, `relevance` for keyword searches.
+5. **Max 300 results per query**: arXiv API caps at this. For larger sets, paginate with `start` parameter.
+6. **arXiv IDs**: Use bare IDs (`2309.10668`), not full URLs, in programmatic code.
+7. **Combine with openalex-database**: For citation counts and impact metrics arXiv doesn't provide.
+
+## Limitations
+
+- **No full-text search**: Only searches metadata (title, abstract, authors, comments)
+- **No citation data**: Use openalex-database or Semantic Scholar for citations
+- **Max 300 results**: Per query. Use pagination for larger sets.
+- **Rate limited**: ~1 request per 3 seconds recommended
+- **Atom XML responses**: The script parses these into JSON automatically
+- **Search lag**: New papers may take hours to appear in API results
+
+## Reference Documentation
+
+- **API Reference**: See [references/api_reference.md](references/api_reference.md) for full endpoint specs, all categories, and response schemas
--- a/scientific-skills/arxiv-database/references/api_reference.md
+++ b/scientific-skills/arxiv-database/references/api_reference.md
@@ -0,0 +1,430 @@
+# arXiv API Reference
+
+## Overview
+
+The arXiv API provides programmatic access to preprint metadata via an Atom XML feed. It supports search queries with field-specific operators, boolean logic, ID-based retrieval, sorting, and pagination. No authentication required.
+
+## Base URL
+
+```
+http://export.arxiv.org/api/query
+```
+
+## Rate Limiting
+
+- Recommended: **1 request per 3 seconds**
+- Aggressive crawling will result in temporary IP bans
+- Use `time.sleep(3)` between requests
+- Include a descriptive `User-Agent` header
+
+## Query Parameters
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `search_query` | Query string with field prefixes and boolean operators | (none) |
+| `id_list` | Comma-separated arXiv IDs | (none) |
+| `start` | Starting index for pagination (0-based) | `0` |
+| `max_results` | Number of results to return (max 300) | `10` |
+| `sortBy` | Sort field: `relevance`, `lastUpdatedDate`, `submittedDate` | `relevance` |
+| `sortOrder` | Sort direction: `ascending`, `descending` | `descending` |
+
+**Note**: `search_query` and `id_list` can be used together (results are ANDed) or separately.
+
+## Search Query Syntax
+
+### Field Prefixes
+
+| Prefix | Field | Example |
+|--------|-------|---------|
+| `ti:` | Title | `ti:transformer` |
+| `au:` | Author | `au:bengio` |
+| `abs:` | Abstract | `abs:attention mechanism` |
+| `co:` | Comment | `co:accepted at NeurIPS` |
+| `jr:` | Journal Reference | `jr:Nature` |
+| `cat:` | Category | `cat:cs.LG` |
+| `all:` | All fields | `all:deep learning` |
+| `id:` | arXiv ID | `id:2309.10668` |
+
+### Boolean Operators
+
+Operators **must** be uppercase:
+
+```
+ti:transformer AND abs:attention           # Both conditions
+au:bengio OR au:lecun                      # Either condition
+cat:cs.LG ANDNOT cat:cs.CV                # Exclude category
+```
+
+### Grouping
+
+Use parentheses for complex queries:
+
+```
+(ti:sparse AND ti:autoencoder) AND cat:cs.LG
+au:anthropic AND (abs:interpretability OR abs:alignment)
+(cat:cs.LG OR cat:cs.CL) AND ti:reinforcement learning
+```
+
+### Phrase Search
+
+Quotes for exact phrases:
+
+```
+ti:"sparse autoencoder"
+au:"Yoshua Bengio"
+abs:"reinforcement learning from human feedback"
+```
+
+### Wildcards
+
+Not supported by the arXiv API. Use broader terms and filter client-side.
+
+## Example Requests
+
+### Basic keyword search
+```
+GET http://export.arxiv.org/api/query?search_query=all:sparse+autoencoder&max_results=10
+```
+
+### Author + category
+```
+GET http://export.arxiv.org/api/query?search_query=au:anthropic+AND+cat:cs.LG&max_results=50&sortBy=submittedDate
+```
+
+### ID lookup
+```
+GET http://export.arxiv.org/api/query?id_list=2309.10668,2406.04093
+```
+
+### Combined search + ID
+```
+GET http://export.arxiv.org/api/query?search_query=cat:cs.LG&id_list=2309.10668
+```
+
+### Paginated results
+```
+# Page 1 (results 0-99)
+GET ...?search_query=cat:cs.LG&start=0&max_results=100&sortBy=submittedDate
+
+# Page 2 (results 100-199)
+GET ...?search_query=cat:cs.LG&start=100&max_results=100&sortBy=submittedDate
+```
+
+## Response Format (Atom XML)
+
+The API returns an Atom 1.0 XML feed.
+
+### Feed-level elements
+
+```xml
+<?xml version="1.0" encoding="UTF-8"?>
+<feed xmlns="http://www.w3.org/2005/Atom"
+      xmlns:arxiv="http://arxiv.org/schemas/atom">
+
+  <title>ArXiv Query: ...</title>
+  <id>http://arxiv.org/api/...</id>
+  <updated>2024-01-15T00:00:00-05:00</updated>
+
+  <!-- Total results available (not just returned) -->
+  <opensearch:totalResults>1500</opensearch:totalResults>
+  <opensearch:startIndex>0</opensearch:startIndex>
+  <opensearch:itemsPerPage>50</opensearch:itemsPerPage>
+
+  <entry>...</entry>
+  <entry>...</entry>
+</feed>
+```
+
+### Entry elements
+
+```xml
+<entry>
+  <!-- Unique identifier (includes version) -->
+  <id>http://arxiv.org/abs/2309.10668v2</id>
+
+  <!-- Dates -->
+  <published>2023-09-19T17:58:00Z</published>
+  <updated>2023-10-04T14:22:00Z</updated>
+
+  <!-- Metadata -->
+  <title>Towards Monosemanticity: Decomposing Language Models...</title>
+  <summary>We attempt to reverse-engineer a trained neural network...</summary>
+
+  <!-- Authors -->
+  <author>
+    <name>Trenton Bricken</name>
+  </author>
+  <author>
+    <name>Adly Templeton</name>
+  </author>
+
+  <!-- Categories -->
+  <arxiv:primary_category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
+  <category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
+  <category term="cs.AI" scheme="http://arxiv.org/schemas/atom"/>
+
+  <!-- Links -->
+  <link href="http://arxiv.org/abs/2309.10668v2" rel="alternate" type="text/html"/>
+  <link href="http://arxiv.org/pdf/2309.10668v2" rel="related" type="application/pdf" title="pdf"/>
+
+  <!-- Optional -->
+  <arxiv:comment>42 pages, 30 figures</arxiv:comment>
+  <arxiv:doi>10.48550/arXiv.2309.10668</arxiv:doi>
+  <arxiv:journal_ref>...</arxiv:journal_ref>
+</entry>
+```
+
+### Entry field descriptions
+
+| Field | Description |
+|-------|-------------|
+| `id` | Canonical arXiv URL with version (e.g., `http://arxiv.org/abs/2309.10668v2`) |
+| `published` | First submission date (ISO 8601) |
+| `updated` | Last update date (ISO 8601) |
+| `title` | Paper title (may contain line breaks in XML) |
+| `summary` | Full abstract text |
+| `author/name` | Author full name (one per `<author>` element) |
+| `arxiv:primary_category` | Primary arXiv category |
+| `category` | All categories (multiple elements) |
+| `link[@type='text/html']` | Abstract page URL |
+| `link[@title='pdf']` | PDF download URL |
+| `arxiv:comment` | Author comment (page count, conference, etc.) |
+| `arxiv:doi` | Associated DOI (if exists) |
+| `arxiv:journal_ref` | Journal publication reference (if published) |
+
+## Complete Category List
+
+### Computer Science (cs.*)
+
+| Category | Name |
+|----------|------|
+| `cs.AI` | Artificial Intelligence |
+| `cs.AR` | Hardware Architecture |
+| `cs.CC` | Computational Complexity |
+| `cs.CE` | Computational Engineering, Finance, and Science |
+| `cs.CG` | Computational Geometry |
+| `cs.CL` | Computation and Language |
+| `cs.CR` | Cryptography and Security |
+| `cs.CV` | Computer Vision and Pattern Recognition |
+| `cs.CY` | Computers and Society |
+| `cs.DB` | Databases |
+| `cs.DC` | Distributed, Parallel, and Cluster Computing |
+| `cs.DL` | Digital Libraries |
+| `cs.DM` | Discrete Mathematics |
+| `cs.DS` | Data Structures and Algorithms |
+| `cs.ET` | Emerging Technologies |
+| `cs.FL` | Formal Languages and Automata Theory |
+| `cs.GL` | General Literature |
+| `cs.GR` | Graphics |
+| `cs.GT` | Computer Science and Game Theory |
+| `cs.HC` | Human-Computer Interaction |
+| `cs.IR` | Information Retrieval |
+| `cs.IT` | Information Theory |
+| `cs.LG` | Machine Learning |
+| `cs.LO` | Logic in Computer Science |
+| `cs.MA` | Multiagent Systems |
+| `cs.MM` | Multimedia |
+| `cs.MS` | Mathematical Software |
+| `cs.NA` | Numerical Analysis |
+| `cs.NE` | Neural and Evolutionary Computing |
+| `cs.NI` | Networking and Internet Architecture |
+| `cs.OH` | Other Computer Science |
+| `cs.OS` | Operating Systems |
+| `cs.PF` | Performance |
+| `cs.PL` | Programming Languages |
+| `cs.RO` | Robotics |
+| `cs.SC` | Symbolic Computation |
+| `cs.SD` | Sound |
+| `cs.SE` | Software Engineering |
+| `cs.SI` | Social and Information Networks |
+| `cs.SY` | Systems and Control |
+
+### Statistics (stat.*)
+
+| Category | Name |
+|----------|------|
+| `stat.AP` | Applications |
+| `stat.CO` | Computation |
+| `stat.ME` | Methodology |
+| `stat.ML` | Machine Learning |
+| `stat.OT` | Other Statistics |
+| `stat.TH` | Statistics Theory |
+
+### Mathematics (math.*)
+
+| Category | Name |
+|----------|------|
+| `math.AC` | Commutative Algebra |
+| `math.AG` | Algebraic Geometry |
+| `math.AP` | Analysis of PDEs |
+| `math.AT` | Algebraic Topology |
+| `math.CA` | Classical Analysis and ODEs |
+| `math.CO` | Combinatorics |
+| `math.CT` | Category Theory |
+| `math.CV` | Complex Variables |
+| `math.DG` | Differential Geometry |
+| `math.DS` | Dynamical Systems |
+| `math.FA` | Functional Analysis |
+| `math.GM` | General Mathematics |
+| `math.GN` | General Topology |
+| `math.GR` | Group Theory |
+| `math.GT` | Geometric Topology |
+| `math.HO` | History and Overview |
+| `math.IT` | Information Theory |
+| `math.KT` | K-Theory and Homology |
+| `math.LO` | Logic |
+| `math.MG` | Metric Geometry |
+| `math.MP` | Mathematical Physics |
+| `math.NA` | Numerical Analysis |
+| `math.NT` | Number Theory |
+| `math.OA` | Operator Algebras |
+| `math.OC` | Optimization and Control |
+| `math.PR` | Probability |
+| `math.QA` | Quantum Algebra |
+| `math.RA` | Rings and Algebras |
+| `math.RT` | Representation Theory |
+| `math.SG` | Symplectic Geometry |
+| `math.SP` | Spectral Theory |
+| `math.ST` | Statistics Theory |
+
+### Physics
+
+| Category | Name |
+|----------|------|
+| `astro-ph` | Astrophysics (+ subcategories: .CO, .EP, .GA, .HE, .IM, .SR) |
+| `cond-mat` | Condensed Matter (+ subcategories) |
+| `gr-qc` | General Relativity and Quantum Cosmology |
+| `hep-ex` | High Energy Physics - Experiment |
+| `hep-lat` | High Energy Physics - Lattice |
+| `hep-ph` | High Energy Physics - Phenomenology |
+| `hep-th` | High Energy Physics - Theory |
+| `math-ph` | Mathematical Physics |
+| `nlin` | Nonlinear Sciences (+ subcategories) |
+| `nucl-ex` | Nuclear Experiment |
+| `nucl-th` | Nuclear Theory |
+| `physics` | Physics (+ subcategories: .comp-ph, .data-an, .bio-ph, etc.) |
+| `quant-ph` | Quantum Physics |
+
+### Quantitative Biology (q-bio.*)
+
+| Category | Name |
+|----------|------|
+| `q-bio.BM` | Biomolecules |
+| `q-bio.CB` | Cell Behavior |
+| `q-bio.GN` | Genomics |
+| `q-bio.MN` | Molecular Networks |
+| `q-bio.NC` | Neurons and Cognition |
+| `q-bio.OT` | Other Quantitative Biology |
+| `q-bio.PE` | Populations and Evolution |
+| `q-bio.QM` | Quantitative Methods |
+| `q-bio.SC` | Subcellular Processes |
+| `q-bio.TO` | Tissues and Organs |
+
+### Quantitative Finance (q-fin.*)
+
+| Category | Name |
+|----------|------|
+| `q-fin.CP` | Computational Finance |
+| `q-fin.EC` | Economics |
+| `q-fin.GN` | General Finance |
+| `q-fin.MF` | Mathematical Finance |
+| `q-fin.PM` | Portfolio Management |
+| `q-fin.PR` | Pricing of Securities |
+| `q-fin.RM` | Risk Management |
+| `q-fin.ST` | Statistical Finance |
+| `q-fin.TR` | Trading and Market Microstructure |
+
+### Electrical Engineering and Systems Science (eess.*)
+
+| Category | Name |
+|----------|------|
+| `eess.AS` | Audio and Speech Processing |
+| `eess.IV` | Image and Video Processing |
+| `eess.SP` | Signal Processing |
+| `eess.SY` | Systems and Control |
+
+### Economics (econ.*)
+
+| Category | Name |
+|----------|------|
+| `econ.EM` | Econometrics |
+| `econ.GN` | General Economics |
+| `econ.TH` | Theoretical Economics |
+
+## Pagination
+
+The API returns at most 300 results per request. For larger result sets, paginate:
+
+```python
+all_results = []
+start = 0
+batch_size = 100
+
+while True:
+    params = {
+        "search_query": "cat:cs.LG",
+        "start": start,
+        "max_results": batch_size,
+        "sortBy": "submittedDate",
+        "sortOrder": "descending",
+    }
+    results = fetch(params)  # your fetch function
+    if not results:
+        break
+    all_results.extend(results)
+    start += batch_size
+    time.sleep(3)  # respect rate limit
+```
+
+The total number of results available is in the `opensearch:totalResults` element of the feed.
+
+## Downloading Papers
+
+### PDF
+```
+http://arxiv.org/pdf/{arxiv_id}
+http://arxiv.org/pdf/{arxiv_id}v{version}
+```
+
+### Abstract page
+```
+http://arxiv.org/abs/{arxiv_id}
+```
+
+### Source (LaTeX)
+```
+http://arxiv.org/e-print/{arxiv_id}
+```
+
+### HTML (experimental)
+```
+http://arxiv.org/html/{arxiv_id}
+```
+
+## arXiv ID Formats
+
+| Format | Era | Example |
+|--------|-----|---------|
+| `YYMM.NNNNN` | 2015+ | `2309.10668` |
+| `YYMM.NNNN` | 2007-2014 | `0706.0001` |
+| `archive/YYMMNNN` | Pre-2007 | `hep-th/9901001` |
+
+All formats are accepted by the API.
+
+## Common Pitfalls
+
+1. **Boolean operators must be UPPERCASE**: `AND`, `OR`, `ANDNOT` (lowercase is treated as search terms)
+2. **URL encoding**: Spaces in queries must be encoded as `+` or `%20`
+3. **No full-text search**: The API only searches metadata (title, abstract, authors, etc.)
+4. **Empty result placeholder**: When no results are found, arXiv may return a single entry with an empty title and the id `http://arxiv.org/api/errors` - filter this out
+5. **Version numbering**: `published` date is v1 submission; `updated` is latest version date
+6. **Rate limiting**: Exceeding limits can result in 403 errors or temporary bans
+7. **Max 300 per request**: Even if `max_results` is set higher, only 300 are returned
+
+## External Resources
+
+- arXiv API documentation: https://info.arxiv.org/help/api/index.html
+- arXiv API user manual: https://info.arxiv.org/help/api/user-manual.html
+- arXiv bulk data access: https://info.arxiv.org/help/bulk_data.html
+- arXiv category taxonomy: https://arxiv.org/category_taxonomy
+- OAI-PMH interface (for bulk metadata): http://export.arxiv.org/oai2
--- a/scientific-skills/arxiv-database/scripts/arxiv_search.py
+++ b/scientific-skills/arxiv-database/scripts/arxiv_search.py
@@ -0,0 +1,414 @@
+#!/usr/bin/env python3
+"""
+arXiv Search Tool
+Search and retrieve preprints from arXiv via the Atom API.
+Supports keyword search, author search, category filtering, ID lookup, and PDF download.
+"""
+
+import requests
+import json
+import argparse
+import xml.etree.ElementTree as ET
+import time
+import sys
+import os
+import re
+from typing import List, Dict, Optional
+from urllib.parse import quote
+
+
+class ArxivSearcher:
+    """Search interface for arXiv preprints via the Atom API."""
+
+    BASE_URL = "http://export.arxiv.org/api/query"
+    ATOM_NS = "{http://www.w3.org/2005/Atom}"
+    ARXIV_NS = "{http://arxiv.org/schemas/atom}"
+
+    VALID_SORT_BY = ["relevance", "lastUpdatedDate", "submittedDate"]
+    VALID_SORT_ORDER = ["ascending", "descending"]
+    VALID_SEARCH_FIELDS = ["ti", "au", "abs", "co", "jr", "cat", "all", "id"]
+
+    def __init__(self, verbose: bool = False, delay: float = 3.0):
+        self.verbose = verbose
+        self.delay = delay
+        self.session = requests.Session()
+        self.session.headers.update({
+            "User-Agent": "ArxivSearchTool/1.0 (scientific-skills)"
+        })
+        self._last_request_time = 0.0
+
+    def _log(self, message: str):
+        if self.verbose:
+            print(f"[INFO] {message}", file=sys.stderr)
+
+    def _rate_limit(self):
+        """Enforce minimum delay between requests."""
+        elapsed = time.time() - self._last_request_time
+        if elapsed < self.delay:
+            wait = self.delay - elapsed
+            self._log(f"Rate limiting: waiting {wait:.1f}s")
+            time.sleep(wait)
+        self._last_request_time = time.time()
+
+    def _parse_entry(self, entry: ET.Element) -> Dict:
+        """Parse a single Atom entry into a dict."""
+        def text(tag, ns=None):
+            ns = ns or self.ATOM_NS
+            el = entry.find(f"{ns}{tag}")
+            return el.text.strip() if el is not None and el.text else ""
+
+        # Authors
+        authors = []
+        for author_el in entry.findall(f"{self.ATOM_NS}author"):
+            name_el = author_el.find(f"{self.ATOM_NS}name")
+            if name_el is not None and name_el.text:
+                authors.append(name_el.text.strip())
+
+        # Categories
+        categories = []
+        primary_category = ""
+        for cat_el in entry.findall(f"{self.ATOM_NS}category"):
+            term = cat_el.get("term", "")
+            if term:
+                categories.append(term)
+        prim_el = entry.find(f"{self.ARXIV_NS}primary_category")
+        if prim_el is not None:
+            primary_category = prim_el.get("term", "")
+
+        # Links
+        pdf_url = ""
+        abs_url = ""
+        for link_el in entry.findall(f"{self.ATOM_NS}link"):
+            href = link_el.get("href", "")
+            link_type = link_el.get("type", "")
+            link_title = link_el.get("title", "")
+            if link_title == "pdf" or link_type == "application/pdf":
+                pdf_url = href
+            elif link_type == "text/html" or (not link_type and "/abs/" in href):
+                abs_url = href
+
+        # Extract arXiv ID from the Atom id field
+        raw_id = text("id")
+        arxiv_id = re.sub(r"^https?://arxiv\.org/abs/", "", raw_id)
+        # Strip version suffix for the canonical ID
+        arxiv_id_bare = re.sub(r"v\d+$", "", arxiv_id)
+
+        return {
+            "arxiv_id": arxiv_id_bare,
+            "title": " ".join(text("title").split()),  # collapse whitespace
+            "authors": authors,
+            "abstract": " ".join(text("summary").split()),
+            "categories": categories,
+            "primary_category": primary_category,
+            "published": text("published"),
+            "updated": text("updated"),
+            "doi": text("doi", self.ARXIV_NS),
+            "comment": text("comment", self.ARXIV_NS),
+            "journal_ref": text("journal_ref", self.ARXIV_NS),
+            "pdf_url": pdf_url,
+            "abs_url": abs_url or f"http://arxiv.org/abs/{arxiv_id}",
+        }
+
+    def _fetch(self, params: Dict) -> List[Dict]:
+        """Execute API request and parse results."""
+        self._rate_limit()
+        self._log(f"Query params: {params}")
+
+        try:
+            response = self.session.get(self.BASE_URL, params=params, timeout=30)
+            response.raise_for_status()
+        except requests.exceptions.RequestException as e:
+            self._log(f"Request error: {e}")
+            return []
+
+        root = ET.fromstring(response.text)
+        entries = root.findall(f"{self.ATOM_NS}entry")
+        self._log(f"Parsed {len(entries)} entries")
+
+        results = []
+        for entry in entries:
+            parsed = self._parse_entry(entry)
+            # Skip the "no results" placeholder entry arXiv returns
+            if not parsed["title"] or parsed["arxiv_id"] == "":
+                continue
+            results.append(parsed)
+
+        return results
+
+    def search(
+        self,
+        query: str,
+        max_results: int = 50,
+        start: int = 0,
+        sort_by: str = "relevance",
+        sort_order: str = "descending",
+    ) -> List[Dict]:
+        """
+        Search arXiv with a query string.
+
+        Args:
+            query: arXiv query string (e.g., "ti:transformer AND cat:cs.LG")
+            max_results: Maximum number of results (max 300 per request)
+            start: Starting index for pagination
+            sort_by: One of "relevance", "lastUpdatedDate", "submittedDate"
+            sort_order: "ascending" or "descending"
+
+        Returns:
+            List of paper dicts
+        """
+        if sort_by not in self.VALID_SORT_BY:
+            raise ValueError(f"sort_by must be one of {self.VALID_SORT_BY}")
+        if sort_order not in self.VALID_SORT_ORDER:
+            raise ValueError(f"sort_order must be one of {self.VALID_SORT_ORDER}")
+
+        max_results = min(max_results, 300)
+
+        params = {
+            "search_query": query,
+            "start": start,
+            "max_results": max_results,
+            "sortBy": sort_by,
+            "sortOrder": sort_order,
+        }
+
+        return self._fetch(params)
+
+    def get_by_ids(self, arxiv_ids: List[str]) -> List[Dict]:
+        """
+        Retrieve papers by their arXiv IDs.
+
+        Args:
+            arxiv_ids: List of arXiv IDs (e.g., ["2309.10668", "2406.04093"])
+
+        Returns:
+            List of paper dicts
+        """
+        # Clean IDs: strip URLs, versions
+        clean_ids = []
+        for aid in arxiv_ids:
+            aid = re.sub(r"^https?://arxiv\.org/abs/", "", aid.strip())
+            aid = re.sub(r"v\d+$", "", aid)
+            clean_ids.append(aid)
+
+        params = {
+            "id_list": ",".join(clean_ids),
+            "max_results": len(clean_ids),
+        }
+
+        return self._fetch(params)
+
+    def download_pdf(self, arxiv_id: str, output_path: str) -> bool:
+        """
+        Download a paper's PDF.
+
+        Args:
+            arxiv_id: arXiv ID (e.g., "2309.10668")
+            output_path: File path or directory to save to
+
+        Returns:
+            True if successful
+        """
+        arxiv_id = re.sub(r"^https?://arxiv\.org/abs/", "", arxiv_id.strip())
+        arxiv_id = re.sub(r"v\d+$", "", arxiv_id)
+
+        pdf_url = f"http://arxiv.org/pdf/{arxiv_id}"
+        self._log(f"Downloading: {pdf_url}")
+
+        # If output_path is a directory, generate filename
+        if os.path.isdir(output_path):
+            filename = arxiv_id.replace("/", "_") + ".pdf"
+            output_path = os.path.join(output_path, filename)
+
+        self._rate_limit()
+
+        try:
+            response = self.session.get(pdf_url, timeout=60)
+            response.raise_for_status()
+
+            os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)
+            with open(output_path, "wb") as f:
+                f.write(response.content)
+
+            self._log(f"Saved to: {output_path}")
+            return True
+        except Exception as e:
+            self._log(f"Download error: {e}")
+            return False
+
+    @staticmethod
+    def build_query(
+        title: Optional[str] = None,
+        author: Optional[str] = None,
+        abstract: Optional[str] = None,
+        category: Optional[str] = None,
+        all_fields: Optional[str] = None,
+    ) -> str:
+        """
+        Build an arXiv query string from components.
+
+        Args:
+            title: Search in title
+            author: Search by author name
+            abstract: Search in abstract
+            category: Filter by category (e.g., "cs.LG")
+            all_fields: Search all fields
+
+        Returns:
+            arXiv query string
+        """
+        parts = []
+        if all_fields:
+            parts.append(f"all:{all_fields}")
+        if title:
+            parts.append(f"ti:{title}")
+        if author:
+            parts.append(f"au:{author}")
+        if abstract:
+            parts.append(f"abs:{abstract}")
+        if category:
+            parts.append(f"cat:{category}")
+
+        return " AND ".join(parts)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Search arXiv preprints",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  %(prog)s --keywords "sparse autoencoder" --category cs.LG --max-results 20
+  %(prog)s --author "Anthropic" --max-results 50
+  %(prog)s --ids 2309.10668 2406.04093
+  %(prog)s --query "ti:GRPO AND cat:cs.LG" --sort-by submittedDate
+  %(prog)s --ids 2309.10668 --download-pdf papers/
+        """,
+    )
+
+    parser.add_argument("--verbose", "-v", action="store_true")
+
+    search_group = parser.add_argument_group("Search options")
+    search_group.add_argument("--keywords", "-k", nargs="+", help="Keywords to search")
+    search_group.add_argument("--author", "-a", help="Author name")
+    search_group.add_argument("--ids", nargs="+", help="arXiv IDs to look up")
+    search_group.add_argument("--query", "-q", help="Raw arXiv query string")
+    search_group.add_argument(
+        "--search-field",
+        choices=ArxivSearcher.VALID_SEARCH_FIELDS,
+        default="all",
+        help="Field to search keywords in (default: all)",
+    )
+
+    filter_group = parser.add_argument_group("Filter options")
+    filter_group.add_argument("--category", "-c", help="arXiv category (e.g., cs.LG)")
+    filter_group.add_argument("--max-results", type=int, default=50, help="Max results (default: 50, max: 300)")
+    filter_group.add_argument(
+        "--sort-by",
+        choices=ArxivSearcher.VALID_SORT_BY,
+        default="relevance",
+        help="Sort order (default: relevance)",
+    )
+    filter_group.add_argument(
+        "--sort-order",
+        choices=ArxivSearcher.VALID_SORT_ORDER,
+        default="descending",
+    )
+
+    output_group = parser.add_argument_group("Output options")
+    output_group.add_argument("--output", "-o", help="Output JSON file (default: stdout)")
+    output_group.add_argument("--download-pdf", help="Download PDFs to this directory")
+
+    args = parser.parse_args()
+    searcher = ArxivSearcher(verbose=args.verbose)
+
+    # --- ID lookup ---
+    if args.ids:
+        if args.download_pdf:
+            for aid in args.ids:
+                searcher.download_pdf(aid, args.download_pdf)
+            return 0
+
+        results = searcher.get_by_ids(args.ids)
+        query_desc = f"id_list:{','.join(args.ids)}"
+
+    # --- Raw query ---
+    elif args.query:
+        query = args.query
+        if args.category and f"cat:{args.category}" not in query:
+            query = f"({query}) AND cat:{args.category}"
+
+        results = searcher.search(
+            query=query,
+            max_results=args.max_results,
+            sort_by=args.sort_by,
+            sort_order=args.sort_order,
+        )
+        query_desc = query
+
+    # --- Keyword search ---
+    elif args.keywords:
+        field = args.search_field
+        keyword_parts = [f'{field}:"{kw}"' if " " in kw else f"{field}:{kw}" for kw in args.keywords]
+        query = " AND ".join(keyword_parts)
+        if args.category:
+            query = f"({query}) AND cat:{args.category}"
+
+        results = searcher.search(
+            query=query,
+            max_results=args.max_results,
+            sort_by=args.sort_by,
+            sort_order=args.sort_order,
+        )
+        query_desc = query
+
+    # --- Author search ---
+    elif args.author:
+        query = f'au:"{args.author}"'
+        if args.category:
+            query = f"{query} AND cat:{args.category}"
+
+        results = searcher.search(
+            query=query,
+            max_results=args.max_results,
+            sort_by=args.sort_by,
+            sort_order=args.sort_order,
+        )
+        query_desc = query
+
+    # --- Category browse ---
+    elif args.category:
+        query = f"cat:{args.category}"
+        results = searcher.search(
+            query=query,
+            max_results=args.max_results,
+            sort_by=args.sort_by or "submittedDate",
+            sort_order=args.sort_order,
+        )
+        query_desc = query
+
+    else:
+        parser.error("Provide --keywords, --author, --ids, --query, or --category")
+        return 1
+
+    # Output
+    output_data = {
+        "query": query_desc,
+        "result_count": len(results),
+        "results": results,
+    }
+
+    output_json = json.dumps(output_data, indent=2, ensure_ascii=False)
+
+    if args.output:
+        os.makedirs(os.path.dirname(args.output) or ".", exist_ok=True)
+        with open(args.output, "w") as f:
+            f.write(output_json)
+        print(f"Results written to {args.output}", file=sys.stderr)
+    else:
+        print(output_json)
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())