# arXiv API Reference ## Overview The arXiv API provides programmatic access to preprint metadata via an Atom XML feed. It supports search queries with field-specific operators, boolean logic, ID-based retrieval, sorting, and pagination. No authentication required. ## Base URL ``` http://export.arxiv.org/api/query ``` ## Rate Limiting - Recommended: **1 request per 3 seconds** - Aggressive crawling will result in temporary IP bans - Use `time.sleep(3)` between requests - Include a descriptive `User-Agent` header ## Query Parameters | Parameter | Description | Default | |-----------|-------------|---------| | `search_query` | Query string with field prefixes and boolean operators | (none) | | `id_list` | Comma-separated arXiv IDs | (none) | | `start` | Starting index for pagination (0-based) | `0` | | `max_results` | Number of results to return (max 300) | `10` | | `sortBy` | Sort field: `relevance`, `lastUpdatedDate`, `submittedDate` | `relevance` | | `sortOrder` | Sort direction: `ascending`, `descending` | `descending` | **Note**: `search_query` and `id_list` can be used together (results are ANDed) or separately. ## Search Query Syntax ### Field Prefixes | Prefix | Field | Example | |--------|-------|---------| | `ti:` | Title | `ti:transformer` | | `au:` | Author | `au:bengio` | | `abs:` | Abstract | `abs:attention mechanism` | | `co:` | Comment | `co:accepted at NeurIPS` | | `jr:` | Journal Reference | `jr:Nature` | | `cat:` | Category | `cat:cs.LG` | | `all:` | All fields | `all:deep learning` | | `id:` | arXiv ID | `id:2309.10668` | ### Boolean Operators Operators **must** be uppercase: ``` ti:transformer AND abs:attention # Both conditions au:bengio OR au:lecun # Either condition cat:cs.LG ANDNOT cat:cs.CV # Exclude category ``` ### Grouping Use parentheses for complex queries: ``` (ti:sparse AND ti:autoencoder) AND cat:cs.LG au:anthropic AND (abs:interpretability OR abs:alignment) (cat:cs.LG OR cat:cs.CL) AND ti:reinforcement learning ``` ### Phrase Search Quotes for exact phrases: ``` ti:"sparse autoencoder" au:"Yoshua Bengio" abs:"reinforcement learning from human feedback" ``` ### Wildcards Not supported by the arXiv API. Use broader terms and filter client-side. ## Example Requests ### Basic keyword search ``` GET http://export.arxiv.org/api/query?search_query=all:sparse+autoencoder&max_results=10 ``` ### Author + category ``` GET http://export.arxiv.org/api/query?search_query=au:anthropic+AND+cat:cs.LG&max_results=50&sortBy=submittedDate ``` ### ID lookup ``` GET http://export.arxiv.org/api/query?id_list=2309.10668,2406.04093 ``` ### Combined search + ID ``` GET http://export.arxiv.org/api/query?search_query=cat:cs.LG&id_list=2309.10668 ``` ### Paginated results ``` # Page 1 (results 0-99) GET ...?search_query=cat:cs.LG&start=0&max_results=100&sortBy=submittedDate # Page 2 (results 100-199) GET ...?search_query=cat:cs.LG&start=100&max_results=100&sortBy=submittedDate ``` ## Response Format (Atom XML) The API returns an Atom 1.0 XML feed. ### Feed-level elements ```xml ArXiv Query: ... http://arxiv.org/api/... 2024-01-15T00:00:00-05:00 1500 0 50 ... ... ``` ### Entry elements ```xml http://arxiv.org/abs/2309.10668v2 2023-09-19T17:58:00Z 2023-10-04T14:22:00Z Towards Monosemanticity: Decomposing Language Models... We attempt to reverse-engineer a trained neural network... Trenton Bricken Adly Templeton 42 pages, 30 figures 10.48550/arXiv.2309.10668 ... ``` ### Entry field descriptions | Field | Description | |-------|-------------| | `id` | Canonical arXiv URL with version (e.g., `http://arxiv.org/abs/2309.10668v2`) | | `published` | First submission date (ISO 8601) | | `updated` | Last update date (ISO 8601) | | `title` | Paper title (may contain line breaks in XML) | | `summary` | Full abstract text | | `author/name` | Author full name (one per `` element) | | `arxiv:primary_category` | Primary arXiv category | | `category` | All categories (multiple elements) | | `link[@type='text/html']` | Abstract page URL | | `link[@title='pdf']` | PDF download URL | | `arxiv:comment` | Author comment (page count, conference, etc.) | | `arxiv:doi` | Associated DOI (if exists) | | `arxiv:journal_ref` | Journal publication reference (if published) | ## Complete Category List ### Computer Science (cs.*) | Category | Name | |----------|------| | `cs.AI` | Artificial Intelligence | | `cs.AR` | Hardware Architecture | | `cs.CC` | Computational Complexity | | `cs.CE` | Computational Engineering, Finance, and Science | | `cs.CG` | Computational Geometry | | `cs.CL` | Computation and Language | | `cs.CR` | Cryptography and Security | | `cs.CV` | Computer Vision and Pattern Recognition | | `cs.CY` | Computers and Society | | `cs.DB` | Databases | | `cs.DC` | Distributed, Parallel, and Cluster Computing | | `cs.DL` | Digital Libraries | | `cs.DM` | Discrete Mathematics | | `cs.DS` | Data Structures and Algorithms | | `cs.ET` | Emerging Technologies | | `cs.FL` | Formal Languages and Automata Theory | | `cs.GL` | General Literature | | `cs.GR` | Graphics | | `cs.GT` | Computer Science and Game Theory | | `cs.HC` | Human-Computer Interaction | | `cs.IR` | Information Retrieval | | `cs.IT` | Information Theory | | `cs.LG` | Machine Learning | | `cs.LO` | Logic in Computer Science | | `cs.MA` | Multiagent Systems | | `cs.MM` | Multimedia | | `cs.MS` | Mathematical Software | | `cs.NA` | Numerical Analysis | | `cs.NE` | Neural and Evolutionary Computing | | `cs.NI` | Networking and Internet Architecture | | `cs.OH` | Other Computer Science | | `cs.OS` | Operating Systems | | `cs.PF` | Performance | | `cs.PL` | Programming Languages | | `cs.RO` | Robotics | | `cs.SC` | Symbolic Computation | | `cs.SD` | Sound | | `cs.SE` | Software Engineering | | `cs.SI` | Social and Information Networks | | `cs.SY` | Systems and Control | ### Statistics (stat.*) | Category | Name | |----------|------| | `stat.AP` | Applications | | `stat.CO` | Computation | | `stat.ME` | Methodology | | `stat.ML` | Machine Learning | | `stat.OT` | Other Statistics | | `stat.TH` | Statistics Theory | ### Mathematics (math.*) | Category | Name | |----------|------| | `math.AC` | Commutative Algebra | | `math.AG` | Algebraic Geometry | | `math.AP` | Analysis of PDEs | | `math.AT` | Algebraic Topology | | `math.CA` | Classical Analysis and ODEs | | `math.CO` | Combinatorics | | `math.CT` | Category Theory | | `math.CV` | Complex Variables | | `math.DG` | Differential Geometry | | `math.DS` | Dynamical Systems | | `math.FA` | Functional Analysis | | `math.GM` | General Mathematics | | `math.GN` | General Topology | | `math.GR` | Group Theory | | `math.GT` | Geometric Topology | | `math.HO` | History and Overview | | `math.IT` | Information Theory | | `math.KT` | K-Theory and Homology | | `math.LO` | Logic | | `math.MG` | Metric Geometry | | `math.MP` | Mathematical Physics | | `math.NA` | Numerical Analysis | | `math.NT` | Number Theory | | `math.OA` | Operator Algebras | | `math.OC` | Optimization and Control | | `math.PR` | Probability | | `math.QA` | Quantum Algebra | | `math.RA` | Rings and Algebras | | `math.RT` | Representation Theory | | `math.SG` | Symplectic Geometry | | `math.SP` | Spectral Theory | | `math.ST` | Statistics Theory | ### Physics | Category | Name | |----------|------| | `astro-ph` | Astrophysics (+ subcategories: .CO, .EP, .GA, .HE, .IM, .SR) | | `cond-mat` | Condensed Matter (+ subcategories) | | `gr-qc` | General Relativity and Quantum Cosmology | | `hep-ex` | High Energy Physics - Experiment | | `hep-lat` | High Energy Physics - Lattice | | `hep-ph` | High Energy Physics - Phenomenology | | `hep-th` | High Energy Physics - Theory | | `math-ph` | Mathematical Physics | | `nlin` | Nonlinear Sciences (+ subcategories) | | `nucl-ex` | Nuclear Experiment | | `nucl-th` | Nuclear Theory | | `physics` | Physics (+ subcategories: .comp-ph, .data-an, .bio-ph, etc.) | | `quant-ph` | Quantum Physics | ### Quantitative Biology (q-bio.*) | Category | Name | |----------|------| | `q-bio.BM` | Biomolecules | | `q-bio.CB` | Cell Behavior | | `q-bio.GN` | Genomics | | `q-bio.MN` | Molecular Networks | | `q-bio.NC` | Neurons and Cognition | | `q-bio.OT` | Other Quantitative Biology | | `q-bio.PE` | Populations and Evolution | | `q-bio.QM` | Quantitative Methods | | `q-bio.SC` | Subcellular Processes | | `q-bio.TO` | Tissues and Organs | ### Quantitative Finance (q-fin.*) | Category | Name | |----------|------| | `q-fin.CP` | Computational Finance | | `q-fin.EC` | Economics | | `q-fin.GN` | General Finance | | `q-fin.MF` | Mathematical Finance | | `q-fin.PM` | Portfolio Management | | `q-fin.PR` | Pricing of Securities | | `q-fin.RM` | Risk Management | | `q-fin.ST` | Statistical Finance | | `q-fin.TR` | Trading and Market Microstructure | ### Electrical Engineering and Systems Science (eess.*) | Category | Name | |----------|------| | `eess.AS` | Audio and Speech Processing | | `eess.IV` | Image and Video Processing | | `eess.SP` | Signal Processing | | `eess.SY` | Systems and Control | ### Economics (econ.*) | Category | Name | |----------|------| | `econ.EM` | Econometrics | | `econ.GN` | General Economics | | `econ.TH` | Theoretical Economics | ## Pagination The API returns at most 300 results per request. For larger result sets, paginate: ```python all_results = [] start = 0 batch_size = 100 while True: params = { "search_query": "cat:cs.LG", "start": start, "max_results": batch_size, "sortBy": "submittedDate", "sortOrder": "descending", } results = fetch(params) # your fetch function if not results: break all_results.extend(results) start += batch_size time.sleep(3) # respect rate limit ``` The total number of results available is in the `opensearch:totalResults` element of the feed. ## Downloading Papers ### PDF ``` http://arxiv.org/pdf/{arxiv_id} http://arxiv.org/pdf/{arxiv_id}v{version} ``` ### Abstract page ``` http://arxiv.org/abs/{arxiv_id} ``` ### Source (LaTeX) ``` http://arxiv.org/e-print/{arxiv_id} ``` ### HTML (experimental) ``` http://arxiv.org/html/{arxiv_id} ``` ## arXiv ID Formats | Format | Era | Example | |--------|-----|---------| | `YYMM.NNNNN` | 2015+ | `2309.10668` | | `YYMM.NNNN` | 2007-2014 | `0706.0001` | | `archive/YYMMNNN` | Pre-2007 | `hep-th/9901001` | All formats are accepted by the API. ## Common Pitfalls 1. **Boolean operators must be UPPERCASE**: `AND`, `OR`, `ANDNOT` (lowercase is treated as search terms) 2. **URL encoding**: Spaces in queries must be encoded as `+` or `%20` 3. **No full-text search**: The API only searches metadata (title, abstract, authors, etc.) 4. **Empty result placeholder**: When no results are found, arXiv may return a single entry with an empty title and the id `http://arxiv.org/api/errors` - filter this out 5. **Version numbering**: `published` date is v1 submission; `updated` is latest version date 6. **Rate limiting**: Exceeding limits can result in 403 errors or temporary bans 7. **Max 300 per request**: Even if `max_results` is set higher, only 300 are returned ## External Resources - arXiv API documentation: https://info.arxiv.org/help/api/index.html - arXiv API user manual: https://info.arxiv.org/help/api/user-manual.html - arXiv bulk data access: https://info.arxiv.org/help/bulk_data.html - arXiv category taxonomy: https://arxiv.org/category_taxonomy - OAI-PMH interface (for bulk metadata): http://export.arxiv.org/oai2