Merge pull request #66 from K-Dense-AI/add-parallel

Add parallel-web skill and update research-lookup
This commit is contained in:
Timothy Kassis
2026-03-03 07:47:29 -08:00
committed by GitHub
13 changed files with 3969 additions and 769 deletions

View File

@@ -0,0 +1,314 @@
---
name: parallel-web
description: Search the web, extract URL content, and run deep research using the Parallel Chat API and Extract API. Use for ALL web searches, research queries, and general information gathering. Provides synthesized summaries with citations.
allowed-tools: Read Write Edit Bash
license: MIT license
compatibility: PARALLEL_API_KEY required
metadata:
skill-author: K-Dense Inc.
---
# Parallel Web Systems API
## Overview
This skill provides access to **Parallel Web Systems** APIs for web search, deep research, and content extraction. It is the **primary tool for all web-related operations** in the scientific writer workflow.
**Primary interface:** Parallel Chat API (OpenAI-compatible) for search and research.
**Secondary interface:** Extract API for URL verification and special cases only.
**API Documentation:** https://docs.parallel.ai
**API Key:** https://platform.parallel.ai
**Environment Variable:** `PARALLEL_API_KEY`
## When to Use This Skill
Use this skill for **ALL** of the following:
- **Web Search**: Any query that requires searching the internet for information
- **Deep Research**: Comprehensive research reports on any topic
- **Market Research**: Industry analysis, competitive intelligence, market data
- **Current Events**: News, recent developments, announcements
- **Technical Information**: Documentation, specifications, product details
- **Statistical Data**: Market sizes, growth rates, industry figures
- **General Information**: Company profiles, facts, comparisons
**Use Extract API only for:**
- Citation verification (confirming a specific URL's content)
- Special cases where you need raw content from a known URL
**Do NOT use this skill for:**
- Academic-specific paper searches (use `research-lookup` which routes to Perplexity for purely academic queries)
- Google Scholar / PubMed database searches (use `citation-management` skill)
---
## Two Capabilities
### 1. Web Search (`search` command)
Search the web via the Parallel Chat API (`base` model) and get a **synthesized summary** with cited sources.
**Best for:** General web searches, current events, fact-finding, technical lookups, news, market data.
```bash
# Basic search
python scripts/parallel_web.py search "latest advances in quantum computing 2025"
# Use core model for more complex queries
python scripts/parallel_web.py search "compare EV battery chemistries NMC vs LFP" --model core
# Save results to file
python scripts/parallel_web.py search "renewable energy policy updates" -o results.txt
# JSON output for programmatic use
python scripts/parallel_web.py search "AI regulation landscape" --json -o results.json
```
**Key Parameters:**
- `objective`: Natural language description of what you want to find
- `--model`: Chat model to use (`base` default, or `core` for deeper research)
- `-o`: Output file path
- `--json`: Output as JSON
**Response includes:** Synthesized summary organized by themes, with inline citations and a sources list.
### 2. Deep Research (`research` command)
Run comprehensive multi-source research via the Parallel Chat API (`core` model) that produces detailed intelligence reports with citations.
**Best for:** Market research, comprehensive analysis, competitive intelligence, technology surveys, industry reports, any research question requiring synthesis of multiple sources.
```bash
# Default deep research (core model)
python scripts/parallel_web.py research "comprehensive analysis of the global EV battery market"
# Save research report to file
python scripts/parallel_web.py research "AI adoption in healthcare 2025" -o report.md
# Use base model for faster, lighter research
python scripts/parallel_web.py research "latest funding rounds in AI startups" --model base
# JSON output
python scripts/parallel_web.py research "renewable energy storage market in Europe" --json -o data.json
```
**Key Parameters:**
- `query`: Research question or topic
- `--model`: Chat model to use (`core` default for deep research, or `base` for faster results)
- `-o`: Output file path
- `--json`: Output as JSON
### 3. URL Extraction (`extract` command) — Verification Only
Extract content from specific URLs. **Use only for citation verification and special cases.**
For general research, use `search` or `research` instead.
```bash
# Verify a citation's content
python scripts/parallel_web.py extract "https://example.com/article" --objective "key findings"
# Get full page content for verification
python scripts/parallel_web.py extract "https://docs.example.com/api" --full-content
# Save extraction to file
python scripts/parallel_web.py extract "https://paper-url.com" --objective "methodology" -o extracted.md
```
---
## Model Selection Guide
The Chat API supports two research models. Use `base` for most searches and `core` for deep research.
| Model | Latency | Strengths | Use When |
|--------|------------|----------------------------------|-----------------------------|
| `base` | 15s-100s | Standard research, factual queries | Web searches, quick lookups |
| `core` | 60s-5min | Complex research, multi-source synthesis | Deep research, comprehensive reports |
**Recommendations:**
- `search` command defaults to `base` — fast, good for most queries
- `research` command defaults to `core` — thorough, good for comprehensive reports
- Override with `--model` when you need different depth/speed tradeoffs
---
## Python API Usage
### Search
```python
from parallel_web import ParallelSearch
searcher = ParallelSearch()
result = searcher.search(
objective="Find latest information about transformer architectures in NLP",
model="base",
)
if result["success"]:
print(result["response"]) # Synthesized summary
for src in result["sources"]:
print(f" {src['title']}: {src['url']}")
```
### Deep Research
```python
from parallel_web import ParallelDeepResearch
researcher = ParallelDeepResearch()
result = researcher.research(
query="Comprehensive analysis of AI regulation in the EU and US",
model="core",
)
if result["success"]:
print(result["response"]) # Full research report
print(f"Citations: {result['citation_count']}")
```
### Extract (Verification Only)
```python
from parallel_web import ParallelExtract
extractor = ParallelExtract()
result = extractor.extract(
urls=["https://docs.example.com/api-reference"],
objective="API authentication methods and rate limits",
)
if result["success"]:
for r in result["results"]:
print(r["excerpts"])
```
---
## MANDATORY: Save All Results to Sources Folder
**Every web search and deep research result MUST be saved to the project's `sources/` folder.**
This ensures all research is preserved for reproducibility, auditability, and context window recovery.
### Saving Rules
| Operation | `-o` Flag Target | Filename Pattern |
|-----------|-----------------|------------------|
| Web Search | `sources/search_<topic>.md` | `search_YYYYMMDD_HHMMSS_<brief_topic>.md` |
| Deep Research | `sources/research_<topic>.md` | `research_YYYYMMDD_HHMMSS_<brief_topic>.md` |
| URL Extract | `sources/extract_<source>.md` | `extract_YYYYMMDD_HHMMSS_<brief_source>.md` |
### How to Save (Always Use `-o` Flag)
**CRITICAL: Every call to `parallel_web.py` MUST include the `-o` flag pointing to the `sources/` folder.**
```bash
# Web search — ALWAYS save to sources/
python scripts/parallel_web.py search "latest advances in quantum computing 2025" \
-o sources/search_20250217_143000_quantum_computing.md
# Deep research — ALWAYS save to sources/
python scripts/parallel_web.py research "comprehensive analysis of the global EV battery market" \
-o sources/research_20250217_144000_ev_battery_market.md
# URL extraction (verification only) — save to sources/
python scripts/parallel_web.py extract "https://example.com/article" --objective "key findings" \
-o sources/extract_20250217_143500_example_article.md
```
### Why Save Everything
1. **Reproducibility**: Every claim in the final document can be traced back to its raw source material
2. **Context Window Recovery**: If context is compacted mid-task, saved results can be re-read from `sources/`
3. **Audit Trail**: The `sources/` folder provides complete transparency into how information was gathered
4. **Reuse Across Sections**: Saved research can be referenced by multiple sections without duplicate API calls
5. **Cost Efficiency**: Avoid redundant API calls by checking `sources/` for existing results
6. **Peer Review Support**: Reviewers can verify the research backing every claim
### Logging
When saving research results, always log:
```
[HH:MM:SS] SAVED: Search results to sources/search_20250217_143000_quantum_computing.md
[HH:MM:SS] SAVED: Deep research report to sources/research_20250217_144000_ev_battery_market.md
```
### Before Making a New Query, Check Sources First
Before calling `parallel_web.py`, check if a relevant result already exists in `sources/`:
```bash
ls sources/ # Check existing saved results
```
---
## Integration with Scientific Writer
### Routing Table
| Task | Tool | Command |
|------|------|---------|
| Web search (any) | `parallel_web.py search` | `python scripts/parallel_web.py search "query" -o sources/search_<topic>.md` |
| Deep research | `parallel_web.py research` | `python scripts/parallel_web.py research "query" -o sources/research_<topic>.md` |
| Citation verification | `parallel_web.py extract` | `python scripts/parallel_web.py extract "url" -o sources/extract_<source>.md` |
| Academic paper search | `research_lookup.py` | Routes to Perplexity sonar-pro-search |
| DOI/metadata lookup | `parallel_web.py extract` | Extract from DOI URLs (verification) |
### When Writing Scientific Documents
1. **Before writing any section**, use `search` or `research` to gather background information — **save results to `sources/`**
2. **For academic citations**, use `research-lookup` (which routes academic queries to Perplexity) — **save results to `sources/`**
3. **For citation verification** (confirming a specific URL), use `parallel_web.py extract`**save results to `sources/`**
4. **For current market/industry data**, use `parallel_web.py research --model core`**save results to `sources/`**
5. **Before any new query**, check `sources/` for existing results to avoid duplicate API calls
---
## Environment Setup
```bash
# Required: Set your Parallel API key
export PARALLEL_API_KEY="your_api_key_here"
# Required Python packages
pip install openai # For Chat API (search/research)
pip install parallel-web # For Extract API (verification only)
```
Get your API key at https://platform.parallel.ai
---
## Error Handling
The script handles errors gracefully and returns structured error responses:
```json
{
"success": false,
"error": "Error description",
"timestamp": "2025-02-14 12:00:00"
}
```
**Common issues:**
- `PARALLEL_API_KEY not set`: Set the environment variable
- `openai not installed`: Run `pip install openai`
- `parallel-web not installed`: Run `pip install parallel-web` (only needed for extract)
- `Rate limit exceeded`: Wait and retry (default: 300 req/min for Chat API)
---
## Complementary Skills
| Skill | Use For |
|-------|---------|
| `research-lookup` | Academic paper searches (routes to Perplexity for scholarly queries) |
| `citation-management` | Google Scholar, PubMed, CrossRef database searches |
| `literature-review` | Systematic literature reviews across academic databases |
| `scientific-schematics` | Generate diagrams from research findings |

View File

@@ -0,0 +1,244 @@
# Parallel Web Systems API Quick Reference
**Full Documentation:** https://docs.parallel.ai
**API Key:** https://platform.parallel.ai
**Python SDK:** `pip install parallel-web`
**Environment Variable:** `PARALLEL_API_KEY`
---
## Search API (Beta)
**Endpoint:** `POST https://api.parallel.ai/v1beta/search`
**Header:** `parallel-beta: search-extract-2025-10-10`
### Request
```json
{
"objective": "Natural language search goal (max 5000 chars)",
"search_queries": ["keyword query 1", "keyword query 2"],
"max_results": 10,
"excerpts": {
"max_chars_per_result": 10000,
"max_chars_total": 50000
},
"source_policy": {
"allow_domains": ["example.com"],
"deny_domains": ["spam.com"],
"after_date": "2024-01-01"
}
}
```
### Response
```json
{
"search_id": "search_...",
"results": [
{
"url": "https://...",
"title": "Page Title",
"publish_date": "2025-01-15",
"excerpts": ["Relevant content..."]
}
]
}
```
### Python SDK
```python
from parallel import Parallel
client = Parallel(api_key="...")
result = client.beta.search(
objective="...",
search_queries=["..."],
max_results=10,
excerpts={"max_chars_per_result": 10000},
)
```
**Cost:** $5 per 1,000 requests (default 10 results each)
**Rate Limit:** 600 requests/minute
---
## Extract API (Beta)
**Endpoint:** `POST https://api.parallel.ai/v1beta/extract`
**Header:** `parallel-beta: search-extract-2025-10-10`
### Request
```json
{
"urls": ["https://example.com/page"],
"objective": "What to focus on",
"excerpts": true,
"full_content": false
}
```
### Response
```json
{
"extract_id": "extract_...",
"results": [
{
"url": "https://...",
"title": "Page Title",
"excerpts": ["Focused content..."],
"full_content": null
}
],
"errors": []
}
```
### Python SDK
```python
result = client.beta.extract(
urls=["https://..."],
objective="...",
excerpts=True,
full_content=False,
)
```
**Cost:** $1 per 1,000 URLs
**Rate Limit:** 600 requests/minute
---
## Task API (Deep Research)
**Endpoint:** `POST https://api.parallel.ai/v1/tasks/runs`
### Create Task Run
```json
{
"input": "Research question (max 15,000 chars)",
"processor": "pro-fast",
"task_spec": {
"output_schema": {
"type": "text"
}
}
}
```
### Response (immediate)
```json
{
"run_id": "trun_...",
"status": "queued"
}
```
### Get Result (blocking)
**Endpoint:** `GET https://api.parallel.ai/v1/tasks/runs/{run_id}/result`
### Python SDK
```python
# Text output (markdown report with citations)
from parallel.types import TaskSpecParam
task_run = client.task_run.create(
input="Research question",
processor="pro-fast",
task_spec=TaskSpecParam(output_schema={"type": "text"}),
)
result = client.task_run.result(task_run.run_id, api_timeout=3600)
print(result.output.content)
# Auto-schema output (structured JSON)
task_run = client.task_run.create(
input="Research question",
processor="pro-fast",
)
result = client.task_run.result(task_run.run_id, api_timeout=3600)
print(result.output.content) # structured dict
print(result.output.basis) # citations per field
```
### Processors
| Processor | Latency | Cost/1000 | Best For |
|-----------|---------|-----------|----------|
| `lite-fast` | 10-20s | $5 | Basic metadata |
| `base-fast` | 15-50s | $10 | Standard enrichments |
| `core-fast` | 15s-100s | $25 | Cross-referenced |
| `core2x-fast` | 15s-3min | $50 | High complexity |
| **`pro-fast`** | **30s-5min** | **$100** | **Default: exploratory research** |
| `ultra-fast` | 1-10min | $300 | Deep multi-source |
| `ultra2x-fast` | 1-20min | $600 | Difficult research |
| `ultra4x-fast` | 1-40min | $1200 | Very difficult |
| `ultra8x-fast` | 1hr | $2400 | Most difficult |
Standard (non-fast) processors have the same cost but higher latency and freshest data.
---
## Chat API (Beta)
**Endpoint:** `POST https://api.parallel.ai/chat/completions`
**Compatible with OpenAI SDK.**
### Models
| Model | Latency (TTFT) | Cost/1000 | Use Case |
|-------|----------------|-----------|----------|
| `speed` | ~3s | $5 | Low-latency chat |
| `lite` | 10-60s | $5 | Simple lookups with basis |
| `base` | 15-100s | $10 | Standard research with basis |
| `core` | 1-5min | $25 | Complex research with basis |
### Python SDK (OpenAI-compatible)
```python
from openai import OpenAI
client = OpenAI(
api_key="PARALLEL_API_KEY",
base_url="https://api.parallel.ai",
)
response = client.chat.completions.create(
model="speed",
messages=[{"role": "user", "content": "What is Parallel Web Systems?"}],
)
```
---
## Rate Limits
| API | Default Limit |
|-----|---------------|
| Search | 600 req/min |
| Extract | 600 req/min |
| Chat | 300 req/min |
| Task | Varies by processor |
---
## Source Policy
Control which sources are used in searches:
```json
{
"source_policy": {
"allow_domains": ["nature.com", "science.org"],
"deny_domains": ["unreliable-source.com"],
"after_date": "2024-01-01"
}
}
```
Works with Search API and can be used to focus results on specific authoritative domains.

View File

@@ -0,0 +1,362 @@
# Deep Research Guide
Comprehensive guide to using Parallel's Task API for deep research, including processor selection, output formats, structured schemas, and advanced patterns.
---
## Overview
Deep Research transforms natural language research queries into comprehensive intelligence reports. Unlike simple search, it performs multi-step web exploration across authoritative sources and synthesizes findings with inline citations and confidence levels.
**Key characteristics:**
- Multi-step, multi-source research
- Automatic citation and source attribution
- Structured or text output formats
- Asynchronous processing (30 seconds to 25+ minutes)
- Research basis with confidence levels per finding
---
## Processor Selection
Choosing the right processor is the most important decision. It determines research depth, speed, and cost.
### Decision Matrix
| Scenario | Recommended Processor | Why |
|----------|----------------------|-----|
| Quick background for a paper section | `pro-fast` | Fast, good depth, low cost |
| Comprehensive market research report | `ultra-fast` | Deep multi-source synthesis |
| Simple fact lookup or metadata | `base-fast` | Fast, low cost |
| Competitive landscape analysis | `pro-fast` | Good balance of depth and speed |
| Background for grant proposal | `pro-fast` | Thorough but timely |
| State-of-the-art review for a topic | `ultra-fast` | Maximum source coverage |
| Quick question during writing | `core-fast` | Sub-2-minute response |
| Breaking news or very recent events | `pro` (standard) | Freshest data prioritized |
| Large-scale data enrichment | `base-fast` | Cost-effective at scale |
### Processor Tiers Explained
**`pro-fast`** (default, recommended for most tasks):
- Latency: 30 seconds to 5 minutes
- Depth: Explores 10-20+ web sources
- Best for: Section-level research, background gathering, comparative analysis
- Cost: $0.10 per query
**`ultra-fast`** (for comprehensive research):
- Latency: 1 to 10 minutes
- Depth: Explores 20-50+ web sources, multiple reasoning steps
- Best for: Full reports, market analysis, complex multi-faceted questions
- Cost: $0.30 per query
**`core-fast`** (quick cross-referenced answers):
- Latency: 15 seconds to 100 seconds
- Depth: Cross-references 5-10 sources
- Best for: Moderate complexity questions, verification tasks
- Cost: $0.025 per query
**`base-fast`** (simple enrichment):
- Latency: 15 to 50 seconds
- Depth: Standard web lookup, 3-5 sources
- Best for: Simple factual queries, metadata enrichment
- Cost: $0.01 per query
### Standard vs Fast
- **Fast processors** (`-fast`): 2-5x faster, very fresh data, ideal for interactive use
- **Standard processors** (no suffix): Highest data freshness, better for background jobs
**Rule of thumb:** Always use `-fast` variants unless you specifically need the freshest possible data (breaking news, live financial data, real-time events).
---
## Output Formats
### Text Mode (Markdown Reports)
Returns a comprehensive markdown report with inline citations. Best for human consumption and document integration.
```python
researcher = ParallelDeepResearch()
result = researcher.research(
query="Comprehensive analysis of mRNA vaccine technology platforms and their applications beyond COVID-19",
processor="pro-fast",
description="Focus on clinical trials, approved applications, pipeline developments, and key companies. Include market size data."
)
# result["output"] contains a full markdown report
# result["citations"] contains source URLs with excerpts
```
**When to use text mode:**
- Writing scientific documents (papers, reviews, reports)
- Background research for a topic
- Creating summaries for human readers
- When you need flowing prose, not structured data
**Guiding text output with `description`:**
The `description` parameter steers the report content:
```python
# Focus on specific aspects
result = researcher.research(
query="Electric vehicle battery technology landscape",
description="Focus on: (1) solid-state battery progress, (2) charging speed improvements, (3) cost per kWh trends, (4) key patents and IP. Format as a structured report with clear sections."
)
# Control length and depth
result = researcher.research(
query="AI in drug discovery",
description="Provide a concise 500-word executive summary covering key applications, notable successes, leading companies, and market projections."
)
```
### Auto-Schema Mode (Structured JSON)
Lets the processor determine the best output structure automatically. Returns structured JSON with per-field citations.
```python
result = researcher.research_structured(
query="Top 5 cloud computing companies: revenue, market share, key products, and recent developments",
processor="pro-fast",
)
# result["content"] contains structured data (dict)
# result["basis"] contains per-field citations with confidence
```
**When to use auto-schema:**
- Data extraction and enrichment
- Comparative analysis with specific fields
- When you need programmatic access to individual data points
- Integration with databases or spreadsheets
### Custom JSON Schema
Define exactly what fields you want returned:
```python
schema = {
"type": "object",
"properties": {
"market_size_2024": {
"type": "string",
"description": "Global market size in USD billions for 2024. Include source."
},
"growth_rate": {
"type": "string",
"description": "CAGR percentage for 2024-2030 forecast period."
},
"top_companies": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Company name"},
"market_share": {"type": "string", "description": "Approximate market share percentage"},
"revenue": {"type": "string", "description": "Most recent annual revenue"}
},
"required": ["name", "market_share", "revenue"]
},
"description": "Top 5 companies by market share"
},
"key_trends": {
"type": "array",
"items": {"type": "string"},
"description": "Top 3-5 industry trends driving growth"
}
},
"required": ["market_size_2024", "growth_rate", "top_companies", "key_trends"],
"additionalProperties": False
}
result = researcher.research_structured(
query="Global cybersecurity market analysis",
output_schema=schema,
)
```
---
## Writing Effective Research Queries
### Query Construction Framework
Structure your query as: **[Topic] + [Specific Aspect] + [Scope/Time] + [Output Expectations]**
**Good queries:**
```
"Comprehensive analysis of the global lithium-ion battery recycling market,
including market size, key players, regulatory drivers, and technology
approaches. Focus on 2023-2025 developments."
"Compare the efficacy, safety profiles, and cost-effectiveness of GLP-1
receptor agonists (semaglutide, tirzepatide, liraglutide) for type 2
diabetes management based on recent clinical trial data."
"Survey of federated learning approaches for healthcare AI, covering
privacy-preserving techniques, real-world deployments, regulatory
compliance, and performance benchmarks from 2023-2025 publications."
```
**Poor queries:**
```
"Tell me about batteries" # Too vague
"AI" # No specific aspect
"What's new?" # No topic at all
"Everything about quantum computing from all time" # Too broad
```
### Tips for Better Results
1. **Be specific about what you need**: "market size" vs "tell me about the market"
2. **Include time bounds**: "2024-2025" narrows to relevant data
3. **Name entities**: "semaglutide vs tirzepatide" vs "diabetes drugs"
4. **Specify output expectations**: "Include statistics, key players, and growth projections"
5. **Keep under 15,000 characters**: Concise queries work better than massive prompts
---
## Working with Research Basis
Every deep research result includes a **basis** -- citations, reasoning, and confidence levels for each finding.
### Text Mode Basis
```python
result = researcher.research(query="...", processor="pro-fast")
# Citations are deduplicated and include URLs + excerpts
for citation in result["citations"]:
print(f"Source: {citation['title']}")
print(f"URL: {citation['url']}")
if citation.get("excerpts"):
print(f"Excerpt: {citation['excerpts'][0][:200]}")
```
### Structured Mode Basis
```python
result = researcher.research_structured(query="...", processor="pro-fast")
for basis_entry in result["basis"]:
print(f"Field: {basis_entry['field']}")
print(f"Confidence: {basis_entry['confidence']}")
print(f"Reasoning: {basis_entry['reasoning']}")
for cit in basis_entry["citations"]:
print(f" Source: {cit['url']}")
```
### Confidence Levels
| Level | Meaning | Action |
|-------|---------|--------|
| `high` | Multiple authoritative sources agree | Use directly |
| `medium` | Some supporting evidence, minor uncertainty | Use with caveat |
| `low` | Limited evidence, significant uncertainty | Verify independently |
---
## Advanced Patterns
### Multi-Stage Research
Use different processors in sequence for progressively deeper research:
```python
# Stage 1: Quick overview with base-fast
overview = researcher.research(
query="What are the main approaches to quantum error correction?",
processor="base-fast",
)
# Stage 2: Deep dive on the most promising approach
deep_dive = researcher.research(
query=f"Detailed analysis of surface code quantum error correction: "
f"recent breakthroughs, implementation challenges, and leading research groups. "
f"Context: {overview['output'][:500]}",
processor="pro-fast",
)
```
### Comparative Research
```python
result = researcher.research(
query="Compare and contrast three leading large language model architectures: "
"GPT-4, Claude, and Gemini. Cover architecture differences, benchmark performance, "
"pricing, context window, and unique capabilities. Include specific benchmark scores.",
processor="pro-fast",
description="Create a structured comparison with a summary table. Include specific numbers and benchmarks."
)
```
### Research with Follow-Up Extraction
```python
# Step 1: Research to find relevant sources
research_result = researcher.research(
query="Most influential papers on attention mechanisms in 2024",
processor="pro-fast",
)
# Step 2: Extract full content from the most relevant sources
from parallel_web import ParallelExtract
extractor = ParallelExtract()
key_urls = [c["url"] for c in research_result["citations"][:5]]
for url in key_urls:
extracted = extractor.extract(
urls=[url],
objective="Key methodology, results, and conclusions",
)
```
---
## Performance Optimization
### Reducing Latency
1. **Use `-fast` processors**: 2-5x faster than standard
2. **Use `core-fast` for moderate queries**: Sub-2-minute for most questions
3. **Be specific in queries**: Vague queries require more exploration
4. **Set appropriate timeouts**: Don't over-wait
### Reducing Cost
1. **Start with `base-fast`**: Upgrade only if depth is insufficient
2. **Use `core-fast` for moderate complexity**: $0.025 vs $0.10 for pro
3. **Batch related queries**: One well-crafted query > multiple simple ones
4. **Cache results**: Store research output for reuse across sections
### Maximizing Quality
1. **Use `pro-fast` or `ultra-fast`**: More sources = better synthesis
2. **Provide context**: "I'm writing a paper for Nature Medicine about..."
3. **Use `description` parameter**: Guide the output structure and focus
4. **Verify critical findings**: Cross-check with Search API or Extract
---
## Common Mistakes
| Mistake | Impact | Fix |
|---------|--------|-----|
| Query too vague | Scattered, unfocused results | Add specific aspects and time bounds |
| Query too long (>15K chars) | API rejection or degraded results | Summarize context, focus on key question |
| Wrong processor | Too slow or too shallow | Use decision matrix above |
| Not using `description` | Report structure not aligned with needs | Add description to guide output |
| Ignoring confidence levels | Using low-confidence data as fact | Check basis confidence before citing |
| Not verifying citations | Risk of outdated or misattributed data | Cross-check key citations with Extract |
---
## See Also
- [API Reference](api_reference.md) - Complete API parameter reference
- [Search Best Practices](search_best_practices.md) - For quick web searches
- [Extraction Patterns](extraction_patterns.md) - For reading specific URLs
- [Workflow Recipes](workflow_recipes.md) - Common multi-step patterns

View File

@@ -0,0 +1,338 @@
# Extraction Patterns
Guide to using Parallel's Extract API for converting web pages into clean, LLM-optimized content.
---
## Overview
The Extract API converts any public URL into clean markdown. It handles JavaScript-heavy pages, PDFs, and complex layouts that simple HTTP fetching cannot parse. Results are optimized for LLM consumption.
**Key capabilities:**
- JavaScript rendering (SPAs, dynamic content)
- PDF extraction to clean text
- Focused excerpts aligned to your objective
- Full page content extraction
- Multiple URL batch processing
---
## When to Use Extract vs Search
| Scenario | Use Extract | Use Search |
|----------|-------------|------------|
| You have a specific URL | Yes | No |
| You need content from a known page | Yes | No |
| You want to find pages about a topic | No | Yes |
| You need to read a research paper URL | Yes | No |
| You need to verify information on a specific site | Yes | No |
| You're looking for information broadly | No | Yes |
| You found URLs from a search and want full content | Yes | No |
**Rule of thumb:** If you have a URL, use Extract. If you need to find URLs, use Search.
---
## Excerpt Mode vs Full Content Mode
### Excerpt Mode (Default)
Returns focused content aligned to your objective. Smaller token footprint, higher relevance.
```python
extractor = ParallelExtract()
result = extractor.extract(
urls=["https://arxiv.org/abs/2301.12345"],
objective="Key methodology and experimental results",
excerpts=True, # Default
full_content=False # Default
)
```
**Best for:**
- Extracting specific information from long pages
- Token-efficient processing
- When you know what you're looking for
- Reading papers for specific claims or data points
### Full Content Mode
Returns the complete page content as clean markdown.
```python
result = extractor.extract(
urls=["https://docs.example.com/api-reference"],
objective="Complete API documentation",
excerpts=False,
full_content=True,
)
```
**Best for:**
- Complete documentation pages
- Full article text needed for analysis
- When you need every detail, not just excerpts
- Archiving or converting web content
### Both Modes
You can request both excerpts and full content:
```python
result = extractor.extract(
urls=["https://example.com/report"],
objective="Executive summary and key recommendations",
excerpts=True,
full_content=True,
)
# Use excerpts for focused analysis
# Use full_content for complete reference
```
---
## Objective Writing for Extraction
The `objective` parameter focuses extraction on relevant content. It dramatically improves excerpt quality.
### Good Objectives
```python
# Specific and actionable
objective="Extract the methodology section, including sample size, statistical methods, and primary endpoints"
# Clear about what you need
objective="Find the pricing information, feature comparison table, and enterprise plan details"
# Targeted for your task
objective="Key findings, effect sizes, confidence intervals, and author conclusions from this clinical trial"
```
### Poor Objectives
```python
# Too vague
objective="Tell me about this page"
# No objective at all (still works but excerpts are less focused)
extractor.extract(urls=["https://..."])
```
### Objective Templates by Use Case
**Academic Paper:**
```python
objective="Abstract, key findings, methodology (sample size, design, statistical tests), results with effect sizes and p-values, and main conclusions"
```
**Product/Company Page:**
```python
objective="Company overview, key products/services, pricing, founding date, leadership team, and recent announcements"
```
**Technical Documentation:**
```python
objective="API endpoints, authentication methods, request/response formats, rate limits, and code examples"
```
**News Article:**
```python
objective="Main story, key quotes, data points, timeline of events, and named sources"
```
**Government/Policy Document:**
```python
objective="Key policy provisions, effective dates, affected parties, compliance requirements, and penalties"
```
---
## Batch Extraction
Extract from multiple URLs in a single call:
```python
result = extractor.extract(
urls=[
"https://nature.com/articles/s12345",
"https://science.org/doi/full/10.1234/science.xyz",
"https://thelancet.com/journals/lancet/article/PIIS0140-6736(24)12345/fulltext"
],
objective="Key findings, sample sizes, and statistical results from each study",
)
# Results are returned in the same order as input URLs
for r in result["results"]:
print(f"=== {r['title']} ===")
print(f"URL: {r['url']}")
for excerpt in r["excerpts"]:
print(excerpt[:500])
```
**Batch limits:**
- No hard limit on number of URLs per request
- Each URL counts as one extraction unit for billing
- Large batches may take longer to process
- Failed URLs are reported in the `errors` field without blocking successful ones
---
## Handling Different Content Types
### Web Pages (HTML)
Standard extraction. JavaScript is rendered, so SPAs and dynamic content work.
```python
# Standard web page
result = extractor.extract(
urls=["https://example.com/article"],
objective="Main article content",
)
```
### PDFs
PDFs are automatically detected and converted to text.
```python
# PDF extraction
result = extractor.extract(
urls=["https://example.com/whitepaper.pdf"],
objective="Executive summary and key recommendations",
)
```
### Documentation Sites
Single-page apps and documentation frameworks (Docusaurus, GitBook, ReadTheDocs) are fully rendered.
```python
result = extractor.extract(
urls=["https://docs.example.com/getting-started"],
objective="Installation instructions and quickstart guide",
full_content=True,
)
```
---
## Common Extraction Patterns
### Pattern 1: Search Then Extract
Find relevant pages with Search, then extract full content from the best results.
```python
from parallel_web import ParallelSearch, ParallelExtract
searcher = ParallelSearch()
extractor = ParallelExtract()
# Step 1: Find relevant pages
search_result = searcher.search(
objective="Find the original transformer paper and its key follow-up papers",
search_queries=["attention is all you need paper", "transformer architecture paper"],
)
# Step 2: Extract detailed content from top results
top_urls = [r["url"] for r in search_result["results"][:3]]
extract_result = extractor.extract(
urls=top_urls,
objective="Abstract, architecture description, key results, and ablation studies",
)
```
### Pattern 2: DOI Resolution and Paper Reading
```python
# Extract content from a DOI URL
result = extractor.extract(
urls=["https://doi.org/10.1038/s41586-024-07487-w"],
objective="Study design, patient population, primary endpoints, efficacy results, and safety data",
)
```
### Pattern 3: Competitive Intelligence from Company Pages
```python
companies = [
"https://openai.com/about",
"https://anthropic.com/company",
"https://deepmind.google/about/",
]
result = extractor.extract(
urls=companies,
objective="Company mission, team size, key products, recent announcements, and funding information",
)
```
### Pattern 4: Documentation Extraction for Reference
```python
result = extractor.extract(
urls=["https://docs.parallel.ai/search/search-quickstart"],
objective="Complete API usage guide including request format, response format, and code examples",
full_content=True,
)
```
### Pattern 5: Metadata Verification
```python
# Verify citation metadata for a specific paper
result = extractor.extract(
urls=["https://doi.org/10.1234/example-doi"],
objective="Complete citation metadata: authors, title, journal, volume, pages, year, DOI",
)
```
---
## Error Handling
### Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| URL not accessible | Page requires authentication, is behind paywall, or is down | Try a different URL or use Search instead |
| Timeout | Page takes too long to render | Retry or use a simpler URL |
| Empty content | Page is dynamically loaded in a way that can't be rendered | Try full_content mode or use Search |
| Rate limited | Too many requests | Wait and retry, or reduce batch size |
### Checking for Errors
```python
result = extractor.extract(urls=["https://example.com/page"])
if not result["success"]:
print(f"Extraction failed: {result['error']}")
elif result.get("errors"):
print(f"Some URLs failed: {result['errors']}")
else:
print(f"Successfully extracted {len(result['results'])} pages")
```
---
## Tips and Best Practices
1. **Always provide an objective**: Even a general one improves excerpt quality significantly
2. **Use excerpts by default**: Full content is only needed when you truly need everything
3. **Batch related URLs**: One call with 5 URLs is better than 5 separate calls
4. **Check for errors**: Not all URLs are extractable (paywalls, auth, etc.)
5. **Combine with Search**: Search finds URLs, Extract reads them in detail
6. **Use for DOI resolution**: Extract handles DOI redirects automatically
7. **Prefer Extract over manual fetching**: Handles JavaScript, PDFs, and complex layouts
---
## See Also
- [API Reference](api_reference.md) - Complete API parameter reference
- [Search Best Practices](search_best_practices.md) - For finding URLs to extract
- [Deep Research Guide](deep_research_guide.md) - For comprehensive research tasks
- [Workflow Recipes](workflow_recipes.md) - Common multi-step patterns

View File

@@ -0,0 +1,297 @@
# Search API Best Practices
Comprehensive guide to getting the best results from Parallel's Search API.
---
## Core Concepts
The Search API returns ranked, LLM-optimized excerpts from web sources based on natural language objectives. Results are designed to serve directly as model input, enabling faster reasoning and higher-quality completions.
### Key Advantages Over Traditional Search
- **Context engineering for token efficiency**: Results are ranked by reasoning utility, not engagement
- **Single-hop resolution**: Complex multi-topic queries resolved in one request
- **Multi-hop efficiency**: Deep research workflows complete in fewer tool calls
---
## Crafting Effective Search Queries
### Provide Both `objective` AND `search_queries`
The `objective` describes your broader goal; `search_queries` ensures specific keywords are prioritized. Using both together gives significantly better results.
**Good:**
```python
searcher.search(
objective="I'm writing a literature review on Alzheimer's treatments. Find peer-reviewed research papers and clinical trial results from the past 2 years on amyloid-beta targeted therapies.",
search_queries=[
"amyloid beta clinical trials 2024-2025",
"Alzheimer's monoclonal antibody treatment results",
"lecanemab donanemab trial outcomes"
],
)
```
**Poor:**
```python
# Too vague - no context about intent
searcher.search(objective="Alzheimer's treatment")
# Missing objective - no context for ranking
searcher.search(search_queries=["Alzheimer's drugs"])
```
### Objective Writing Tips
1. **State your broader task**: "I'm writing a research paper on...", "I'm analyzing the market for...", "I'm preparing a presentation about..."
2. **Be specific about source preferences**: "Prefer official government websites", "Focus on peer-reviewed journals", "From major news outlets"
3. **Include freshness requirements**: "From the past 6 months", "Published in 2024-2025", "Most recent data available"
4. **Specify content type**: "Technical documentation", "Clinical trial results", "Market analysis reports", "Product announcements"
### Example Objectives by Use Case
**Academic Research:**
```
"I'm writing a literature review on CRISPR gene editing applications in cancer therapy.
Find peer-reviewed papers from Nature, Science, Cell, and other high-impact journals
published in 2023-2025. Prefer clinical trial results and systematic reviews."
```
**Market Intelligence:**
```
"I'm preparing Q1 2025 investor materials for a fintech startup.
Find recent announcements from the Federal Reserve and SEC about digital asset
regulations and banking partnerships with crypto firms. Past 3 months only."
```
**Technical Documentation:**
```
"I'm designing a machine learning course. Find technical documentation and API guides
that explain how transformer attention mechanisms work, preferably from official
framework documentation like PyTorch or Hugging Face."
```
**Current Events:**
```
"I'm tracking AI regulation developments. Find official policy announcements,
legislative actions, and regulatory guidance from the EU, US, and UK governments
from the past month."
```
---
## Search Modes
Use the `mode` parameter to optimize for your workflow:
| Mode | Best For | Excerpt Style | Latency |
|------|----------|---------------|---------|
| `one-shot` (default) | Direct queries, single-request workflows | Comprehensive, longer | Lower |
| `agentic` | Multi-step reasoning loops, agent workflows | Concise, token-efficient | Slightly higher |
| `fast` | Real-time applications, UI auto-complete | Minimal, speed-optimized | ~1 second |
### When to Use Each Mode
**`one-shot`** (default):
- Single research question that needs comprehensive answer
- Writing a section of a paper and need full context
- Background research before starting a document
- Any case where you'll make only one search call
**`agentic`**:
- Multi-step research workflows (search → analyze → search again)
- Agent loops where token efficiency matters
- Iterative refinement of research queries
- When integrating with other tools (search → extract → synthesize)
**`fast`**:
- Live autocomplete or suggestion systems
- Quick fact-checking during writing
- Real-time metadata lookups
- Any latency-sensitive application
---
## Source Policy
Control which domains are included or excluded from results:
```python
searcher.search(
objective="Find clinical trial results for new cancer immunotherapy drugs",
search_queries=["checkpoint inhibitor clinical trials 2025"],
source_policy={
"allow_domains": ["clinicaltrials.gov", "nejm.org", "thelancet.com", "nature.com"],
"deny_domains": ["reddit.com", "quora.com"],
"after_date": "2024-01-01"
},
)
```
### Source Policy Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `allow_domains` | list[str] | Only include results from these domains |
| `deny_domains` | list[str] | Exclude results from these domains |
| `after_date` | str (YYYY-MM-DD) | Only include content published after this date |
### Domain Lists by Use Case
**Academic Research:**
```python
allow_domains = [
"nature.com", "science.org", "cell.com", "thelancet.com",
"nejm.org", "bmj.com", "pnas.org", "arxiv.org",
"pubmed.ncbi.nlm.nih.gov", "scholar.google.com"
]
```
**Technology/AI:**
```python
allow_domains = [
"arxiv.org", "openai.com", "anthropic.com", "deepmind.google",
"huggingface.co", "pytorch.org", "tensorflow.org",
"proceedings.neurips.cc", "proceedings.mlr.press"
]
```
**Market Intelligence:**
```python
deny_domains = [
"reddit.com", "quora.com", "medium.com",
"wikipedia.org" # Good for facts, not for market data
]
```
**Government/Policy:**
```python
allow_domains = [
"gov", "europa.eu", "who.int", "worldbank.org",
"imf.org", "oecd.org", "un.org"
]
```
---
## Controlling Result Volume
### `max_results` Parameter
- Range: 1-20 (default: 10)
- More results = broader coverage but more tokens to process
- Fewer results = more focused but may miss relevant sources
**Recommendations:**
- Quick fact check: `max_results=3`
- Standard research: `max_results=10` (default)
- Comprehensive survey: `max_results=20`
### Excerpt Length Control
```python
searcher.search(
objective="...",
max_chars_per_result=10000, # Default: 10000
)
```
- **Short excerpts (1000-3000)**: Quick summaries, metadata extraction
- **Medium excerpts (5000-10000)**: Standard research, balanced depth
- **Long excerpts (10000-50000)**: Full article content, deep analysis
---
## Common Patterns
### Pattern 1: Research Before Writing
```python
# Before writing each section, search for relevant information
result = searcher.search(
objective="Find recent advances in transformer attention mechanisms for a NeurIPS paper introduction",
search_queries=["attention mechanism innovations 2024", "efficient transformers"],
max_results=10,
)
# Extract key findings for the section
for r in result["results"]:
print(f"Source: {r['title']} ({r['url']})")
# Use excerpts to inform writing
```
### Pattern 2: Fact Verification
```python
# Quick verification of a specific claim
result = searcher.search(
objective="Verify: Did GPT-4 achieve 86.4% on MMLU benchmark?",
search_queries=["GPT-4 MMLU benchmark score"],
max_results=5,
)
```
### Pattern 3: Competitive Intelligence
```python
result = searcher.search(
objective="Find recent product launches and funding announcements for AI coding assistants in 2025",
search_queries=[
"AI coding assistant funding 2025",
"code generation tool launch",
"AI developer tools new product"
],
source_policy={"after_date": "2025-01-01"},
max_results=15,
)
```
### Pattern 4: Multi-Language Research
```python
# Search includes multilingual results automatically
result = searcher.search(
objective="Find global perspectives on AI regulation, including EU, China, and US approaches",
search_queries=[
"EU AI Act implementation 2025",
"China AI regulation policy",
"US AI executive order updates"
],
)
```
---
## Troubleshooting
### Few or No Results
- **Broaden your objective**: Remove overly specific constraints
- **Add more search queries**: Different phrasings of the same concept
- **Remove source policy**: Domain restrictions may be too narrow
- **Check date filters**: `after_date` may be too recent
### Irrelevant Results
- **Make objective more specific**: Add context about your task
- **Use source policy**: Allow only authoritative domains
- **Add negative context**: "Not about [unrelated topic]"
- **Refine search queries**: Use more precise keywords
### Too Many Tokens in Results
- **Reduce `max_results`**: From 10 to 5 or 3
- **Reduce excerpt length**: Lower `max_chars_per_result`
- **Use `agentic` mode**: More concise excerpts
- **Use `fast` mode**: Minimal excerpts
---
## See Also
- [API Reference](api_reference.md) - Complete API parameter reference
- [Deep Research Guide](deep_research_guide.md) - For comprehensive research tasks
- [Extraction Patterns](extraction_patterns.md) - For reading specific URLs
- [Workflow Recipes](workflow_recipes.md) - Common multi-step patterns

View File

@@ -0,0 +1,456 @@
# Workflow Recipes
Common multi-step patterns combining Parallel's Search, Extract, and Deep Research APIs for scientific writing tasks.
---
## Recipe Index
| Recipe | APIs Used | Time | Use Case |
|--------|-----------|------|----------|
| [Section Research Pipeline](#recipe-1-section-research-pipeline) | Research + Search | 2-5 min | Writing a paper section |
| [Citation Verification](#recipe-2-citation-verification) | Search + Extract | 1-2 min | Verifying paper metadata |
| [Literature Survey](#recipe-3-literature-survey) | Research + Search + Extract | 5-15 min | Comprehensive lit review |
| [Market Intelligence Report](#recipe-4-market-intelligence-report) | Research (multi-stage) | 10-30 min | Market/industry analysis |
| [Competitive Analysis](#recipe-5-competitive-analysis) | Search + Extract + Research | 5-10 min | Comparing companies/products |
| [Fact-Check Pipeline](#recipe-6-fact-check-pipeline) | Search + Extract | 1-3 min | Verifying claims |
| [Current Events Briefing](#recipe-7-current-events-briefing) | Search + Research | 3-5 min | News synthesis |
| [Technical Documentation Gathering](#recipe-8-technical-documentation-gathering) | Search + Extract | 2-5 min | API/framework docs |
| [Grant Background Research](#recipe-9-grant-background-research) | Research + Search | 5-10 min | Grant proposal background |
---
## Recipe 1: Section Research Pipeline
**Goal:** Gather research and citations for writing a single section of a scientific paper.
**APIs:** Deep Research (pro-fast) + Search
```bash
# Step 1: Deep research for comprehensive background
python scripts/parallel_web.py research \
"Recent advances in federated learning for healthcare AI, focusing on privacy-preserving training methods, real-world deployments, and regulatory considerations (2023-2025)" \
--processor pro-fast -o sources/section_background.md
# Step 2: Targeted search for specific citations
python scripts/parallel_web.py search \
"Find peer-reviewed papers on federated learning in hospitals" \
--queries "federated learning clinical deployment" "privacy preserving ML healthcare" \
--max-results 10 -o sources/section_citations.txt
```
**Python version:**
```python
from parallel_web import ParallelDeepResearch, ParallelSearch
researcher = ParallelDeepResearch()
searcher = ParallelSearch()
# Step 1: Deep background research
background = researcher.research(
query="Recent advances in federated learning for healthcare AI (2023-2025): "
"privacy-preserving methods, real-world deployments, regulatory landscape",
processor="pro-fast",
description="Structure as: (1) Key approaches, (2) Clinical deployments, "
"(3) Regulatory considerations, (4) Open challenges. Include statistics."
)
# Step 2: Find specific papers to cite
papers = searcher.search(
objective="Find recent peer-reviewed papers on federated learning deployed in hospital settings",
search_queries=[
"federated learning hospital clinical study 2024",
"privacy preserving machine learning healthcare deployment"
],
source_policy={"allow_domains": ["nature.com", "thelancet.com", "arxiv.org", "pubmed.ncbi.nlm.nih.gov"]},
)
# Combine: use background for writing, papers for citations
```
**When to use:** Before writing each major section of a research paper, literature review, or grant proposal.
---
## Recipe 2: Citation Verification
**Goal:** Verify that a citation is real and get complete metadata (DOI, volume, pages, year).
**APIs:** Search + Extract
```bash
# Option A: Search for the paper
python scripts/parallel_web.py search \
"Vaswani et al 2017 Attention is All You Need paper NeurIPS" \
--queries "Attention is All You Need DOI" --max-results 5
# Option B: Extract metadata from a DOI
python scripts/parallel_web.py extract \
"https://doi.org/10.48550/arXiv.1706.03762" \
--objective "Complete citation: authors, title, venue, year, pages, DOI"
```
**Python version:**
```python
from parallel_web import ParallelSearch, ParallelExtract
searcher = ParallelSearch()
extractor = ParallelExtract()
# Step 1: Find the paper
result = searcher.search(
objective="Find the exact citation details for the Attention Is All You Need paper by Vaswani et al.",
search_queries=["Attention is All You Need Vaswani 2017 NeurIPS DOI"],
max_results=5,
)
# Step 2: Extract full metadata from the paper's page
paper_url = result["results"][0]["url"]
metadata = extractor.extract(
urls=[paper_url],
objective="Complete BibTeX citation: all authors, title, conference/journal, year, pages, DOI, volume",
)
```
**When to use:** After writing a section, verify every citation in references.bib has correct and complete metadata.
---
## Recipe 3: Literature Survey
**Goal:** Comprehensive survey of a research field, identifying key papers, themes, and gaps.
**APIs:** Deep Research + Search + Extract
```python
from parallel_web import ParallelDeepResearch, ParallelSearch, ParallelExtract
researcher = ParallelDeepResearch()
searcher = ParallelSearch()
extractor = ParallelExtract()
topic = "CRISPR-based diagnostics for infectious diseases"
# Stage 1: Broad research overview
overview = researcher.research(
query=f"Comprehensive review of {topic}: key developments, clinical applications, "
f"regulatory status, commercial products, and future directions (2020-2025)",
processor="ultra-fast",
description="Structure as a literature review: (1) Historical development, "
"(2) Current technologies, (3) Clinical applications, "
"(4) Regulatory landscape, (5) Commercial products, "
"(6) Limitations and future directions. Include key statistics and milestones."
)
# Stage 2: Find specific landmark papers
key_papers = searcher.search(
objective=f"Find the most cited and influential papers on {topic} from Nature, Science, Cell, NEJM",
search_queries=[
"CRISPR diagnostics SHERLOCK DETECTR Nature",
"CRISPR point-of-care testing clinical study",
"nucleic acid detection CRISPR review"
],
source_policy={
"allow_domains": ["nature.com", "science.org", "cell.com", "nejm.org", "thelancet.com"],
},
max_results=15,
)
# Stage 3: Extract detailed content from top 5 papers
top_urls = [r["url"] for r in key_papers["results"][:5]]
detailed = extractor.extract(
urls=top_urls,
objective="Study design, key results, sensitivity/specificity data, and clinical implications",
)
```
**When to use:** Starting a literature review, systematic review, or comprehensive background section.
---
## Recipe 4: Market Intelligence Report
**Goal:** Generate a comprehensive market research report on an industry or product category.
**APIs:** Deep Research (multi-stage)
```python
researcher = ParallelDeepResearch()
industry = "AI-powered drug discovery"
# Stage 1: Market overview (ultra-fast for maximum depth)
market_overview = researcher.research(
query=f"Comprehensive market analysis of {industry}: market size, growth rate, "
f"key segments, geographic distribution, and forecast through 2030",
processor="ultra-fast",
description="Include specific dollar figures, CAGR percentages, and data sources. "
"Break down by segment and geography."
)
# Stage 2: Competitive landscape
competitors = researcher.research_structured(
query=f"Top 10 companies in {industry}: revenue, funding, key products, partnerships, and market position",
processor="pro-fast",
)
# Stage 3: Technology and innovation trends
tech_trends = researcher.research(
query=f"Technology trends and innovation landscape in {industry}: "
f"emerging approaches, breakthrough technologies, patent landscape, and R&D investment",
processor="pro-fast",
description="Focus on specific technologies, quantify R&D spending, and identify emerging leaders."
)
# Stage 4: Regulatory and risk analysis
regulatory = researcher.research(
query=f"Regulatory landscape and risk factors for {industry}: "
f"FDA guidance, EMA requirements, compliance challenges, and market risks",
processor="pro-fast",
)
```
**When to use:** Creating market research reports, investor presentations, or strategic analysis documents.
---
## Recipe 5: Competitive Analysis
**Goal:** Compare multiple companies, products, or technologies side-by-side.
**APIs:** Search + Extract + Research
```python
searcher = ParallelSearch()
extractor = ParallelExtract()
researcher = ParallelDeepResearch()
companies = ["OpenAI", "Anthropic", "Google DeepMind"]
# Step 1: Search for recent data on each company
for company in companies:
result = searcher.search(
objective=f"Latest product launches, funding, team size, and strategy for {company} in 2025",
search_queries=[f"{company} product launch 2025", f"{company} funding valuation"],
source_policy={"after_date": "2024-06-01"},
)
# Step 2: Extract from company pages
company_pages = [
"https://openai.com/about",
"https://anthropic.com/company",
"https://deepmind.google/about/",
]
company_data = extractor.extract(
urls=company_pages,
objective="Mission, key products, team size, founding date, and recent milestones",
)
# Step 3: Deep research for synthesis
comparison = researcher.research(
query=f"Detailed comparison of {', '.join(companies)}: "
f"products, pricing, technology approach, market position, strengths, weaknesses",
processor="pro-fast",
description="Create a structured comparison covering: "
"(1) Product portfolio, (2) Technology approach, (3) Pricing, "
"(4) Market position, (5) Strengths/weaknesses, (6) Future outlook. "
"Include a summary comparison table."
)
```
---
## Recipe 6: Fact-Check Pipeline
**Goal:** Verify specific claims or statistics before including in a document.
**APIs:** Search + Extract
```python
searcher = ParallelSearch()
extractor = ParallelExtract()
claim = "The global AI market is expected to reach $1.8 trillion by 2030"
# Step 1: Search for corroborating sources
result = searcher.search(
objective=f"Verify this claim: '{claim}'. Find authoritative sources that confirm or contradict this figure.",
search_queries=["global AI market size 2030 forecast", "artificial intelligence market projection trillion"],
max_results=8,
)
# Step 2: Extract specific figures from top sources
source_urls = [r["url"] for r in result["results"][:3]]
details = extractor.extract(
urls=source_urls,
objective="Specific market size figures, forecast years, CAGR, and methodology of the projection",
)
# Analyze: Do multiple authoritative sources agree?
```
**When to use:** Before including any specific statistic, market figure, or factual claim in a paper or report.
---
## Recipe 7: Current Events Briefing
**Goal:** Get up-to-date synthesis of recent developments on a topic.
**APIs:** Search + Research
```python
searcher = ParallelSearch()
researcher = ParallelDeepResearch()
topic = "EU AI Act implementation"
# Step 1: Find the latest news
latest = searcher.search(
objective=f"Latest news and developments on {topic} from the past month",
search_queries=[f"{topic} 2025", f"{topic} latest updates"],
source_policy={"after_date": "2025-01-15"},
max_results=15,
)
# Step 2: Synthesize into a briefing
briefing = researcher.research(
query=f"Summarize the latest developments in {topic} as of February 2025: "
f"key milestones, compliance deadlines, industry reactions, and implications",
processor="pro-fast",
description="Write a concise 500-word executive briefing with timeline of key events."
)
```
---
## Recipe 8: Technical Documentation Gathering
**Goal:** Collect and synthesize technical documentation for a framework or API.
**APIs:** Search + Extract
```python
searcher = ParallelSearch()
extractor = ParallelExtract()
# Step 1: Find documentation pages
docs = searcher.search(
objective="Find official PyTorch documentation for implementing custom attention mechanisms",
search_queries=["PyTorch attention mechanism tutorial", "PyTorch MultiheadAttention documentation"],
source_policy={"allow_domains": ["pytorch.org", "github.com/pytorch"]},
)
# Step 2: Extract full content from documentation pages
doc_urls = [r["url"] for r in docs["results"][:3]]
full_docs = extractor.extract(
urls=doc_urls,
objective="Complete API reference, parameters, usage examples, and code snippets",
full_content=True,
)
```
---
## Recipe 9: Grant Background Research
**Goal:** Build a comprehensive background section for a grant proposal with verified statistics.
**APIs:** Deep Research + Search
```python
researcher = ParallelDeepResearch()
searcher = ParallelSearch()
research_area = "AI-guided antibiotic discovery to combat antimicrobial resistance"
# Step 1: Significance and burden of disease
significance = researcher.research(
query=f"Burden of antimicrobial resistance: mortality statistics, economic impact, "
f"WHO priority pathogens, and projections. Include specific numbers.",
processor="pro-fast",
description="Focus on statistics suitable for NIH Significance section: "
"deaths per year, economic cost, resistance trends, and urgency."
)
# Step 2: Innovation landscape
innovation = researcher.research(
query=f"Current approaches to {research_area}: successes (halicin, etc.), "
f"limitations of current methods, and what makes our approach novel",
processor="pro-fast",
description="Focus on Innovation section: what has been tried, what gaps remain, "
"and what new approaches are emerging."
)
# Step 3: Find specific papers for preliminary data context
papers = searcher.search(
objective="Find landmark papers on AI-discovered antibiotics and ML approaches to drug discovery",
search_queries=[
"halicin AI antibiotic discovery Nature",
"machine learning antibiotic resistance prediction",
"deep learning drug discovery antibiotics"
],
source_policy={"allow_domains": ["nature.com", "science.org", "cell.com", "pnas.org"]},
)
```
**When to use:** Writing Significance, Innovation, or Background sections for NIH, NSF, or other grant proposals.
---
## Combining with Other Skills
### With `research-lookup` (Academic Papers)
```python
# Use parallel-web for general research
researcher.research("Current state of quantum computing applications")
# Use research-lookup for academic paper search (auto-routes to Perplexity)
# python research_lookup.py "find papers on quantum error correction in Nature and Science"
```
### With `citation-management` (BibTeX)
```python
# Step 1: Find paper with parallel search
result = searcher.search(objective="Vaswani et al Attention Is All You Need paper")
# Step 2: Get DOI from results
doi = "10.48550/arXiv.1706.03762"
# Step 3: Convert to BibTeX with citation-management skill
# python scripts/doi_to_bibtex.py 10.48550/arXiv.1706.03762
```
### With `scientific-schematics` (Diagrams)
```python
# Step 1: Research a process
result = researcher.research("How does the CRISPR-Cas9 gene editing mechanism work step by step")
# Step 2: Use the research to inform a schematic
# python scripts/generate_schematic.py "CRISPR-Cas9 gene editing workflow: guide RNA design -> Cas9 binding -> DNA cleavage -> repair pathway" -o figures/crispr_mechanism.png
```
---
## Performance Cheat Sheet
| Task | Processor | Expected Time | Approximate Cost |
|------|-----------|---------------|------------------|
| Quick fact lookup | `base-fast` | 15-50s | $0.01 |
| Section background | `pro-fast` | 30s-5min | $0.10 |
| Comprehensive report | `ultra-fast` | 1-10min | $0.30 |
| Web search (10 results) | Search API | 1-3s | $0.005 |
| URL extraction (1 URL) | Extract API | 1-20s | $0.001 |
| URL extraction (5 URLs) | Extract API | 5-30s | $0.005 |
---
## See Also
- [API Reference](api_reference.md) - Complete API parameter reference
- [Search Best Practices](search_best_practices.md) - Effective search queries
- [Deep Research Guide](deep_research_guide.md) - Processor selection and output formats
- [Extraction Patterns](extraction_patterns.md) - URL content extraction

View File

@@ -0,0 +1,568 @@
#!/usr/bin/env python3
"""
Parallel Web Systems API Client
Provides web search, URL content extraction, and deep research capabilities
using the Parallel Web Systems APIs (https://docs.parallel.ai).
Primary interface: Parallel Chat API (OpenAI-compatible) for search and research.
Secondary interface: Extract API for URL verification and special cases.
Main classes:
- ParallelChat: Core Chat API client (base/core models)
- ParallelSearch: Web search via Chat API (base model)
- ParallelDeepResearch: Deep research via Chat API (core model)
- ParallelExtract: URL content extraction (Extract API, verification only)
Environment variable required:
PARALLEL_API_KEY - Your Parallel API key from https://platform.parallel.ai
"""
import os
import sys
import json
import argparse
from datetime import datetime
from typing import Any, Dict, List, Optional
def _get_api_key():
"""Validate and return the Parallel API key."""
api_key = os.getenv("PARALLEL_API_KEY")
if not api_key:
raise ValueError(
"PARALLEL_API_KEY environment variable not set.\n"
"Get your key at https://platform.parallel.ai and set it:\n"
" export PARALLEL_API_KEY='your_key_here'"
)
return api_key
def _get_extract_client():
"""Create and return a Parallel SDK client for the Extract API."""
try:
from parallel import Parallel
except ImportError:
raise ImportError(
"The 'parallel-web' package is required for extract. Install it with:\n"
" pip install parallel-web"
)
return Parallel(api_key=_get_api_key())
class ParallelChat:
"""Core client for the Parallel Chat API.
OpenAI-compatible chat completions endpoint that performs web research
and returns synthesized responses with citations.
Models:
- base : Standard research, factual queries (15-100s latency)
- core : Complex research, multi-source synthesis (60s-5min latency)
"""
CHAT_BASE_URL = "https://api.parallel.ai"
def __init__(self):
try:
from openai import OpenAI
except ImportError:
raise ImportError(
"The 'openai' package is required. Install it with:\n"
" pip install openai"
)
self.client = OpenAI(
api_key=_get_api_key(),
base_url=self.CHAT_BASE_URL,
)
def query(
self,
user_message: str,
system_message: Optional[str] = None,
model: str = "base",
) -> Dict[str, Any]:
"""Send a query to the Parallel Chat API.
Args:
user_message: The research query or question.
system_message: Optional system prompt to guide response style.
model: Chat model to use ('base' or 'core').
Returns:
Dict with 'content' (response text), 'sources' (citations), and metadata.
"""
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
messages = []
if system_message:
messages.append({"role": "system", "content": system_message})
messages.append({"role": "user", "content": user_message})
try:
print(f"[Parallel Chat] Querying model={model}...", file=sys.stderr)
response = self.client.chat.completions.create(
model=model,
messages=messages,
stream=False,
)
content = ""
if response.choices and len(response.choices) > 0:
content = response.choices[0].message.content or ""
sources = self._extract_basis(response)
return {
"success": True,
"content": content,
"sources": sources,
"citation_count": len(sources),
"model": model,
"timestamp": timestamp,
}
except Exception as e:
return {
"success": False,
"error": str(e),
"model": model,
"timestamp": timestamp,
}
def _extract_basis(self, response) -> List[Dict[str, str]]:
"""Extract citation sources from the Chat API research basis."""
sources = []
basis = getattr(response, "basis", None)
if not basis:
return sources
seen_urls = set()
if isinstance(basis, list):
for item in basis:
citations = (
item.get("citations", []) if isinstance(item, dict)
else getattr(item, "citations", None) or []
)
for cit in citations:
url = cit.get("url", "") if isinstance(cit, dict) else getattr(cit, "url", "")
if url and url not in seen_urls:
seen_urls.add(url)
title = cit.get("title", "") if isinstance(cit, dict) else getattr(cit, "title", "")
excerpts = cit.get("excerpts", []) if isinstance(cit, dict) else getattr(cit, "excerpts", [])
sources.append({
"type": "source",
"url": url,
"title": title,
"excerpts": excerpts,
})
return sources
class ParallelSearch:
"""Web search using the Parallel Chat API (base model).
Sends a search query to the Chat API which performs web research and
returns a synthesized summary with cited sources.
"""
SYSTEM_PROMPT = (
"You are a web research assistant. Search the web and synthesize information "
"about the user's query. Provide a clear, well-organized summary with:\n"
"- Key facts, data points, and statistics\n"
"- Specific names, dates, and numbers when available\n"
"- Multiple perspectives if the topic is debated\n"
"Cite your sources inline. Be comprehensive but concise."
)
def __init__(self):
self.chat = ParallelChat()
def search(
self,
objective: str,
model: str = "base",
) -> Dict[str, Any]:
"""Execute a web search via the Chat API.
Args:
objective: Natural language description of the search goal.
model: Chat model to use ('base' or 'core', default 'base').
Returns:
Dict with 'response' (synthesized text), 'sources', and metadata.
"""
result = self.chat.query(
user_message=objective,
system_message=self.SYSTEM_PROMPT,
model=model,
)
if not result["success"]:
return {
"success": False,
"objective": objective,
"error": result.get("error", "Unknown error"),
"timestamp": result["timestamp"],
}
return {
"success": True,
"objective": objective,
"response": result["content"],
"sources": result["sources"],
"citation_count": result["citation_count"],
"model": result["model"],
"backend": "parallel-chat",
"timestamp": result["timestamp"],
}
class ParallelExtract:
"""Extract clean content from URLs using Parallel's Extract API.
Converts any public URL into clean, LLM-optimized markdown.
Use for citation verification and special cases only.
For general research, use ParallelSearch or ParallelDeepResearch instead.
"""
def __init__(self):
self.client = _get_extract_client()
def extract(
self,
urls: List[str],
objective: Optional[str] = None,
excerpts: bool = True,
full_content: bool = False,
) -> Dict[str, Any]:
"""Extract content from one or more URLs.
Args:
urls: List of URLs to extract content from.
objective: Optional objective to focus extraction.
excerpts: Whether to return focused excerpts (default True).
full_content: Whether to return full page content (default False).
Returns:
Dict with 'results' list containing url, title, excerpts/content.
"""
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
kwargs = {
"urls": urls,
"excerpts": excerpts,
"full_content": full_content,
}
if objective:
kwargs["objective"] = objective
try:
response = self.client.beta.extract(**kwargs)
results = []
if hasattr(response, "results") and response.results:
for r in response.results:
result = {
"url": getattr(r, "url", ""),
"title": getattr(r, "title", ""),
"publish_date": getattr(r, "publish_date", None),
"excerpts": getattr(r, "excerpts", []),
"full_content": getattr(r, "full_content", None),
}
results.append(result)
errors = []
if hasattr(response, "errors") and response.errors:
errors = [str(e) for e in response.errors]
return {
"success": True,
"urls": urls,
"results": results,
"errors": errors,
"timestamp": timestamp,
"extract_id": getattr(response, "extract_id", None),
}
except Exception as e:
return {
"success": False,
"urls": urls,
"error": str(e),
"timestamp": timestamp,
}
class ParallelDeepResearch:
"""Deep research using the Parallel Chat API (core model).
Sends complex research queries to the Chat API which performs
multi-source web research and returns comprehensive reports with citations.
"""
SYSTEM_PROMPT = (
"You are a deep research analyst. Provide a comprehensive, well-structured "
"research report on the user's topic. Include:\n"
"- Executive summary of key findings\n"
"- Detailed analysis organized by themes\n"
"- Specific data, statistics, and quantitative evidence\n"
"- Multiple authoritative sources\n"
"- Implications and future outlook where relevant\n"
"Use markdown formatting with clear section headers. "
"Cite all sources inline."
)
def __init__(self):
self.chat = ParallelChat()
def research(
self,
query: str,
model: str = "core",
system_prompt: Optional[str] = None,
) -> Dict[str, Any]:
"""Run deep research via the Chat API.
Args:
query: The research question or topic.
model: Chat model to use ('base' or 'core', default 'core').
system_prompt: Optional override for the system prompt.
Returns:
Dict with 'response' (markdown report), 'citations', and metadata.
"""
result = self.chat.query(
user_message=query,
system_message=system_prompt or self.SYSTEM_PROMPT,
model=model,
)
if not result["success"]:
return {
"success": False,
"query": query,
"error": result.get("error", "Unknown error"),
"model": model,
"timestamp": result["timestamp"],
}
return {
"success": True,
"query": query,
"response": result["content"],
"output": result["content"],
"citations": result["sources"],
"sources": result["sources"],
"citation_count": result["citation_count"],
"model": model,
"backend": "parallel-chat",
"timestamp": result["timestamp"],
}
# ---------------------------------------------------------------------------
# CLI Interface
# ---------------------------------------------------------------------------
def _print_search_results(result: Dict[str, Any], output_file=None):
"""Print search results (synthesized summary + sources)."""
def write(text):
if output_file:
output_file.write(text + "\n")
else:
print(text)
if not result["success"]:
write(f"Error: {result.get('error', 'Unknown error')}")
return
write(f"\n{'='*80}")
write(f"Search: {result['objective']}")
write(f"Model: {result['model']} | Time: {result['timestamp']}")
write(f"{'='*80}\n")
write(result.get("response", "No response received."))
sources = result.get("sources", [])
if sources:
write(f"\n\n{'='*40} SOURCES {'='*40}")
for i, src in enumerate(sources):
title = src.get("title", "Untitled")
url = src.get("url", "")
write(f" [{i+1}] {title}")
if url:
write(f" {url}")
def _print_extract_results(result: Dict[str, Any], output_file=None):
"""Pretty-print extract results."""
def write(text):
if output_file:
output_file.write(text + "\n")
else:
print(text)
if not result["success"]:
write(f"Error: {result.get('error', 'Unknown error')}")
return
write(f"\n{'='*80}")
write(f"Extracted from: {', '.join(result['urls'])}")
write(f"Time: {result['timestamp']}")
write(f"{'='*80}")
for i, r in enumerate(result["results"]):
write(f"\n--- [{i+1}] {r['title']} ---")
write(f"URL: {r['url']}")
if r.get("full_content"):
write(f"\n{r['full_content']}")
elif r.get("excerpts"):
for j, excerpt in enumerate(r["excerpts"]):
write(f"\nExcerpt {j+1}:")
write(excerpt[:2000] if len(excerpt) > 2000 else excerpt)
if result.get("errors"):
write(f"\nErrors: {result['errors']}")
def _print_research_results(result: Dict[str, Any], output_file=None):
"""Print deep research results (report + sources)."""
def write(text):
if output_file:
output_file.write(text + "\n")
else:
print(text)
if not result["success"]:
write(f"Error: {result.get('error', 'Unknown error')}")
return
write(f"\n{'='*80}")
query_display = result['query'][:100]
if len(result['query']) > 100:
query_display += "..."
write(f"Research: {query_display}")
write(f"Model: {result['model']} | Citations: {result.get('citation_count', 0)} | Time: {result['timestamp']}")
write(f"{'='*80}\n")
write(result.get("response", result.get("output", "No output received.")))
citations = result.get("citations", result.get("sources", []))
if citations:
write(f"\n\n{'='*40} SOURCES {'='*40}")
seen_urls = set()
for cit in citations:
url = cit.get("url", "")
if url and url not in seen_urls:
seen_urls.add(url)
title = cit.get("title", "Untitled")
write(f" [{len(seen_urls)}] {title}")
write(f" {url}")
def main():
parser = argparse.ArgumentParser(
description="Parallel Web Systems API Client - Search, Extract, and Deep Research",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python parallel_web.py search "latest advances in quantum computing"
python parallel_web.py search "climate policy 2025" --model core
python parallel_web.py extract "https://example.com" --objective "key findings"
python parallel_web.py research "comprehensive analysis of EV battery market"
python parallel_web.py research "compare mRNA vs protein subunit vaccines" --model base
python parallel_web.py research "AI regulation landscape 2025" -o report.md
""",
)
subparsers = parser.add_subparsers(dest="command", help="API command")
# --- search subcommand ---
search_parser = subparsers.add_parser("search", help="Web search via Chat API (synthesized results)")
search_parser.add_argument("objective", help="Natural language search objective")
search_parser.add_argument("--model", default="base", choices=["base", "core"],
help="Chat model to use (default: base)")
search_parser.add_argument("-o", "--output", help="Write output to file")
search_parser.add_argument("--json", action="store_true", help="Output as JSON")
# --- extract subcommand ---
extract_parser = subparsers.add_parser("extract", help="Extract content from URLs (verification only)")
extract_parser.add_argument("urls", nargs="+", help="One or more URLs to extract")
extract_parser.add_argument("--objective", help="Objective to focus extraction")
extract_parser.add_argument("--full-content", action="store_true", help="Return full page content")
extract_parser.add_argument("-o", "--output", help="Write output to file")
extract_parser.add_argument("--json", action="store_true", help="Output as JSON")
# --- research subcommand ---
research_parser = subparsers.add_parser("research", help="Deep research via Chat API (comprehensive report)")
research_parser.add_argument("query", help="Research question or topic")
research_parser.add_argument("--model", default="core", choices=["base", "core"],
help="Chat model to use (default: core)")
research_parser.add_argument("-o", "--output", help="Write output to file")
research_parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args()
if not args.command:
parser.print_help()
return 1
output_file = None
if hasattr(args, "output") and args.output:
output_file = open(args.output, "w", encoding="utf-8")
try:
if args.command == "search":
searcher = ParallelSearch()
result = searcher.search(
objective=args.objective,
model=args.model,
)
if args.json:
text = json.dumps(result, indent=2, ensure_ascii=False, default=str)
(output_file or sys.stdout).write(text + "\n")
else:
_print_search_results(result, output_file)
elif args.command == "extract":
extractor = ParallelExtract()
result = extractor.extract(
urls=args.urls,
objective=args.objective,
full_content=args.full_content,
)
if args.json:
text = json.dumps(result, indent=2, ensure_ascii=False, default=str)
(output_file or sys.stdout).write(text + "\n")
else:
_print_extract_results(result, output_file)
elif args.command == "research":
researcher = ParallelDeepResearch()
result = researcher.research(
query=args.query,
model=args.model,
)
if args.json:
text = json.dumps(result, indent=2, ensure_ascii=False, default=str)
(output_file or sys.stdout).write(text + "\n")
else:
_print_research_results(result, output_file)
return 0
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return 1
finally:
if output_file:
output_file.close()
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,156 @@
# Research Lookup Skill
This skill provides real-time research information lookup using Perplexity's Sonar Pro Search model through OpenRouter.
## Setup
1. **Get OpenRouter API Key:**
- Visit [openrouter.ai](https://openrouter.ai)
- Create account and generate API key
- Add credits to your account
2. **Configure Environment:**
```bash
export OPENROUTER_API_KEY="your_api_key_here"
```
3. **Test Setup:**
```bash
python scripts/research_lookup.py --model-info
```
## Usage
### Command Line Usage
```bash
# Single research query
python scripts/research_lookup.py "Recent advances in CRISPR gene editing 2024"
# Multiple queries with delay
python scripts/research_lookup.py --batch "CRISPR applications" "gene therapy trials" "ethical considerations"
# Claude Code integration (called automatically)
python lookup.py "your research query here"
```
### Claude Code Integration
The research lookup tool is automatically available in Claude Code when you:
1. **Ask research questions:** "Research recent advances in quantum computing"
2. **Request literature reviews:** "Find current studies on climate change impacts"
3. **Need citations:** "What are the latest papers on transformer attention mechanisms?"
4. **Want technical information:** "Standard protocols for flow cytometry"
## Features
- **Academic Focus:** Prioritizes peer-reviewed papers and reputable sources
- **Current Information:** Focuses on recent publications (2020-2024)
- **Complete Citations:** Provides full bibliographic information with DOIs
- **Multiple Formats:** Supports various query types and research needs
- **High Search Context:** Always uses high search context for deeper, more comprehensive research
- **Quality Prioritization:** Automatically prioritizes highly-cited papers from top venues
- **Cost Effective:** Typically $0.01-0.05 per research query
## Paper Quality Prioritization
This skill **always prioritizes high-impact, influential papers** over obscure publications. Results are ranked by:
### Citation-Based Ranking
| Paper Age | Citation Threshold | Classification |
|-----------|-------------------|----------------|
| 0-3 years | 20+ citations | Noteworthy |
| 0-3 years | 100+ citations | Highly Influential |
| 3-7 years | 100+ citations | Significant |
| 3-7 years | 500+ citations | Landmark |
| 7+ years | 500+ citations | Seminal |
| 7+ years | 1000+ citations | Foundational |
### Venue Quality Tiers
Papers from higher-tier venues are always preferred:
- **Tier 1 (Highest Priority):** Nature, Science, Cell, NEJM, Lancet, JAMA, PNAS, Nature Medicine, Nature Biotechnology
- **Tier 2 (High Priority):** High-impact journals (IF>10), top conferences (NeurIPS, ICML, ICLR for ML/AI)
- **Tier 3 (Good):** Respected specialized journals (IF 5-10)
- **Tier 4 (Use Sparingly):** Other peer-reviewed venues
### Author Reputation
The skill prefers papers from:
- Senior researchers with high h-index
- Established research groups at recognized institutions
- Authors with multiple publications in Tier-1 venues
- Researchers with recognized expertise (awards, editorial positions)
### Relevance Priority
1. Papers directly addressing the research question
2. Papers with applicable methods/data
3. Tangentially related papers (only from top venues or highly cited)
## Query Examples
### Academic Research
- "Recent systematic reviews on AI in medical diagnosis 2024"
- "Meta-analysis of randomized controlled trials for depression treatment"
- "Current state of quantum computing error correction research"
### Technical Methods
- "Standard protocols for immunohistochemistry in tissue samples"
- "Best practices for machine learning model validation"
- "Statistical methods for analyzing longitudinal data"
### Statistical Data
- "Global renewable energy adoption statistics 2024"
- "Prevalence of diabetes in different populations"
- "Market size for autonomous vehicles industry"
## Response Format
Each research result includes:
- **Summary:** Brief overview of key findings
- **Key Studies:** 3-5 most relevant recent papers
- **Citations:** Complete bibliographic information
- **Usage Stats:** Token usage for cost tracking
- **Timestamp:** When the research was performed
## Integration with Scientific Writing
This skill enhances the scientific writing process by providing:
1. **Literature Reviews:** Current research for introduction sections
2. **Methods Validation:** Verify protocols against current standards
3. **Results Context:** Compare findings with recent similar studies
4. **Discussion Support:** Latest evidence for arguments
5. **Citation Management:** Properly formatted references
## Troubleshooting
**"API key not found"**
- Ensure `OPENROUTER_API_KEY` environment variable is set
- Check that you have credits in your OpenRouter account
**"Model not available"**
- Verify your API key has access to Perplexity models
- Check OpenRouter status page for service issues
**"Rate limit exceeded"**
- Add delays between requests using `--delay` option
- Check your OpenRouter account limits
**"No relevant results"**
- Try more specific or broader queries
- Include time frames (e.g., "2023-2024")
- Use academic keywords and technical terms
## Cost Management
- Monitor usage through OpenRouter dashboard
- Typical costs: $0.01-0.05 per research query
- Batch processing available for multiple queries
- Consider query specificity to optimize token usage
This skill is designed for academic and research purposes, providing high-quality, cited information to support scientific writing and research activities.

View File

@@ -1,27 +1,35 @@
---
name: research-lookup
description: "Look up current research information using Perplexity's Sonar Pro Search or Sonar Reasoning Pro models through OpenRouter. Automatically selects the best model based on query complexity. Search academic papers, recent studies, technical documentation, and general research information with citations."
description: Look up current research information using the Parallel Chat API (primary) or Perplexity sonar-pro-search (academic paper searches). Automatically routes queries to the best backend. Use for finding papers, gathering research data, and verifying scientific information.
allowed-tools: Read Write Edit Bash
license: MIT license
compatibility: PARALLEL_API_KEY and OPENROUTER_API_KEY required
metadata:
skill-author: K-Dense Inc.
---
# Research Information Lookup
## Overview
This skill enables real-time research information lookup using Perplexity's Sonar models through OpenRouter. It intelligently selects between **Sonar Pro Search** (fast, efficient lookup) and **Sonar Reasoning Pro** (deep analytical reasoning) based on query complexity. The skill provides access to current academic literature, recent studies, technical documentation, and general research information with proper citations and source attribution.
This skill provides real-time research information lookup with **intelligent backend routing**:
- **Parallel Chat API** (`core` model): Default backend for all general research queries. Provides comprehensive, multi-source research reports with inline citations via the OpenAI-compatible Chat API at `https://api.parallel.ai`.
- **Perplexity sonar-pro-search** (via OpenRouter): Used only for academic-specific paper searches where scholarly database access is critical.
The skill automatically detects query type and routes to the optimal backend.
## When to Use This Skill
Use this skill when you need:
- **Current Research Information**: Latest studies, papers, and findings in a specific field
- **Current Research Information**: Latest studies, papers, and findings
- **Literature Verification**: Check facts, statistics, or claims against current research
- **Background Research**: Gather context and supporting evidence for scientific writing
- **Citation Sources**: Find relevant papers and studies to cite in manuscripts
- **Citation Sources**: Find relevant papers and studies to cite
- **Technical Documentation**: Look up specifications, protocols, or methodologies
- **Recent Developments**: Stay current with emerging trends and breakthroughs
- **Statistical Data**: Find recent statistics, survey results, or research findings
- **Expert Opinions**: Access insights from recent interviews, reviews, or commentary
- **Market/Industry Data**: Current statistics, trends, competitive intelligence
- **Recent Developments**: Emerging trends, breakthroughs, announcements
## Visual Enhancement with Scientific Schematics
@@ -30,269 +38,133 @@ Use this skill when you need:
If your document does not already contain schematics or diagrams:
- Use the **scientific-schematics** skill to generate AI-powered publication-quality diagrams
- Simply describe your desired diagram in natural language
- Nano Banana Pro will automatically generate, review, and refine the schematic
**For new documents:** Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.
**How to generate schematics:**
```bash
python scripts/generate_schematic.py "your diagram description" -o figures/output.png
```
The AI will automatically:
- Create publication-quality images with proper formatting
- Review and refine through multiple iterations
- Ensure accessibility (colorblind-friendly, high contrast)
- Save outputs in the figures/ directory
---
**When to add schematics:**
- Research information flow diagrams
- Query processing workflow illustrations
- Model selection decision trees
- System integration architecture diagrams
- Information retrieval pipeline visualizations
- Knowledge synthesis frameworks
- Any complex concept that benefits from visualization
## Automatic Backend Selection
For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.
The skill automatically routes queries to the best backend based on content:
### Routing Logic
```
Query arrives
|
+-- Contains academic keywords? (papers, DOI, journal, peer-reviewed, etc.)
| YES --> Perplexity sonar-pro-search (academic search mode)
|
+-- Everything else (general research, market data, technical info, analysis)
--> Parallel Chat API (core model)
```
### Academic Keywords (Routes to Perplexity)
Queries containing these terms are routed to Perplexity for academic-focused search:
- Paper finding: `find papers`, `find articles`, `research papers on`, `published studies`
- Citations: `cite`, `citation`, `doi`, `pubmed`, `pmid`
- Academic sources: `peer-reviewed`, `journal article`, `scholarly`, `arxiv`, `preprint`
- Review types: `systematic review`, `meta-analysis`, `literature search`
- Paper quality: `foundational papers`, `seminal papers`, `landmark papers`, `highly cited`
### Everything Else (Routes to Parallel)
All other queries go to the Parallel Chat API (core model), including:
- General research questions
- Market and industry analysis
- Technical information and documentation
- Current events and recent developments
- Comparative analysis
- Statistical data retrieval
- Complex analytical queries
### Manual Override
You can force a specific backend:
```bash
# Force Parallel Deep Research
python research_lookup.py "your query" --force-backend parallel
# Force Perplexity academic search
python research_lookup.py "your query" --force-backend perplexity
```
---
## Core Capabilities
### 1. Academic Research Queries
### 1. General Research Queries (Parallel Chat API)
**Search Academic Literature**: Query for recent papers, studies, and reviews in specific domains:
**Default backend.** Provides comprehensive, multi-source research with citations via the Chat API (`core` model).
```
Query Examples:
- "Recent advances in CRISPR gene editing 2024"
- "Latest clinical trials for Alzheimer's disease treatment"
- "Machine learning applications in drug discovery systematic review"
- "Climate change impacts on biodiversity meta-analysis"
- "Recent advances in CRISPR gene editing 2025"
- "Compare mRNA vaccines vs traditional vaccines for cancer treatment"
- "AI adoption in healthcare industry statistics"
- "Global renewable energy market trends and projections"
- "Explain the mechanism underlying gut microbiome and depression"
```
**Expected Response Format**:
- Summary of key findings from recent literature
- Citation of 3-5 most relevant papers with authors, titles, journals, and years
- Key statistics or findings highlighted
- Identification of research gaps or controversies
- Links to full papers when available
**Response includes:**
- Comprehensive research report in markdown
- Inline citations from authoritative web sources
- Structured sections with key findings
- Multiple perspectives and data points
- Source URLs for verification
### 2. Technical and Methodological Information
### 2. Academic Paper Search (Perplexity sonar-pro-search)
**Protocol and Method Lookups**: Find detailed procedures, specifications, and methodologies:
**Used for academic-specific queries.** Prioritizes scholarly databases and peer-reviewed sources.
```
Query Examples:
- "Find papers on transformer attention mechanisms in NeurIPS 2024"
- "Foundational papers on quantum error correction"
- "Systematic review of immunotherapy in non-small cell lung cancer"
- "Cite the original BERT paper and its most influential follow-ups"
- "Published studies on CRISPR off-target effects in clinical trials"
```
**Response includes:**
- Summary of key findings from academic literature
- 5-8 high-quality citations with authors, titles, journals, years, DOIs
- Citation counts and venue tier indicators
- Key statistics and methodology highlights
- Research gaps and future directions
### 3. Technical and Methodological Information
```
Query Examples:
- "Western blot protocol for protein detection"
- "RNA sequencing library preparation methods"
- "Statistical power analysis for clinical trials"
- "Machine learning model evaluation metrics"
- "Machine learning model evaluation metrics comparison"
```
**Expected Response Format**:
- Step-by-step procedures or protocols
- Required materials and equipment
- Critical parameters and considerations
- Troubleshooting common issues
- References to standard protocols or seminal papers
### 3. Statistical and Data Information
**Research Statistics**: Look up current statistics, survey results, and research data:
### 4. Statistical and Market Data
```
Query Examples:
- "Prevalence of diabetes in US population 2024"
- "Global renewable energy adoption statistics"
- "Prevalence of diabetes in US population 2025"
- "Global AI market size and growth projections"
- "COVID-19 vaccination rates by country"
- "AI adoption in healthcare industry survey"
```
**Expected Response Format**:
- Current statistics with dates and sources
- Methodology of data collection
- Confidence intervals or margins of error when available
- Comparison with previous years or benchmarks
- Citations to original surveys or studies
### 4. Citation and Reference Assistance
**Citation Finding**: Locate the most influential, highly-cited papers from reputable authors and prestigious venues:
```
Query Examples:
- "Foundational papers on transformer architecture" (expect: Vaswani et al. 2017 in NeurIPS, 90,000+ citations)
- "Seminal works in quantum computing" (expect: papers from Nature, Science by leading researchers)
- "Key studies on climate change mitigation" (expect: IPCC-cited papers, Nature Climate Change)
- "Landmark trials in cancer immunotherapy" (expect: NEJM, Lancet trials with 1000+ citations)
```
**Expected Response Format**:
- 5-10 most influential papers, **ranked by impact and relevance**
- Complete citation information (authors, title, journal, year, DOI)
- **Citation count** for each paper (approximate if exact unavailable)
- **Venue tier** indication (Nature, Science, Cell = Tier 1, etc.)
- Brief description of each paper's contribution
- **Author credentials** when notable (e.g., "from the Hinton lab", "Nobel laureate")
- Journal impact factors when relevant
**Quality Criteria for Citation Selection**:
- Prefer papers with **100+ citations** (for papers 3+ years old)
- Prioritize **Tier-1 journals** (Nature, Science, Cell, NEJM, Lancet)
- Include work from **recognized leaders** in the field
- Balance **foundational papers** (high citations, older) with **recent advances** (emerging, high-impact venues)
## Automatic Model Selection
This skill features **intelligent model selection** based on query complexity:
### Model Types
**1. Sonar Pro Search** (`perplexity/sonar-pro-search`)
- **Use Case**: Straightforward information lookup
- **Best For**:
- Simple fact-finding queries
- Recent publication searches
- Basic protocol lookups
- Statistical data retrieval
- **Speed**: Fast responses
- **Cost**: Lower cost per query
**2. Sonar Reasoning Pro** (`perplexity/sonar-reasoning-pro`)
- **Use Case**: Complex analytical queries requiring deep reasoning
- **Best For**:
- Comparative analysis ("compare X vs Y")
- Synthesis of multiple studies
- Evaluating trade-offs or controversies
- Explaining mechanisms or relationships
- Critical analysis and interpretation
- **Speed**: Slower but more thorough
- **Cost**: Higher cost per query, but provides deeper insights
### Complexity Assessment
The skill automatically detects query complexity using these indicators:
**Reasoning Keywords** (triggers Sonar Reasoning Pro):
- Analytical: `compare`, `contrast`, `analyze`, `analysis`, `evaluate`, `critique`
- Comparative: `versus`, `vs`, `vs.`, `compared to`, `differences between`, `similarities`
- Synthesis: `meta-analysis`, `systematic review`, `synthesis`, `integrate`
- Causal: `mechanism`, `why`, `how does`, `how do`, `explain`, `relationship`, `causal relationship`, `underlying mechanism`
- Theoretical: `theoretical framework`, `implications`, `interpret`, `reasoning`
- Debate: `controversy`, `conflicting`, `paradox`, `debate`, `reconcile`
- Trade-offs: `pros and cons`, `advantages and disadvantages`, `trade-off`, `tradeoff`, `trade offs`
- Complexity: `multifaceted`, `complex interaction`, `critical analysis`
**Complexity Scoring**:
- Reasoning keywords: 3 points each (heavily weighted)
- Multiple questions: 2 points per question mark
- Complex sentence structures: 1.5 points per clause indicator (and, or, but, however, whereas, although)
- Very long queries: 1 point if >150 characters
- **Threshold**: Queries scoring ≥3 points trigger Sonar Reasoning Pro
**Practical Result**: Even a single strong reasoning keyword (compare, explain, analyze, etc.) will trigger the more powerful Sonar Reasoning Pro model, ensuring you get deep analysis when needed.
**Example Query Classification**:
**Sonar Pro Search** (straightforward lookup):
- "Recent advances in CRISPR gene editing 2024"
- "Prevalence of diabetes in US population"
- "Western blot protocol for protein detection"
**Sonar Reasoning Pro** (complex analysis):
- "Compare and contrast mRNA vaccines vs traditional vaccines for cancer treatment"
- "Explain the mechanism underlying the relationship between gut microbiome and depression"
- "Analyze the controversy surrounding AI in medical diagnosis and evaluate trade-offs"
### Manual Override
You can force a specific model using the `force_model` parameter:
```python
# Force Sonar Pro Search for fast lookup
research = ResearchLookup(force_model='pro')
# Force Sonar Reasoning Pro for deep analysis
research = ResearchLookup(force_model='reasoning')
# Automatic selection (default)
research = ResearchLookup()
```
Command-line usage:
```bash
# Force Sonar Pro Search
python research_lookup.py "your query" --force-model pro
# Force Sonar Reasoning Pro
python research_lookup.py "your query" --force-model reasoning
# Automatic (no flag)
python research_lookup.py "your query"
# Save output to a file
python research_lookup.py "your query" -o results.txt
# Output as JSON (useful for programmatic access)
python research_lookup.py "your query" --json
# Combine: JSON output saved to file
python research_lookup.py "your query" --json -o results.json
```
## Technical Integration
### OpenRouter API Configuration
This skill integrates with OpenRouter (openrouter.ai) to access Perplexity's Sonar models:
**Model Specifications**:
- **Models**:
- `perplexity/sonar-pro-search` (fast lookup)
- `perplexity/sonar-reasoning-pro-online` (deep analysis)
- **Search Mode**: Academic/scholarly mode (prioritizes peer-reviewed sources)
- **Search Context**: Always uses `high` search context for deeper, more comprehensive research results
- **Context Window**: 200K+ tokens for comprehensive research
- **Capabilities**: Academic paper search, citation generation, scholarly analysis
- **Output**: Rich responses with citations and source links from academic databases
**API Requirements**:
- OpenRouter API key (set as `OPENROUTER_API_KEY` environment variable)
- Account with sufficient credits for research queries
- Proper attribution and citation of sources
**Academic Mode Configuration**:
- System message configured to prioritize scholarly sources
- Search focused on peer-reviewed journals and academic publications
- Enhanced citation extraction for academic references
- Preference for recent academic literature (2020-2024)
- Direct access to academic databases and repositories
### Response Quality and Reliability
**Source Verification**: The skill prioritizes:
- Peer-reviewed academic papers and journals
- Reputable institutional sources (universities, government agencies, NGOs)
- Recent publications (within last 2-3 years preferred)
- High-impact journals and conferences
- Primary research over secondary sources
**Citation Standards**: All responses include:
- Complete bibliographic information
- DOI or stable URLs when available
- Access dates for web sources
- Clear attribution of direct quotes or data
---
## Paper Quality and Popularity Prioritization
**CRITICAL**: When searching for papers, ALWAYS prioritize high-quality, influential papers over obscure or low-impact publications. Quality matters more than quantity.
**CRITICAL**: When searching for papers, ALWAYS prioritize high-quality, influential papers.
### Citation-Based Ranking
Prioritize papers based on citation count relative to their age:
| Paper Age | Citation Threshold | Classification |
|-----------|-------------------|----------------|
| 0-3 years | 20+ citations | Noteworthy |
@@ -302,305 +174,240 @@ Prioritize papers based on citation count relative to their age:
| 7+ years | 500+ citations | Seminal Work |
| 7+ years | 1000+ citations | Foundational |
**When reporting citations**: Always indicate approximate citation count when known (e.g., "cited 500+ times" or "highly cited").
### Venue Quality Tiers
Prioritize papers from higher-tier venues:
**Tier 1 - Premier Venues** (Always prefer):
- **General Science**: Nature, Science, Cell, PNAS
- **Medicine**: NEJM, Lancet, JAMA, BMJ
- **Field-Specific Flagships**: Nature Medicine, Nature Biotechnology, Nature Methods, Nature Genetics, Cell Stem Cell, Immunity
- **Top CS/AI**: NeurIPS, ICML, ICLR, ACL, CVPR (for ML/AI topics)
- **Field-Specific**: Nature Medicine, Nature Biotechnology, Nature Methods
- **Top CS/AI**: NeurIPS, ICML, ICLR, ACL, CVPR
**Tier 2 - High-Impact Specialized** (Strong preference):
- Journals with Impact Factor > 10
- Top conferences in subfields (e.g., EMNLP, NAACL, ECCV, MICCAI)
- Society flagship journals (e.g., Blood, Circulation, Gastroenterology)
- Top conferences in subfields (EMNLP, NAACL, ECCV, MICCAI)
**Tier 3 - Respected Specialized** (Include when relevant):
- Journals with Impact Factor 5-10
- Established conferences in the field
- Well-indexed specialized journals
**Tier 4 - Other Peer-Reviewed** (Use sparingly):
- Lower-impact journals, only if directly relevant and no better source exists
---
### Author Reputation Indicators
## Technical Integration
Prefer papers from established, reputable researchers:
### Environment Variables
- **Senior authors with high h-index** (>40 in established fields)
- **Multiple publications in Tier-1 venues**
- **Leadership positions** at recognized research institutions
- **Recognized expertise**: Awards, editorial positions, society fellows
- **First/last author on landmark papers** in the field
```bash
# Primary backend (Parallel Chat API) - REQUIRED
export PARALLEL_API_KEY="your_parallel_api_key"
### Direct Relevance Scoring
Always prioritize papers that directly address the research question:
1. **Primary Priority**: Papers directly addressing the exact research question
2. **Secondary Priority**: Papers with applicable methods, data, or conceptual frameworks
3. **Tertiary Priority**: Tangentially related papers (include ONLY if from Tier-1 venues or highly cited)
### Practical Application
When conducting research lookups:
1. **Start with the most influential papers** - Look for highly-cited, foundational work first
2. **Prioritize Tier-1 venues** - Nature, Science, Cell family journals, NEJM, Lancet for medical topics
3. **Check author credentials** - Prefer work from established research groups
4. **Balance recency with impact** - Recent highly-cited papers > older obscure papers > recent uncited papers
5. **Report quality indicators** - Include citation counts, journal names, and author affiliations in responses
**Example Quality-Focused Query Response**:
```
Key findings from high-impact literature:
1. Smith et al. (2023), Nature Medicine (IF: 82.9, cited 450+ times)
- Senior author: Prof. John Smith, Harvard Medical School
- Key finding: [finding]
2. Johnson & Lee (2024), Cell (IF: 64.5, cited 120+ times)
- From the renowned Lee Lab at Stanford
- Key finding: [finding]
3. Chen et al. (2022), NEJM (IF: 158.5, cited 890+ times)
- Landmark clinical trial (N=5,000)
- Key finding: [finding]
# Academic search backend (Perplexity) - REQUIRED for academic queries
export OPENROUTER_API_KEY="your_openrouter_api_key"
```
## Query Best Practices
### API Specifications
### 1. Model Selection Strategy
**Parallel Chat API:**
- Endpoint: `https://api.parallel.ai` (OpenAI SDK compatible)
- Model: `core` (60s-5min latency, complex multi-source synthesis)
- Output: Markdown text with inline citations
- Citations: Research basis with URLs, reasoning, and confidence levels
- Rate limits: 300 req/min
- Python package: `openai`
**For Simple Lookups (Sonar Pro Search)**:
- Recent papers on a specific topic
- Statistical data or prevalence rates
- Standard protocols or methodologies
- Citation finding for specific papers
- Factual information retrieval
**Perplexity sonar-pro-search:**
- Model: `perplexity/sonar-pro-search` (via OpenRouter)
- Search mode: Academic (prioritizes peer-reviewed sources)
- Search context: High (comprehensive research)
- Response time: 5-15 seconds
**For Complex Analysis (Sonar Reasoning Pro)**:
- Comparative studies and synthesis
- Mechanism explanations
- Controversy evaluation
- Trade-off analysis
- Theoretical frameworks
- Multi-faceted relationships
### Command-Line Usage
**Pro Tip**: The automatic selection is optimized for most use cases. Only use `force_model` if you have specific requirements or know the query needs deeper reasoning than detected.
```bash
# Auto-routed research (recommended) — ALWAYS save to sources/
python research_lookup.py "your query" -o sources/research_YYYYMMDD_HHMMSS_<topic>.md
### 2. Specific and Focused Queries
# Force specific backend — ALWAYS save to sources/
python research_lookup.py "your query" --force-backend parallel -o sources/research_<topic>.md
python research_lookup.py "your query" --force-backend perplexity -o sources/papers_<topic>.md
**Good Queries** (will trigger appropriate model):
- "Randomized controlled trials of mRNA vaccines for cancer treatment 2023-2024" → Sonar Pro Search
- "Compare the efficacy and safety of mRNA vaccines vs traditional vaccines for cancer treatment" → Sonar Reasoning Pro
- "Explain the mechanism by which CRISPR off-target effects occur and strategies to minimize them" → Sonar Reasoning Pro
# JSON output — ALWAYS save to sources/
python research_lookup.py "your query" --json -o sources/research_<topic>.json
**Poor Queries**:
- "Tell me about AI" (too broad)
- "Cancer research" (lacks specificity)
- "Latest news" (too vague)
### 3. Structured Query Format
**Recommended Structure**:
```
[Topic] + [Specific Aspect] + [Time Frame] + [Type of Information]
# Batch queries — ALWAYS save to sources/
python research_lookup.py --batch "query 1" "query 2" "query 3" -o sources/batch_research_<topic>.md
```
**Examples**:
- "CRISPR gene editing + off-target effects + 2024 + clinical trials"
- "Quantum computing + error correction + recent advances + review papers"
- "Renewable energy + solar efficiency + 2023-2024 + statistical data"
---
### 4. Follow-up Queries
## MANDATORY: Save All Results to Sources Folder
**Effective Follow-ups**:
- "Show me the full citation for the Smith et al. 2024 paper"
- "What are the limitations of this methodology?"
- "Find similar studies using different approaches"
- "What controversies exist in this research area?"
**Every research-lookup result MUST be saved to the project's `sources/` folder.**
This is non-negotiable. Research results are expensive to obtain and critical for reproducibility.
### Saving Rules
| Backend | `-o` Flag Target | Filename Pattern |
|---------|-----------------|------------------|
| Parallel Deep Research | `sources/research_<topic>.md` | `research_YYYYMMDD_HHMMSS_<brief_topic>.md` |
| Perplexity (academic) | `sources/papers_<topic>.md` | `papers_YYYYMMDD_HHMMSS_<brief_topic>.md` |
| Batch queries | `sources/batch_<topic>.md` | `batch_research_YYYYMMDD_HHMMSS_<brief_topic>.md` |
### How to Save
**CRITICAL: Every call to `research_lookup.py` MUST include the `-o` flag pointing to the `sources/` folder.**
**CRITICAL: Saved files MUST preserve all citations, source URLs, and DOIs.** The default text output automatically includes a `Sources` section (with title, date, URL for each source) and an `Additional References` section (with DOIs and academic URLs extracted from the response text). For maximum citation metadata, use `--json`.
```bash
# General research — save to sources/ (includes Sources + Additional References sections)
python research_lookup.py "Recent advances in CRISPR gene editing 2025" \
-o sources/research_20250217_143000_crispr_advances.md
# Academic paper search — save to sources/ (includes paper citations with DOIs)
python research_lookup.py "Find papers on transformer attention mechanisms in NeurIPS 2024" \
-o sources/papers_20250217_143500_transformer_attention.md
# JSON format for maximum citation metadata (full citation objects with URLs, DOIs, snippets)
python research_lookup.py "CRISPR clinical trials" --json \
-o sources/research_20250217_143000_crispr_trials.json
# Forced backend — save to sources/
python research_lookup.py "AI regulation landscape" --force-backend parallel \
-o sources/research_20250217_144000_ai_regulation.md
# Batch queries — save to sources/
python research_lookup.py --batch "mRNA vaccines efficacy" "mRNA vaccines safety" \
-o sources/batch_research_20250217_144500_mrna_vaccines.md
```
### Citation Preservation in Saved Files
Each output format preserves citations differently:
| Format | Citations Included | When to Use |
|--------|-------------------|-------------|
| Text (default) | `Sources (N):` section with `[title] (date) + URL` + `Additional References (N):` with DOIs and academic URLs | Standard use — human-readable with all citations |
| JSON (`--json`) | Full citation objects: `url`, `title`, `date`, `snippet`, `doi`, `type` | When you need maximum citation metadata |
**For Parallel backend**, saved files include: research report + Sources list (title, URL) + Additional References (DOIs, academic URLs).
**For Perplexity backend**, saved files include: academic summary + Sources list (title, date, URL, snippet) + Additional References (DOIs, academic URLs).
**Use `--json` when you need to:**
- Parse citation metadata programmatically
- Preserve full DOI and URL data for BibTeX generation
- Maintain the structured citation objects for cross-referencing
### Why Save Everything
1. **Reproducibility**: Every citation and claim can be traced back to its raw research source
2. **Context Window Recovery**: If context is compacted, saved results can be re-read without re-querying
3. **Audit Trail**: The `sources/` folder documents exactly how all research information was gathered
4. **Reuse Across Sections**: Multiple sections can reference the same saved research without duplicate queries
5. **Cost Efficiency**: Check `sources/` for existing results before making new API calls
6. **Peer Review Support**: Reviewers can verify the research backing every citation
### Before Making a New Query, Check Sources First
Before calling `research_lookup.py`, check if a relevant result already exists:
```bash
ls sources/ # Check existing saved results
```
If a prior lookup covers the same topic, re-read the saved file instead of making a new API call.
### Logging
When saving research results, always log:
```
[HH:MM:SS] SAVED: Research lookup to sources/research_20250217_143000_crispr_advances.md (3,800 words, 8 citations)
[HH:MM:SS] SAVED: Paper search to sources/papers_20250217_143500_transformer_attention.md (6 papers found)
```
---
## Integration with Scientific Writing
This skill enhances scientific writing by providing:
1. **Literature Review Support**: Gather current research for introduction and discussion sections
2. **Methods Validation**: Verify protocols and procedures against current standards
3. **Results Contextualization**: Compare findings with recent similar studies
4. **Discussion Enhancement**: Support arguments with latest evidence
5. **Citation Management**: Provide properly formatted citations in multiple styles
## Error Handling and Limitations
**Known Limitations**:
- Information cutoff: Responses limited to training data (typically 2023-2024)
- Paywall content: May not access full text behind paywalls
- Emerging research: May miss very recent papers not yet indexed
- Specialized databases: Cannot access proprietary or restricted databases
**Error Conditions**:
- API rate limits or quota exceeded
- Network connectivity issues
- Malformed or ambiguous queries
- Model unavailability or maintenance
**Fallback Strategies**:
- Rephrase queries for better clarity
- Break complex queries into simpler components
- Use broader time frames if recent data unavailable
- Cross-reference with multiple query variations
## Usage Examples
### Example 1: Simple Literature Search (Sonar Pro Search)
**Query**: "Recent advances in transformer attention mechanisms 2024"
**Model Selected**: Sonar Pro Search (straightforward lookup)
**Response Includes**:
- Summary of 5 key papers from 2024
- Complete citations with DOIs
- Key innovations and improvements
- Performance benchmarks
- Future research directions
### Example 2: Comparative Analysis (Sonar Reasoning Pro)
**Query**: "Compare and contrast the advantages and limitations of transformer-based models versus traditional RNNs for sequence modeling"
**Model Selected**: Sonar Reasoning Pro (complex analysis required)
**Response Includes**:
- Detailed comparison across multiple dimensions
- Analysis of architectural differences
- Trade-offs in computational efficiency vs performance
- Use case recommendations
- Synthesis of evidence from multiple studies
- Discussion of ongoing debates in the field
### Example 3: Method Verification (Sonar Pro Search)
**Query**: "Standard protocols for flow cytometry analysis"
**Model Selected**: Sonar Pro Search (protocol lookup)
**Response Includes**:
- Step-by-step protocol from recent review
- Required controls and calibrations
- Common pitfalls and troubleshooting
- Reference to definitive methodology paper
- Alternative approaches with pros/cons
### Example 4: Mechanism Explanation (Sonar Reasoning Pro)
**Query**: "Explain the underlying mechanism of how mRNA vaccines trigger immune responses and why they differ from traditional vaccines"
**Model Selected**: Sonar Reasoning Pro (requires causal reasoning)
**Response Includes**:
- Detailed mechanistic explanation
- Step-by-step biological processes
- Comparative analysis with traditional vaccines
- Molecular-level interactions
- Integration of immunology and pharmacology concepts
- Evidence from recent research
### Example 5: Statistical Data (Sonar Pro Search)
**Query**: "Global AI adoption in healthcare statistics 2024"
**Model Selected**: Sonar Pro Search (data lookup)
**Response Includes**:
- Current adoption rates by region
- Market size and growth projections
- Survey methodology and sample size
- Comparison with previous years
- Citations to market research reports
## Performance and Cost Considerations
### Response Times
**Sonar Pro Search**:
- Typical response time: 5-15 seconds
- Best for rapid information gathering
- Suitable for batch queries
**Sonar Reasoning Pro**:
- Typical response time: 15-45 seconds
- Worth the wait for complex analytical queries
- Provides more thorough reasoning and synthesis
### Cost Optimization
**Automatic Selection Benefits**:
- Saves costs by using Sonar Pro Search for straightforward queries
- Reserves Sonar Reasoning Pro for queries that truly benefit from deeper analysis
- Optimizes the balance between cost and quality
**Manual Override Use Cases**:
- Force Sonar Pro Search when budget is constrained and speed is priority
- Force Sonar Reasoning Pro when working on critical research requiring maximum depth
- Use for specific sections of papers (e.g., Pro Search for methods, Reasoning for discussion)
**Best Practices**:
1. Trust the automatic selection for most use cases
2. Review query results - if Sonar Pro Search doesn't provide sufficient depth, rephrase with reasoning keywords
3. Use batch queries strategically - combine simple lookups to minimize total query count
4. For literature reviews, start with Sonar Pro Search for breadth, then use Sonar Reasoning Pro for synthesis
## Security and Ethical Considerations
**Responsible Use**:
- Verify all information against primary sources when possible
- Clearly attribute all data and quotes to original sources
- Avoid presenting AI-generated summaries as original research
- Respect copyright and licensing restrictions
- Use for research assistance, not to bypass paywalls or subscriptions
**Academic Integrity**:
- Always cite original sources, not the AI tool
- Use as a starting point for literature searches
- Follow institutional guidelines for AI tool usage
- Maintain transparency about research methods
1. **Literature Review Support**: Gather current research for introduction and discussion **save to `sources/`**
2. **Methods Validation**: Verify protocols against current standards**save to `sources/`**
3. **Results Contextualization**: Compare findings with recent similar studies**save to `sources/`**
4. **Discussion Enhancement**: Support arguments with latest evidence**save to `sources/`**
5. **Citation Management**: Provide properly formatted citations **save to `sources/`**
## Complementary Tools
In addition to research-lookup, the scientific writer has access to **WebSearch** for:
- **Quick metadata verification**: Look up DOIs, publication years, journal names, volume/page numbers
- **Non-academic sources**: News, blogs, technical documentation, current events
- **General information**: Company info, product details, current statistics
- **Cross-referencing**: Verify citation details found through research-lookup
**When to use which tool:**
| Task | Tool |
|------|------|
| Find academic papers | research-lookup |
| Literature search | research-lookup |
| Deep analysis/comparison | research-lookup (Sonar Reasoning Pro) |
| Look up DOI/metadata | WebSearch |
| Verify publication year | WebSearch |
| Find journal volume/pages | WebSearch |
| Current events/news | WebSearch |
| Non-scholarly sources | WebSearch |
| General web search | `parallel-web` skill (`parallel_web.py search`) |
| Citation verification | `parallel-web` skill (`parallel_web.py extract`) |
| Deep research (any topic) | `research-lookup` or `parallel-web` skill |
| Academic paper search | `research-lookup` (auto-routes to Perplexity) |
| Google Scholar search | `citation-management` skill |
| PubMed search | `citation-management` skill |
| DOI to BibTeX | `citation-management` skill |
| Metadata verification | `parallel-web` skill (`parallel_web.py search` or `extract`) |
---
## Error Handling and Limitations
**Known Limitations:**
- Parallel Chat API (core model): Complex queries may take up to 5 minutes
- Perplexity: Information cutoff, may not access full text behind paywalls
- Both: Cannot access proprietary or restricted databases
**Fallback Behavior:**
- If the selected backend's API key is missing, tries the other backend
- If both backends fail, returns structured error response
- Rephrase queries for better results if initial response is insufficient
---
## Usage Examples
### Example 1: General Research (Routes to Parallel)
**Query**: "Recent advances in transformer attention mechanisms 2025"
**Backend**: Parallel Chat API (core model)
**Response**: Comprehensive markdown report with citations from authoritative sources, covering recent papers, key innovations, and performance benchmarks.
### Example 2: Academic Paper Search (Routes to Perplexity)
**Query**: "Find papers on CRISPR off-target effects in clinical trials"
**Backend**: Perplexity sonar-pro-search (academic mode)
**Response**: Curated list of 5-8 high-impact papers with full citations, DOIs, citation counts, and venue tier indicators.
### Example 3: Comparative Analysis (Routes to Parallel)
**Query**: "Compare and contrast mRNA vaccines vs traditional vaccines for cancer treatment"
**Backend**: Parallel Chat API (core model)
**Response**: Detailed comparative report with data from multiple sources, structured analysis, and cited evidence.
### Example 4: Market Data (Routes to Parallel)
**Query**: "Global AI adoption in healthcare statistics 2025"
**Backend**: Parallel Chat API (core model)
**Response**: Current market data, adoption rates, growth projections, and regional analysis with source citations.
---
## Summary
This skill serves as a powerful research assistant with intelligent dual-model selection:
This skill serves as the primary research interface with intelligent dual-backend routing:
- **Automatic Intelligence**: Analyzes query complexity and selects the optimal model (Sonar Pro Search or Sonar Reasoning Pro)
- **Cost-Effective**: Uses faster, cheaper Sonar Pro Search for straightforward lookups
- **Deep Analysis**: Automatically engages Sonar Reasoning Pro for complex comparative, analytical, and theoretical queries
- **Flexible Control**: Manual override available when you know exactly what level of analysis you need
- **Academic Focus**: Both models configured to prioritize peer-reviewed sources and scholarly literature
- **Complementary WebSearch**: Use alongside WebSearch for metadata verification and non-academic sources
Whether you need quick fact-finding or deep analytical synthesis, this skill automatically adapts to deliver the right level of research support for your scientific writing needs.
- **Parallel Chat API** (default, `core` model): Comprehensive, multi-source research for any topic
- **Perplexity sonar-pro-search**: Academic-specific paper searches only
- **Automatic routing**: Detects academic queries and routes appropriately
- **Manual override**: Force any backend when needed
- **Complementary**: Works alongside `parallel-web` skill for web search and URL extraction

View File

@@ -0,0 +1,566 @@
#!/usr/bin/env python3
"""
Research Information Lookup Tool
Routes research queries to the best backend:
- Parallel Chat API (core model): Default for all general research queries
- Perplexity sonar-pro-search (via OpenRouter): Academic-specific paper searches
Environment variables:
PARALLEL_API_KEY - Required for Parallel Chat API (primary backend)
OPENROUTER_API_KEY - Required for Perplexity academic searches (fallback)
"""
import os
import sys
import json
import re
import time
import requests
from datetime import datetime
from typing import Any, Dict, List, Optional
class ResearchLookup:
"""Research information lookup with intelligent backend routing.
Routes queries to the Parallel Chat API (default) or Perplexity
sonar-pro-search (academic paper searches only).
"""
ACADEMIC_KEYWORDS = [
"find papers", "find paper", "find articles", "find article",
"cite ", "citation", "citations for",
"doi ", "doi:", "pubmed", "pmid",
"journal article", "peer-reviewed",
"systematic review", "meta-analysis",
"literature search", "literature on",
"academic papers", "academic paper",
"research papers on", "research paper on",
"published studies", "published study",
"scholarly", "scholar",
"arxiv", "preprint",
"foundational papers", "seminal papers", "landmark papers",
"highly cited", "most cited",
]
PARALLEL_SYSTEM_PROMPT = (
"You are a deep research analyst. Provide a comprehensive, well-cited "
"research report on the user's topic. Include:\n"
"- Key findings with specific data, statistics, and quantitative evidence\n"
"- Detailed analysis organized by themes\n"
"- Multiple authoritative sources cited inline\n"
"- Methodologies and implications where relevant\n"
"- Future outlook and research gaps\n"
"Use markdown formatting with clear section headers. "
"Prioritize authoritative and recent sources."
)
CHAT_BASE_URL = "https://api.parallel.ai"
def __init__(self, force_backend: Optional[str] = None):
"""Initialize the research lookup tool.
Args:
force_backend: Force a specific backend ('parallel' or 'perplexity').
If None, backend is auto-selected based on query content.
"""
self.force_backend = force_backend
self.parallel_available = bool(os.getenv("PARALLEL_API_KEY"))
self.perplexity_available = bool(os.getenv("OPENROUTER_API_KEY"))
if not self.parallel_available and not self.perplexity_available:
raise ValueError(
"No API keys found. Set at least one of:\n"
" PARALLEL_API_KEY (for Parallel Chat API - primary)\n"
" OPENROUTER_API_KEY (for Perplexity academic search - fallback)"
)
def _select_backend(self, query: str) -> str:
"""Select the best backend for a query."""
if self.force_backend:
if self.force_backend == "perplexity" and self.perplexity_available:
return "perplexity"
if self.force_backend == "parallel" and self.parallel_available:
return "parallel"
query_lower = query.lower()
is_academic = any(kw in query_lower for kw in self.ACADEMIC_KEYWORDS)
if is_academic and self.perplexity_available:
return "perplexity"
if self.parallel_available:
return "parallel"
if self.perplexity_available:
return "perplexity"
raise ValueError("No backend available. Check API keys.")
# ------------------------------------------------------------------
# Parallel Chat API backend
# ------------------------------------------------------------------
def _get_chat_client(self):
"""Lazy-load and cache the OpenAI client for Parallel Chat API."""
if not hasattr(self, "_chat_client"):
try:
from openai import OpenAI
except ImportError:
raise ImportError(
"The 'openai' package is required for Parallel Chat API.\n"
"Install it with: pip install openai"
)
self._chat_client = OpenAI(
api_key=os.getenv("PARALLEL_API_KEY"),
base_url=self.CHAT_BASE_URL,
)
return self._chat_client
def _parallel_lookup(self, query: str) -> Dict[str, Any]:
"""Run research via the Parallel Chat API (core model)."""
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
model = "core"
try:
client = self._get_chat_client()
print(f"[Research] Parallel Chat API (model={model})...", file=sys.stderr)
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": self.PARALLEL_SYSTEM_PROMPT},
{"role": "user", "content": query},
],
stream=False,
)
content = ""
if response.choices and len(response.choices) > 0:
content = response.choices[0].message.content or ""
api_citations = self._extract_basis_citations(response)
text_citations = self._extract_citations_from_text(content)
return {
"success": True,
"query": query,
"response": content,
"citations": api_citations + text_citations,
"sources": api_citations,
"timestamp": timestamp,
"backend": "parallel",
"model": f"parallel-chat/{model}",
}
except Exception as e:
return {
"success": False,
"query": query,
"error": str(e),
"timestamp": timestamp,
"backend": "parallel",
"model": f"parallel-chat/{model}",
}
def _extract_basis_citations(self, response) -> List[Dict[str, str]]:
"""Extract citation sources from the Chat API research basis."""
citations = []
basis = getattr(response, "basis", None)
if not basis:
return citations
seen_urls = set()
if isinstance(basis, list):
for item in basis:
cits = (
item.get("citations", []) if isinstance(item, dict)
else getattr(item, "citations", None) or []
)
for cit in cits:
url = cit.get("url", "") if isinstance(cit, dict) else getattr(cit, "url", "")
if url and url not in seen_urls:
seen_urls.add(url)
title = cit.get("title", "") if isinstance(cit, dict) else getattr(cit, "title", "")
excerpts = cit.get("excerpts", []) if isinstance(cit, dict) else getattr(cit, "excerpts", [])
citations.append({
"type": "source",
"url": url,
"title": title,
"excerpts": excerpts,
})
return citations
# ------------------------------------------------------------------
# Perplexity academic search backend
# ------------------------------------------------------------------
def _perplexity_lookup(self, query: str) -> Dict[str, Any]:
"""Run academic search via Perplexity sonar-pro-search through OpenRouter."""
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
api_key = os.getenv("OPENROUTER_API_KEY")
model = "perplexity/sonar-pro-search"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
"HTTP-Referer": "https://scientific-writer.local",
"X-Title": "Scientific Writer Research Tool",
}
research_prompt = self._format_academic_prompt(query)
messages = [
{
"role": "system",
"content": (
"You are an academic research assistant specializing in finding "
"HIGH-IMPACT, INFLUENTIAL research.\n\n"
"QUALITY PRIORITIZATION (CRITICAL):\n"
"- ALWAYS prefer highly-cited papers over obscure publications\n"
"- ALWAYS prioritize Tier-1 venues: Nature, Science, Cell, NEJM, Lancet, JAMA, PNAS\n"
"- ALWAYS prefer papers from established researchers\n"
"- Include citation counts when known (e.g., 'cited 500+ times')\n"
"- Quality matters more than quantity\n\n"
"VENUE HIERARCHY:\n"
"1. Nature/Science/Cell family, NEJM, Lancet, JAMA (highest)\n"
"2. High-impact specialized journals (IF>10), top conferences (NeurIPS, ICML, ICLR)\n"
"3. Respected field-specific journals (IF 5-10)\n"
"4. Other peer-reviewed sources (only if no better option)\n\n"
"Focus exclusively on scholarly sources. Prioritize recent literature (2020-2026) "
"and provide complete citations with DOIs."
),
},
{"role": "user", "content": research_prompt},
]
data = {
"model": model,
"messages": messages,
"max_tokens": 8000,
"temperature": 0.1,
"search_mode": "academic",
"search_context_size": "high",
}
try:
response = requests.post(
"https://openrouter.ai/api/v1/chat/completions",
headers=headers,
json=data,
timeout=90,
)
response.raise_for_status()
resp_json = response.json()
if "choices" in resp_json and len(resp_json["choices"]) > 0:
choice = resp_json["choices"][0]
if "message" in choice and "content" in choice["message"]:
content = choice["message"]["content"]
api_citations = self._extract_api_citations(resp_json, choice)
text_citations = self._extract_citations_from_text(content)
citations = api_citations + text_citations
return {
"success": True,
"query": query,
"response": content,
"citations": citations,
"sources": api_citations,
"timestamp": timestamp,
"backend": "perplexity",
"model": model,
"usage": resp_json.get("usage", {}),
}
else:
raise Exception("Invalid response format from API")
else:
raise Exception("No response choices received from API")
except Exception as e:
return {
"success": False,
"query": query,
"error": str(e),
"timestamp": timestamp,
"backend": "perplexity",
"model": model,
}
# ------------------------------------------------------------------
# Shared utilities
# ------------------------------------------------------------------
def _format_academic_prompt(self, query: str) -> str:
"""Format a query for academic research results via Perplexity."""
return f"""You are an expert research assistant. Please provide comprehensive, accurate research information for the following query: "{query}"
IMPORTANT INSTRUCTIONS:
1. Focus on ACADEMIC and SCIENTIFIC sources (peer-reviewed papers, reputable journals, institutional research)
2. Include RECENT information (prioritize 2020-2026 publications)
3. Provide COMPLETE citations with authors, title, journal/conference, year, and DOI when available
4. Structure your response with clear sections and proper attribution
5. Be comprehensive but concise - aim for 800-1200 words
6. Include key findings, methodologies, and implications when relevant
7. Note any controversies, limitations, or conflicting evidence
PAPER QUALITY PRIORITIZATION (CRITICAL):
8. ALWAYS prioritize HIGHLY-CITED papers over obscure publications
9. ALWAYS prioritize papers from TOP-TIER VENUES (Nature, Science, Cell, NEJM, Lancet, JAMA, PNAS)
10. PREFER papers from ESTABLISHED, REPUTABLE AUTHORS
11. For EACH citation include when available: citation count, venue tier, author credentials
12. PRIORITIZE papers that DIRECTLY address the research question
RESPONSE FORMAT:
- Start with a brief summary (2-3 sentences)
- Present key findings and studies in organized sections
- Rank papers by impact: most influential/cited first
- End with future directions or research gaps if applicable
- Include 5-8 high-quality citations
Remember: Quality over quantity. Prioritize influential, highly-cited papers from prestigious venues."""
def _extract_api_citations(self, response: Dict[str, Any], choice: Dict[str, Any]) -> List[Dict[str, str]]:
"""Extract citations from Perplexity API response fields."""
citations = []
search_results = (
response.get("search_results")
or choice.get("search_results")
or choice.get("message", {}).get("search_results")
or []
)
for result in search_results:
citation = {
"type": "source",
"title": result.get("title", ""),
"url": result.get("url", ""),
"date": result.get("date", ""),
}
if result.get("snippet"):
citation["snippet"] = result["snippet"]
citations.append(citation)
legacy_citations = (
response.get("citations")
or choice.get("citations")
or choice.get("message", {}).get("citations")
or []
)
for url in legacy_citations:
if isinstance(url, str):
citations.append({"type": "source", "url": url, "title": "", "date": ""})
elif isinstance(url, dict):
citations.append({
"type": "source",
"url": url.get("url", ""),
"title": url.get("title", ""),
"date": url.get("date", ""),
})
return citations
def _extract_citations_from_text(self, text: str) -> List[Dict[str, str]]:
"""Extract DOIs and academic URLs from response text as fallback."""
citations = []
doi_pattern = r'(?:doi[:\s]*|https?://(?:dx\.)?doi\.org/)(10\.[0-9]{4,}/[^\s\)\]\,\[\<\>]+)'
doi_matches = re.findall(doi_pattern, text, re.IGNORECASE)
seen_dois = set()
for doi in doi_matches:
doi_clean = doi.strip().rstrip(".,;:)]")
if doi_clean and doi_clean not in seen_dois:
seen_dois.add(doi_clean)
citations.append({
"type": "doi",
"doi": doi_clean,
"url": f"https://doi.org/{doi_clean}",
})
url_pattern = (
r'https?://[^\s\)\]\,\<\>\"\']+(?:arxiv\.org|pubmed|ncbi\.nlm\.nih\.gov|'
r'nature\.com|science\.org|wiley\.com|springer\.com|ieee\.org|acm\.org)'
r'[^\s\)\]\,\<\>\"\']*'
)
url_matches = re.findall(url_pattern, text, re.IGNORECASE)
seen_urls = set()
for url in url_matches:
url_clean = url.rstrip(".")
if url_clean not in seen_urls:
seen_urls.add(url_clean)
citations.append({"type": "url", "url": url_clean})
return citations
# ------------------------------------------------------------------
# Public API
# ------------------------------------------------------------------
def lookup(self, query: str) -> Dict[str, Any]:
"""Perform a research lookup, routing to the best backend.
Parallel Chat API is used by default. Perplexity sonar-pro-search
is used only for academic-specific queries (paper searches, DOI lookups).
"""
backend = self._select_backend(query)
print(f"[Research] Backend: {backend} | Query: {query[:80]}...", file=sys.stderr)
if backend == "parallel":
return self._parallel_lookup(query)
else:
return self._perplexity_lookup(query)
def batch_lookup(self, queries: List[str], delay: float = 1.0) -> List[Dict[str, Any]]:
"""Perform multiple research lookups with delay between requests."""
results = []
for i, query in enumerate(queries):
if i > 0 and delay > 0:
time.sleep(delay)
result = self.lookup(query)
results.append(result)
print(f"[Research] Completed query {i+1}/{len(queries)}: {query[:50]}...", file=sys.stderr)
return results
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
"""Command-line interface for the research lookup tool."""
import argparse
parser = argparse.ArgumentParser(
description="Research Information Lookup Tool (Parallel Chat API + Perplexity)",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# General research (uses Parallel Chat API, core model)
python research_lookup.py "latest advances in quantum computing 2025"
# Academic paper search (auto-routes to Perplexity)
python research_lookup.py "find papers on CRISPR gene editing clinical trials"
# Force a specific backend
python research_lookup.py "topic" --force-backend parallel
python research_lookup.py "topic" --force-backend perplexity
# Save output to file
python research_lookup.py "topic" -o results.txt
# JSON output
python research_lookup.py "topic" --json -o results.json
""",
)
parser.add_argument("query", nargs="?", help="Research query to look up")
parser.add_argument("--batch", nargs="+", help="Run multiple queries")
parser.add_argument(
"--force-backend",
choices=["parallel", "perplexity"],
help="Force a specific backend (default: auto-select)",
)
parser.add_argument("-o", "--output", help="Write output to file")
parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args()
output_file = None
if args.output:
output_file = open(args.output, "w", encoding="utf-8")
def write_output(text):
if output_file:
output_file.write(text + "\n")
else:
print(text)
has_parallel = bool(os.getenv("PARALLEL_API_KEY"))
has_perplexity = bool(os.getenv("OPENROUTER_API_KEY"))
if not has_parallel and not has_perplexity:
print("Error: No API keys found. Set at least one:", file=sys.stderr)
print(" export PARALLEL_API_KEY='...' (primary - Parallel Chat API)", file=sys.stderr)
print(" export OPENROUTER_API_KEY='...' (fallback - Perplexity academic)", file=sys.stderr)
if output_file:
output_file.close()
return 1
if not args.query and not args.batch:
parser.print_help()
if output_file:
output_file.close()
return 1
try:
research = ResearchLookup(force_backend=args.force_backend)
if args.batch:
print(f"Running batch research for {len(args.batch)} queries...", file=sys.stderr)
results = research.batch_lookup(args.batch)
else:
print(f"Researching: {args.query}", file=sys.stderr)
results = [research.lookup(args.query)]
if args.json:
write_output(json.dumps(results, indent=2, ensure_ascii=False, default=str))
if output_file:
output_file.close()
return 0
for i, result in enumerate(results):
if result["success"]:
write_output(f"\n{'='*80}")
write_output(f"Query {i+1}: {result['query']}")
write_output(f"Timestamp: {result['timestamp']}")
write_output(f"Backend: {result.get('backend', 'unknown')} | Model: {result.get('model', 'unknown')}")
write_output(f"{'='*80}")
write_output(result["response"])
sources = result.get("sources", [])
if sources:
write_output(f"\nSources ({len(sources)}):")
for j, source in enumerate(sources):
title = source.get("title", "Untitled")
url = source.get("url", "")
date = source.get("date", "")
date_str = f" ({date})" if date else ""
write_output(f" [{j+1}] {title}{date_str}")
if url:
write_output(f" {url}")
citations = result.get("citations", [])
text_citations = [c for c in citations if c.get("type") in ("doi", "url")]
if text_citations:
write_output(f"\nAdditional References ({len(text_citations)}):")
for j, citation in enumerate(text_citations):
if citation.get("type") == "doi":
write_output(f" [{j+1}] DOI: {citation.get('doi', '')} - {citation.get('url', '')}")
elif citation.get("type") == "url":
write_output(f" [{j+1}] {citation.get('url', '')}")
if result.get("usage"):
write_output(f"\nUsage: {result['usage']}")
else:
write_output(f"\nError in query {i+1}: {result['error']}")
if output_file:
output_file.close()
return 0
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
if output_file:
output_file.close()
return 1
if __name__ == "__main__":
sys.exit(main())

View File

@@ -1,208 +1,269 @@
#!/usr/bin/env python3
"""
Research Information Lookup Tool
Uses Perplexity's Sonar Pro Search model through OpenRouter for academic research queries.
Routes research queries to the best backend:
- Parallel Chat API (core model): Default for all general research queries
- Perplexity sonar-pro-search (via OpenRouter): Academic-specific paper searches
Environment variables:
PARALLEL_API_KEY - Required for Parallel Chat API (primary backend)
OPENROUTER_API_KEY - Required for Perplexity academic searches (fallback)
"""
import os
import sys
import json
import requests
import re
import time
import requests
from datetime import datetime
from typing import Dict, List, Optional, Any
from urllib.parse import quote
from typing import Any, Dict, List, Optional
class ResearchLookup:
"""Research information lookup using Perplexity Sonar models via OpenRouter."""
"""Research information lookup with intelligent backend routing.
# Available models
MODELS = {
"pro": "perplexity/sonar-pro", # Fast lookup, cost-effective
"reasoning": "perplexity/sonar-reasoning-pro", # Deep analysis with reasoning
}
Routes queries to the Parallel Chat API (default) or Perplexity
sonar-pro-search (academic paper searches only).
"""
# Keywords that indicate complex queries requiring reasoning model
REASONING_KEYWORDS = [
"compare", "contrast", "analyze", "analysis", "evaluate", "critique",
"versus", "vs", "vs.", "compared to", "differences between", "similarities",
"meta-analysis", "systematic review", "synthesis", "integrate",
"mechanism", "why", "how does", "how do", "explain", "relationship",
"theoretical framework", "implications", "interpret", "reasoning",
"controversy", "conflicting", "paradox", "debate", "reconcile",
"pros and cons", "advantages and disadvantages", "trade-off", "tradeoff",
ACADEMIC_KEYWORDS = [
"find papers", "find paper", "find articles", "find article",
"cite ", "citation", "citations for",
"doi ", "doi:", "pubmed", "pmid",
"journal article", "peer-reviewed",
"systematic review", "meta-analysis",
"literature search", "literature on",
"academic papers", "academic paper",
"research papers on", "research paper on",
"published studies", "published study",
"scholarly", "scholar",
"arxiv", "preprint",
"foundational papers", "seminal papers", "landmark papers",
"highly cited", "most cited",
]
def __init__(self, force_model: Optional[str] = None):
"""
Initialize the research lookup tool.
Args:
force_model: Optional model override ('pro' or 'reasoning').
If None, model is auto-selected based on query complexity.
"""
self.api_key = os.getenv("OPENROUTER_API_KEY")
if not self.api_key:
raise ValueError("OPENROUTER_API_KEY environment variable not set")
PARALLEL_SYSTEM_PROMPT = (
"You are a deep research analyst. Provide a comprehensive, well-cited "
"research report on the user's topic. Include:\n"
"- Key findings with specific data, statistics, and quantitative evidence\n"
"- Detailed analysis organized by themes\n"
"- Multiple authoritative sources cited inline\n"
"- Methodologies and implications where relevant\n"
"- Future outlook and research gaps\n"
"Use markdown formatting with clear section headers. "
"Prioritize authoritative and recent sources."
)
self.base_url = "https://openrouter.ai/api/v1"
self.force_model = force_model
self.headers = {
"Authorization": f"Bearer {self.api_key}",
CHAT_BASE_URL = "https://api.parallel.ai"
def __init__(self, force_backend: Optional[str] = None):
"""Initialize the research lookup tool.
Args:
force_backend: Force a specific backend ('parallel' or 'perplexity').
If None, backend is auto-selected based on query content.
"""
self.force_backend = force_backend
self.parallel_available = bool(os.getenv("PARALLEL_API_KEY"))
self.perplexity_available = bool(os.getenv("OPENROUTER_API_KEY"))
if not self.parallel_available and not self.perplexity_available:
raise ValueError(
"No API keys found. Set at least one of:\n"
" PARALLEL_API_KEY (for Parallel Chat API - primary)\n"
" OPENROUTER_API_KEY (for Perplexity academic search - fallback)"
)
def _select_backend(self, query: str) -> str:
"""Select the best backend for a query."""
if self.force_backend:
if self.force_backend == "perplexity" and self.perplexity_available:
return "perplexity"
if self.force_backend == "parallel" and self.parallel_available:
return "parallel"
query_lower = query.lower()
is_academic = any(kw in query_lower for kw in self.ACADEMIC_KEYWORDS)
if is_academic and self.perplexity_available:
return "perplexity"
if self.parallel_available:
return "parallel"
if self.perplexity_available:
return "perplexity"
raise ValueError("No backend available. Check API keys.")
# ------------------------------------------------------------------
# Parallel Chat API backend
# ------------------------------------------------------------------
def _get_chat_client(self):
"""Lazy-load and cache the OpenAI client for Parallel Chat API."""
if not hasattr(self, "_chat_client"):
try:
from openai import OpenAI
except ImportError:
raise ImportError(
"The 'openai' package is required for Parallel Chat API.\n"
"Install it with: pip install openai"
)
self._chat_client = OpenAI(
api_key=os.getenv("PARALLEL_API_KEY"),
base_url=self.CHAT_BASE_URL,
)
return self._chat_client
def _parallel_lookup(self, query: str) -> Dict[str, Any]:
"""Run research via the Parallel Chat API (core model)."""
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
model = "core"
try:
client = self._get_chat_client()
print(f"[Research] Parallel Chat API (model={model})...", file=sys.stderr)
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": self.PARALLEL_SYSTEM_PROMPT},
{"role": "user", "content": query},
],
stream=False,
)
content = ""
if response.choices and len(response.choices) > 0:
content = response.choices[0].message.content or ""
api_citations = self._extract_basis_citations(response)
text_citations = self._extract_citations_from_text(content)
return {
"success": True,
"query": query,
"response": content,
"citations": api_citations + text_citations,
"sources": api_citations,
"timestamp": timestamp,
"backend": "parallel",
"model": f"parallel-chat/{model}",
}
except Exception as e:
return {
"success": False,
"query": query,
"error": str(e),
"timestamp": timestamp,
"backend": "parallel",
"model": f"parallel-chat/{model}",
}
def _extract_basis_citations(self, response) -> List[Dict[str, str]]:
"""Extract citation sources from the Chat API research basis."""
citations = []
basis = getattr(response, "basis", None)
if not basis:
return citations
seen_urls = set()
if isinstance(basis, list):
for item in basis:
cits = (
item.get("citations", []) if isinstance(item, dict)
else getattr(item, "citations", None) or []
)
for cit in cits:
url = cit.get("url", "") if isinstance(cit, dict) else getattr(cit, "url", "")
if url and url not in seen_urls:
seen_urls.add(url)
title = cit.get("title", "") if isinstance(cit, dict) else getattr(cit, "title", "")
excerpts = cit.get("excerpts", []) if isinstance(cit, dict) else getattr(cit, "excerpts", [])
citations.append({
"type": "source",
"url": url,
"title": title,
"excerpts": excerpts,
})
return citations
# ------------------------------------------------------------------
# Perplexity academic search backend
# ------------------------------------------------------------------
def _perplexity_lookup(self, query: str) -> Dict[str, Any]:
"""Run academic search via Perplexity sonar-pro-search through OpenRouter."""
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
api_key = os.getenv("OPENROUTER_API_KEY")
model = "perplexity/sonar-pro-search"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
"HTTP-Referer": "https://scientific-writer.local",
"X-Title": "Scientific Writer Research Tool"
"X-Title": "Scientific Writer Research Tool",
}
def _select_model(self, query: str) -> str:
"""
Select the appropriate model based on query complexity.
Args:
query: The research query
Returns:
Model identifier string
"""
if self.force_model:
return self.MODELS.get(self.force_model, self.MODELS["reasoning"])
# Check for reasoning keywords (case-insensitive)
query_lower = query.lower()
for keyword in self.REASONING_KEYWORDS:
if keyword in query_lower:
return self.MODELS["reasoning"]
# Check for multiple questions or complex structure
question_count = query.count("?")
if question_count >= 2:
return self.MODELS["reasoning"]
# Check for very long queries (likely complex)
if len(query) > 200:
return self.MODELS["reasoning"]
# Default to pro for simple lookups
return self.MODELS["pro"]
research_prompt = self._format_academic_prompt(query)
messages = [
{
"role": "system",
"content": (
"You are an academic research assistant specializing in finding "
"HIGH-IMPACT, INFLUENTIAL research.\n\n"
"QUALITY PRIORITIZATION (CRITICAL):\n"
"- ALWAYS prefer highly-cited papers over obscure publications\n"
"- ALWAYS prioritize Tier-1 venues: Nature, Science, Cell, NEJM, Lancet, JAMA, PNAS\n"
"- ALWAYS prefer papers from established researchers\n"
"- Include citation counts when known (e.g., 'cited 500+ times')\n"
"- Quality matters more than quantity\n\n"
"VENUE HIERARCHY:\n"
"1. Nature/Science/Cell family, NEJM, Lancet, JAMA (highest)\n"
"2. High-impact specialized journals (IF>10), top conferences (NeurIPS, ICML, ICLR)\n"
"3. Respected field-specific journals (IF 5-10)\n"
"4. Other peer-reviewed sources (only if no better option)\n\n"
"Focus exclusively on scholarly sources. Prioritize recent literature (2020-2026) "
"and provide complete citations with DOIs."
),
},
{"role": "user", "content": research_prompt},
]
def _make_request(self, messages: List[Dict[str, str]], model: str, **kwargs) -> Dict[str, Any]:
"""Make a request to the OpenRouter API with academic search mode."""
data = {
"model": model,
"messages": messages,
"max_tokens": 8000,
"temperature": 0.1, # Low temperature for factual research
# Perplexity-specific parameters for academic search
"search_mode": "academic", # Prioritize scholarly sources (peer-reviewed papers, journals)
"search_context_size": "high", # Always use high context for deeper research
**kwargs
"temperature": 0.1,
"search_mode": "academic",
"search_context_size": "high",
}
try:
response = requests.post(
f"{self.base_url}/chat/completions",
headers=self.headers,
"https://openrouter.ai/api/v1/chat/completions",
headers=headers,
json=data,
timeout=90 # Increased timeout for academic search
timeout=90,
)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
raise Exception(f"API request failed: {str(e)}")
resp_json = response.json()
def _format_research_prompt(self, query: str) -> str:
"""Format the query for optimal research results."""
return f"""You are an expert research assistant. Please provide comprehensive, accurate research information for the following query: "{query}"
IMPORTANT INSTRUCTIONS:
1. Focus on ACADEMIC and SCIENTIFIC sources (peer-reviewed papers, reputable journals, institutional research)
2. Include RECENT information (prioritize 2020-2026 publications)
3. Provide COMPLETE citations with authors, title, journal/conference, year, and DOI when available
4. Structure your response with clear sections and proper attribution
5. Be comprehensive but concise - aim for 800-1200 words
6. Include key findings, methodologies, and implications when relevant
7. Note any controversies, limitations, or conflicting evidence
PAPER QUALITY AND POPULARITY PRIORITIZATION (CRITICAL):
8. ALWAYS prioritize HIGHLY-CITED papers over obscure publications:
- Recent papers (0-3 years): prefer 20+ citations, highlight 100+ as highly influential
- Mid-age papers (3-7 years): prefer 100+ citations, highlight 500+ as landmark
- Older papers (7+ years): prefer 500+ citations, highlight 1000+ as foundational
9. ALWAYS prioritize papers from TOP-TIER VENUES:
- Tier 1 (highest priority): Nature, Science, Cell, NEJM, Lancet, JAMA, PNAS, Nature Medicine, Nature Biotechnology
- Tier 2 (high priority): High-impact specialized journals (IF>10), top conferences (NeurIPS, ICML, ICLR for AI/ML)
- Tier 3: Respected specialized journals (IF 5-10)
- Only cite lower-tier venues if directly relevant AND no better source exists
10. PREFER papers from ESTABLISHED, REPUTABLE AUTHORS:
- Senior researchers with high h-index and multiple high-impact publications
- Leading research groups at recognized institutions
- Authors with recognized expertise (awards, editorial positions)
11. For EACH citation, include when available:
- Approximate citation count (e.g., "cited 500+ times")
- Journal/venue tier indicator
- Notable author credentials if relevant
12. PRIORITIZE papers that DIRECTLY address the research question over tangentially related work
RESPONSE FORMAT:
- Start with a brief summary (2-3 sentences)
- Present key findings and studies in organized sections
- Rank papers by impact: most influential/cited first
- End with future directions or research gaps if applicable
- Include 5-8 high-quality citations, emphasizing Tier-1 venues and highly-cited papers
Remember: Quality over quantity. Prioritize influential, highly-cited papers from prestigious venues and established researchers."""
def lookup(self, query: str) -> Dict[str, Any]:
"""Perform a research lookup for the given query."""
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
# Select model based on query complexity
model = self._select_model(query)
# Format the research prompt
research_prompt = self._format_research_prompt(query)
# Prepare messages for the API with system message for academic mode
messages = [
{
"role": "system",
"content": """You are an academic research assistant specializing in finding HIGH-IMPACT, INFLUENTIAL research.
QUALITY PRIORITIZATION (CRITICAL):
- ALWAYS prefer highly-cited papers over obscure publications
- ALWAYS prioritize Tier-1 venues: Nature, Science, Cell, NEJM, Lancet, JAMA, PNAS, and their family journals
- ALWAYS prefer papers from established researchers with strong publication records
- Include citation counts when known (e.g., "cited 500+ times")
- Quality matters more than quantity - 5 excellent papers beats 10 mediocre ones
VENUE HIERARCHY:
1. Nature/Science/Cell family, NEJM, Lancet, JAMA (highest priority)
2. High-impact specialized journals (IF>10), top ML conferences (NeurIPS, ICML, ICLR)
3. Respected field-specific journals (IF 5-10)
4. Other peer-reviewed sources (only if no better option exists)
Focus exclusively on scholarly sources: peer-reviewed journals, academic papers, research institutions. Prioritize recent academic literature (2020-2026) and provide complete citations with DOIs. Always indicate paper impact through citation counts and venue prestige."""
},
{"role": "user", "content": research_prompt}
]
try:
# Make the API request
response = self._make_request(messages, model)
# Extract the response content
if "choices" in response and len(response["choices"]) > 0:
choice = response["choices"][0]
if "choices" in resp_json and len(resp_json["choices"]) > 0:
choice = resp_json["choices"][0]
if "message" in choice and "content" in choice["message"]:
content = choice["message"]["content"]
# Extract citations from API response (Perplexity provides these)
api_citations = self._extract_api_citations(response, choice)
# Also extract citations from text as fallback
api_citations = self._extract_api_citations(resp_json, choice)
text_citations = self._extract_citations_from_text(content)
# Combine: prioritize API citations, add text citations if no duplicates
citations = api_citations + text_citations
return {
@@ -210,10 +271,11 @@ Focus exclusively on scholarly sources: peer-reviewed journals, academic papers,
"query": query,
"response": content,
"citations": citations,
"sources": api_citations, # Separate field for API-provided sources
"sources": api_citations,
"timestamp": timestamp,
"backend": "perplexity",
"model": model,
"usage": response.get("usage", {})
"usage": resp_json.get("usage", {}),
}
else:
raise Exception("Invalid response format from API")
@@ -226,22 +288,54 @@ Focus exclusively on scholarly sources: peer-reviewed journals, academic papers,
"query": query,
"error": str(e),
"timestamp": timestamp,
"model": model
"backend": "perplexity",
"model": model,
}
# ------------------------------------------------------------------
# Shared utilities
# ------------------------------------------------------------------
def _format_academic_prompt(self, query: str) -> str:
"""Format a query for academic research results via Perplexity."""
return f"""You are an expert research assistant. Please provide comprehensive, accurate research information for the following query: "{query}"
IMPORTANT INSTRUCTIONS:
1. Focus on ACADEMIC and SCIENTIFIC sources (peer-reviewed papers, reputable journals, institutional research)
2. Include RECENT information (prioritize 2020-2026 publications)
3. Provide COMPLETE citations with authors, title, journal/conference, year, and DOI when available
4. Structure your response with clear sections and proper attribution
5. Be comprehensive but concise - aim for 800-1200 words
6. Include key findings, methodologies, and implications when relevant
7. Note any controversies, limitations, or conflicting evidence
PAPER QUALITY PRIORITIZATION (CRITICAL):
8. ALWAYS prioritize HIGHLY-CITED papers over obscure publications
9. ALWAYS prioritize papers from TOP-TIER VENUES (Nature, Science, Cell, NEJM, Lancet, JAMA, PNAS)
10. PREFER papers from ESTABLISHED, REPUTABLE AUTHORS
11. For EACH citation include when available: citation count, venue tier, author credentials
12. PRIORITIZE papers that DIRECTLY address the research question
RESPONSE FORMAT:
- Start with a brief summary (2-3 sentences)
- Present key findings and studies in organized sections
- Rank papers by impact: most influential/cited first
- End with future directions or research gaps if applicable
- Include 5-8 high-quality citations
Remember: Quality over quantity. Prioritize influential, highly-cited papers from prestigious venues."""
def _extract_api_citations(self, response: Dict[str, Any], choice: Dict[str, Any]) -> List[Dict[str, str]]:
"""Extract citations from Perplexity API response fields."""
citations = []
# Perplexity returns citations in search_results field (new format)
# Check multiple possible locations where OpenRouter might place them
search_results = (
response.get("search_results") or
choice.get("search_results") or
choice.get("message", {}).get("search_results") or
[]
response.get("search_results")
or choice.get("search_results")
or choice.get("message", {}).get("search_results")
or []
)
for result in search_results:
citation = {
"type": "source",
@@ -249,162 +343,164 @@ Focus exclusively on scholarly sources: peer-reviewed journals, academic papers,
"url": result.get("url", ""),
"date": result.get("date", ""),
}
# Add snippet if available (newer API feature)
if result.get("snippet"):
citation["snippet"] = result.get("snippet")
citation["snippet"] = result["snippet"]
citations.append(citation)
# Also check for legacy citations field (backward compatibility)
legacy_citations = (
response.get("citations") or
choice.get("citations") or
choice.get("message", {}).get("citations") or
[]
response.get("citations")
or choice.get("citations")
or choice.get("message", {}).get("citations")
or []
)
for url in legacy_citations:
if isinstance(url, str):
# Legacy format was just URLs
citations.append({
"type": "source",
"url": url,
"title": "",
"date": ""
})
citations.append({"type": "source", "url": url, "title": "", "date": ""})
elif isinstance(url, dict):
citations.append({
"type": "source",
"url": url.get("url", ""),
"title": url.get("title", ""),
"date": url.get("date", "")
"date": url.get("date", ""),
})
return citations
def _extract_citations_from_text(self, text: str) -> List[Dict[str, str]]:
"""Extract potential citations from the response text as fallback."""
import re
"""Extract DOIs and academic URLs from response text as fallback."""
citations = []
# Look for DOI patterns first (most reliable)
# Matches: doi:10.xxx, DOI: 10.xxx, https://doi.org/10.xxx
doi_pattern = r'(?:doi[:\s]*|https?://(?:dx\.)?doi\.org/)(10\.[0-9]{4,}/[^\s\)\]\,\[\<\>]+)'
doi_matches = re.findall(doi_pattern, text, re.IGNORECASE)
seen_dois = set()
for doi in doi_matches:
# Clean up DOI - remove trailing punctuation and brackets
doi_clean = doi.strip().rstrip('.,;:)]')
doi_clean = doi.strip().rstrip(".,;:)]")
if doi_clean and doi_clean not in seen_dois:
seen_dois.add(doi_clean)
citations.append({
"type": "doi",
"doi": doi_clean,
"url": f"https://doi.org/{doi_clean}"
"url": f"https://doi.org/{doi_clean}",
})
# Look for URLs that might be sources
url_pattern = r'https?://[^\s\)\]\,\<\>\"\']+(?:arxiv\.org|pubmed|ncbi\.nlm\.nih\.gov|nature\.com|science\.org|wiley\.com|springer\.com|ieee\.org|acm\.org)[^\s\)\]\,\<\>\"\']*'
url_pattern = (
r'https?://[^\s\)\]\,\<\>\"\']+(?:arxiv\.org|pubmed|ncbi\.nlm\.nih\.gov|'
r'nature\.com|science\.org|wiley\.com|springer\.com|ieee\.org|acm\.org)'
r'[^\s\)\]\,\<\>\"\']*'
)
url_matches = re.findall(url_pattern, text, re.IGNORECASE)
seen_urls = set()
for url in url_matches:
url_clean = url.rstrip('.')
url_clean = url.rstrip(".")
if url_clean not in seen_urls:
seen_urls.add(url_clean)
citations.append({
"type": "url",
"url": url_clean
})
citations.append({"type": "url", "url": url_clean})
return citations
def batch_lookup(self, queries: List[str], delay: float = 1.0) -> List[Dict[str, Any]]:
"""Perform multiple research lookups with optional delay between requests."""
results = []
# ------------------------------------------------------------------
# Public API
# ------------------------------------------------------------------
def lookup(self, query: str) -> Dict[str, Any]:
"""Perform a research lookup, routing to the best backend.
Parallel Chat API is used by default. Perplexity sonar-pro-search
is used only for academic-specific queries (paper searches, DOI lookups).
"""
backend = self._select_backend(query)
print(f"[Research] Backend: {backend} | Query: {query[:80]}...", file=sys.stderr)
if backend == "parallel":
return self._parallel_lookup(query)
else:
return self._perplexity_lookup(query)
def batch_lookup(self, queries: List[str], delay: float = 1.0) -> List[Dict[str, Any]]:
"""Perform multiple research lookups with delay between requests."""
results = []
for i, query in enumerate(queries):
if i > 0 and delay > 0:
time.sleep(delay) # Rate limiting
time.sleep(delay)
result = self.lookup(query)
results.append(result)
# Print progress
print(f"[Research] Completed query {i+1}/{len(queries)}: {query[:50]}...")
print(f"[Research] Completed query {i+1}/{len(queries)}: {query[:50]}...", file=sys.stderr)
return results
def get_model_info(self) -> Dict[str, Any]:
"""Get information about available models from OpenRouter."""
try:
response = requests.get(
f"{self.base_url}/models",
headers=self.headers,
timeout=30
)
response.raise_for_status()
return response.json()
except Exception as e:
return {"error": str(e)}
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
"""Command-line interface for testing the research lookup tool."""
"""Command-line interface for the research lookup tool."""
import argparse
import sys
parser = argparse.ArgumentParser(description="Research Information Lookup Tool")
parser = argparse.ArgumentParser(
description="Research Information Lookup Tool (Parallel Chat API + Perplexity)",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# General research (uses Parallel Chat API, core model)
python research_lookup.py "latest advances in quantum computing 2025"
# Academic paper search (auto-routes to Perplexity)
python research_lookup.py "find papers on CRISPR gene editing clinical trials"
# Force a specific backend
python research_lookup.py "topic" --force-backend parallel
python research_lookup.py "topic" --force-backend perplexity
# Save output to file
python research_lookup.py "topic" -o results.txt
# JSON output
python research_lookup.py "topic" --json -o results.json
""",
)
parser.add_argument("query", nargs="?", help="Research query to look up")
parser.add_argument("--model-info", action="store_true", help="Show available models")
parser.add_argument("--batch", nargs="+", help="Run multiple queries")
parser.add_argument("--force-model", choices=["pro", "reasoning"],
help="Force specific model: 'pro' for fast lookup, 'reasoning' for deep analysis")
parser.add_argument("-o", "--output", help="Write output to file instead of stdout")
parser.add_argument("--json", action="store_true", help="Output results as JSON")
parser.add_argument(
"--force-backend",
choices=["parallel", "perplexity"],
help="Force a specific backend (default: auto-select)",
)
parser.add_argument("-o", "--output", help="Write output to file")
parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args()
# Set up output destination
output_file = None
if args.output:
output_file = open(args.output, 'w', encoding='utf-8')
output_file = open(args.output, "w", encoding="utf-8")
def write_output(text):
"""Write to file or stdout."""
if output_file:
output_file.write(text + '\n')
output_file.write(text + "\n")
else:
print(text)
# Check for API key
if not os.getenv("OPENROUTER_API_KEY"):
print("Error: OPENROUTER_API_KEY environment variable not set", file=sys.stderr)
print("Please set it in your .env file or export it:", file=sys.stderr)
print(" export OPENROUTER_API_KEY='your_openrouter_api_key'", file=sys.stderr)
has_parallel = bool(os.getenv("PARALLEL_API_KEY"))
has_perplexity = bool(os.getenv("OPENROUTER_API_KEY"))
if not has_parallel and not has_perplexity:
print("Error: No API keys found. Set at least one:", file=sys.stderr)
print(" export PARALLEL_API_KEY='...' (primary - Parallel Chat API)", file=sys.stderr)
print(" export OPENROUTER_API_KEY='...' (fallback - Perplexity academic)", file=sys.stderr)
if output_file:
output_file.close()
return 1
if not args.query and not args.batch:
parser.print_help()
if output_file:
output_file.close()
return 1
try:
research = ResearchLookup(force_model=args.force_model)
if args.model_info:
write_output("Available models from OpenRouter:")
models = research.get_model_info()
if "data" in models:
for model in models["data"]:
if "perplexity" in model["id"].lower():
write_output(f" - {model['id']}: {model.get('name', 'N/A')}")
if output_file:
output_file.close()
return 0
if not args.query and not args.batch:
print("Error: No query provided. Use --model-info to see available models.", file=sys.stderr)
if output_file:
output_file.close()
return 1
research = ResearchLookup(force_backend=args.force_backend)
if args.batch:
print(f"Running batch research for {len(args.batch)} queries...", file=sys.stderr)
@@ -413,27 +509,24 @@ def main():
print(f"Researching: {args.query}", file=sys.stderr)
results = [research.lookup(args.query)]
# Output as JSON if requested
if args.json:
write_output(json.dumps(results, indent=2, ensure_ascii=False))
write_output(json.dumps(results, indent=2, ensure_ascii=False, default=str))
if output_file:
output_file.close()
return 0
# Display results in human-readable format
for i, result in enumerate(results):
if result["success"]:
write_output(f"\n{'='*80}")
write_output(f"Query {i+1}: {result['query']}")
write_output(f"Timestamp: {result['timestamp']}")
write_output(f"Model: {result['model']}")
write_output(f"Backend: {result.get('backend', 'unknown')} | Model: {result.get('model', 'unknown')}")
write_output(f"{'='*80}")
write_output(result["response"])
# Display API-provided sources first (most reliable)
sources = result.get("sources", [])
if sources:
write_output(f"\n📚 Sources ({len(sources)}):")
write_output(f"\nSources ({len(sources)}):")
for j, source in enumerate(sources):
title = source.get("title", "Untitled")
url = source.get("url", "")
@@ -443,11 +536,10 @@ def main():
if url:
write_output(f" {url}")
# Display additional text-extracted citations
citations = result.get("citations", [])
text_citations = [c for c in citations if c.get("type") in ("doi", "url")]
if text_citations:
write_output(f"\n🔗 Additional References ({len(text_citations)}):")
write_output(f"\nAdditional References ({len(text_citations)}):")
for j, citation in enumerate(text_citations):
if citation.get("type") == "doi":
write_output(f" [{j+1}] DOI: {citation.get('doi', '')} - {citation.get('url', '')}")
@@ -464,11 +556,11 @@ def main():
return 0
except Exception as e:
print(f"Error: {str(e)}", file=sys.stderr)
print(f"Error: {e}", file=sys.stderr)
if output_file:
output_file.close()
return 1
if __name__ == "__main__":
exit(main())
sys.exit(main())