mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Added parallel-web skill
Refactor research lookup skill to enhance backend routing and update documentation. The skill now intelligently selects between the Parallel Chat API and Perplexity sonar-pro-search based on query type. Added compatibility notes, license information, and improved descriptions for clarity. Removed outdated example scripts to streamline the codebase.
This commit is contained in:
314
scientific-skills/parallel-web/SKILL.md
Normal file
314
scientific-skills/parallel-web/SKILL.md
Normal file
@@ -0,0 +1,314 @@
|
||||
---
|
||||
name: parallel-web
|
||||
description: Search the web, extract URL content, and run deep research using the Parallel Chat API and Extract API. Use for ALL web searches, research queries, and general information gathering. Provides synthesized summaries with citations.
|
||||
allowed-tools: Read Write Edit Bash
|
||||
license: MIT license
|
||||
compatibility: PARALLEL_API_KEY required
|
||||
metadata:
|
||||
skill-author: K-Dense Inc.
|
||||
---
|
||||
|
||||
# Parallel Web Systems API
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides access to **Parallel Web Systems** APIs for web search, deep research, and content extraction. It is the **primary tool for all web-related operations** in the scientific writer workflow.
|
||||
|
||||
**Primary interface:** Parallel Chat API (OpenAI-compatible) for search and research.
|
||||
**Secondary interface:** Extract API for URL verification and special cases only.
|
||||
|
||||
**API Documentation:** https://docs.parallel.ai
|
||||
**API Key:** https://platform.parallel.ai
|
||||
**Environment Variable:** `PARALLEL_API_KEY`
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill for **ALL** of the following:
|
||||
|
||||
- **Web Search**: Any query that requires searching the internet for information
|
||||
- **Deep Research**: Comprehensive research reports on any topic
|
||||
- **Market Research**: Industry analysis, competitive intelligence, market data
|
||||
- **Current Events**: News, recent developments, announcements
|
||||
- **Technical Information**: Documentation, specifications, product details
|
||||
- **Statistical Data**: Market sizes, growth rates, industry figures
|
||||
- **General Information**: Company profiles, facts, comparisons
|
||||
|
||||
**Use Extract API only for:**
|
||||
- Citation verification (confirming a specific URL's content)
|
||||
- Special cases where you need raw content from a known URL
|
||||
|
||||
**Do NOT use this skill for:**
|
||||
- Academic-specific paper searches (use `research-lookup` which routes to Perplexity for purely academic queries)
|
||||
- Google Scholar / PubMed database searches (use `citation-management` skill)
|
||||
|
||||
---
|
||||
|
||||
## Two Capabilities
|
||||
|
||||
### 1. Web Search (`search` command)
|
||||
|
||||
Search the web via the Parallel Chat API (`base` model) and get a **synthesized summary** with cited sources.
|
||||
|
||||
**Best for:** General web searches, current events, fact-finding, technical lookups, news, market data.
|
||||
|
||||
```bash
|
||||
# Basic search
|
||||
python scripts/parallel_web.py search "latest advances in quantum computing 2025"
|
||||
|
||||
# Use core model for more complex queries
|
||||
python scripts/parallel_web.py search "compare EV battery chemistries NMC vs LFP" --model core
|
||||
|
||||
# Save results to file
|
||||
python scripts/parallel_web.py search "renewable energy policy updates" -o results.txt
|
||||
|
||||
# JSON output for programmatic use
|
||||
python scripts/parallel_web.py search "AI regulation landscape" --json -o results.json
|
||||
```
|
||||
|
||||
**Key Parameters:**
|
||||
- `objective`: Natural language description of what you want to find
|
||||
- `--model`: Chat model to use (`base` default, or `core` for deeper research)
|
||||
- `-o`: Output file path
|
||||
- `--json`: Output as JSON
|
||||
|
||||
**Response includes:** Synthesized summary organized by themes, with inline citations and a sources list.
|
||||
|
||||
### 2. Deep Research (`research` command)
|
||||
|
||||
Run comprehensive multi-source research via the Parallel Chat API (`core` model) that produces detailed intelligence reports with citations.
|
||||
|
||||
**Best for:** Market research, comprehensive analysis, competitive intelligence, technology surveys, industry reports, any research question requiring synthesis of multiple sources.
|
||||
|
||||
```bash
|
||||
# Default deep research (core model)
|
||||
python scripts/parallel_web.py research "comprehensive analysis of the global EV battery market"
|
||||
|
||||
# Save research report to file
|
||||
python scripts/parallel_web.py research "AI adoption in healthcare 2025" -o report.md
|
||||
|
||||
# Use base model for faster, lighter research
|
||||
python scripts/parallel_web.py research "latest funding rounds in AI startups" --model base
|
||||
|
||||
# JSON output
|
||||
python scripts/parallel_web.py research "renewable energy storage market in Europe" --json -o data.json
|
||||
```
|
||||
|
||||
**Key Parameters:**
|
||||
- `query`: Research question or topic
|
||||
- `--model`: Chat model to use (`core` default for deep research, or `base` for faster results)
|
||||
- `-o`: Output file path
|
||||
- `--json`: Output as JSON
|
||||
|
||||
### 3. URL Extraction (`extract` command) — Verification Only
|
||||
|
||||
Extract content from specific URLs. **Use only for citation verification and special cases.**
|
||||
|
||||
For general research, use `search` or `research` instead.
|
||||
|
||||
```bash
|
||||
# Verify a citation's content
|
||||
python scripts/parallel_web.py extract "https://example.com/article" --objective "key findings"
|
||||
|
||||
# Get full page content for verification
|
||||
python scripts/parallel_web.py extract "https://docs.example.com/api" --full-content
|
||||
|
||||
# Save extraction to file
|
||||
python scripts/parallel_web.py extract "https://paper-url.com" --objective "methodology" -o extracted.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Model Selection Guide
|
||||
|
||||
The Chat API supports two research models. Use `base` for most searches and `core` for deep research.
|
||||
|
||||
| Model | Latency | Strengths | Use When |
|
||||
|--------|------------|----------------------------------|-----------------------------|
|
||||
| `base` | 15s-100s | Standard research, factual queries | Web searches, quick lookups |
|
||||
| `core` | 60s-5min | Complex research, multi-source synthesis | Deep research, comprehensive reports |
|
||||
|
||||
**Recommendations:**
|
||||
- `search` command defaults to `base` — fast, good for most queries
|
||||
- `research` command defaults to `core` — thorough, good for comprehensive reports
|
||||
- Override with `--model` when you need different depth/speed tradeoffs
|
||||
|
||||
---
|
||||
|
||||
## Python API Usage
|
||||
|
||||
### Search
|
||||
|
||||
```python
|
||||
from parallel_web import ParallelSearch
|
||||
|
||||
searcher = ParallelSearch()
|
||||
result = searcher.search(
|
||||
objective="Find latest information about transformer architectures in NLP",
|
||||
model="base",
|
||||
)
|
||||
|
||||
if result["success"]:
|
||||
print(result["response"]) # Synthesized summary
|
||||
for src in result["sources"]:
|
||||
print(f" {src['title']}: {src['url']}")
|
||||
```
|
||||
|
||||
### Deep Research
|
||||
|
||||
```python
|
||||
from parallel_web import ParallelDeepResearch
|
||||
|
||||
researcher = ParallelDeepResearch()
|
||||
result = researcher.research(
|
||||
query="Comprehensive analysis of AI regulation in the EU and US",
|
||||
model="core",
|
||||
)
|
||||
|
||||
if result["success"]:
|
||||
print(result["response"]) # Full research report
|
||||
print(f"Citations: {result['citation_count']}")
|
||||
```
|
||||
|
||||
### Extract (Verification Only)
|
||||
|
||||
```python
|
||||
from parallel_web import ParallelExtract
|
||||
|
||||
extractor = ParallelExtract()
|
||||
result = extractor.extract(
|
||||
urls=["https://docs.example.com/api-reference"],
|
||||
objective="API authentication methods and rate limits",
|
||||
)
|
||||
|
||||
if result["success"]:
|
||||
for r in result["results"]:
|
||||
print(r["excerpts"])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## MANDATORY: Save All Results to Sources Folder
|
||||
|
||||
**Every web search and deep research result MUST be saved to the project's `sources/` folder.**
|
||||
|
||||
This ensures all research is preserved for reproducibility, auditability, and context window recovery.
|
||||
|
||||
### Saving Rules
|
||||
|
||||
| Operation | `-o` Flag Target | Filename Pattern |
|
||||
|-----------|-----------------|------------------|
|
||||
| Web Search | `sources/search_<topic>.md` | `search_YYYYMMDD_HHMMSS_<brief_topic>.md` |
|
||||
| Deep Research | `sources/research_<topic>.md` | `research_YYYYMMDD_HHMMSS_<brief_topic>.md` |
|
||||
| URL Extract | `sources/extract_<source>.md` | `extract_YYYYMMDD_HHMMSS_<brief_source>.md` |
|
||||
|
||||
### How to Save (Always Use `-o` Flag)
|
||||
|
||||
**CRITICAL: Every call to `parallel_web.py` MUST include the `-o` flag pointing to the `sources/` folder.**
|
||||
|
||||
```bash
|
||||
# Web search — ALWAYS save to sources/
|
||||
python scripts/parallel_web.py search "latest advances in quantum computing 2025" \
|
||||
-o sources/search_20250217_143000_quantum_computing.md
|
||||
|
||||
# Deep research — ALWAYS save to sources/
|
||||
python scripts/parallel_web.py research "comprehensive analysis of the global EV battery market" \
|
||||
-o sources/research_20250217_144000_ev_battery_market.md
|
||||
|
||||
# URL extraction (verification only) — save to sources/
|
||||
python scripts/parallel_web.py extract "https://example.com/article" --objective "key findings" \
|
||||
-o sources/extract_20250217_143500_example_article.md
|
||||
```
|
||||
|
||||
### Why Save Everything
|
||||
|
||||
1. **Reproducibility**: Every claim in the final document can be traced back to its raw source material
|
||||
2. **Context Window Recovery**: If context is compacted mid-task, saved results can be re-read from `sources/`
|
||||
3. **Audit Trail**: The `sources/` folder provides complete transparency into how information was gathered
|
||||
4. **Reuse Across Sections**: Saved research can be referenced by multiple sections without duplicate API calls
|
||||
5. **Cost Efficiency**: Avoid redundant API calls by checking `sources/` for existing results
|
||||
6. **Peer Review Support**: Reviewers can verify the research backing every claim
|
||||
|
||||
### Logging
|
||||
|
||||
When saving research results, always log:
|
||||
|
||||
```
|
||||
[HH:MM:SS] SAVED: Search results to sources/search_20250217_143000_quantum_computing.md
|
||||
[HH:MM:SS] SAVED: Deep research report to sources/research_20250217_144000_ev_battery_market.md
|
||||
```
|
||||
|
||||
### Before Making a New Query, Check Sources First
|
||||
|
||||
Before calling `parallel_web.py`, check if a relevant result already exists in `sources/`:
|
||||
|
||||
```bash
|
||||
ls sources/ # Check existing saved results
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration with Scientific Writer
|
||||
|
||||
### Routing Table
|
||||
|
||||
| Task | Tool | Command |
|
||||
|------|------|---------|
|
||||
| Web search (any) | `parallel_web.py search` | `python scripts/parallel_web.py search "query" -o sources/search_<topic>.md` |
|
||||
| Deep research | `parallel_web.py research` | `python scripts/parallel_web.py research "query" -o sources/research_<topic>.md` |
|
||||
| Citation verification | `parallel_web.py extract` | `python scripts/parallel_web.py extract "url" -o sources/extract_<source>.md` |
|
||||
| Academic paper search | `research_lookup.py` | Routes to Perplexity sonar-pro-search |
|
||||
| DOI/metadata lookup | `parallel_web.py extract` | Extract from DOI URLs (verification) |
|
||||
|
||||
### When Writing Scientific Documents
|
||||
|
||||
1. **Before writing any section**, use `search` or `research` to gather background information — **save results to `sources/`**
|
||||
2. **For academic citations**, use `research-lookup` (which routes academic queries to Perplexity) — **save results to `sources/`**
|
||||
3. **For citation verification** (confirming a specific URL), use `parallel_web.py extract` — **save results to `sources/`**
|
||||
4. **For current market/industry data**, use `parallel_web.py research --model core` — **save results to `sources/`**
|
||||
5. **Before any new query**, check `sources/` for existing results to avoid duplicate API calls
|
||||
|
||||
---
|
||||
|
||||
## Environment Setup
|
||||
|
||||
```bash
|
||||
# Required: Set your Parallel API key
|
||||
export PARALLEL_API_KEY="your_api_key_here"
|
||||
|
||||
# Required Python packages
|
||||
pip install openai # For Chat API (search/research)
|
||||
pip install parallel-web # For Extract API (verification only)
|
||||
```
|
||||
|
||||
Get your API key at https://platform.parallel.ai
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
The script handles errors gracefully and returns structured error responses:
|
||||
|
||||
```json
|
||||
{
|
||||
"success": false,
|
||||
"error": "Error description",
|
||||
"timestamp": "2025-02-14 12:00:00"
|
||||
}
|
||||
```
|
||||
|
||||
**Common issues:**
|
||||
- `PARALLEL_API_KEY not set`: Set the environment variable
|
||||
- `openai not installed`: Run `pip install openai`
|
||||
- `parallel-web not installed`: Run `pip install parallel-web` (only needed for extract)
|
||||
- `Rate limit exceeded`: Wait and retry (default: 300 req/min for Chat API)
|
||||
|
||||
---
|
||||
|
||||
## Complementary Skills
|
||||
|
||||
| Skill | Use For |
|
||||
|-------|---------|
|
||||
| `research-lookup` | Academic paper searches (routes to Perplexity for scholarly queries) |
|
||||
| `citation-management` | Google Scholar, PubMed, CrossRef database searches |
|
||||
| `literature-review` | Systematic literature reviews across academic databases |
|
||||
| `scientific-schematics` | Generate diagrams from research findings |
|
||||
244
scientific-skills/parallel-web/references/api_reference.md
Normal file
244
scientific-skills/parallel-web/references/api_reference.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Parallel Web Systems API Quick Reference
|
||||
|
||||
**Full Documentation:** https://docs.parallel.ai
|
||||
**API Key:** https://platform.parallel.ai
|
||||
**Python SDK:** `pip install parallel-web`
|
||||
**Environment Variable:** `PARALLEL_API_KEY`
|
||||
|
||||
---
|
||||
|
||||
## Search API (Beta)
|
||||
|
||||
**Endpoint:** `POST https://api.parallel.ai/v1beta/search`
|
||||
**Header:** `parallel-beta: search-extract-2025-10-10`
|
||||
|
||||
### Request
|
||||
|
||||
```json
|
||||
{
|
||||
"objective": "Natural language search goal (max 5000 chars)",
|
||||
"search_queries": ["keyword query 1", "keyword query 2"],
|
||||
"max_results": 10,
|
||||
"excerpts": {
|
||||
"max_chars_per_result": 10000,
|
||||
"max_chars_total": 50000
|
||||
},
|
||||
"source_policy": {
|
||||
"allow_domains": ["example.com"],
|
||||
"deny_domains": ["spam.com"],
|
||||
"after_date": "2024-01-01"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Response
|
||||
|
||||
```json
|
||||
{
|
||||
"search_id": "search_...",
|
||||
"results": [
|
||||
{
|
||||
"url": "https://...",
|
||||
"title": "Page Title",
|
||||
"publish_date": "2025-01-15",
|
||||
"excerpts": ["Relevant content..."]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Python SDK
|
||||
|
||||
```python
|
||||
from parallel import Parallel
|
||||
client = Parallel(api_key="...")
|
||||
result = client.beta.search(
|
||||
objective="...",
|
||||
search_queries=["..."],
|
||||
max_results=10,
|
||||
excerpts={"max_chars_per_result": 10000},
|
||||
)
|
||||
```
|
||||
|
||||
**Cost:** $5 per 1,000 requests (default 10 results each)
|
||||
**Rate Limit:** 600 requests/minute
|
||||
|
||||
---
|
||||
|
||||
## Extract API (Beta)
|
||||
|
||||
**Endpoint:** `POST https://api.parallel.ai/v1beta/extract`
|
||||
**Header:** `parallel-beta: search-extract-2025-10-10`
|
||||
|
||||
### Request
|
||||
|
||||
```json
|
||||
{
|
||||
"urls": ["https://example.com/page"],
|
||||
"objective": "What to focus on",
|
||||
"excerpts": true,
|
||||
"full_content": false
|
||||
}
|
||||
```
|
||||
|
||||
### Response
|
||||
|
||||
```json
|
||||
{
|
||||
"extract_id": "extract_...",
|
||||
"results": [
|
||||
{
|
||||
"url": "https://...",
|
||||
"title": "Page Title",
|
||||
"excerpts": ["Focused content..."],
|
||||
"full_content": null
|
||||
}
|
||||
],
|
||||
"errors": []
|
||||
}
|
||||
```
|
||||
|
||||
### Python SDK
|
||||
|
||||
```python
|
||||
result = client.beta.extract(
|
||||
urls=["https://..."],
|
||||
objective="...",
|
||||
excerpts=True,
|
||||
full_content=False,
|
||||
)
|
||||
```
|
||||
|
||||
**Cost:** $1 per 1,000 URLs
|
||||
**Rate Limit:** 600 requests/minute
|
||||
|
||||
---
|
||||
|
||||
## Task API (Deep Research)
|
||||
|
||||
**Endpoint:** `POST https://api.parallel.ai/v1/tasks/runs`
|
||||
|
||||
### Create Task Run
|
||||
|
||||
```json
|
||||
{
|
||||
"input": "Research question (max 15,000 chars)",
|
||||
"processor": "pro-fast",
|
||||
"task_spec": {
|
||||
"output_schema": {
|
||||
"type": "text"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Response (immediate)
|
||||
|
||||
```json
|
||||
{
|
||||
"run_id": "trun_...",
|
||||
"status": "queued"
|
||||
}
|
||||
```
|
||||
|
||||
### Get Result (blocking)
|
||||
|
||||
**Endpoint:** `GET https://api.parallel.ai/v1/tasks/runs/{run_id}/result`
|
||||
|
||||
### Python SDK
|
||||
|
||||
```python
|
||||
# Text output (markdown report with citations)
|
||||
from parallel.types import TaskSpecParam
|
||||
task_run = client.task_run.create(
|
||||
input="Research question",
|
||||
processor="pro-fast",
|
||||
task_spec=TaskSpecParam(output_schema={"type": "text"}),
|
||||
)
|
||||
result = client.task_run.result(task_run.run_id, api_timeout=3600)
|
||||
print(result.output.content)
|
||||
|
||||
# Auto-schema output (structured JSON)
|
||||
task_run = client.task_run.create(
|
||||
input="Research question",
|
||||
processor="pro-fast",
|
||||
)
|
||||
result = client.task_run.result(task_run.run_id, api_timeout=3600)
|
||||
print(result.output.content) # structured dict
|
||||
print(result.output.basis) # citations per field
|
||||
```
|
||||
|
||||
### Processors
|
||||
|
||||
| Processor | Latency | Cost/1000 | Best For |
|
||||
|-----------|---------|-----------|----------|
|
||||
| `lite-fast` | 10-20s | $5 | Basic metadata |
|
||||
| `base-fast` | 15-50s | $10 | Standard enrichments |
|
||||
| `core-fast` | 15s-100s | $25 | Cross-referenced |
|
||||
| `core2x-fast` | 15s-3min | $50 | High complexity |
|
||||
| **`pro-fast`** | **30s-5min** | **$100** | **Default: exploratory research** |
|
||||
| `ultra-fast` | 1-10min | $300 | Deep multi-source |
|
||||
| `ultra2x-fast` | 1-20min | $600 | Difficult research |
|
||||
| `ultra4x-fast` | 1-40min | $1200 | Very difficult |
|
||||
| `ultra8x-fast` | 1hr | $2400 | Most difficult |
|
||||
|
||||
Standard (non-fast) processors have the same cost but higher latency and freshest data.
|
||||
|
||||
---
|
||||
|
||||
## Chat API (Beta)
|
||||
|
||||
**Endpoint:** `POST https://api.parallel.ai/chat/completions`
|
||||
**Compatible with OpenAI SDK.**
|
||||
|
||||
### Models
|
||||
|
||||
| Model | Latency (TTFT) | Cost/1000 | Use Case |
|
||||
|-------|----------------|-----------|----------|
|
||||
| `speed` | ~3s | $5 | Low-latency chat |
|
||||
| `lite` | 10-60s | $5 | Simple lookups with basis |
|
||||
| `base` | 15-100s | $10 | Standard research with basis |
|
||||
| `core` | 1-5min | $25 | Complex research with basis |
|
||||
|
||||
### Python SDK (OpenAI-compatible)
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI(
|
||||
api_key="PARALLEL_API_KEY",
|
||||
base_url="https://api.parallel.ai",
|
||||
)
|
||||
response = client.chat.completions.create(
|
||||
model="speed",
|
||||
messages=[{"role": "user", "content": "What is Parallel Web Systems?"}],
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rate Limits
|
||||
|
||||
| API | Default Limit |
|
||||
|-----|---------------|
|
||||
| Search | 600 req/min |
|
||||
| Extract | 600 req/min |
|
||||
| Chat | 300 req/min |
|
||||
| Task | Varies by processor |
|
||||
|
||||
---
|
||||
|
||||
## Source Policy
|
||||
|
||||
Control which sources are used in searches:
|
||||
|
||||
```json
|
||||
{
|
||||
"source_policy": {
|
||||
"allow_domains": ["nature.com", "science.org"],
|
||||
"deny_domains": ["unreliable-source.com"],
|
||||
"after_date": "2024-01-01"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Works with Search API and can be used to focus results on specific authoritative domains.
|
||||
362
scientific-skills/parallel-web/references/deep_research_guide.md
Normal file
362
scientific-skills/parallel-web/references/deep_research_guide.md
Normal file
@@ -0,0 +1,362 @@
|
||||
# Deep Research Guide
|
||||
|
||||
Comprehensive guide to using Parallel's Task API for deep research, including processor selection, output formats, structured schemas, and advanced patterns.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Deep Research transforms natural language research queries into comprehensive intelligence reports. Unlike simple search, it performs multi-step web exploration across authoritative sources and synthesizes findings with inline citations and confidence levels.
|
||||
|
||||
**Key characteristics:**
|
||||
- Multi-step, multi-source research
|
||||
- Automatic citation and source attribution
|
||||
- Structured or text output formats
|
||||
- Asynchronous processing (30 seconds to 25+ minutes)
|
||||
- Research basis with confidence levels per finding
|
||||
|
||||
---
|
||||
|
||||
## Processor Selection
|
||||
|
||||
Choosing the right processor is the most important decision. It determines research depth, speed, and cost.
|
||||
|
||||
### Decision Matrix
|
||||
|
||||
| Scenario | Recommended Processor | Why |
|
||||
|----------|----------------------|-----|
|
||||
| Quick background for a paper section | `pro-fast` | Fast, good depth, low cost |
|
||||
| Comprehensive market research report | `ultra-fast` | Deep multi-source synthesis |
|
||||
| Simple fact lookup or metadata | `base-fast` | Fast, low cost |
|
||||
| Competitive landscape analysis | `pro-fast` | Good balance of depth and speed |
|
||||
| Background for grant proposal | `pro-fast` | Thorough but timely |
|
||||
| State-of-the-art review for a topic | `ultra-fast` | Maximum source coverage |
|
||||
| Quick question during writing | `core-fast` | Sub-2-minute response |
|
||||
| Breaking news or very recent events | `pro` (standard) | Freshest data prioritized |
|
||||
| Large-scale data enrichment | `base-fast` | Cost-effective at scale |
|
||||
|
||||
### Processor Tiers Explained
|
||||
|
||||
**`pro-fast`** (default, recommended for most tasks):
|
||||
- Latency: 30 seconds to 5 minutes
|
||||
- Depth: Explores 10-20+ web sources
|
||||
- Best for: Section-level research, background gathering, comparative analysis
|
||||
- Cost: $0.10 per query
|
||||
|
||||
**`ultra-fast`** (for comprehensive research):
|
||||
- Latency: 1 to 10 minutes
|
||||
- Depth: Explores 20-50+ web sources, multiple reasoning steps
|
||||
- Best for: Full reports, market analysis, complex multi-faceted questions
|
||||
- Cost: $0.30 per query
|
||||
|
||||
**`core-fast`** (quick cross-referenced answers):
|
||||
- Latency: 15 seconds to 100 seconds
|
||||
- Depth: Cross-references 5-10 sources
|
||||
- Best for: Moderate complexity questions, verification tasks
|
||||
- Cost: $0.025 per query
|
||||
|
||||
**`base-fast`** (simple enrichment):
|
||||
- Latency: 15 to 50 seconds
|
||||
- Depth: Standard web lookup, 3-5 sources
|
||||
- Best for: Simple factual queries, metadata enrichment
|
||||
- Cost: $0.01 per query
|
||||
|
||||
### Standard vs Fast
|
||||
|
||||
- **Fast processors** (`-fast`): 2-5x faster, very fresh data, ideal for interactive use
|
||||
- **Standard processors** (no suffix): Highest data freshness, better for background jobs
|
||||
|
||||
**Rule of thumb:** Always use `-fast` variants unless you specifically need the freshest possible data (breaking news, live financial data, real-time events).
|
||||
|
||||
---
|
||||
|
||||
## Output Formats
|
||||
|
||||
### Text Mode (Markdown Reports)
|
||||
|
||||
Returns a comprehensive markdown report with inline citations. Best for human consumption and document integration.
|
||||
|
||||
```python
|
||||
researcher = ParallelDeepResearch()
|
||||
|
||||
result = researcher.research(
|
||||
query="Comprehensive analysis of mRNA vaccine technology platforms and their applications beyond COVID-19",
|
||||
processor="pro-fast",
|
||||
description="Focus on clinical trials, approved applications, pipeline developments, and key companies. Include market size data."
|
||||
)
|
||||
|
||||
# result["output"] contains a full markdown report
|
||||
# result["citations"] contains source URLs with excerpts
|
||||
```
|
||||
|
||||
**When to use text mode:**
|
||||
- Writing scientific documents (papers, reviews, reports)
|
||||
- Background research for a topic
|
||||
- Creating summaries for human readers
|
||||
- When you need flowing prose, not structured data
|
||||
|
||||
**Guiding text output with `description`:**
|
||||
|
||||
The `description` parameter steers the report content:
|
||||
|
||||
```python
|
||||
# Focus on specific aspects
|
||||
result = researcher.research(
|
||||
query="Electric vehicle battery technology landscape",
|
||||
description="Focus on: (1) solid-state battery progress, (2) charging speed improvements, (3) cost per kWh trends, (4) key patents and IP. Format as a structured report with clear sections."
|
||||
)
|
||||
|
||||
# Control length and depth
|
||||
result = researcher.research(
|
||||
query="AI in drug discovery",
|
||||
description="Provide a concise 500-word executive summary covering key applications, notable successes, leading companies, and market projections."
|
||||
)
|
||||
```
|
||||
|
||||
### Auto-Schema Mode (Structured JSON)
|
||||
|
||||
Lets the processor determine the best output structure automatically. Returns structured JSON with per-field citations.
|
||||
|
||||
```python
|
||||
result = researcher.research_structured(
|
||||
query="Top 5 cloud computing companies: revenue, market share, key products, and recent developments",
|
||||
processor="pro-fast",
|
||||
)
|
||||
|
||||
# result["content"] contains structured data (dict)
|
||||
# result["basis"] contains per-field citations with confidence
|
||||
```
|
||||
|
||||
**When to use auto-schema:**
|
||||
- Data extraction and enrichment
|
||||
- Comparative analysis with specific fields
|
||||
- When you need programmatic access to individual data points
|
||||
- Integration with databases or spreadsheets
|
||||
|
||||
### Custom JSON Schema
|
||||
|
||||
Define exactly what fields you want returned:
|
||||
|
||||
```python
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"market_size_2024": {
|
||||
"type": "string",
|
||||
"description": "Global market size in USD billions for 2024. Include source."
|
||||
},
|
||||
"growth_rate": {
|
||||
"type": "string",
|
||||
"description": "CAGR percentage for 2024-2030 forecast period."
|
||||
},
|
||||
"top_companies": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"name": {"type": "string", "description": "Company name"},
|
||||
"market_share": {"type": "string", "description": "Approximate market share percentage"},
|
||||
"revenue": {"type": "string", "description": "Most recent annual revenue"}
|
||||
},
|
||||
"required": ["name", "market_share", "revenue"]
|
||||
},
|
||||
"description": "Top 5 companies by market share"
|
||||
},
|
||||
"key_trends": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"},
|
||||
"description": "Top 3-5 industry trends driving growth"
|
||||
}
|
||||
},
|
||||
"required": ["market_size_2024", "growth_rate", "top_companies", "key_trends"],
|
||||
"additionalProperties": False
|
||||
}
|
||||
|
||||
result = researcher.research_structured(
|
||||
query="Global cybersecurity market analysis",
|
||||
output_schema=schema,
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Writing Effective Research Queries
|
||||
|
||||
### Query Construction Framework
|
||||
|
||||
Structure your query as: **[Topic] + [Specific Aspect] + [Scope/Time] + [Output Expectations]**
|
||||
|
||||
**Good queries:**
|
||||
```
|
||||
"Comprehensive analysis of the global lithium-ion battery recycling market,
|
||||
including market size, key players, regulatory drivers, and technology
|
||||
approaches. Focus on 2023-2025 developments."
|
||||
|
||||
"Compare the efficacy, safety profiles, and cost-effectiveness of GLP-1
|
||||
receptor agonists (semaglutide, tirzepatide, liraglutide) for type 2
|
||||
diabetes management based on recent clinical trial data."
|
||||
|
||||
"Survey of federated learning approaches for healthcare AI, covering
|
||||
privacy-preserving techniques, real-world deployments, regulatory
|
||||
compliance, and performance benchmarks from 2023-2025 publications."
|
||||
```
|
||||
|
||||
**Poor queries:**
|
||||
```
|
||||
"Tell me about batteries" # Too vague
|
||||
"AI" # No specific aspect
|
||||
"What's new?" # No topic at all
|
||||
"Everything about quantum computing from all time" # Too broad
|
||||
```
|
||||
|
||||
### Tips for Better Results
|
||||
|
||||
1. **Be specific about what you need**: "market size" vs "tell me about the market"
|
||||
2. **Include time bounds**: "2024-2025" narrows to relevant data
|
||||
3. **Name entities**: "semaglutide vs tirzepatide" vs "diabetes drugs"
|
||||
4. **Specify output expectations**: "Include statistics, key players, and growth projections"
|
||||
5. **Keep under 15,000 characters**: Concise queries work better than massive prompts
|
||||
|
||||
---
|
||||
|
||||
## Working with Research Basis
|
||||
|
||||
Every deep research result includes a **basis** -- citations, reasoning, and confidence levels for each finding.
|
||||
|
||||
### Text Mode Basis
|
||||
|
||||
```python
|
||||
result = researcher.research(query="...", processor="pro-fast")
|
||||
|
||||
# Citations are deduplicated and include URLs + excerpts
|
||||
for citation in result["citations"]:
|
||||
print(f"Source: {citation['title']}")
|
||||
print(f"URL: {citation['url']}")
|
||||
if citation.get("excerpts"):
|
||||
print(f"Excerpt: {citation['excerpts'][0][:200]}")
|
||||
```
|
||||
|
||||
### Structured Mode Basis
|
||||
|
||||
```python
|
||||
result = researcher.research_structured(query="...", processor="pro-fast")
|
||||
|
||||
for basis_entry in result["basis"]:
|
||||
print(f"Field: {basis_entry['field']}")
|
||||
print(f"Confidence: {basis_entry['confidence']}")
|
||||
print(f"Reasoning: {basis_entry['reasoning']}")
|
||||
for cit in basis_entry["citations"]:
|
||||
print(f" Source: {cit['url']}")
|
||||
```
|
||||
|
||||
### Confidence Levels
|
||||
|
||||
| Level | Meaning | Action |
|
||||
|-------|---------|--------|
|
||||
| `high` | Multiple authoritative sources agree | Use directly |
|
||||
| `medium` | Some supporting evidence, minor uncertainty | Use with caveat |
|
||||
| `low` | Limited evidence, significant uncertainty | Verify independently |
|
||||
|
||||
---
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Multi-Stage Research
|
||||
|
||||
Use different processors in sequence for progressively deeper research:
|
||||
|
||||
```python
|
||||
# Stage 1: Quick overview with base-fast
|
||||
overview = researcher.research(
|
||||
query="What are the main approaches to quantum error correction?",
|
||||
processor="base-fast",
|
||||
)
|
||||
|
||||
# Stage 2: Deep dive on the most promising approach
|
||||
deep_dive = researcher.research(
|
||||
query=f"Detailed analysis of surface code quantum error correction: "
|
||||
f"recent breakthroughs, implementation challenges, and leading research groups. "
|
||||
f"Context: {overview['output'][:500]}",
|
||||
processor="pro-fast",
|
||||
)
|
||||
```
|
||||
|
||||
### Comparative Research
|
||||
|
||||
```python
|
||||
result = researcher.research(
|
||||
query="Compare and contrast three leading large language model architectures: "
|
||||
"GPT-4, Claude, and Gemini. Cover architecture differences, benchmark performance, "
|
||||
"pricing, context window, and unique capabilities. Include specific benchmark scores.",
|
||||
processor="pro-fast",
|
||||
description="Create a structured comparison with a summary table. Include specific numbers and benchmarks."
|
||||
)
|
||||
```
|
||||
|
||||
### Research with Follow-Up Extraction
|
||||
|
||||
```python
|
||||
# Step 1: Research to find relevant sources
|
||||
research_result = researcher.research(
|
||||
query="Most influential papers on attention mechanisms in 2024",
|
||||
processor="pro-fast",
|
||||
)
|
||||
|
||||
# Step 2: Extract full content from the most relevant sources
|
||||
from parallel_web import ParallelExtract
|
||||
extractor = ParallelExtract()
|
||||
|
||||
key_urls = [c["url"] for c in research_result["citations"][:5]]
|
||||
for url in key_urls:
|
||||
extracted = extractor.extract(
|
||||
urls=[url],
|
||||
objective="Key methodology, results, and conclusions",
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Reducing Latency
|
||||
|
||||
1. **Use `-fast` processors**: 2-5x faster than standard
|
||||
2. **Use `core-fast` for moderate queries**: Sub-2-minute for most questions
|
||||
3. **Be specific in queries**: Vague queries require more exploration
|
||||
4. **Set appropriate timeouts**: Don't over-wait
|
||||
|
||||
### Reducing Cost
|
||||
|
||||
1. **Start with `base-fast`**: Upgrade only if depth is insufficient
|
||||
2. **Use `core-fast` for moderate complexity**: $0.025 vs $0.10 for pro
|
||||
3. **Batch related queries**: One well-crafted query > multiple simple ones
|
||||
4. **Cache results**: Store research output for reuse across sections
|
||||
|
||||
### Maximizing Quality
|
||||
|
||||
1. **Use `pro-fast` or `ultra-fast`**: More sources = better synthesis
|
||||
2. **Provide context**: "I'm writing a paper for Nature Medicine about..."
|
||||
3. **Use `description` parameter**: Guide the output structure and focus
|
||||
4. **Verify critical findings**: Cross-check with Search API or Extract
|
||||
|
||||
---
|
||||
|
||||
## Common Mistakes
|
||||
|
||||
| Mistake | Impact | Fix |
|
||||
|---------|--------|-----|
|
||||
| Query too vague | Scattered, unfocused results | Add specific aspects and time bounds |
|
||||
| Query too long (>15K chars) | API rejection or degraded results | Summarize context, focus on key question |
|
||||
| Wrong processor | Too slow or too shallow | Use decision matrix above |
|
||||
| Not using `description` | Report structure not aligned with needs | Add description to guide output |
|
||||
| Ignoring confidence levels | Using low-confidence data as fact | Check basis confidence before citing |
|
||||
| Not verifying citations | Risk of outdated or misattributed data | Cross-check key citations with Extract |
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- [API Reference](api_reference.md) - Complete API parameter reference
|
||||
- [Search Best Practices](search_best_practices.md) - For quick web searches
|
||||
- [Extraction Patterns](extraction_patterns.md) - For reading specific URLs
|
||||
- [Workflow Recipes](workflow_recipes.md) - Common multi-step patterns
|
||||
338
scientific-skills/parallel-web/references/extraction_patterns.md
Normal file
338
scientific-skills/parallel-web/references/extraction_patterns.md
Normal file
@@ -0,0 +1,338 @@
|
||||
# Extraction Patterns
|
||||
|
||||
Guide to using Parallel's Extract API for converting web pages into clean, LLM-optimized content.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
The Extract API converts any public URL into clean markdown. It handles JavaScript-heavy pages, PDFs, and complex layouts that simple HTTP fetching cannot parse. Results are optimized for LLM consumption.
|
||||
|
||||
**Key capabilities:**
|
||||
- JavaScript rendering (SPAs, dynamic content)
|
||||
- PDF extraction to clean text
|
||||
- Focused excerpts aligned to your objective
|
||||
- Full page content extraction
|
||||
- Multiple URL batch processing
|
||||
|
||||
---
|
||||
|
||||
## When to Use Extract vs Search
|
||||
|
||||
| Scenario | Use Extract | Use Search |
|
||||
|----------|-------------|------------|
|
||||
| You have a specific URL | Yes | No |
|
||||
| You need content from a known page | Yes | No |
|
||||
| You want to find pages about a topic | No | Yes |
|
||||
| You need to read a research paper URL | Yes | No |
|
||||
| You need to verify information on a specific site | Yes | No |
|
||||
| You're looking for information broadly | No | Yes |
|
||||
| You found URLs from a search and want full content | Yes | No |
|
||||
|
||||
**Rule of thumb:** If you have a URL, use Extract. If you need to find URLs, use Search.
|
||||
|
||||
---
|
||||
|
||||
## Excerpt Mode vs Full Content Mode
|
||||
|
||||
### Excerpt Mode (Default)
|
||||
|
||||
Returns focused content aligned to your objective. Smaller token footprint, higher relevance.
|
||||
|
||||
```python
|
||||
extractor = ParallelExtract()
|
||||
|
||||
result = extractor.extract(
|
||||
urls=["https://arxiv.org/abs/2301.12345"],
|
||||
objective="Key methodology and experimental results",
|
||||
excerpts=True, # Default
|
||||
full_content=False # Default
|
||||
)
|
||||
```
|
||||
|
||||
**Best for:**
|
||||
- Extracting specific information from long pages
|
||||
- Token-efficient processing
|
||||
- When you know what you're looking for
|
||||
- Reading papers for specific claims or data points
|
||||
|
||||
### Full Content Mode
|
||||
|
||||
Returns the complete page content as clean markdown.
|
||||
|
||||
```python
|
||||
result = extractor.extract(
|
||||
urls=["https://docs.example.com/api-reference"],
|
||||
objective="Complete API documentation",
|
||||
excerpts=False,
|
||||
full_content=True,
|
||||
)
|
||||
```
|
||||
|
||||
**Best for:**
|
||||
- Complete documentation pages
|
||||
- Full article text needed for analysis
|
||||
- When you need every detail, not just excerpts
|
||||
- Archiving or converting web content
|
||||
|
||||
### Both Modes
|
||||
|
||||
You can request both excerpts and full content:
|
||||
|
||||
```python
|
||||
result = extractor.extract(
|
||||
urls=["https://example.com/report"],
|
||||
objective="Executive summary and key recommendations",
|
||||
excerpts=True,
|
||||
full_content=True,
|
||||
)
|
||||
|
||||
# Use excerpts for focused analysis
|
||||
# Use full_content for complete reference
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Objective Writing for Extraction
|
||||
|
||||
The `objective` parameter focuses extraction on relevant content. It dramatically improves excerpt quality.
|
||||
|
||||
### Good Objectives
|
||||
|
||||
```python
|
||||
# Specific and actionable
|
||||
objective="Extract the methodology section, including sample size, statistical methods, and primary endpoints"
|
||||
|
||||
# Clear about what you need
|
||||
objective="Find the pricing information, feature comparison table, and enterprise plan details"
|
||||
|
||||
# Targeted for your task
|
||||
objective="Key findings, effect sizes, confidence intervals, and author conclusions from this clinical trial"
|
||||
```
|
||||
|
||||
### Poor Objectives
|
||||
|
||||
```python
|
||||
# Too vague
|
||||
objective="Tell me about this page"
|
||||
|
||||
# No objective at all (still works but excerpts are less focused)
|
||||
extractor.extract(urls=["https://..."])
|
||||
```
|
||||
|
||||
### Objective Templates by Use Case
|
||||
|
||||
**Academic Paper:**
|
||||
```python
|
||||
objective="Abstract, key findings, methodology (sample size, design, statistical tests), results with effect sizes and p-values, and main conclusions"
|
||||
```
|
||||
|
||||
**Product/Company Page:**
|
||||
```python
|
||||
objective="Company overview, key products/services, pricing, founding date, leadership team, and recent announcements"
|
||||
```
|
||||
|
||||
**Technical Documentation:**
|
||||
```python
|
||||
objective="API endpoints, authentication methods, request/response formats, rate limits, and code examples"
|
||||
```
|
||||
|
||||
**News Article:**
|
||||
```python
|
||||
objective="Main story, key quotes, data points, timeline of events, and named sources"
|
||||
```
|
||||
|
||||
**Government/Policy Document:**
|
||||
```python
|
||||
objective="Key policy provisions, effective dates, affected parties, compliance requirements, and penalties"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Batch Extraction
|
||||
|
||||
Extract from multiple URLs in a single call:
|
||||
|
||||
```python
|
||||
result = extractor.extract(
|
||||
urls=[
|
||||
"https://nature.com/articles/s12345",
|
||||
"https://science.org/doi/full/10.1234/science.xyz",
|
||||
"https://thelancet.com/journals/lancet/article/PIIS0140-6736(24)12345/fulltext"
|
||||
],
|
||||
objective="Key findings, sample sizes, and statistical results from each study",
|
||||
)
|
||||
|
||||
# Results are returned in the same order as input URLs
|
||||
for r in result["results"]:
|
||||
print(f"=== {r['title']} ===")
|
||||
print(f"URL: {r['url']}")
|
||||
for excerpt in r["excerpts"]:
|
||||
print(excerpt[:500])
|
||||
```
|
||||
|
||||
**Batch limits:**
|
||||
- No hard limit on number of URLs per request
|
||||
- Each URL counts as one extraction unit for billing
|
||||
- Large batches may take longer to process
|
||||
- Failed URLs are reported in the `errors` field without blocking successful ones
|
||||
|
||||
---
|
||||
|
||||
## Handling Different Content Types
|
||||
|
||||
### Web Pages (HTML)
|
||||
|
||||
Standard extraction. JavaScript is rendered, so SPAs and dynamic content work.
|
||||
|
||||
```python
|
||||
# Standard web page
|
||||
result = extractor.extract(
|
||||
urls=["https://example.com/article"],
|
||||
objective="Main article content",
|
||||
)
|
||||
```
|
||||
|
||||
### PDFs
|
||||
|
||||
PDFs are automatically detected and converted to text.
|
||||
|
||||
```python
|
||||
# PDF extraction
|
||||
result = extractor.extract(
|
||||
urls=["https://example.com/whitepaper.pdf"],
|
||||
objective="Executive summary and key recommendations",
|
||||
)
|
||||
```
|
||||
|
||||
### Documentation Sites
|
||||
|
||||
Single-page apps and documentation frameworks (Docusaurus, GitBook, ReadTheDocs) are fully rendered.
|
||||
|
||||
```python
|
||||
result = extractor.extract(
|
||||
urls=["https://docs.example.com/getting-started"],
|
||||
objective="Installation instructions and quickstart guide",
|
||||
full_content=True,
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Extraction Patterns
|
||||
|
||||
### Pattern 1: Search Then Extract
|
||||
|
||||
Find relevant pages with Search, then extract full content from the best results.
|
||||
|
||||
```python
|
||||
from parallel_web import ParallelSearch, ParallelExtract
|
||||
|
||||
searcher = ParallelSearch()
|
||||
extractor = ParallelExtract()
|
||||
|
||||
# Step 1: Find relevant pages
|
||||
search_result = searcher.search(
|
||||
objective="Find the original transformer paper and its key follow-up papers",
|
||||
search_queries=["attention is all you need paper", "transformer architecture paper"],
|
||||
)
|
||||
|
||||
# Step 2: Extract detailed content from top results
|
||||
top_urls = [r["url"] for r in search_result["results"][:3]]
|
||||
extract_result = extractor.extract(
|
||||
urls=top_urls,
|
||||
objective="Abstract, architecture description, key results, and ablation studies",
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 2: DOI Resolution and Paper Reading
|
||||
|
||||
```python
|
||||
# Extract content from a DOI URL
|
||||
result = extractor.extract(
|
||||
urls=["https://doi.org/10.1038/s41586-024-07487-w"],
|
||||
objective="Study design, patient population, primary endpoints, efficacy results, and safety data",
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 3: Competitive Intelligence from Company Pages
|
||||
|
||||
```python
|
||||
companies = [
|
||||
"https://openai.com/about",
|
||||
"https://anthropic.com/company",
|
||||
"https://deepmind.google/about/",
|
||||
]
|
||||
|
||||
result = extractor.extract(
|
||||
urls=companies,
|
||||
objective="Company mission, team size, key products, recent announcements, and funding information",
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 4: Documentation Extraction for Reference
|
||||
|
||||
```python
|
||||
result = extractor.extract(
|
||||
urls=["https://docs.parallel.ai/search/search-quickstart"],
|
||||
objective="Complete API usage guide including request format, response format, and code examples",
|
||||
full_content=True,
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 5: Metadata Verification
|
||||
|
||||
```python
|
||||
# Verify citation metadata for a specific paper
|
||||
result = extractor.extract(
|
||||
urls=["https://doi.org/10.1234/example-doi"],
|
||||
objective="Complete citation metadata: authors, title, journal, volume, pages, year, DOI",
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Errors
|
||||
|
||||
| Error | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| URL not accessible | Page requires authentication, is behind paywall, or is down | Try a different URL or use Search instead |
|
||||
| Timeout | Page takes too long to render | Retry or use a simpler URL |
|
||||
| Empty content | Page is dynamically loaded in a way that can't be rendered | Try full_content mode or use Search |
|
||||
| Rate limited | Too many requests | Wait and retry, or reduce batch size |
|
||||
|
||||
### Checking for Errors
|
||||
|
||||
```python
|
||||
result = extractor.extract(urls=["https://example.com/page"])
|
||||
|
||||
if not result["success"]:
|
||||
print(f"Extraction failed: {result['error']}")
|
||||
elif result.get("errors"):
|
||||
print(f"Some URLs failed: {result['errors']}")
|
||||
else:
|
||||
print(f"Successfully extracted {len(result['results'])} pages")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tips and Best Practices
|
||||
|
||||
1. **Always provide an objective**: Even a general one improves excerpt quality significantly
|
||||
2. **Use excerpts by default**: Full content is only needed when you truly need everything
|
||||
3. **Batch related URLs**: One call with 5 URLs is better than 5 separate calls
|
||||
4. **Check for errors**: Not all URLs are extractable (paywalls, auth, etc.)
|
||||
5. **Combine with Search**: Search finds URLs, Extract reads them in detail
|
||||
6. **Use for DOI resolution**: Extract handles DOI redirects automatically
|
||||
7. **Prefer Extract over manual fetching**: Handles JavaScript, PDFs, and complex layouts
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- [API Reference](api_reference.md) - Complete API parameter reference
|
||||
- [Search Best Practices](search_best_practices.md) - For finding URLs to extract
|
||||
- [Deep Research Guide](deep_research_guide.md) - For comprehensive research tasks
|
||||
- [Workflow Recipes](workflow_recipes.md) - Common multi-step patterns
|
||||
@@ -0,0 +1,297 @@
|
||||
# Search API Best Practices
|
||||
|
||||
Comprehensive guide to getting the best results from Parallel's Search API.
|
||||
|
||||
---
|
||||
|
||||
## Core Concepts
|
||||
|
||||
The Search API returns ranked, LLM-optimized excerpts from web sources based on natural language objectives. Results are designed to serve directly as model input, enabling faster reasoning and higher-quality completions.
|
||||
|
||||
### Key Advantages Over Traditional Search
|
||||
|
||||
- **Context engineering for token efficiency**: Results are ranked by reasoning utility, not engagement
|
||||
- **Single-hop resolution**: Complex multi-topic queries resolved in one request
|
||||
- **Multi-hop efficiency**: Deep research workflows complete in fewer tool calls
|
||||
|
||||
---
|
||||
|
||||
## Crafting Effective Search Queries
|
||||
|
||||
### Provide Both `objective` AND `search_queries`
|
||||
|
||||
The `objective` describes your broader goal; `search_queries` ensures specific keywords are prioritized. Using both together gives significantly better results.
|
||||
|
||||
**Good:**
|
||||
```python
|
||||
searcher.search(
|
||||
objective="I'm writing a literature review on Alzheimer's treatments. Find peer-reviewed research papers and clinical trial results from the past 2 years on amyloid-beta targeted therapies.",
|
||||
search_queries=[
|
||||
"amyloid beta clinical trials 2024-2025",
|
||||
"Alzheimer's monoclonal antibody treatment results",
|
||||
"lecanemab donanemab trial outcomes"
|
||||
],
|
||||
)
|
||||
```
|
||||
|
||||
**Poor:**
|
||||
```python
|
||||
# Too vague - no context about intent
|
||||
searcher.search(objective="Alzheimer's treatment")
|
||||
|
||||
# Missing objective - no context for ranking
|
||||
searcher.search(search_queries=["Alzheimer's drugs"])
|
||||
```
|
||||
|
||||
### Objective Writing Tips
|
||||
|
||||
1. **State your broader task**: "I'm writing a research paper on...", "I'm analyzing the market for...", "I'm preparing a presentation about..."
|
||||
2. **Be specific about source preferences**: "Prefer official government websites", "Focus on peer-reviewed journals", "From major news outlets"
|
||||
3. **Include freshness requirements**: "From the past 6 months", "Published in 2024-2025", "Most recent data available"
|
||||
4. **Specify content type**: "Technical documentation", "Clinical trial results", "Market analysis reports", "Product announcements"
|
||||
|
||||
### Example Objectives by Use Case
|
||||
|
||||
**Academic Research:**
|
||||
```
|
||||
"I'm writing a literature review on CRISPR gene editing applications in cancer therapy.
|
||||
Find peer-reviewed papers from Nature, Science, Cell, and other high-impact journals
|
||||
published in 2023-2025. Prefer clinical trial results and systematic reviews."
|
||||
```
|
||||
|
||||
**Market Intelligence:**
|
||||
```
|
||||
"I'm preparing Q1 2025 investor materials for a fintech startup.
|
||||
Find recent announcements from the Federal Reserve and SEC about digital asset
|
||||
regulations and banking partnerships with crypto firms. Past 3 months only."
|
||||
```
|
||||
|
||||
**Technical Documentation:**
|
||||
```
|
||||
"I'm designing a machine learning course. Find technical documentation and API guides
|
||||
that explain how transformer attention mechanisms work, preferably from official
|
||||
framework documentation like PyTorch or Hugging Face."
|
||||
```
|
||||
|
||||
**Current Events:**
|
||||
```
|
||||
"I'm tracking AI regulation developments. Find official policy announcements,
|
||||
legislative actions, and regulatory guidance from the EU, US, and UK governments
|
||||
from the past month."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Search Modes
|
||||
|
||||
Use the `mode` parameter to optimize for your workflow:
|
||||
|
||||
| Mode | Best For | Excerpt Style | Latency |
|
||||
|------|----------|---------------|---------|
|
||||
| `one-shot` (default) | Direct queries, single-request workflows | Comprehensive, longer | Lower |
|
||||
| `agentic` | Multi-step reasoning loops, agent workflows | Concise, token-efficient | Slightly higher |
|
||||
| `fast` | Real-time applications, UI auto-complete | Minimal, speed-optimized | ~1 second |
|
||||
|
||||
### When to Use Each Mode
|
||||
|
||||
**`one-shot`** (default):
|
||||
- Single research question that needs comprehensive answer
|
||||
- Writing a section of a paper and need full context
|
||||
- Background research before starting a document
|
||||
- Any case where you'll make only one search call
|
||||
|
||||
**`agentic`**:
|
||||
- Multi-step research workflows (search → analyze → search again)
|
||||
- Agent loops where token efficiency matters
|
||||
- Iterative refinement of research queries
|
||||
- When integrating with other tools (search → extract → synthesize)
|
||||
|
||||
**`fast`**:
|
||||
- Live autocomplete or suggestion systems
|
||||
- Quick fact-checking during writing
|
||||
- Real-time metadata lookups
|
||||
- Any latency-sensitive application
|
||||
|
||||
---
|
||||
|
||||
## Source Policy
|
||||
|
||||
Control which domains are included or excluded from results:
|
||||
|
||||
```python
|
||||
searcher.search(
|
||||
objective="Find clinical trial results for new cancer immunotherapy drugs",
|
||||
search_queries=["checkpoint inhibitor clinical trials 2025"],
|
||||
source_policy={
|
||||
"allow_domains": ["clinicaltrials.gov", "nejm.org", "thelancet.com", "nature.com"],
|
||||
"deny_domains": ["reddit.com", "quora.com"],
|
||||
"after_date": "2024-01-01"
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
### Source Policy Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `allow_domains` | list[str] | Only include results from these domains |
|
||||
| `deny_domains` | list[str] | Exclude results from these domains |
|
||||
| `after_date` | str (YYYY-MM-DD) | Only include content published after this date |
|
||||
|
||||
### Domain Lists by Use Case
|
||||
|
||||
**Academic Research:**
|
||||
```python
|
||||
allow_domains = [
|
||||
"nature.com", "science.org", "cell.com", "thelancet.com",
|
||||
"nejm.org", "bmj.com", "pnas.org", "arxiv.org",
|
||||
"pubmed.ncbi.nlm.nih.gov", "scholar.google.com"
|
||||
]
|
||||
```
|
||||
|
||||
**Technology/AI:**
|
||||
```python
|
||||
allow_domains = [
|
||||
"arxiv.org", "openai.com", "anthropic.com", "deepmind.google",
|
||||
"huggingface.co", "pytorch.org", "tensorflow.org",
|
||||
"proceedings.neurips.cc", "proceedings.mlr.press"
|
||||
]
|
||||
```
|
||||
|
||||
**Market Intelligence:**
|
||||
```python
|
||||
deny_domains = [
|
||||
"reddit.com", "quora.com", "medium.com",
|
||||
"wikipedia.org" # Good for facts, not for market data
|
||||
]
|
||||
```
|
||||
|
||||
**Government/Policy:**
|
||||
```python
|
||||
allow_domains = [
|
||||
"gov", "europa.eu", "who.int", "worldbank.org",
|
||||
"imf.org", "oecd.org", "un.org"
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Controlling Result Volume
|
||||
|
||||
### `max_results` Parameter
|
||||
|
||||
- Range: 1-20 (default: 10)
|
||||
- More results = broader coverage but more tokens to process
|
||||
- Fewer results = more focused but may miss relevant sources
|
||||
|
||||
**Recommendations:**
|
||||
- Quick fact check: `max_results=3`
|
||||
- Standard research: `max_results=10` (default)
|
||||
- Comprehensive survey: `max_results=20`
|
||||
|
||||
### Excerpt Length Control
|
||||
|
||||
```python
|
||||
searcher.search(
|
||||
objective="...",
|
||||
max_chars_per_result=10000, # Default: 10000
|
||||
)
|
||||
```
|
||||
|
||||
- **Short excerpts (1000-3000)**: Quick summaries, metadata extraction
|
||||
- **Medium excerpts (5000-10000)**: Standard research, balanced depth
|
||||
- **Long excerpts (10000-50000)**: Full article content, deep analysis
|
||||
|
||||
---
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Research Before Writing
|
||||
|
||||
```python
|
||||
# Before writing each section, search for relevant information
|
||||
result = searcher.search(
|
||||
objective="Find recent advances in transformer attention mechanisms for a NeurIPS paper introduction",
|
||||
search_queries=["attention mechanism innovations 2024", "efficient transformers"],
|
||||
max_results=10,
|
||||
)
|
||||
|
||||
# Extract key findings for the section
|
||||
for r in result["results"]:
|
||||
print(f"Source: {r['title']} ({r['url']})")
|
||||
# Use excerpts to inform writing
|
||||
```
|
||||
|
||||
### Pattern 2: Fact Verification
|
||||
|
||||
```python
|
||||
# Quick verification of a specific claim
|
||||
result = searcher.search(
|
||||
objective="Verify: Did GPT-4 achieve 86.4% on MMLU benchmark?",
|
||||
search_queries=["GPT-4 MMLU benchmark score"],
|
||||
max_results=5,
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 3: Competitive Intelligence
|
||||
|
||||
```python
|
||||
result = searcher.search(
|
||||
objective="Find recent product launches and funding announcements for AI coding assistants in 2025",
|
||||
search_queries=[
|
||||
"AI coding assistant funding 2025",
|
||||
"code generation tool launch",
|
||||
"AI developer tools new product"
|
||||
],
|
||||
source_policy={"after_date": "2025-01-01"},
|
||||
max_results=15,
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 4: Multi-Language Research
|
||||
|
||||
```python
|
||||
# Search includes multilingual results automatically
|
||||
result = searcher.search(
|
||||
objective="Find global perspectives on AI regulation, including EU, China, and US approaches",
|
||||
search_queries=[
|
||||
"EU AI Act implementation 2025",
|
||||
"China AI regulation policy",
|
||||
"US AI executive order updates"
|
||||
],
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Few or No Results
|
||||
|
||||
- **Broaden your objective**: Remove overly specific constraints
|
||||
- **Add more search queries**: Different phrasings of the same concept
|
||||
- **Remove source policy**: Domain restrictions may be too narrow
|
||||
- **Check date filters**: `after_date` may be too recent
|
||||
|
||||
### Irrelevant Results
|
||||
|
||||
- **Make objective more specific**: Add context about your task
|
||||
- **Use source policy**: Allow only authoritative domains
|
||||
- **Add negative context**: "Not about [unrelated topic]"
|
||||
- **Refine search queries**: Use more precise keywords
|
||||
|
||||
### Too Many Tokens in Results
|
||||
|
||||
- **Reduce `max_results`**: From 10 to 5 or 3
|
||||
- **Reduce excerpt length**: Lower `max_chars_per_result`
|
||||
- **Use `agentic` mode**: More concise excerpts
|
||||
- **Use `fast` mode**: Minimal excerpts
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- [API Reference](api_reference.md) - Complete API parameter reference
|
||||
- [Deep Research Guide](deep_research_guide.md) - For comprehensive research tasks
|
||||
- [Extraction Patterns](extraction_patterns.md) - For reading specific URLs
|
||||
- [Workflow Recipes](workflow_recipes.md) - Common multi-step patterns
|
||||
456
scientific-skills/parallel-web/references/workflow_recipes.md
Normal file
456
scientific-skills/parallel-web/references/workflow_recipes.md
Normal file
@@ -0,0 +1,456 @@
|
||||
# Workflow Recipes
|
||||
|
||||
Common multi-step patterns combining Parallel's Search, Extract, and Deep Research APIs for scientific writing tasks.
|
||||
|
||||
---
|
||||
|
||||
## Recipe Index
|
||||
|
||||
| Recipe | APIs Used | Time | Use Case |
|
||||
|--------|-----------|------|----------|
|
||||
| [Section Research Pipeline](#recipe-1-section-research-pipeline) | Research + Search | 2-5 min | Writing a paper section |
|
||||
| [Citation Verification](#recipe-2-citation-verification) | Search + Extract | 1-2 min | Verifying paper metadata |
|
||||
| [Literature Survey](#recipe-3-literature-survey) | Research + Search + Extract | 5-15 min | Comprehensive lit review |
|
||||
| [Market Intelligence Report](#recipe-4-market-intelligence-report) | Research (multi-stage) | 10-30 min | Market/industry analysis |
|
||||
| [Competitive Analysis](#recipe-5-competitive-analysis) | Search + Extract + Research | 5-10 min | Comparing companies/products |
|
||||
| [Fact-Check Pipeline](#recipe-6-fact-check-pipeline) | Search + Extract | 1-3 min | Verifying claims |
|
||||
| [Current Events Briefing](#recipe-7-current-events-briefing) | Search + Research | 3-5 min | News synthesis |
|
||||
| [Technical Documentation Gathering](#recipe-8-technical-documentation-gathering) | Search + Extract | 2-5 min | API/framework docs |
|
||||
| [Grant Background Research](#recipe-9-grant-background-research) | Research + Search | 5-10 min | Grant proposal background |
|
||||
|
||||
---
|
||||
|
||||
## Recipe 1: Section Research Pipeline
|
||||
|
||||
**Goal:** Gather research and citations for writing a single section of a scientific paper.
|
||||
|
||||
**APIs:** Deep Research (pro-fast) + Search
|
||||
|
||||
```bash
|
||||
# Step 1: Deep research for comprehensive background
|
||||
python scripts/parallel_web.py research \
|
||||
"Recent advances in federated learning for healthcare AI, focusing on privacy-preserving training methods, real-world deployments, and regulatory considerations (2023-2025)" \
|
||||
--processor pro-fast -o sources/section_background.md
|
||||
|
||||
# Step 2: Targeted search for specific citations
|
||||
python scripts/parallel_web.py search \
|
||||
"Find peer-reviewed papers on federated learning in hospitals" \
|
||||
--queries "federated learning clinical deployment" "privacy preserving ML healthcare" \
|
||||
--max-results 10 -o sources/section_citations.txt
|
||||
```
|
||||
|
||||
**Python version:**
|
||||
```python
|
||||
from parallel_web import ParallelDeepResearch, ParallelSearch
|
||||
|
||||
researcher = ParallelDeepResearch()
|
||||
searcher = ParallelSearch()
|
||||
|
||||
# Step 1: Deep background research
|
||||
background = researcher.research(
|
||||
query="Recent advances in federated learning for healthcare AI (2023-2025): "
|
||||
"privacy-preserving methods, real-world deployments, regulatory landscape",
|
||||
processor="pro-fast",
|
||||
description="Structure as: (1) Key approaches, (2) Clinical deployments, "
|
||||
"(3) Regulatory considerations, (4) Open challenges. Include statistics."
|
||||
)
|
||||
|
||||
# Step 2: Find specific papers to cite
|
||||
papers = searcher.search(
|
||||
objective="Find recent peer-reviewed papers on federated learning deployed in hospital settings",
|
||||
search_queries=[
|
||||
"federated learning hospital clinical study 2024",
|
||||
"privacy preserving machine learning healthcare deployment"
|
||||
],
|
||||
source_policy={"allow_domains": ["nature.com", "thelancet.com", "arxiv.org", "pubmed.ncbi.nlm.nih.gov"]},
|
||||
)
|
||||
|
||||
# Combine: use background for writing, papers for citations
|
||||
```
|
||||
|
||||
**When to use:** Before writing each major section of a research paper, literature review, or grant proposal.
|
||||
|
||||
---
|
||||
|
||||
## Recipe 2: Citation Verification
|
||||
|
||||
**Goal:** Verify that a citation is real and get complete metadata (DOI, volume, pages, year).
|
||||
|
||||
**APIs:** Search + Extract
|
||||
|
||||
```bash
|
||||
# Option A: Search for the paper
|
||||
python scripts/parallel_web.py search \
|
||||
"Vaswani et al 2017 Attention is All You Need paper NeurIPS" \
|
||||
--queries "Attention is All You Need DOI" --max-results 5
|
||||
|
||||
# Option B: Extract metadata from a DOI
|
||||
python scripts/parallel_web.py extract \
|
||||
"https://doi.org/10.48550/arXiv.1706.03762" \
|
||||
--objective "Complete citation: authors, title, venue, year, pages, DOI"
|
||||
```
|
||||
|
||||
**Python version:**
|
||||
```python
|
||||
from parallel_web import ParallelSearch, ParallelExtract
|
||||
|
||||
searcher = ParallelSearch()
|
||||
extractor = ParallelExtract()
|
||||
|
||||
# Step 1: Find the paper
|
||||
result = searcher.search(
|
||||
objective="Find the exact citation details for the Attention Is All You Need paper by Vaswani et al.",
|
||||
search_queries=["Attention is All You Need Vaswani 2017 NeurIPS DOI"],
|
||||
max_results=5,
|
||||
)
|
||||
|
||||
# Step 2: Extract full metadata from the paper's page
|
||||
paper_url = result["results"][0]["url"]
|
||||
metadata = extractor.extract(
|
||||
urls=[paper_url],
|
||||
objective="Complete BibTeX citation: all authors, title, conference/journal, year, pages, DOI, volume",
|
||||
)
|
||||
```
|
||||
|
||||
**When to use:** After writing a section, verify every citation in references.bib has correct and complete metadata.
|
||||
|
||||
---
|
||||
|
||||
## Recipe 3: Literature Survey
|
||||
|
||||
**Goal:** Comprehensive survey of a research field, identifying key papers, themes, and gaps.
|
||||
|
||||
**APIs:** Deep Research + Search + Extract
|
||||
|
||||
```python
|
||||
from parallel_web import ParallelDeepResearch, ParallelSearch, ParallelExtract
|
||||
|
||||
researcher = ParallelDeepResearch()
|
||||
searcher = ParallelSearch()
|
||||
extractor = ParallelExtract()
|
||||
|
||||
topic = "CRISPR-based diagnostics for infectious diseases"
|
||||
|
||||
# Stage 1: Broad research overview
|
||||
overview = researcher.research(
|
||||
query=f"Comprehensive review of {topic}: key developments, clinical applications, "
|
||||
f"regulatory status, commercial products, and future directions (2020-2025)",
|
||||
processor="ultra-fast",
|
||||
description="Structure as a literature review: (1) Historical development, "
|
||||
"(2) Current technologies, (3) Clinical applications, "
|
||||
"(4) Regulatory landscape, (5) Commercial products, "
|
||||
"(6) Limitations and future directions. Include key statistics and milestones."
|
||||
)
|
||||
|
||||
# Stage 2: Find specific landmark papers
|
||||
key_papers = searcher.search(
|
||||
objective=f"Find the most cited and influential papers on {topic} from Nature, Science, Cell, NEJM",
|
||||
search_queries=[
|
||||
"CRISPR diagnostics SHERLOCK DETECTR Nature",
|
||||
"CRISPR point-of-care testing clinical study",
|
||||
"nucleic acid detection CRISPR review"
|
||||
],
|
||||
source_policy={
|
||||
"allow_domains": ["nature.com", "science.org", "cell.com", "nejm.org", "thelancet.com"],
|
||||
},
|
||||
max_results=15,
|
||||
)
|
||||
|
||||
# Stage 3: Extract detailed content from top 5 papers
|
||||
top_urls = [r["url"] for r in key_papers["results"][:5]]
|
||||
detailed = extractor.extract(
|
||||
urls=top_urls,
|
||||
objective="Study design, key results, sensitivity/specificity data, and clinical implications",
|
||||
)
|
||||
```
|
||||
|
||||
**When to use:** Starting a literature review, systematic review, or comprehensive background section.
|
||||
|
||||
---
|
||||
|
||||
## Recipe 4: Market Intelligence Report
|
||||
|
||||
**Goal:** Generate a comprehensive market research report on an industry or product category.
|
||||
|
||||
**APIs:** Deep Research (multi-stage)
|
||||
|
||||
```python
|
||||
researcher = ParallelDeepResearch()
|
||||
|
||||
industry = "AI-powered drug discovery"
|
||||
|
||||
# Stage 1: Market overview (ultra-fast for maximum depth)
|
||||
market_overview = researcher.research(
|
||||
query=f"Comprehensive market analysis of {industry}: market size, growth rate, "
|
||||
f"key segments, geographic distribution, and forecast through 2030",
|
||||
processor="ultra-fast",
|
||||
description="Include specific dollar figures, CAGR percentages, and data sources. "
|
||||
"Break down by segment and geography."
|
||||
)
|
||||
|
||||
# Stage 2: Competitive landscape
|
||||
competitors = researcher.research_structured(
|
||||
query=f"Top 10 companies in {industry}: revenue, funding, key products, partnerships, and market position",
|
||||
processor="pro-fast",
|
||||
)
|
||||
|
||||
# Stage 3: Technology and innovation trends
|
||||
tech_trends = researcher.research(
|
||||
query=f"Technology trends and innovation landscape in {industry}: "
|
||||
f"emerging approaches, breakthrough technologies, patent landscape, and R&D investment",
|
||||
processor="pro-fast",
|
||||
description="Focus on specific technologies, quantify R&D spending, and identify emerging leaders."
|
||||
)
|
||||
|
||||
# Stage 4: Regulatory and risk analysis
|
||||
regulatory = researcher.research(
|
||||
query=f"Regulatory landscape and risk factors for {industry}: "
|
||||
f"FDA guidance, EMA requirements, compliance challenges, and market risks",
|
||||
processor="pro-fast",
|
||||
)
|
||||
```
|
||||
|
||||
**When to use:** Creating market research reports, investor presentations, or strategic analysis documents.
|
||||
|
||||
---
|
||||
|
||||
## Recipe 5: Competitive Analysis
|
||||
|
||||
**Goal:** Compare multiple companies, products, or technologies side-by-side.
|
||||
|
||||
**APIs:** Search + Extract + Research
|
||||
|
||||
```python
|
||||
searcher = ParallelSearch()
|
||||
extractor = ParallelExtract()
|
||||
researcher = ParallelDeepResearch()
|
||||
|
||||
companies = ["OpenAI", "Anthropic", "Google DeepMind"]
|
||||
|
||||
# Step 1: Search for recent data on each company
|
||||
for company in companies:
|
||||
result = searcher.search(
|
||||
objective=f"Latest product launches, funding, team size, and strategy for {company} in 2025",
|
||||
search_queries=[f"{company} product launch 2025", f"{company} funding valuation"],
|
||||
source_policy={"after_date": "2024-06-01"},
|
||||
)
|
||||
|
||||
# Step 2: Extract from company pages
|
||||
company_pages = [
|
||||
"https://openai.com/about",
|
||||
"https://anthropic.com/company",
|
||||
"https://deepmind.google/about/",
|
||||
]
|
||||
company_data = extractor.extract(
|
||||
urls=company_pages,
|
||||
objective="Mission, key products, team size, founding date, and recent milestones",
|
||||
)
|
||||
|
||||
# Step 3: Deep research for synthesis
|
||||
comparison = researcher.research(
|
||||
query=f"Detailed comparison of {', '.join(companies)}: "
|
||||
f"products, pricing, technology approach, market position, strengths, weaknesses",
|
||||
processor="pro-fast",
|
||||
description="Create a structured comparison covering: "
|
||||
"(1) Product portfolio, (2) Technology approach, (3) Pricing, "
|
||||
"(4) Market position, (5) Strengths/weaknesses, (6) Future outlook. "
|
||||
"Include a summary comparison table."
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recipe 6: Fact-Check Pipeline
|
||||
|
||||
**Goal:** Verify specific claims or statistics before including in a document.
|
||||
|
||||
**APIs:** Search + Extract
|
||||
|
||||
```python
|
||||
searcher = ParallelSearch()
|
||||
extractor = ParallelExtract()
|
||||
|
||||
claim = "The global AI market is expected to reach $1.8 trillion by 2030"
|
||||
|
||||
# Step 1: Search for corroborating sources
|
||||
result = searcher.search(
|
||||
objective=f"Verify this claim: '{claim}'. Find authoritative sources that confirm or contradict this figure.",
|
||||
search_queries=["global AI market size 2030 forecast", "artificial intelligence market projection trillion"],
|
||||
max_results=8,
|
||||
)
|
||||
|
||||
# Step 2: Extract specific figures from top sources
|
||||
source_urls = [r["url"] for r in result["results"][:3]]
|
||||
details = extractor.extract(
|
||||
urls=source_urls,
|
||||
objective="Specific market size figures, forecast years, CAGR, and methodology of the projection",
|
||||
)
|
||||
|
||||
# Analyze: Do multiple authoritative sources agree?
|
||||
```
|
||||
|
||||
**When to use:** Before including any specific statistic, market figure, or factual claim in a paper or report.
|
||||
|
||||
---
|
||||
|
||||
## Recipe 7: Current Events Briefing
|
||||
|
||||
**Goal:** Get up-to-date synthesis of recent developments on a topic.
|
||||
|
||||
**APIs:** Search + Research
|
||||
|
||||
```python
|
||||
searcher = ParallelSearch()
|
||||
researcher = ParallelDeepResearch()
|
||||
|
||||
topic = "EU AI Act implementation"
|
||||
|
||||
# Step 1: Find the latest news
|
||||
latest = searcher.search(
|
||||
objective=f"Latest news and developments on {topic} from the past month",
|
||||
search_queries=[f"{topic} 2025", f"{topic} latest updates"],
|
||||
source_policy={"after_date": "2025-01-15"},
|
||||
max_results=15,
|
||||
)
|
||||
|
||||
# Step 2: Synthesize into a briefing
|
||||
briefing = researcher.research(
|
||||
query=f"Summarize the latest developments in {topic} as of February 2025: "
|
||||
f"key milestones, compliance deadlines, industry reactions, and implications",
|
||||
processor="pro-fast",
|
||||
description="Write a concise 500-word executive briefing with timeline of key events."
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recipe 8: Technical Documentation Gathering
|
||||
|
||||
**Goal:** Collect and synthesize technical documentation for a framework or API.
|
||||
|
||||
**APIs:** Search + Extract
|
||||
|
||||
```python
|
||||
searcher = ParallelSearch()
|
||||
extractor = ParallelExtract()
|
||||
|
||||
# Step 1: Find documentation pages
|
||||
docs = searcher.search(
|
||||
objective="Find official PyTorch documentation for implementing custom attention mechanisms",
|
||||
search_queries=["PyTorch attention mechanism tutorial", "PyTorch MultiheadAttention documentation"],
|
||||
source_policy={"allow_domains": ["pytorch.org", "github.com/pytorch"]},
|
||||
)
|
||||
|
||||
# Step 2: Extract full content from documentation pages
|
||||
doc_urls = [r["url"] for r in docs["results"][:3]]
|
||||
full_docs = extractor.extract(
|
||||
urls=doc_urls,
|
||||
objective="Complete API reference, parameters, usage examples, and code snippets",
|
||||
full_content=True,
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recipe 9: Grant Background Research
|
||||
|
||||
**Goal:** Build a comprehensive background section for a grant proposal with verified statistics.
|
||||
|
||||
**APIs:** Deep Research + Search
|
||||
|
||||
```python
|
||||
researcher = ParallelDeepResearch()
|
||||
searcher = ParallelSearch()
|
||||
|
||||
research_area = "AI-guided antibiotic discovery to combat antimicrobial resistance"
|
||||
|
||||
# Step 1: Significance and burden of disease
|
||||
significance = researcher.research(
|
||||
query=f"Burden of antimicrobial resistance: mortality statistics, economic impact, "
|
||||
f"WHO priority pathogens, and projections. Include specific numbers.",
|
||||
processor="pro-fast",
|
||||
description="Focus on statistics suitable for NIH Significance section: "
|
||||
"deaths per year, economic cost, resistance trends, and urgency."
|
||||
)
|
||||
|
||||
# Step 2: Innovation landscape
|
||||
innovation = researcher.research(
|
||||
query=f"Current approaches to {research_area}: successes (halicin, etc.), "
|
||||
f"limitations of current methods, and what makes our approach novel",
|
||||
processor="pro-fast",
|
||||
description="Focus on Innovation section: what has been tried, what gaps remain, "
|
||||
"and what new approaches are emerging."
|
||||
)
|
||||
|
||||
# Step 3: Find specific papers for preliminary data context
|
||||
papers = searcher.search(
|
||||
objective="Find landmark papers on AI-discovered antibiotics and ML approaches to drug discovery",
|
||||
search_queries=[
|
||||
"halicin AI antibiotic discovery Nature",
|
||||
"machine learning antibiotic resistance prediction",
|
||||
"deep learning drug discovery antibiotics"
|
||||
],
|
||||
source_policy={"allow_domains": ["nature.com", "science.org", "cell.com", "pnas.org"]},
|
||||
)
|
||||
```
|
||||
|
||||
**When to use:** Writing Significance, Innovation, or Background sections for NIH, NSF, or other grant proposals.
|
||||
|
||||
---
|
||||
|
||||
## Combining with Other Skills
|
||||
|
||||
### With `research-lookup` (Academic Papers)
|
||||
|
||||
```python
|
||||
# Use parallel-web for general research
|
||||
researcher.research("Current state of quantum computing applications")
|
||||
|
||||
# Use research-lookup for academic paper search (auto-routes to Perplexity)
|
||||
# python research_lookup.py "find papers on quantum error correction in Nature and Science"
|
||||
```
|
||||
|
||||
### With `citation-management` (BibTeX)
|
||||
|
||||
```python
|
||||
# Step 1: Find paper with parallel search
|
||||
result = searcher.search(objective="Vaswani et al Attention Is All You Need paper")
|
||||
|
||||
# Step 2: Get DOI from results
|
||||
doi = "10.48550/arXiv.1706.03762"
|
||||
|
||||
# Step 3: Convert to BibTeX with citation-management skill
|
||||
# python scripts/doi_to_bibtex.py 10.48550/arXiv.1706.03762
|
||||
```
|
||||
|
||||
### With `scientific-schematics` (Diagrams)
|
||||
|
||||
```python
|
||||
# Step 1: Research a process
|
||||
result = researcher.research("How does the CRISPR-Cas9 gene editing mechanism work step by step")
|
||||
|
||||
# Step 2: Use the research to inform a schematic
|
||||
# python scripts/generate_schematic.py "CRISPR-Cas9 gene editing workflow: guide RNA design -> Cas9 binding -> DNA cleavage -> repair pathway" -o figures/crispr_mechanism.png
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Cheat Sheet
|
||||
|
||||
| Task | Processor | Expected Time | Approximate Cost |
|
||||
|------|-----------|---------------|------------------|
|
||||
| Quick fact lookup | `base-fast` | 15-50s | $0.01 |
|
||||
| Section background | `pro-fast` | 30s-5min | $0.10 |
|
||||
| Comprehensive report | `ultra-fast` | 1-10min | $0.30 |
|
||||
| Web search (10 results) | Search API | 1-3s | $0.005 |
|
||||
| URL extraction (1 URL) | Extract API | 1-20s | $0.001 |
|
||||
| URL extraction (5 URLs) | Extract API | 5-30s | $0.005 |
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- [API Reference](api_reference.md) - Complete API parameter reference
|
||||
- [Search Best Practices](search_best_practices.md) - Effective search queries
|
||||
- [Deep Research Guide](deep_research_guide.md) - Processor selection and output formats
|
||||
- [Extraction Patterns](extraction_patterns.md) - URL content extraction
|
||||
568
scientific-skills/parallel-web/scripts/parallel_web.py
Normal file
568
scientific-skills/parallel-web/scripts/parallel_web.py
Normal file
@@ -0,0 +1,568 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Parallel Web Systems API Client
|
||||
|
||||
Provides web search, URL content extraction, and deep research capabilities
|
||||
using the Parallel Web Systems APIs (https://docs.parallel.ai).
|
||||
|
||||
Primary interface: Parallel Chat API (OpenAI-compatible) for search and research.
|
||||
Secondary interface: Extract API for URL verification and special cases.
|
||||
|
||||
Main classes:
|
||||
- ParallelChat: Core Chat API client (base/core models)
|
||||
- ParallelSearch: Web search via Chat API (base model)
|
||||
- ParallelDeepResearch: Deep research via Chat API (core model)
|
||||
- ParallelExtract: URL content extraction (Extract API, verification only)
|
||||
|
||||
Environment variable required:
|
||||
PARALLEL_API_KEY - Your Parallel API key from https://platform.parallel.ai
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import argparse
|
||||
from datetime import datetime
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
|
||||
def _get_api_key():
|
||||
"""Validate and return the Parallel API key."""
|
||||
api_key = os.getenv("PARALLEL_API_KEY")
|
||||
if not api_key:
|
||||
raise ValueError(
|
||||
"PARALLEL_API_KEY environment variable not set.\n"
|
||||
"Get your key at https://platform.parallel.ai and set it:\n"
|
||||
" export PARALLEL_API_KEY='your_key_here'"
|
||||
)
|
||||
return api_key
|
||||
|
||||
|
||||
def _get_extract_client():
|
||||
"""Create and return a Parallel SDK client for the Extract API."""
|
||||
try:
|
||||
from parallel import Parallel
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"The 'parallel-web' package is required for extract. Install it with:\n"
|
||||
" pip install parallel-web"
|
||||
)
|
||||
return Parallel(api_key=_get_api_key())
|
||||
|
||||
|
||||
class ParallelChat:
|
||||
"""Core client for the Parallel Chat API.
|
||||
|
||||
OpenAI-compatible chat completions endpoint that performs web research
|
||||
and returns synthesized responses with citations.
|
||||
|
||||
Models:
|
||||
- base : Standard research, factual queries (15-100s latency)
|
||||
- core : Complex research, multi-source synthesis (60s-5min latency)
|
||||
"""
|
||||
|
||||
CHAT_BASE_URL = "https://api.parallel.ai"
|
||||
|
||||
def __init__(self):
|
||||
try:
|
||||
from openai import OpenAI
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"The 'openai' package is required. Install it with:\n"
|
||||
" pip install openai"
|
||||
)
|
||||
|
||||
self.client = OpenAI(
|
||||
api_key=_get_api_key(),
|
||||
base_url=self.CHAT_BASE_URL,
|
||||
)
|
||||
|
||||
def query(
|
||||
self,
|
||||
user_message: str,
|
||||
system_message: Optional[str] = None,
|
||||
model: str = "base",
|
||||
) -> Dict[str, Any]:
|
||||
"""Send a query to the Parallel Chat API.
|
||||
|
||||
Args:
|
||||
user_message: The research query or question.
|
||||
system_message: Optional system prompt to guide response style.
|
||||
model: Chat model to use ('base' or 'core').
|
||||
|
||||
Returns:
|
||||
Dict with 'content' (response text), 'sources' (citations), and metadata.
|
||||
"""
|
||||
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||||
|
||||
messages = []
|
||||
if system_message:
|
||||
messages.append({"role": "system", "content": system_message})
|
||||
messages.append({"role": "user", "content": user_message})
|
||||
|
||||
try:
|
||||
print(f"[Parallel Chat] Querying model={model}...", file=sys.stderr)
|
||||
|
||||
response = self.client.chat.completions.create(
|
||||
model=model,
|
||||
messages=messages,
|
||||
stream=False,
|
||||
)
|
||||
|
||||
content = ""
|
||||
if response.choices and len(response.choices) > 0:
|
||||
content = response.choices[0].message.content or ""
|
||||
|
||||
sources = self._extract_basis(response)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"content": content,
|
||||
"sources": sources,
|
||||
"citation_count": len(sources),
|
||||
"model": model,
|
||||
"timestamp": timestamp,
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"model": model,
|
||||
"timestamp": timestamp,
|
||||
}
|
||||
|
||||
def _extract_basis(self, response) -> List[Dict[str, str]]:
|
||||
"""Extract citation sources from the Chat API research basis."""
|
||||
sources = []
|
||||
basis = getattr(response, "basis", None)
|
||||
if not basis:
|
||||
return sources
|
||||
|
||||
seen_urls = set()
|
||||
if isinstance(basis, list):
|
||||
for item in basis:
|
||||
citations = (
|
||||
item.get("citations", []) if isinstance(item, dict)
|
||||
else getattr(item, "citations", None) or []
|
||||
)
|
||||
for cit in citations:
|
||||
url = cit.get("url", "") if isinstance(cit, dict) else getattr(cit, "url", "")
|
||||
if url and url not in seen_urls:
|
||||
seen_urls.add(url)
|
||||
title = cit.get("title", "") if isinstance(cit, dict) else getattr(cit, "title", "")
|
||||
excerpts = cit.get("excerpts", []) if isinstance(cit, dict) else getattr(cit, "excerpts", [])
|
||||
sources.append({
|
||||
"type": "source",
|
||||
"url": url,
|
||||
"title": title,
|
||||
"excerpts": excerpts,
|
||||
})
|
||||
|
||||
return sources
|
||||
|
||||
|
||||
class ParallelSearch:
|
||||
"""Web search using the Parallel Chat API (base model).
|
||||
|
||||
Sends a search query to the Chat API which performs web research and
|
||||
returns a synthesized summary with cited sources.
|
||||
"""
|
||||
|
||||
SYSTEM_PROMPT = (
|
||||
"You are a web research assistant. Search the web and synthesize information "
|
||||
"about the user's query. Provide a clear, well-organized summary with:\n"
|
||||
"- Key facts, data points, and statistics\n"
|
||||
"- Specific names, dates, and numbers when available\n"
|
||||
"- Multiple perspectives if the topic is debated\n"
|
||||
"Cite your sources inline. Be comprehensive but concise."
|
||||
)
|
||||
|
||||
def __init__(self):
|
||||
self.chat = ParallelChat()
|
||||
|
||||
def search(
|
||||
self,
|
||||
objective: str,
|
||||
model: str = "base",
|
||||
) -> Dict[str, Any]:
|
||||
"""Execute a web search via the Chat API.
|
||||
|
||||
Args:
|
||||
objective: Natural language description of the search goal.
|
||||
model: Chat model to use ('base' or 'core', default 'base').
|
||||
|
||||
Returns:
|
||||
Dict with 'response' (synthesized text), 'sources', and metadata.
|
||||
"""
|
||||
result = self.chat.query(
|
||||
user_message=objective,
|
||||
system_message=self.SYSTEM_PROMPT,
|
||||
model=model,
|
||||
)
|
||||
|
||||
if not result["success"]:
|
||||
return {
|
||||
"success": False,
|
||||
"objective": objective,
|
||||
"error": result.get("error", "Unknown error"),
|
||||
"timestamp": result["timestamp"],
|
||||
}
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"objective": objective,
|
||||
"response": result["content"],
|
||||
"sources": result["sources"],
|
||||
"citation_count": result["citation_count"],
|
||||
"model": result["model"],
|
||||
"backend": "parallel-chat",
|
||||
"timestamp": result["timestamp"],
|
||||
}
|
||||
|
||||
|
||||
class ParallelExtract:
|
||||
"""Extract clean content from URLs using Parallel's Extract API.
|
||||
|
||||
Converts any public URL into clean, LLM-optimized markdown.
|
||||
Use for citation verification and special cases only.
|
||||
For general research, use ParallelSearch or ParallelDeepResearch instead.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.client = _get_extract_client()
|
||||
|
||||
def extract(
|
||||
self,
|
||||
urls: List[str],
|
||||
objective: Optional[str] = None,
|
||||
excerpts: bool = True,
|
||||
full_content: bool = False,
|
||||
) -> Dict[str, Any]:
|
||||
"""Extract content from one or more URLs.
|
||||
|
||||
Args:
|
||||
urls: List of URLs to extract content from.
|
||||
objective: Optional objective to focus extraction.
|
||||
excerpts: Whether to return focused excerpts (default True).
|
||||
full_content: Whether to return full page content (default False).
|
||||
|
||||
Returns:
|
||||
Dict with 'results' list containing url, title, excerpts/content.
|
||||
"""
|
||||
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||||
|
||||
kwargs = {
|
||||
"urls": urls,
|
||||
"excerpts": excerpts,
|
||||
"full_content": full_content,
|
||||
}
|
||||
if objective:
|
||||
kwargs["objective"] = objective
|
||||
|
||||
try:
|
||||
response = self.client.beta.extract(**kwargs)
|
||||
|
||||
results = []
|
||||
if hasattr(response, "results") and response.results:
|
||||
for r in response.results:
|
||||
result = {
|
||||
"url": getattr(r, "url", ""),
|
||||
"title": getattr(r, "title", ""),
|
||||
"publish_date": getattr(r, "publish_date", None),
|
||||
"excerpts": getattr(r, "excerpts", []),
|
||||
"full_content": getattr(r, "full_content", None),
|
||||
}
|
||||
results.append(result)
|
||||
|
||||
errors = []
|
||||
if hasattr(response, "errors") and response.errors:
|
||||
errors = [str(e) for e in response.errors]
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"urls": urls,
|
||||
"results": results,
|
||||
"errors": errors,
|
||||
"timestamp": timestamp,
|
||||
"extract_id": getattr(response, "extract_id", None),
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"success": False,
|
||||
"urls": urls,
|
||||
"error": str(e),
|
||||
"timestamp": timestamp,
|
||||
}
|
||||
|
||||
|
||||
class ParallelDeepResearch:
|
||||
"""Deep research using the Parallel Chat API (core model).
|
||||
|
||||
Sends complex research queries to the Chat API which performs
|
||||
multi-source web research and returns comprehensive reports with citations.
|
||||
"""
|
||||
|
||||
SYSTEM_PROMPT = (
|
||||
"You are a deep research analyst. Provide a comprehensive, well-structured "
|
||||
"research report on the user's topic. Include:\n"
|
||||
"- Executive summary of key findings\n"
|
||||
"- Detailed analysis organized by themes\n"
|
||||
"- Specific data, statistics, and quantitative evidence\n"
|
||||
"- Multiple authoritative sources\n"
|
||||
"- Implications and future outlook where relevant\n"
|
||||
"Use markdown formatting with clear section headers. "
|
||||
"Cite all sources inline."
|
||||
)
|
||||
|
||||
def __init__(self):
|
||||
self.chat = ParallelChat()
|
||||
|
||||
def research(
|
||||
self,
|
||||
query: str,
|
||||
model: str = "core",
|
||||
system_prompt: Optional[str] = None,
|
||||
) -> Dict[str, Any]:
|
||||
"""Run deep research via the Chat API.
|
||||
|
||||
Args:
|
||||
query: The research question or topic.
|
||||
model: Chat model to use ('base' or 'core', default 'core').
|
||||
system_prompt: Optional override for the system prompt.
|
||||
|
||||
Returns:
|
||||
Dict with 'response' (markdown report), 'citations', and metadata.
|
||||
"""
|
||||
result = self.chat.query(
|
||||
user_message=query,
|
||||
system_message=system_prompt or self.SYSTEM_PROMPT,
|
||||
model=model,
|
||||
)
|
||||
|
||||
if not result["success"]:
|
||||
return {
|
||||
"success": False,
|
||||
"query": query,
|
||||
"error": result.get("error", "Unknown error"),
|
||||
"model": model,
|
||||
"timestamp": result["timestamp"],
|
||||
}
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"query": query,
|
||||
"response": result["content"],
|
||||
"output": result["content"],
|
||||
"citations": result["sources"],
|
||||
"sources": result["sources"],
|
||||
"citation_count": result["citation_count"],
|
||||
"model": model,
|
||||
"backend": "parallel-chat",
|
||||
"timestamp": result["timestamp"],
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI Interface
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _print_search_results(result: Dict[str, Any], output_file=None):
|
||||
"""Print search results (synthesized summary + sources)."""
|
||||
def write(text):
|
||||
if output_file:
|
||||
output_file.write(text + "\n")
|
||||
else:
|
||||
print(text)
|
||||
|
||||
if not result["success"]:
|
||||
write(f"Error: {result.get('error', 'Unknown error')}")
|
||||
return
|
||||
|
||||
write(f"\n{'='*80}")
|
||||
write(f"Search: {result['objective']}")
|
||||
write(f"Model: {result['model']} | Time: {result['timestamp']}")
|
||||
write(f"{'='*80}\n")
|
||||
|
||||
write(result.get("response", "No response received."))
|
||||
|
||||
sources = result.get("sources", [])
|
||||
if sources:
|
||||
write(f"\n\n{'='*40} SOURCES {'='*40}")
|
||||
for i, src in enumerate(sources):
|
||||
title = src.get("title", "Untitled")
|
||||
url = src.get("url", "")
|
||||
write(f" [{i+1}] {title}")
|
||||
if url:
|
||||
write(f" {url}")
|
||||
|
||||
|
||||
def _print_extract_results(result: Dict[str, Any], output_file=None):
|
||||
"""Pretty-print extract results."""
|
||||
def write(text):
|
||||
if output_file:
|
||||
output_file.write(text + "\n")
|
||||
else:
|
||||
print(text)
|
||||
|
||||
if not result["success"]:
|
||||
write(f"Error: {result.get('error', 'Unknown error')}")
|
||||
return
|
||||
|
||||
write(f"\n{'='*80}")
|
||||
write(f"Extracted from: {', '.join(result['urls'])}")
|
||||
write(f"Time: {result['timestamp']}")
|
||||
write(f"{'='*80}")
|
||||
|
||||
for i, r in enumerate(result["results"]):
|
||||
write(f"\n--- [{i+1}] {r['title']} ---")
|
||||
write(f"URL: {r['url']}")
|
||||
if r.get("full_content"):
|
||||
write(f"\n{r['full_content']}")
|
||||
elif r.get("excerpts"):
|
||||
for j, excerpt in enumerate(r["excerpts"]):
|
||||
write(f"\nExcerpt {j+1}:")
|
||||
write(excerpt[:2000] if len(excerpt) > 2000 else excerpt)
|
||||
|
||||
if result.get("errors"):
|
||||
write(f"\nErrors: {result['errors']}")
|
||||
|
||||
|
||||
def _print_research_results(result: Dict[str, Any], output_file=None):
|
||||
"""Print deep research results (report + sources)."""
|
||||
def write(text):
|
||||
if output_file:
|
||||
output_file.write(text + "\n")
|
||||
else:
|
||||
print(text)
|
||||
|
||||
if not result["success"]:
|
||||
write(f"Error: {result.get('error', 'Unknown error')}")
|
||||
return
|
||||
|
||||
write(f"\n{'='*80}")
|
||||
query_display = result['query'][:100]
|
||||
if len(result['query']) > 100:
|
||||
query_display += "..."
|
||||
write(f"Research: {query_display}")
|
||||
write(f"Model: {result['model']} | Citations: {result.get('citation_count', 0)} | Time: {result['timestamp']}")
|
||||
write(f"{'='*80}\n")
|
||||
|
||||
write(result.get("response", result.get("output", "No output received.")))
|
||||
|
||||
citations = result.get("citations", result.get("sources", []))
|
||||
if citations:
|
||||
write(f"\n\n{'='*40} SOURCES {'='*40}")
|
||||
seen_urls = set()
|
||||
for cit in citations:
|
||||
url = cit.get("url", "")
|
||||
if url and url not in seen_urls:
|
||||
seen_urls.add(url)
|
||||
title = cit.get("title", "Untitled")
|
||||
write(f" [{len(seen_urls)}] {title}")
|
||||
write(f" {url}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Parallel Web Systems API Client - Search, Extract, and Deep Research",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
python parallel_web.py search "latest advances in quantum computing"
|
||||
python parallel_web.py search "climate policy 2025" --model core
|
||||
python parallel_web.py extract "https://example.com" --objective "key findings"
|
||||
python parallel_web.py research "comprehensive analysis of EV battery market"
|
||||
python parallel_web.py research "compare mRNA vs protein subunit vaccines" --model base
|
||||
python parallel_web.py research "AI regulation landscape 2025" -o report.md
|
||||
""",
|
||||
)
|
||||
|
||||
subparsers = parser.add_subparsers(dest="command", help="API command")
|
||||
|
||||
# --- search subcommand ---
|
||||
search_parser = subparsers.add_parser("search", help="Web search via Chat API (synthesized results)")
|
||||
search_parser.add_argument("objective", help="Natural language search objective")
|
||||
search_parser.add_argument("--model", default="base", choices=["base", "core"],
|
||||
help="Chat model to use (default: base)")
|
||||
search_parser.add_argument("-o", "--output", help="Write output to file")
|
||||
search_parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
|
||||
# --- extract subcommand ---
|
||||
extract_parser = subparsers.add_parser("extract", help="Extract content from URLs (verification only)")
|
||||
extract_parser.add_argument("urls", nargs="+", help="One or more URLs to extract")
|
||||
extract_parser.add_argument("--objective", help="Objective to focus extraction")
|
||||
extract_parser.add_argument("--full-content", action="store_true", help="Return full page content")
|
||||
extract_parser.add_argument("-o", "--output", help="Write output to file")
|
||||
extract_parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
|
||||
# --- research subcommand ---
|
||||
research_parser = subparsers.add_parser("research", help="Deep research via Chat API (comprehensive report)")
|
||||
research_parser.add_argument("query", help="Research question or topic")
|
||||
research_parser.add_argument("--model", default="core", choices=["base", "core"],
|
||||
help="Chat model to use (default: core)")
|
||||
research_parser.add_argument("-o", "--output", help="Write output to file")
|
||||
research_parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.command:
|
||||
parser.print_help()
|
||||
return 1
|
||||
|
||||
output_file = None
|
||||
if hasattr(args, "output") and args.output:
|
||||
output_file = open(args.output, "w", encoding="utf-8")
|
||||
|
||||
try:
|
||||
if args.command == "search":
|
||||
searcher = ParallelSearch()
|
||||
result = searcher.search(
|
||||
objective=args.objective,
|
||||
model=args.model,
|
||||
)
|
||||
if args.json:
|
||||
text = json.dumps(result, indent=2, ensure_ascii=False, default=str)
|
||||
(output_file or sys.stdout).write(text + "\n")
|
||||
else:
|
||||
_print_search_results(result, output_file)
|
||||
|
||||
elif args.command == "extract":
|
||||
extractor = ParallelExtract()
|
||||
result = extractor.extract(
|
||||
urls=args.urls,
|
||||
objective=args.objective,
|
||||
full_content=args.full_content,
|
||||
)
|
||||
if args.json:
|
||||
text = json.dumps(result, indent=2, ensure_ascii=False, default=str)
|
||||
(output_file or sys.stdout).write(text + "\n")
|
||||
else:
|
||||
_print_extract_results(result, output_file)
|
||||
|
||||
elif args.command == "research":
|
||||
researcher = ParallelDeepResearch()
|
||||
result = researcher.research(
|
||||
query=args.query,
|
||||
model=args.model,
|
||||
)
|
||||
if args.json:
|
||||
text = json.dumps(result, indent=2, ensure_ascii=False, default=str)
|
||||
(output_file or sys.stdout).write(text + "\n")
|
||||
else:
|
||||
_print_research_results(result, output_file)
|
||||
|
||||
return 0
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
finally:
|
||||
if output_file:
|
||||
output_file.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user