Added parallel-web skill

Refactor research lookup skill to enhance backend routing and update documentation. The skill now intelligently selects between the Parallel Chat API and Perplexity sonar-pro-search based on query type. Added compatibility notes, license information, and improved descriptions for clarity. Removed outdated example scripts to streamline the codebase.
2026-03-27 07:09:27 +08:00 · 2026-03-01 07:36:19 -08:00
parent 29c869326e
commit f72b7f4521
13 changed files with 3969 additions and 769 deletions
--- a/scientific-skills/parallel-web/references/extraction_patterns.md
+++ b/scientific-skills/parallel-web/references/extraction_patterns.md
@@ -0,0 +1,338 @@
+# Extraction Patterns
+
+Guide to using Parallel's Extract API for converting web pages into clean, LLM-optimized content.
+
+---
+
+## Overview
+
+The Extract API converts any public URL into clean markdown. It handles JavaScript-heavy pages, PDFs, and complex layouts that simple HTTP fetching cannot parse. Results are optimized for LLM consumption.
+
+**Key capabilities:**
+- JavaScript rendering (SPAs, dynamic content)
+- PDF extraction to clean text
+- Focused excerpts aligned to your objective
+- Full page content extraction
+- Multiple URL batch processing
+
+---
+
+## When to Use Extract vs Search
+
+| Scenario | Use Extract | Use Search |
+|----------|-------------|------------|
+| You have a specific URL | Yes | No |
+| You need content from a known page | Yes | No |
+| You want to find pages about a topic | No | Yes |
+| You need to read a research paper URL | Yes | No |
+| You need to verify information on a specific site | Yes | No |
+| You're looking for information broadly | No | Yes |
+| You found URLs from a search and want full content | Yes | No |
+
+**Rule of thumb:** If you have a URL, use Extract. If you need to find URLs, use Search.
+
+---
+
+## Excerpt Mode vs Full Content Mode
+
+### Excerpt Mode (Default)
+
+Returns focused content aligned to your objective. Smaller token footprint, higher relevance.
+
+```python
+extractor = ParallelExtract()
+
+result = extractor.extract(
+    urls=["https://arxiv.org/abs/2301.12345"],
+    objective="Key methodology and experimental results",
+    excerpts=True,     # Default
+    full_content=False  # Default
+)
+```
+
+**Best for:**
+- Extracting specific information from long pages
+- Token-efficient processing
+- When you know what you're looking for
+- Reading papers for specific claims or data points
+
+### Full Content Mode
+
+Returns the complete page content as clean markdown.
+
+```python
+result = extractor.extract(
+    urls=["https://docs.example.com/api-reference"],
+    objective="Complete API documentation",
+    excerpts=False,
+    full_content=True,
+)
+```
+
+**Best for:**
+- Complete documentation pages
+- Full article text needed for analysis
+- When you need every detail, not just excerpts
+- Archiving or converting web content
+
+### Both Modes
+
+You can request both excerpts and full content:
+
+```python
+result = extractor.extract(
+    urls=["https://example.com/report"],
+    objective="Executive summary and key recommendations",
+    excerpts=True,
+    full_content=True,
+)
+
+# Use excerpts for focused analysis
+# Use full_content for complete reference
+```
+
+---
+
+## Objective Writing for Extraction
+
+The `objective` parameter focuses extraction on relevant content. It dramatically improves excerpt quality.
+
+### Good Objectives
+
+```python
+# Specific and actionable
+objective="Extract the methodology section, including sample size, statistical methods, and primary endpoints"
+
+# Clear about what you need
+objective="Find the pricing information, feature comparison table, and enterprise plan details"
+
+# Targeted for your task
+objective="Key findings, effect sizes, confidence intervals, and author conclusions from this clinical trial"
+```
+
+### Poor Objectives
+
+```python
+# Too vague
+objective="Tell me about this page"
+
+# No objective at all (still works but excerpts are less focused)
+extractor.extract(urls=["https://..."])
+```
+
+### Objective Templates by Use Case
+
+**Academic Paper:**
+```python
+objective="Abstract, key findings, methodology (sample size, design, statistical tests), results with effect sizes and p-values, and main conclusions"
+```
+
+**Product/Company Page:**
+```python
+objective="Company overview, key products/services, pricing, founding date, leadership team, and recent announcements"
+```
+
+**Technical Documentation:**
+```python
+objective="API endpoints, authentication methods, request/response formats, rate limits, and code examples"
+```
+
+**News Article:**
+```python
+objective="Main story, key quotes, data points, timeline of events, and named sources"
+```
+
+**Government/Policy Document:**
+```python
+objective="Key policy provisions, effective dates, affected parties, compliance requirements, and penalties"
+```
+
+---
+
+## Batch Extraction
+
+Extract from multiple URLs in a single call:
+
+```python
+result = extractor.extract(
+    urls=[
+        "https://nature.com/articles/s12345",
+        "https://science.org/doi/full/10.1234/science.xyz",
+        "https://thelancet.com/journals/lancet/article/PIIS0140-6736(24)12345/fulltext"
+    ],
+    objective="Key findings, sample sizes, and statistical results from each study",
+)
+
+# Results are returned in the same order as input URLs
+for r in result["results"]:
+    print(f"=== {r['title']} ===")
+    print(f"URL: {r['url']}")
+    for excerpt in r["excerpts"]:
+        print(excerpt[:500])
+```
+
+**Batch limits:**
+- No hard limit on number of URLs per request
+- Each URL counts as one extraction unit for billing
+- Large batches may take longer to process
+- Failed URLs are reported in the `errors` field without blocking successful ones
+
+---
+
+## Handling Different Content Types
+
+### Web Pages (HTML)
+
+Standard extraction. JavaScript is rendered, so SPAs and dynamic content work.
+
+```python
+# Standard web page
+result = extractor.extract(
+    urls=["https://example.com/article"],
+    objective="Main article content",
+)
+```
+
+### PDFs
+
+PDFs are automatically detected and converted to text.
+
+```python
+# PDF extraction
+result = extractor.extract(
+    urls=["https://example.com/whitepaper.pdf"],
+    objective="Executive summary and key recommendations",
+)
+```
+
+### Documentation Sites
+
+Single-page apps and documentation frameworks (Docusaurus, GitBook, ReadTheDocs) are fully rendered.
+
+```python
+result = extractor.extract(
+    urls=["https://docs.example.com/getting-started"],
+    objective="Installation instructions and quickstart guide",
+    full_content=True,
+)
+```
+
+---
+
+## Common Extraction Patterns
+
+### Pattern 1: Search Then Extract
+
+Find relevant pages with Search, then extract full content from the best results.
+
+```python
+from parallel_web import ParallelSearch, ParallelExtract
+
+searcher = ParallelSearch()
+extractor = ParallelExtract()
+
+# Step 1: Find relevant pages
+search_result = searcher.search(
+    objective="Find the original transformer paper and its key follow-up papers",
+    search_queries=["attention is all you need paper", "transformer architecture paper"],
+)
+
+# Step 2: Extract detailed content from top results
+top_urls = [r["url"] for r in search_result["results"][:3]]
+extract_result = extractor.extract(
+    urls=top_urls,
+    objective="Abstract, architecture description, key results, and ablation studies",
+)
+```
+
+### Pattern 2: DOI Resolution and Paper Reading
+
+```python
+# Extract content from a DOI URL
+result = extractor.extract(
+    urls=["https://doi.org/10.1038/s41586-024-07487-w"],
+    objective="Study design, patient population, primary endpoints, efficacy results, and safety data",
+)
+```
+
+### Pattern 3: Competitive Intelligence from Company Pages
+
+```python
+companies = [
+    "https://openai.com/about",
+    "https://anthropic.com/company",
+    "https://deepmind.google/about/",
+]
+
+result = extractor.extract(
+    urls=companies,
+    objective="Company mission, team size, key products, recent announcements, and funding information",
+)
+```
+
+### Pattern 4: Documentation Extraction for Reference
+
+```python
+result = extractor.extract(
+    urls=["https://docs.parallel.ai/search/search-quickstart"],
+    objective="Complete API usage guide including request format, response format, and code examples",
+    full_content=True,
+)
+```
+
+### Pattern 5: Metadata Verification
+
+```python
+# Verify citation metadata for a specific paper
+result = extractor.extract(
+    urls=["https://doi.org/10.1234/example-doi"],
+    objective="Complete citation metadata: authors, title, journal, volume, pages, year, DOI",
+)
+```
+
+---
+
+## Error Handling
+
+### Common Errors
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| URL not accessible | Page requires authentication, is behind paywall, or is down | Try a different URL or use Search instead |
+| Timeout | Page takes too long to render | Retry or use a simpler URL |
+| Empty content | Page is dynamically loaded in a way that can't be rendered | Try full_content mode or use Search |
+| Rate limited | Too many requests | Wait and retry, or reduce batch size |
+
+### Checking for Errors
+
+```python
+result = extractor.extract(urls=["https://example.com/page"])
+
+if not result["success"]:
+    print(f"Extraction failed: {result['error']}")
+elif result.get("errors"):
+    print(f"Some URLs failed: {result['errors']}")
+else:
+    print(f"Successfully extracted {len(result['results'])} pages")
+```
+
+---
+
+## Tips and Best Practices
+
+1. **Always provide an objective**: Even a general one improves excerpt quality significantly
+2. **Use excerpts by default**: Full content is only needed when you truly need everything
+3. **Batch related URLs**: One call with 5 URLs is better than 5 separate calls
+4. **Check for errors**: Not all URLs are extractable (paywalls, auth, etc.)
+5. **Combine with Search**: Search finds URLs, Extract reads them in detail
+6. **Use for DOI resolution**: Extract handles DOI redirects automatically
+7. **Prefer Extract over manual fetching**: Handles JavaScript, PDFs, and complex layouts
+
+---
+
+## See Also
+
+- [API Reference](api_reference.md) - Complete API parameter reference
+- [Search Best Practices](search_best_practices.md) - For finding URLs to extract
+- [Deep Research Guide](deep_research_guide.md) - For comprehensive research tasks
+- [Workflow Recipes](workflow_recipes.md) - Common multi-step patterns