Refactor research lookup skill to enhance backend routing and update documentation. The skill now intelligently selects between the Parallel Chat API and Perplexity sonar-pro-search based on query type. Added compatibility notes, license information, and improved descriptions for clarity. Removed outdated example scripts to streamline the codebase.
9.3 KiB
Extraction Patterns
Guide to using Parallel's Extract API for converting web pages into clean, LLM-optimized content.
Overview
The Extract API converts any public URL into clean markdown. It handles JavaScript-heavy pages, PDFs, and complex layouts that simple HTTP fetching cannot parse. Results are optimized for LLM consumption.
Key capabilities:
- JavaScript rendering (SPAs, dynamic content)
- PDF extraction to clean text
- Focused excerpts aligned to your objective
- Full page content extraction
- Multiple URL batch processing
When to Use Extract vs Search
| Scenario | Use Extract | Use Search |
|---|---|---|
| You have a specific URL | Yes | No |
| You need content from a known page | Yes | No |
| You want to find pages about a topic | No | Yes |
| You need to read a research paper URL | Yes | No |
| You need to verify information on a specific site | Yes | No |
| You're looking for information broadly | No | Yes |
| You found URLs from a search and want full content | Yes | No |
Rule of thumb: If you have a URL, use Extract. If you need to find URLs, use Search.
Excerpt Mode vs Full Content Mode
Excerpt Mode (Default)
Returns focused content aligned to your objective. Smaller token footprint, higher relevance.
extractor = ParallelExtract()
result = extractor.extract(
urls=["https://arxiv.org/abs/2301.12345"],
objective="Key methodology and experimental results",
excerpts=True, # Default
full_content=False # Default
)
Best for:
- Extracting specific information from long pages
- Token-efficient processing
- When you know what you're looking for
- Reading papers for specific claims or data points
Full Content Mode
Returns the complete page content as clean markdown.
result = extractor.extract(
urls=["https://docs.example.com/api-reference"],
objective="Complete API documentation",
excerpts=False,
full_content=True,
)
Best for:
- Complete documentation pages
- Full article text needed for analysis
- When you need every detail, not just excerpts
- Archiving or converting web content
Both Modes
You can request both excerpts and full content:
result = extractor.extract(
urls=["https://example.com/report"],
objective="Executive summary and key recommendations",
excerpts=True,
full_content=True,
)
# Use excerpts for focused analysis
# Use full_content for complete reference
Objective Writing for Extraction
The objective parameter focuses extraction on relevant content. It dramatically improves excerpt quality.
Good Objectives
# Specific and actionable
objective="Extract the methodology section, including sample size, statistical methods, and primary endpoints"
# Clear about what you need
objective="Find the pricing information, feature comparison table, and enterprise plan details"
# Targeted for your task
objective="Key findings, effect sizes, confidence intervals, and author conclusions from this clinical trial"
Poor Objectives
# Too vague
objective="Tell me about this page"
# No objective at all (still works but excerpts are less focused)
extractor.extract(urls=["https://..."])
Objective Templates by Use Case
Academic Paper:
objective="Abstract, key findings, methodology (sample size, design, statistical tests), results with effect sizes and p-values, and main conclusions"
Product/Company Page:
objective="Company overview, key products/services, pricing, founding date, leadership team, and recent announcements"
Technical Documentation:
objective="API endpoints, authentication methods, request/response formats, rate limits, and code examples"
News Article:
objective="Main story, key quotes, data points, timeline of events, and named sources"
Government/Policy Document:
objective="Key policy provisions, effective dates, affected parties, compliance requirements, and penalties"
Batch Extraction
Extract from multiple URLs in a single call:
result = extractor.extract(
urls=[
"https://nature.com/articles/s12345",
"https://science.org/doi/full/10.1234/science.xyz",
"https://thelancet.com/journals/lancet/article/PIIS0140-6736(24)12345/fulltext"
],
objective="Key findings, sample sizes, and statistical results from each study",
)
# Results are returned in the same order as input URLs
for r in result["results"]:
print(f"=== {r['title']} ===")
print(f"URL: {r['url']}")
for excerpt in r["excerpts"]:
print(excerpt[:500])
Batch limits:
- No hard limit on number of URLs per request
- Each URL counts as one extraction unit for billing
- Large batches may take longer to process
- Failed URLs are reported in the
errorsfield without blocking successful ones
Handling Different Content Types
Web Pages (HTML)
Standard extraction. JavaScript is rendered, so SPAs and dynamic content work.
# Standard web page
result = extractor.extract(
urls=["https://example.com/article"],
objective="Main article content",
)
PDFs
PDFs are automatically detected and converted to text.
# PDF extraction
result = extractor.extract(
urls=["https://example.com/whitepaper.pdf"],
objective="Executive summary and key recommendations",
)
Documentation Sites
Single-page apps and documentation frameworks (Docusaurus, GitBook, ReadTheDocs) are fully rendered.
result = extractor.extract(
urls=["https://docs.example.com/getting-started"],
objective="Installation instructions and quickstart guide",
full_content=True,
)
Common Extraction Patterns
Pattern 1: Search Then Extract
Find relevant pages with Search, then extract full content from the best results.
from parallel_web import ParallelSearch, ParallelExtract
searcher = ParallelSearch()
extractor = ParallelExtract()
# Step 1: Find relevant pages
search_result = searcher.search(
objective="Find the original transformer paper and its key follow-up papers",
search_queries=["attention is all you need paper", "transformer architecture paper"],
)
# Step 2: Extract detailed content from top results
top_urls = [r["url"] for r in search_result["results"][:3]]
extract_result = extractor.extract(
urls=top_urls,
objective="Abstract, architecture description, key results, and ablation studies",
)
Pattern 2: DOI Resolution and Paper Reading
# Extract content from a DOI URL
result = extractor.extract(
urls=["https://doi.org/10.1038/s41586-024-07487-w"],
objective="Study design, patient population, primary endpoints, efficacy results, and safety data",
)
Pattern 3: Competitive Intelligence from Company Pages
companies = [
"https://openai.com/about",
"https://anthropic.com/company",
"https://deepmind.google/about/",
]
result = extractor.extract(
urls=companies,
objective="Company mission, team size, key products, recent announcements, and funding information",
)
Pattern 4: Documentation Extraction for Reference
result = extractor.extract(
urls=["https://docs.parallel.ai/search/search-quickstart"],
objective="Complete API usage guide including request format, response format, and code examples",
full_content=True,
)
Pattern 5: Metadata Verification
# Verify citation metadata for a specific paper
result = extractor.extract(
urls=["https://doi.org/10.1234/example-doi"],
objective="Complete citation metadata: authors, title, journal, volume, pages, year, DOI",
)
Error Handling
Common Errors
| Error | Cause | Solution |
|---|---|---|
| URL not accessible | Page requires authentication, is behind paywall, or is down | Try a different URL or use Search instead |
| Timeout | Page takes too long to render | Retry or use a simpler URL |
| Empty content | Page is dynamically loaded in a way that can't be rendered | Try full_content mode or use Search |
| Rate limited | Too many requests | Wait and retry, or reduce batch size |
Checking for Errors
result = extractor.extract(urls=["https://example.com/page"])
if not result["success"]:
print(f"Extraction failed: {result['error']}")
elif result.get("errors"):
print(f"Some URLs failed: {result['errors']}")
else:
print(f"Successfully extracted {len(result['results'])} pages")
Tips and Best Practices
- Always provide an objective: Even a general one improves excerpt quality significantly
- Use excerpts by default: Full content is only needed when you truly need everything
- Batch related URLs: One call with 5 URLs is better than 5 separate calls
- Check for errors: Not all URLs are extractable (paywalls, auth, etc.)
- Combine with Search: Search finds URLs, Extract reads them in detail
- Use for DOI resolution: Extract handles DOI redirects automatically
- Prefer Extract over manual fetching: Handles JavaScript, PDFs, and complex layouts
See Also
- API Reference - Complete API parameter reference
- Search Best Practices - For finding URLs to extract
- Deep Research Guide - For comprehensive research tasks
- Workflow Recipes - Common multi-step patterns