Files
claude-scientific-skills/scientific-skills/parallel-web/references/extraction_patterns.md
Vinayak Agarwal f72b7f4521 Added parallel-web skill
Refactor research lookup skill to enhance backend routing and update documentation. The skill now intelligently selects between the Parallel Chat API and Perplexity sonar-pro-search based on query type. Added compatibility notes, license information, and improved descriptions for clarity. Removed outdated example scripts to streamline the codebase.
2026-03-01 07:36:19 -08:00

339 lines
9.3 KiB
Markdown

# Extraction Patterns
Guide to using Parallel's Extract API for converting web pages into clean, LLM-optimized content.
---
## Overview
The Extract API converts any public URL into clean markdown. It handles JavaScript-heavy pages, PDFs, and complex layouts that simple HTTP fetching cannot parse. Results are optimized for LLM consumption.
**Key capabilities:**
- JavaScript rendering (SPAs, dynamic content)
- PDF extraction to clean text
- Focused excerpts aligned to your objective
- Full page content extraction
- Multiple URL batch processing
---
## When to Use Extract vs Search
| Scenario | Use Extract | Use Search |
|----------|-------------|------------|
| You have a specific URL | Yes | No |
| You need content from a known page | Yes | No |
| You want to find pages about a topic | No | Yes |
| You need to read a research paper URL | Yes | No |
| You need to verify information on a specific site | Yes | No |
| You're looking for information broadly | No | Yes |
| You found URLs from a search and want full content | Yes | No |
**Rule of thumb:** If you have a URL, use Extract. If you need to find URLs, use Search.
---
## Excerpt Mode vs Full Content Mode
### Excerpt Mode (Default)
Returns focused content aligned to your objective. Smaller token footprint, higher relevance.
```python
extractor = ParallelExtract()
result = extractor.extract(
urls=["https://arxiv.org/abs/2301.12345"],
objective="Key methodology and experimental results",
excerpts=True, # Default
full_content=False # Default
)
```
**Best for:**
- Extracting specific information from long pages
- Token-efficient processing
- When you know what you're looking for
- Reading papers for specific claims or data points
### Full Content Mode
Returns the complete page content as clean markdown.
```python
result = extractor.extract(
urls=["https://docs.example.com/api-reference"],
objective="Complete API documentation",
excerpts=False,
full_content=True,
)
```
**Best for:**
- Complete documentation pages
- Full article text needed for analysis
- When you need every detail, not just excerpts
- Archiving or converting web content
### Both Modes
You can request both excerpts and full content:
```python
result = extractor.extract(
urls=["https://example.com/report"],
objective="Executive summary and key recommendations",
excerpts=True,
full_content=True,
)
# Use excerpts for focused analysis
# Use full_content for complete reference
```
---
## Objective Writing for Extraction
The `objective` parameter focuses extraction on relevant content. It dramatically improves excerpt quality.
### Good Objectives
```python
# Specific and actionable
objective="Extract the methodology section, including sample size, statistical methods, and primary endpoints"
# Clear about what you need
objective="Find the pricing information, feature comparison table, and enterprise plan details"
# Targeted for your task
objective="Key findings, effect sizes, confidence intervals, and author conclusions from this clinical trial"
```
### Poor Objectives
```python
# Too vague
objective="Tell me about this page"
# No objective at all (still works but excerpts are less focused)
extractor.extract(urls=["https://..."])
```
### Objective Templates by Use Case
**Academic Paper:**
```python
objective="Abstract, key findings, methodology (sample size, design, statistical tests), results with effect sizes and p-values, and main conclusions"
```
**Product/Company Page:**
```python
objective="Company overview, key products/services, pricing, founding date, leadership team, and recent announcements"
```
**Technical Documentation:**
```python
objective="API endpoints, authentication methods, request/response formats, rate limits, and code examples"
```
**News Article:**
```python
objective="Main story, key quotes, data points, timeline of events, and named sources"
```
**Government/Policy Document:**
```python
objective="Key policy provisions, effective dates, affected parties, compliance requirements, and penalties"
```
---
## Batch Extraction
Extract from multiple URLs in a single call:
```python
result = extractor.extract(
urls=[
"https://nature.com/articles/s12345",
"https://science.org/doi/full/10.1234/science.xyz",
"https://thelancet.com/journals/lancet/article/PIIS0140-6736(24)12345/fulltext"
],
objective="Key findings, sample sizes, and statistical results from each study",
)
# Results are returned in the same order as input URLs
for r in result["results"]:
print(f"=== {r['title']} ===")
print(f"URL: {r['url']}")
for excerpt in r["excerpts"]:
print(excerpt[:500])
```
**Batch limits:**
- No hard limit on number of URLs per request
- Each URL counts as one extraction unit for billing
- Large batches may take longer to process
- Failed URLs are reported in the `errors` field without blocking successful ones
---
## Handling Different Content Types
### Web Pages (HTML)
Standard extraction. JavaScript is rendered, so SPAs and dynamic content work.
```python
# Standard web page
result = extractor.extract(
urls=["https://example.com/article"],
objective="Main article content",
)
```
### PDFs
PDFs are automatically detected and converted to text.
```python
# PDF extraction
result = extractor.extract(
urls=["https://example.com/whitepaper.pdf"],
objective="Executive summary and key recommendations",
)
```
### Documentation Sites
Single-page apps and documentation frameworks (Docusaurus, GitBook, ReadTheDocs) are fully rendered.
```python
result = extractor.extract(
urls=["https://docs.example.com/getting-started"],
objective="Installation instructions and quickstart guide",
full_content=True,
)
```
---
## Common Extraction Patterns
### Pattern 1: Search Then Extract
Find relevant pages with Search, then extract full content from the best results.
```python
from parallel_web import ParallelSearch, ParallelExtract
searcher = ParallelSearch()
extractor = ParallelExtract()
# Step 1: Find relevant pages
search_result = searcher.search(
objective="Find the original transformer paper and its key follow-up papers",
search_queries=["attention is all you need paper", "transformer architecture paper"],
)
# Step 2: Extract detailed content from top results
top_urls = [r["url"] for r in search_result["results"][:3]]
extract_result = extractor.extract(
urls=top_urls,
objective="Abstract, architecture description, key results, and ablation studies",
)
```
### Pattern 2: DOI Resolution and Paper Reading
```python
# Extract content from a DOI URL
result = extractor.extract(
urls=["https://doi.org/10.1038/s41586-024-07487-w"],
objective="Study design, patient population, primary endpoints, efficacy results, and safety data",
)
```
### Pattern 3: Competitive Intelligence from Company Pages
```python
companies = [
"https://openai.com/about",
"https://anthropic.com/company",
"https://deepmind.google/about/",
]
result = extractor.extract(
urls=companies,
objective="Company mission, team size, key products, recent announcements, and funding information",
)
```
### Pattern 4: Documentation Extraction for Reference
```python
result = extractor.extract(
urls=["https://docs.parallel.ai/search/search-quickstart"],
objective="Complete API usage guide including request format, response format, and code examples",
full_content=True,
)
```
### Pattern 5: Metadata Verification
```python
# Verify citation metadata for a specific paper
result = extractor.extract(
urls=["https://doi.org/10.1234/example-doi"],
objective="Complete citation metadata: authors, title, journal, volume, pages, year, DOI",
)
```
---
## Error Handling
### Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| URL not accessible | Page requires authentication, is behind paywall, or is down | Try a different URL or use Search instead |
| Timeout | Page takes too long to render | Retry or use a simpler URL |
| Empty content | Page is dynamically loaded in a way that can't be rendered | Try full_content mode or use Search |
| Rate limited | Too many requests | Wait and retry, or reduce batch size |
### Checking for Errors
```python
result = extractor.extract(urls=["https://example.com/page"])
if not result["success"]:
print(f"Extraction failed: {result['error']}")
elif result.get("errors"):
print(f"Some URLs failed: {result['errors']}")
else:
print(f"Successfully extracted {len(result['results'])} pages")
```
---
## Tips and Best Practices
1. **Always provide an objective**: Even a general one improves excerpt quality significantly
2. **Use excerpts by default**: Full content is only needed when you truly need everything
3. **Batch related URLs**: One call with 5 URLs is better than 5 separate calls
4. **Check for errors**: Not all URLs are extractable (paywalls, auth, etc.)
5. **Combine with Search**: Search finds URLs, Extract reads them in detail
6. **Use for DOI resolution**: Extract handles DOI redirects automatically
7. **Prefer Extract over manual fetching**: Handles JavaScript, PDFs, and complex layouts
---
## See Also
- [API Reference](api_reference.md) - Complete API parameter reference
- [Search Best Practices](search_best_practices.md) - For finding URLs to extract
- [Deep Research Guide](deep_research_guide.md) - For comprehensive research tasks
- [Workflow Recipes](workflow_recipes.md) - Common multi-step patterns