claude-scientific-skills/scientific-skills/parallel-web/references/extraction_patterns.md

# Extraction Patterns

Guide to using Parallel's Extract API for converting web pages into clean, LLM-optimized content.

---

## Overview

The Extract API converts any public URL into clean markdown. It handles JavaScript-heavy pages, PDFs, and complex layouts that simple HTTP fetching cannot parse. Results are optimized for LLM consumption.

**Key capabilities:**
- JavaScript rendering (SPAs, dynamic content)
- PDF extraction to clean text
- Focused excerpts aligned to your objective
- Full page content extraction
- Multiple URL batch processing

---

## When to Use Extract vs Search

| Scenario | Use Extract | Use Search |
|----------|-------------|------------|
| You have a specific URL | Yes | No |
| You need content from a known page | Yes | No |
| You want to find pages about a topic | No | Yes |
| You need to read a research paper URL | Yes | No |
| You need to verify information on a specific site | Yes | No |
| You're looking for information broadly | No | Yes |
| You found URLs from a search and want full content | Yes | No |

**Rule of thumb:** If you have a URL, use Extract. If you need to find URLs, use Search.

---

## Excerpt Mode vs Full Content Mode

### Excerpt Mode (Default)

Returns focused content aligned to your objective. Smaller token footprint, higher relevance.

```python
extractor = ParallelExtract()

result = extractor.extract(
    urls=["https://arxiv.org/abs/2301.12345"],
    objective="Key methodology and experimental results",
    excerpts=True,     # Default
    full_content=False  # Default
)
```

**Best for:**
- Extracting specific information from long pages
- Token-efficient processing
- When you know what you're looking for
- Reading papers for specific claims or data points

### Full Content Mode

Returns the complete page content as clean markdown.

```python
result = extractor.extract(
    urls=["https://docs.example.com/api-reference"],
    objective="Complete API documentation",
    excerpts=False,
    full_content=True,
)
```

**Best for:**
- Complete documentation pages
- Full article text needed for analysis
- When you need every detail, not just excerpts
- Archiving or converting web content

### Both Modes

You can request both excerpts and full content:

```python
result = extractor.extract(
    urls=["https://example.com/report"],
    objective="Executive summary and key recommendations",
    excerpts=True,
    full_content=True,
)

# Use excerpts for focused analysis
# Use full_content for complete reference
```

---

## Objective Writing for Extraction

The `objective` parameter focuses extraction on relevant content. It dramatically improves excerpt quality.

### Good Objectives

```python
# Specific and actionable
objective="Extract the methodology section, including sample size, statistical methods, and primary endpoints"

# Clear about what you need
objective="Find the pricing information, feature comparison table, and enterprise plan details"

# Targeted for your task
objective="Key findings, effect sizes, confidence intervals, and author conclusions from this clinical trial"
```

### Poor Objectives

```python
# Too vague
objective="Tell me about this page"

# No objective at all (still works but excerpts are less focused)
extractor.extract(urls=["https://..."])
```

### Objective Templates by Use Case

**Academic Paper:**
```python
objective="Abstract, key findings, methodology (sample size, design, statistical tests), results with effect sizes and p-values, and main conclusions"
```

**Product/Company Page:**
```python
objective="Company overview, key products/services, pricing, founding date, leadership team, and recent announcements"
```

**Technical Documentation:**
```python
objective="API endpoints, authentication methods, request/response formats, rate limits, and code examples"
```

**News Article:**
```python
objective="Main story, key quotes, data points, timeline of events, and named sources"
```

**Government/Policy Document:**
```python
objective="Key policy provisions, effective dates, affected parties, compliance requirements, and penalties"
```

---

## Batch Extraction

Extract from multiple URLs in a single call:

```python
result = extractor.extract(
    urls=[
        "https://nature.com/articles/s12345",
        "https://science.org/doi/full/10.1234/science.xyz",
        "https://thelancet.com/journals/lancet/article/PIIS0140-6736(24)12345/fulltext"
    ],
    objective="Key findings, sample sizes, and statistical results from each study",
)

# Results are returned in the same order as input URLs
for r in result["results"]:
    print(f"=== {r['title']} ===")
    print(f"URL: {r['url']}")
    for excerpt in r["excerpts"]:
        print(excerpt[:500])
```

**Batch limits:**
- No hard limit on number of URLs per request
- Each URL counts as one extraction unit for billing
- Large batches may take longer to process
- Failed URLs are reported in the `errors` field without blocking successful ones

---

## Handling Different Content Types

### Web Pages (HTML)

Standard extraction. JavaScript is rendered, so SPAs and dynamic content work.

```python
# Standard web page
result = extractor.extract(
    urls=["https://example.com/article"],
    objective="Main article content",
)
```

### PDFs

PDFs are automatically detected and converted to text.

```python
# PDF extraction
result = extractor.extract(
    urls=["https://example.com/whitepaper.pdf"],
    objective="Executive summary and key recommendations",
)
```

### Documentation Sites

Single-page apps and documentation frameworks (Docusaurus, GitBook, ReadTheDocs) are fully rendered.

```python
result = extractor.extract(
    urls=["https://docs.example.com/getting-started"],
    objective="Installation instructions and quickstart guide",
    full_content=True,
)
```

---

## Common Extraction Patterns

### Pattern 1: Search Then Extract

Find relevant pages with Search, then extract full content from the best results.

```python
from parallel_web import ParallelSearch, ParallelExtract

searcher = ParallelSearch()
extractor = ParallelExtract()

# Step 1: Find relevant pages
search_result = searcher.search(
    objective="Find the original transformer paper and its key follow-up papers",
    search_queries=["attention is all you need paper", "transformer architecture paper"],
)

# Step 2: Extract detailed content from top results
top_urls = [r["url"] for r in search_result["results"][:3]]
extract_result = extractor.extract(
    urls=top_urls,
    objective="Abstract, architecture description, key results, and ablation studies",
)
```

### Pattern 2: DOI Resolution and Paper Reading

```python
# Extract content from a DOI URL
result = extractor.extract(
    urls=["https://doi.org/10.1038/s41586-024-07487-w"],
    objective="Study design, patient population, primary endpoints, efficacy results, and safety data",
)
```

### Pattern 3: Competitive Intelligence from Company Pages

```python
companies = [
    "https://openai.com/about",
    "https://anthropic.com/company",
    "https://deepmind.google/about/",
]

result = extractor.extract(
    urls=companies,
    objective="Company mission, team size, key products, recent announcements, and funding information",
)
```

### Pattern 4: Documentation Extraction for Reference

```python
result = extractor.extract(
    urls=["https://docs.parallel.ai/search/search-quickstart"],
    objective="Complete API usage guide including request format, response format, and code examples",
    full_content=True,
)
```

### Pattern 5: Metadata Verification

```python
# Verify citation metadata for a specific paper
result = extractor.extract(
    urls=["https://doi.org/10.1234/example-doi"],
    objective="Complete citation metadata: authors, title, journal, volume, pages, year, DOI",
)
```

---

## Error Handling

### Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| URL not accessible | Page requires authentication, is behind paywall, or is down | Try a different URL or use Search instead |
| Timeout | Page takes too long to render | Retry or use a simpler URL |
| Empty content | Page is dynamically loaded in a way that can't be rendered | Try full_content mode or use Search |
| Rate limited | Too many requests | Wait and retry, or reduce batch size |

### Checking for Errors

```python
result = extractor.extract(urls=["https://example.com/page"])

if not result["success"]:
    print(f"Extraction failed: {result['error']}")
elif result.get("errors"):
    print(f"Some URLs failed: {result['errors']}")
else:
    print(f"Successfully extracted {len(result['results'])} pages")
```

---

## Tips and Best Practices

1. **Always provide an objective**: Even a general one improves excerpt quality significantly
2. **Use excerpts by default**: Full content is only needed when you truly need everything
3. **Batch related URLs**: One call with 5 URLs is better than 5 separate calls
4. **Check for errors**: Not all URLs are extractable (paywalls, auth, etc.)
5. **Combine with Search**: Search finds URLs, Extract reads them in detail
6. **Use for DOI resolution**: Extract handles DOI redirects automatically
7. **Prefer Extract over manual fetching**: Handles JavaScript, PDFs, and complex layouts

---

## See Also

- [API Reference](api_reference.md) - Complete API parameter reference
- [Search Best Practices](search_best_practices.md) - For finding URLs to extract
- [Deep Research Guide](deep_research_guide.md) - For comprehensive research tasks
- [Workflow Recipes](workflow_recipes.md) - Common multi-step patterns