mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-03-27 07:09:27 +08:00

Files

Vinayak Agarwal f72b7f4521 Added parallel-web skill

Refactor research lookup skill to enhance backend routing and update documentation. The skill now intelligently selects between the Parallel Chat API and Perplexity sonar-pro-search based on query type. Added compatibility notes, license information, and improved descriptions for clarity. Removed outdated example scripts to streamline the codebase.

2026-03-01 07:36:19 -08:00

9.3 KiB

Raw Blame History

Extraction Patterns

Guide to using Parallel's Extract API for converting web pages into clean, LLM-optimized content.

Overview

The Extract API converts any public URL into clean markdown. It handles JavaScript-heavy pages, PDFs, and complex layouts that simple HTTP fetching cannot parse. Results are optimized for LLM consumption.

Key capabilities:

JavaScript rendering (SPAs, dynamic content)
PDF extraction to clean text
Focused excerpts aligned to your objective
Full page content extraction
Multiple URL batch processing

When to Use Extract vs Search

Scenario	Use Extract	Use Search
You have a specific URL	Yes	No
You need content from a known page	Yes	No
You want to find pages about a topic	No	Yes
You need to read a research paper URL	Yes	No
You need to verify information on a specific site	Yes	No
You're looking for information broadly	No	Yes
You found URLs from a search and want full content	Yes	No

Rule of thumb: If you have a URL, use Extract. If you need to find URLs, use Search.

Excerpt Mode vs Full Content Mode

Excerpt Mode (Default)

Returns focused content aligned to your objective. Smaller token footprint, higher relevance.

extractor = ParallelExtract()

result = extractor.extract(
    urls=["https://arxiv.org/abs/2301.12345"],
    objective="Key methodology and experimental results",
    excerpts=True,     # Default
    full_content=False  # Default
)

Best for:

Extracting specific information from long pages
Token-efficient processing
When you know what you're looking for
Reading papers for specific claims or data points

Full Content Mode

Returns the complete page content as clean markdown.

result = extractor.extract(
    urls=["https://docs.example.com/api-reference"],
    objective="Complete API documentation",
    excerpts=False,
    full_content=True,
)

Best for:

Complete documentation pages
Full article text needed for analysis
When you need every detail, not just excerpts
Archiving or converting web content

Both Modes

You can request both excerpts and full content:

result = extractor.extract(
    urls=["https://example.com/report"],
    objective="Executive summary and key recommendations",
    excerpts=True,
    full_content=True,
)

# Use excerpts for focused analysis
# Use full_content for complete reference

Objective Writing for Extraction

The objective parameter focuses extraction on relevant content. It dramatically improves excerpt quality.

Good Objectives

# Specific and actionable
objective="Extract the methodology section, including sample size, statistical methods, and primary endpoints"

# Clear about what you need
objective="Find the pricing information, feature comparison table, and enterprise plan details"

# Targeted for your task
objective="Key findings, effect sizes, confidence intervals, and author conclusions from this clinical trial"

Poor Objectives

# Too vague
objective="Tell me about this page"

# No objective at all (still works but excerpts are less focused)
extractor.extract(urls=["https://..."])

Objective Templates by Use Case

Academic Paper:

objective="Abstract, key findings, methodology (sample size, design, statistical tests), results with effect sizes and p-values, and main conclusions"

Product/Company Page:

objective="Company overview, key products/services, pricing, founding date, leadership team, and recent announcements"

Technical Documentation:

objective="API endpoints, authentication methods, request/response formats, rate limits, and code examples"

News Article:

objective="Main story, key quotes, data points, timeline of events, and named sources"

Government/Policy Document:

objective="Key policy provisions, effective dates, affected parties, compliance requirements, and penalties"

Batch Extraction

Extract from multiple URLs in a single call:

result = extractor.extract(
    urls=[
        "https://nature.com/articles/s12345",
        "https://science.org/doi/full/10.1234/science.xyz",
        "https://thelancet.com/journals/lancet/article/PIIS0140-6736(24)12345/fulltext"
    ],
    objective="Key findings, sample sizes, and statistical results from each study",
)

# Results are returned in the same order as input URLs
for r in result["results"]:
    print(f"=== {r['title']} ===")
    print(f"URL: {r['url']}")
    for excerpt in r["excerpts"]:
        print(excerpt[:500])

Batch limits:

No hard limit on number of URLs per request
Each URL counts as one extraction unit for billing
Large batches may take longer to process
Failed URLs are reported in the errors field without blocking successful ones

Handling Different Content Types

Web Pages (HTML)

Standard extraction. JavaScript is rendered, so SPAs and dynamic content work.

# Standard web page
result = extractor.extract(
    urls=["https://example.com/article"],
    objective="Main article content",
)

PDFs

PDFs are automatically detected and converted to text.

# PDF extraction
result = extractor.extract(
    urls=["https://example.com/whitepaper.pdf"],
    objective="Executive summary and key recommendations",
)

Documentation Sites

Single-page apps and documentation frameworks (Docusaurus, GitBook, ReadTheDocs) are fully rendered.

result = extractor.extract(
    urls=["https://docs.example.com/getting-started"],
    objective="Installation instructions and quickstart guide",
    full_content=True,
)

Common Extraction Patterns

Pattern 1: Search Then Extract

Find relevant pages with Search, then extract full content from the best results.

from parallel_web import ParallelSearch, ParallelExtract

searcher = ParallelSearch()
extractor = ParallelExtract()

# Step 1: Find relevant pages
search_result = searcher.search(
    objective="Find the original transformer paper and its key follow-up papers",
    search_queries=["attention is all you need paper", "transformer architecture paper"],
)

# Step 2: Extract detailed content from top results
top_urls = [r["url"] for r in search_result["results"][:3]]
extract_result = extractor.extract(
    urls=top_urls,
    objective="Abstract, architecture description, key results, and ablation studies",
)

Pattern 2: DOI Resolution and Paper Reading

# Extract content from a DOI URL
result = extractor.extract(
    urls=["https://doi.org/10.1038/s41586-024-07487-w"],
    objective="Study design, patient population, primary endpoints, efficacy results, and safety data",
)

Pattern 3: Competitive Intelligence from Company Pages

companies = [
    "https://openai.com/about",
    "https://anthropic.com/company",
    "https://deepmind.google/about/",
]

result = extractor.extract(
    urls=companies,
    objective="Company mission, team size, key products, recent announcements, and funding information",
)

Pattern 4: Documentation Extraction for Reference

result = extractor.extract(
    urls=["https://docs.parallel.ai/search/search-quickstart"],
    objective="Complete API usage guide including request format, response format, and code examples",
    full_content=True,
)

Pattern 5: Metadata Verification

# Verify citation metadata for a specific paper
result = extractor.extract(
    urls=["https://doi.org/10.1234/example-doi"],
    objective="Complete citation metadata: authors, title, journal, volume, pages, year, DOI",
)

Error Handling

Common Errors

Error	Cause	Solution
URL not accessible	Page requires authentication, is behind paywall, or is down	Try a different URL or use Search instead
Timeout	Page takes too long to render	Retry or use a simpler URL
Empty content	Page is dynamically loaded in a way that can't be rendered	Try full_content mode or use Search
Rate limited	Too many requests	Wait and retry, or reduce batch size

Checking for Errors

result = extractor.extract(urls=["https://example.com/page"])

if not result["success"]:
    print(f"Extraction failed: {result['error']}")
elif result.get("errors"):
    print(f"Some URLs failed: {result['errors']}")
else:
    print(f"Successfully extracted {len(result['results'])} pages")

Tips and Best Practices

Always provide an objective: Even a general one improves excerpt quality significantly
Use excerpts by default: Full content is only needed when you truly need everything
Batch related URLs: One call with 5 URLs is better than 5 separate calls
Check for errors: Not all URLs are extractable (paywalls, auth, etc.)
Combine with Search: Search finds URLs, Extract reads them in detail
Use for DOI resolution: Extract handles DOI redirects automatically
Prefer Extract over manual fetching: Handles JavaScript, PDFs, and complex layouts

9.3 KiB Raw Blame History