Update all the latest writing skills

2026-03-29 07:43:46 +08:00 · 2025-12-12 11:42:41 -08:00
parent c85faf039a
commit cf1d4aac5d
30 changed files with 5895 additions and 2983 deletions
--- a/scientific-skills/markitdown/references/advanced_integrations.md
+++ b/scientific-skills/markitdown/references/advanced_integrations.md
@@ -1,538 +0,0 @@
-# Advanced Integrations Reference
-
-This document provides detailed information about advanced MarkItDown features including Azure Document Intelligence integration, LLM-powered descriptions, and plugin system.
-
-## Azure Document Intelligence Integration
-
-Azure Document Intelligence (formerly Form Recognizer) provides superior PDF processing with advanced table extraction and layout analysis.
-
-### Setup
-
-**Prerequisites:**
-1. Azure subscription
-2. Document Intelligence resource created in Azure
-3. Endpoint URL and API key
-
-**Create Azure Resource:**
-```bash
-# Using Azure CLI
-az cognitiveservices account create \
-  --name my-doc-intelligence \
-  --resource-group my-resource-group \
-  --kind FormRecognizer \
-  --sku F0 \
-  --location eastus
-```
-
-### Basic Usage
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown(
-    docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/",
-    docintel_key="YOUR-API-KEY"
-)
-
-result = md.convert("complex_document.pdf")
-print(result.text_content)
-```
-
-### Configuration from Environment Variables
-
-```python
-import os
-from markitdown import MarkItDown
-
-# Set environment variables
-os.environ['AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'] = 'YOUR-ENDPOINT'
-os.environ['AZURE_DOCUMENT_INTELLIGENCE_KEY'] = 'YOUR-KEY'
-
-# Use without explicit credentials
-md = MarkItDown(
-    docintel_endpoint=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'),
-    docintel_key=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
-)
-
-result = md.convert("document.pdf")
-```
-
-### When to Use Azure Document Intelligence
-
-**Use for:**
- Complex PDFs with sophisticated tables
- Multi-column layouts
- Forms and structured documents
- Scanned documents requiring OCR
- PDFs with mixed content types
- Documents with intricate formatting
-
-**Benefits over standard extraction:**
- **Superior table extraction** - Better handling of merged cells, complex layouts
- **Layout analysis** - Understands document structure (headers, footers, columns)
- **Form fields** - Extracts key-value pairs from forms
- **Reading order** - Maintains correct text flow in complex layouts
- **OCR quality** - High-quality text extraction from scanned documents
-
-### Comparison Example
-
-**Standard extraction:**
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("complex_table.pdf")
-# May struggle with complex tables
-```
-
-**Azure Document Intelligence:**
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown(
-    docintel_endpoint="YOUR-ENDPOINT",
-    docintel_key="YOUR-KEY"
-)
-result = md.convert("complex_table.pdf")
-# Better table reconstruction and layout understanding
-```
-
-### Cost Considerations
-
-Azure Document Intelligence is a paid service:
- **Free tier**: 500 pages per month
- **Paid tiers**: Pay per page processed
- Monitor usage to control costs
- Use standard extraction for simple documents
-
-### Error Handling
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown(
-    docintel_endpoint="YOUR-ENDPOINT",
-    docintel_key="YOUR-KEY"
-)
-
-try:
-    result = md.convert("document.pdf")
-    print(result.text_content)
-except Exception as e:
-    print(f"Document Intelligence error: {e}")
-    # Common issues: authentication, quota exceeded, unsupported file
-```
-
-## LLM-Powered Image Descriptions
-
-Generate detailed, contextual descriptions for images using large language models.
-
-### Setup with OpenAI
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI(api_key="YOUR-OPENAI-API-KEY")
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-
-result = md.convert("image.jpg")
-print(result.text_content)
-```
-
-### Supported Use Cases
-
-**Images in documents:**
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-
-# PowerPoint with images
-result = md.convert("presentation.pptx")
-
-# Word documents with images
-result = md.convert("report.docx")
-
-# Standalone images
-result = md.convert("diagram.png")
-```
-
-### Custom Prompts
-
-Customize the LLM prompt for specific needs:
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-
-# For diagrams
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    llm_prompt="Analyze this diagram and explain all components, connections, and relationships in detail"
-)
-
-# For charts
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    llm_prompt="Describe this chart, including the type, axes, data points, trends, and key insights"
-)
-
-# For UI screenshots
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    llm_prompt="Describe this user interface screenshot, listing all UI elements, their layout, and functionality"
-)
-
-# For scientific figures
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    llm_prompt="Describe this scientific figure in detail, including methodology, results shown, and significance"
-)
-```
-
-### Model Selection
-
-**GPT-4o (Recommended):**
- Best vision capabilities
- High-quality descriptions
- Good at understanding context
- Higher cost per image
-
-**GPT-4o-mini:**
- Lower cost alternative
- Good for simpler images
- Faster processing
- May miss subtle details
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-
-# High quality (more expensive)
-md_quality = MarkItDown(llm_client=client, llm_model="gpt-4o")
-
-# Budget option (less expensive)
-md_budget = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")
-```
-
-### Configuration from Environment
-
-```python
-import os
-from markitdown import MarkItDown
-from openai import OpenAI
-
-# Set API key in environment
-os.environ['OPENAI_API_KEY'] = 'YOUR-API-KEY'
-
-client = OpenAI()  # Uses env variable
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-```
-
-### Alternative LLM Providers
-
-**Anthropic Claude:**
-```python
-from markitdown import MarkItDown
-from anthropic import Anthropic
-
-# Note: Check current compatibility with MarkItDown
-client = Anthropic(api_key="YOUR-API-KEY")
-# May require adapter for MarkItDown compatibility
-```
-
-**Azure OpenAI:**
-```python
-from markitdown import MarkItDown
-from openai import AzureOpenAI
-
-client = AzureOpenAI(
-    api_key="YOUR-AZURE-KEY",
-    api_version="2024-02-01",
-    azure_endpoint="https://YOUR-RESOURCE.openai.azure.com"
-)
-
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-```
-
-### Cost Management
-
-**Strategies to reduce LLM costs:**
-
-1. **Selective processing:**
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-
-# Only use LLM for important documents
-if is_important_document(file):
-    md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-else:
-    md = MarkItDown()  # Standard processing
-
-result = md.convert(file)
-```
-
-2. **Image filtering:**
-```python
-# Pre-process to identify images that need descriptions
-# Only use LLM for complex/important images
-```
-
-3. **Batch processing:**
-```python
-# Process multiple images in batches
-# Monitor costs and set limits
-```
-
-4. **Model selection:**
-```python
-# Use gpt-4o-mini for simple images
-# Reserve gpt-4o for complex visualizations
-```
-
-### Performance Considerations
-
-**LLM processing adds latency:**
- Each image requires an API call
- Processing time: 1-5 seconds per image
- Network dependent
- Consider parallel processing for multiple images
-
-**Batch optimization:**
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-import concurrent.futures
-
-client = OpenAI()
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-
-def process_image(image_path):
-    return md.convert(image_path)
-
-# Process multiple images in parallel
-images = ["img1.jpg", "img2.jpg", "img3.jpg"]
-with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
-    results = list(executor.map(process_image, images))
-```
-
-## Combined Advanced Features
-
-### Azure Document Intelligence + LLM Descriptions
-
-Combine both for maximum quality:
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    docintel_endpoint="YOUR-AZURE-ENDPOINT",
-    docintel_key="YOUR-AZURE-KEY"
-)
-
-# Best possible PDF conversion with image descriptions
-result = md.convert("complex_report.pdf")
-```
-
-**Use cases:**
- Research papers with figures
- Business reports with charts
- Technical documentation with diagrams
- Presentations with visual data
-
-### Smart Document Processing Pipeline
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-import os
-
-def smart_convert(file_path):
-    """Intelligently choose processing method based on file type."""
-    client = OpenAI()
-    ext = os.path.splitext(file_path)[1].lower()
-
-    # PDFs with complex tables: Use Azure
-    if ext == '.pdf':
-        md = MarkItDown(
-            docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
-            docintel_key=os.getenv('AZURE_KEY')
-        )
-
-    # Documents/presentations with images: Use LLM
-    elif ext in ['.pptx', '.docx']:
-        md = MarkItDown(
-            llm_client=client,
-            llm_model="gpt-4o"
-        )
-
-    # Simple formats: Standard processing
-    else:
-        md = MarkItDown()
-
-    return md.convert(file_path)
-
-# Use it
-result = smart_convert("document.pdf")
-```
-
-## Plugin System
-
-MarkItDown supports custom plugins for extending functionality.
-
-### Plugin Architecture
-
-Plugins are disabled by default for security:
-
-```python
-from markitdown import MarkItDown
-
-# Enable plugins
-md = MarkItDown(enable_plugins=True)
-```
-
-### Creating Custom Plugins
-
-**Plugin structure:**
-```python
-class CustomConverter:
-    """Custom converter plugin for MarkItDown."""
-
-    def can_convert(self, file_path):
-        """Check if this plugin can handle the file."""
-        return file_path.endswith('.custom')
-
-    def convert(self, file_path):
-        """Convert file to Markdown."""
-        # Your conversion logic here
-        return {
-            'text_content': '# Converted Content\n\n...'
-        }
-```
-
-### Plugin Registration
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown(enable_plugins=True)
-
-# Register custom plugin
-md.register_plugin(CustomConverter())
-
-# Use normally
-result = md.convert("file.custom")
-```
-
-### Plugin Use Cases
-
-**Custom formats:**
- Proprietary document formats
- Specialized scientific data formats
- Legacy file formats
-
-**Enhanced processing:**
- Custom OCR engines
- Specialized table extraction
- Domain-specific parsing
-
-**Integration:**
- Enterprise document systems
- Custom databases
- Specialized APIs
-
-### Plugin Security
-
-**Important security considerations:**
- Plugins run with full system access
- Only enable for trusted plugins
- Validate plugin code before use
- Disable plugins in production unless required
-
-## Error Handling for Advanced Features
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-def robust_convert(file_path):
-    """Convert with fallback strategies."""
-    try:
-        # Try with all advanced features
-        client = OpenAI()
-        md = MarkItDown(
-            llm_client=client,
-            llm_model="gpt-4o",
-            docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
-            docintel_key=os.getenv('AZURE_KEY')
-        )
-        return md.convert(file_path)
-
-    except Exception as azure_error:
-        print(f"Azure failed: {azure_error}")
-
-        try:
-            # Fallback: LLM only
-            client = OpenAI()
-            md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-            return md.convert(file_path)
-
-        except Exception as llm_error:
-            print(f"LLM failed: {llm_error}")
-
-            # Final fallback: Standard processing
-            md = MarkItDown()
-            return md.convert(file_path)
-
-# Use it
-result = robust_convert("document.pdf")
-```
-
-## Best Practices
-
-### Azure Document Intelligence
- Use for complex PDFs only (cost optimization)
- Monitor usage and costs
- Store credentials securely
- Handle quota limits gracefully
- Fall back to standard processing if needed
-
-### LLM Integration
- Use appropriate models for task complexity
- Customize prompts for specific use cases
- Monitor API costs
- Implement rate limiting
- Cache results when possible
- Handle API errors gracefully
-
-### Combined Features
- Test cost/quality tradeoffs
- Use selectively for important documents
- Implement intelligent routing
- Monitor performance and costs
- Have fallback strategies
-
-### Security
- Store API keys securely (environment variables, secrets manager)
- Never commit credentials to code
- Disable plugins unless required
- Validate all inputs
- Use least privilege access
--- a/scientific-skills/markitdown/references/api_reference.md
+++ b/scientific-skills/markitdown/references/api_reference.md
@@ -0,0 +1,399 @@
+# MarkItDown API Reference
+
+## Core Classes
+
+### MarkItDown
+
+The main class for converting files to Markdown.
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown(
+    llm_client=None,
+    llm_model=None,
+    llm_prompt=None,
+    docintel_endpoint=None,
+    enable_plugins=False
+)
+```
+
+#### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `llm_client` | OpenAI client | `None` | OpenAI-compatible client for AI image descriptions |
+| `llm_model` | str | `None` | Model name (e.g., "anthropic/claude-sonnet-4.5") for image descriptions |
+| `llm_prompt` | str | `None` | Custom prompt for image description |
+| `docintel_endpoint` | str | `None` | Azure Document Intelligence endpoint |
+| `enable_plugins` | bool | `False` | Enable 3rd-party plugins |
+
+#### Methods
+
+##### convert()
+
+Convert a file to Markdown.
+
+```python
+result = md.convert(
+    source,
+    file_extension=None
+)
+```
+
+**Parameters**:
+- `source` (str): Path to the file to convert
+- `file_extension` (str, optional): Override file extension detection
+
+**Returns**: `DocumentConverterResult` object
+
+**Example**:
+```python
+result = md.convert("document.pdf")
+print(result.text_content)
+```
+
+##### convert_stream()
+
+Convert from a file-like binary stream.
+
+```python
+result = md.convert_stream(
+    stream,
+    file_extension
+)
+```
+
+**Parameters**:
+- `stream` (BinaryIO): Binary file-like object (e.g., file opened in `"rb"` mode)
+- `file_extension` (str): File extension to determine conversion method (e.g., ".pdf")
+
+**Returns**: `DocumentConverterResult` object
+
+**Example**:
+```python
+with open("document.pdf", "rb") as f:
+    result = md.convert_stream(f, file_extension=".pdf")
+    print(result.text_content)
+```
+
+**Important**: The stream must be opened in binary mode (`"rb"`), not text mode.
+
+## Result Object
+
+### DocumentConverterResult
+
+The result of a conversion operation.
+
+#### Attributes
+
+| Attribute | Type | Description |
+|-----------|------|-------------|
+| `text_content` | str | The converted Markdown text |
+| `title` | str | Document title (if available) |
+
+#### Example
+
+```python
+result = md.convert("paper.pdf")
+
+# Access content
+content = result.text_content
+
+# Access title (if available)
+title = result.title
+```
+
+## Custom Converters
+
+You can create custom document converters by implementing the `DocumentConverter` interface.
+
+### DocumentConverter Interface
+
+```python
+from markitdown import DocumentConverter
+
+class CustomConverter(DocumentConverter):
+    def convert(self, stream, file_extension):
+        """
+        Convert a document from a binary stream.
+        
+        Parameters:
+            stream (BinaryIO): Binary file-like object
+            file_extension (str): File extension (e.g., ".custom")
+            
+        Returns:
+            DocumentConverterResult: Conversion result
+        """
+        # Your conversion logic here
+        pass
+```
+
+### Registering Custom Converters
+
+```python
+from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult
+
+class MyCustomConverter(DocumentConverter):
+    def convert(self, stream, file_extension):
+        content = stream.read().decode('utf-8')
+        markdown_text = f"# Custom Format\n\n{content}"
+        return DocumentConverterResult(
+            text_content=markdown_text,
+            title="Custom Document"
+        )
+
+# Create MarkItDown instance
+md = MarkItDown()
+
+# Register custom converter for .custom files
+md.register_converter(".custom", MyCustomConverter())
+
+# Use it
+result = md.convert("myfile.custom")
+```
+
+## Plugin System
+
+### Finding Plugins
+
+Search GitHub for `#markitdown-plugin` tag.
+
+### Using Plugins
+
+```python
+from markitdown import MarkItDown
+
+# Enable plugins
+md = MarkItDown(enable_plugins=True)
+result = md.convert("document.pdf")
+```
+
+### Creating Plugins
+
+Plugins are Python packages that register converters with MarkItDown.
+
+**Plugin Structure**:
+```
+my-markitdown-plugin/
+├── setup.py
+├── my_plugin/
+│   ├── __init__.py
+│   └── converter.py
+└── README.md
+```
+
+**setup.py**:
+```python
+from setuptools import setup
+
+setup(
+    name="markitdown-my-plugin",
+    version="0.1.0",
+    packages=["my_plugin"],
+    entry_points={
+        "markitdown.plugins": [
+            "my_plugin = my_plugin.converter:MyConverter",
+        ],
+    },
+)
+```
+
+**converter.py**:
+```python
+from markitdown import DocumentConverter, DocumentConverterResult
+
+class MyConverter(DocumentConverter):
+    def convert(self, stream, file_extension):
+        # Your conversion logic
+        content = stream.read()
+        markdown = self.process(content)
+        return DocumentConverterResult(
+            text_content=markdown,
+            title="My Document"
+        )
+    
+    def process(self, content):
+        # Process content
+        return "# Converted Content\n\n..."
+```
+
+## AI-Enhanced Conversions
+
+### Using OpenRouter for Image Descriptions
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+# Initialize OpenRouter client (OpenAI-compatible API)
+client = OpenAI(
+    api_key="your-openrouter-api-key",
+    base_url="https://openrouter.ai/api/v1"
+)
+
+# Create MarkItDown with AI support
+md = MarkItDown(
+    llm_client=client,
+    llm_model="anthropic/claude-sonnet-4.5",  # recommended for scientific vision
+    llm_prompt="Describe this image in detail for scientific documentation"
+)
+
+# Convert files with images
+result = md.convert("presentation.pptx")
+```
+
+### Available Models via OpenRouter
+
+Popular models with vision support:
+- `anthropic/claude-sonnet-4.5` - **Recommended for scientific vision**
+- `anthropic/claude-opus-4.5` - Advanced vision model
+- `openai/gpt-4o` - GPT-4 Omni
+- `openai/gpt-4-vision` - GPT-4 Vision
+- `google/gemini-pro-vision` - Gemini Pro Vision
+
+See https://openrouter.ai/models for the complete list.
+
+### Custom Prompts
+
+```python
+# For scientific diagrams
+scientific_prompt = """
+Analyze this scientific diagram or chart. Describe:
+1. The type of visualization (graph, chart, diagram, etc.)
+2. Key data points or trends
+3. Labels and axes
+4. Scientific significance
+Be precise and technical.
+"""
+
+md = MarkItDown(
+    llm_client=client,
+    llm_model="anthropic/claude-sonnet-4.5",
+    llm_prompt=scientific_prompt
+)
+```
+
+## Azure Document Intelligence
+
+### Setup
+
+1. Create Azure Document Intelligence resource
+2. Get endpoint URL
+3. Set authentication
+
+### Usage
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown(
+    docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/"
+)
+
+result = md.convert("complex_document.pdf")
+```
+
+### Authentication
+
+Set environment variables:
+```bash
+export AZURE_DOCUMENT_INTELLIGENCE_KEY="your-key"
+```
+
+Or pass credentials programmatically.
+
+## Error Handling
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("document.pdf")
+    print(result.text_content)
+except FileNotFoundError:
+    print("File not found")
+except ValueError as e:
+    print(f"Invalid file format: {e}")
+except Exception as e:
+    print(f"Conversion error: {e}")
+```
+
+## Performance Tips
+
+### 1. Reuse MarkItDown Instance
+
+```python
+# Good: Create once, use many times
+md = MarkItDown()
+
+for file in files:
+    result = md.convert(file)
+    process(result)
+```
+
+### 2. Use Streaming for Large Files
+
+```python
+# For large files
+with open("large_file.pdf", "rb") as f:
+    result = md.convert_stream(f, file_extension=".pdf")
+```
+
+### 3. Batch Processing
+
+```python
+from concurrent.futures import ThreadPoolExecutor
+
+md = MarkItDown()
+
+def convert_file(filepath):
+    return md.convert(filepath)
+
+with ThreadPoolExecutor(max_workers=4) as executor:
+    results = executor.map(convert_file, file_list)
+```
+
+## Breaking Changes (v0.0.1 to v0.1.0)
+
+1. **Dependencies**: Now organized into optional feature groups
+   ```bash
+   # Old
+   pip install markitdown
+   
+   # New
+   pip install 'markitdown[all]'
+   ```
+
+2. **convert_stream()**: Now requires binary file-like object
+   ```python
+   # Old (also accepted text)
+   with open("file.pdf", "r") as f:  # text mode
+       result = md.convert_stream(f)
+   
+   # New (binary only)
+   with open("file.pdf", "rb") as f:  # binary mode
+       result = md.convert_stream(f, file_extension=".pdf")
+   ```
+
+3. **DocumentConverter Interface**: Changed to read from streams instead of file paths
+   - No temporary files created
+   - More memory efficient
+   - Plugins need updating
+
+## Version Compatibility
+
+- **Python**: 3.10 or higher required
+- **Dependencies**: Check `setup.py` for version constraints
+- **OpenAI**: Compatible with OpenAI Python SDK v1.0+
+
+## Environment Variables
+
+| Variable | Description | Example |
+|----------|-------------|---------|
+| `OPENROUTER_API_KEY` | OpenRouter API key for image descriptions | `sk-or-v1-...` |
+| `AZURE_DOCUMENT_INTELLIGENCE_KEY` | Azure DI authentication | `key123...` |
+| `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` | Azure DI endpoint | `https://...` |
+
--- a/scientific-skills/markitdown/references/document_conversion.md
+++ b/scientific-skills/markitdown/references/document_conversion.md
@@ -1,273 +0,0 @@
-# Document Conversion Reference
-
-This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.
-
-## PDF Files
-
-PDF conversion extracts text, tables, and structure from PDF documents.
-
-### Basic PDF Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("document.pdf")
-print(result.text_content)
-```
-
-### PDF with Azure Document Intelligence
-
-For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown(
-    docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
-    docintel_key="YOUR-API-KEY"
-)
-result = md.convert("complex_table.pdf")
-print(result.text_content)
-```
-
-**Benefits of Azure Document Intelligence:**
- Superior table extraction and reconstruction
- Better handling of multi-column layouts
- Form field recognition
- Improved text ordering in complex documents
-
-### PDF Handling Notes
-
- Scanned PDFs require OCR (automatically handled if tesseract is installed)
- Password-protected PDFs are not supported
- Large PDFs may take longer to process
- Vector graphics and embedded images are extracted where possible
-
-## Word Documents (DOCX)
-
-Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.
-
-### Basic DOCX Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("document.docx")
-print(result.text_content)
-```
-
-### DOCX Structure Preservation
-
-MarkItDown preserves:
- **Headings** → Markdown headers (`#`, `##`, etc.)
- **Bold/Italic** → Markdown emphasis (`**bold**`, `*italic*`)
- **Lists** → Markdown lists (ordered and unordered)
- **Tables** → Markdown tables
- **Hyperlinks** → Markdown links `[text](url)`
- **Images** → Referenced with descriptions (can use LLM for descriptions)
-
-### Command-Line Usage
-
-```bash
-# Basic conversion
-markitdown report.docx -o report.md
-
-# With output directory
-markitdown report.docx -o output/report.md
-```
-
-### DOCX with Images
-
-To generate descriptions for images in Word documents, use LLM integration:
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-result = md.convert("document_with_images.docx")
-```
-
-## PowerPoint Presentations (PPTX)
-
-PowerPoint conversion extracts text from slides while preserving structure.
-
-### Basic PPTX Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("presentation.pptx")
-print(result.text_content)
-```
-
-### PPTX Structure
-
-MarkItDown processes presentations as:
- Each slide becomes a major section
- Slide titles become headers
- Bullet points are preserved
- Tables are converted to Markdown tables
- Notes are included if present
-
-### PPTX with Image Descriptions
-
-Presentations often contain important visual information. Use LLM integration to describe images:
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    llm_prompt="Describe this slide image in detail, focusing on key information"
-)
-result = md.convert("presentation.pptx")
-```
-
-**Custom prompts for presentations:**
- "Describe charts and graphs with their key data points"
- "Explain diagrams and their relationships"
- "Summarize visual content for accessibility"
-
-## Excel Spreadsheets (XLSX, XLS)
-
-Excel conversion formats spreadsheet data as Markdown tables.
-
-### Basic XLSX Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("data.xlsx")
-print(result.text_content)
-```
-
-### Multi-Sheet Workbooks
-
-For workbooks with multiple sheets:
- Each sheet becomes a separate section
- Sheet names are used as headers
- Empty sheets are skipped
- Formulas are evaluated (values shown, not formulas)
-
-### XLSX Conversion Details
-
-**What's preserved:**
- Cell values (text, numbers, dates)
- Table structure (rows and columns)
- Sheet names
- Cell formatting (bold headers)
-
-**What's not preserved:**
- Formulas (only computed values)
- Charts and graphs (use LLM integration for descriptions)
- Cell colors and conditional formatting
- Comments and notes
-
-### Large Spreadsheets
-
-For large spreadsheets, consider:
- Processing may be slower for files with many rows/columns
- Very wide tables may not format well in Markdown
- Consider filtering or preprocessing data if possible
-
-### XLS (Legacy Excel) Files
-
-Legacy `.xls` files are supported but require additional dependencies:
-
-```bash
-pip install 'markitdown[xls]'
-```
-
-Then use normally:
-```python
-md = MarkItDown()
-result = md.convert("legacy_data.xls")
-```
-
-## Common Document Conversion Patterns
-
-### Batch Document Processing
-
-```python
-from markitdown import MarkItDown
-import os
-
-md = MarkItDown()
-
-# Process all documents in a directory
-for filename in os.listdir("documents"):
-    if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
-        result = md.convert(f"documents/{filename}")
-
-        # Save to output directory
-        output_name = os.path.splitext(filename)[0] + ".md"
-        with open(f"markdown/{output_name}", "w") as f:
-            f.write(result.text_content)
-```
-
-### Document with Mixed Content
-
-For documents containing multiple types of content (text, tables, images):
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-# Use LLM for image descriptions + Azure for complex tables
-client = OpenAI()
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    docintel_endpoint="YOUR-ENDPOINT",
-    docintel_key="YOUR-KEY"
-)
-
-result = md.convert("complex_report.pdf")
-```
-
-### Error Handling
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-try:
-    result = md.convert("document.pdf")
-    print(result.text_content)
-except Exception as e:
-    print(f"Conversion failed: {e}")
-    # Handle specific errors (file not found, unsupported format, etc.)
-```
-
-## Output Quality Tips
-
-**For best results:**
-1. Use Azure Document Intelligence for PDFs with complex tables
-2. Enable LLM descriptions for documents with important visual content
-3. Ensure source documents are well-structured (proper headings, etc.)
-4. For scanned documents, ensure good scan quality for OCR accuracy
-5. Test with sample documents to verify output quality
-
-## Performance Considerations
-
-**Conversion speed depends on:**
- Document size and complexity
- Number of images (especially with LLM descriptions)
- Use of Azure Document Intelligence
- Available system resources
-
-**Optimization tips:**
- Disable LLM integration if image descriptions aren't needed
- Use standard extraction (not Azure) for simple documents
- Process large batches in parallel when possible
- Consider streaming for very large documents
--- a/scientific-skills/markitdown/references/file_formats.md
+++ b/scientific-skills/markitdown/references/file_formats.md
@@ -0,0 +1,542 @@
+# File Format Support
+
+This document provides detailed information about each file format supported by MarkItDown.
+
+## Document Formats
+
+### PDF (.pdf)
+
+**Capabilities**:
+- Text extraction
+- Table detection
+- Metadata extraction
+- OCR for scanned documents (with dependencies)
+
+**Dependencies**:
+```bash
+pip install 'markitdown[pdf]'
+```
+
+**Best For**:
+- Scientific papers
+- Reports
+- Books
+- Forms
+
+**Limitations**:
+- Complex layouts may not preserve perfect formatting
+- Scanned PDFs require OCR setup
+- Some PDF features (annotations, forms) may not convert
+
+**Example**:
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("research_paper.pdf")
+print(result.text_content)
+```
+
+**Enhanced with Azure Document Intelligence**:
+```python
+md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
+result = md.convert("complex_layout.pdf")
+```
+
+---
+
+### Microsoft Word (.docx)
+
+**Capabilities**:
+- Text extraction
+- Table conversion
+- Heading hierarchy
+- List formatting
+- Basic text formatting (bold, italic)
+
+**Dependencies**:
+```bash
+pip install 'markitdown[docx]'
+```
+
+**Best For**:
+- Research papers
+- Reports
+- Documentation
+- Manuscripts
+
+**Preserved Elements**:
+- Headings (converted to Markdown headers)
+- Tables (converted to Markdown tables)
+- Lists (bulleted and numbered)
+- Basic formatting (bold, italic)
+- Paragraphs
+
+**Example**:
+```python
+result = md.convert("manuscript.docx")
+```
+
+---
+
+### PowerPoint (.pptx)
+
+**Capabilities**:
+- Slide content extraction
+- Speaker notes
+- Table extraction
+- Image descriptions (with AI)
+
+**Dependencies**:
+```bash
+pip install 'markitdown[pptx]'
+```
+
+**Best For**:
+- Presentations
+- Lecture slides
+- Conference talks
+
+**Output Format**:
+```markdown
+# Slide 1: Title
+
+Content from slide 1...
+
+**Notes**: Speaker notes appear here
+
+---
+
+# Slide 2: Next Topic
+
+...
+```
+
+**With AI Image Descriptions**:
+```python
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+result = md.convert("presentation.pptx")
+```
+
+---
+
+### Excel (.xlsx, .xls)
+
+**Capabilities**:
+- Sheet extraction
+- Table formatting
+- Data preservation
+- Formula values (calculated)
+
+**Dependencies**:
+```bash
+pip install 'markitdown[xlsx]'  # Modern Excel
+pip install 'markitdown[xls]'   # Legacy Excel
+```
+
+**Best For**:
+- Data tables
+- Research data
+- Statistical results
+- Experimental data
+
+**Output Format**:
+```markdown
+# Sheet: Results
+
+| Sample | Control | Treatment | P-value |
+|--------|---------|-----------|---------|
+| 1      | 10.2    | 12.5      | 0.023   |
+| 2      | 9.8     | 11.9      | 0.031   |
+```
+
+**Example**:
+```python
+result = md.convert("experimental_data.xlsx")
+```
+
+---
+
+## Image Formats
+
+### Images (.jpg, .jpeg, .png, .gif, .webp)
+
+**Capabilities**:
+- EXIF metadata extraction
+- OCR text extraction
+- AI-powered image descriptions
+
+**Dependencies**:
+```bash
+pip install 'markitdown[all]'  # Includes image support
+```
+
+**Best For**:
+- Scanned documents
+- Charts and graphs
+- Scientific diagrams
+- Photographs with text
+
+**Output Without AI**:
+```markdown
+![Image](image.jpg)
+
+**EXIF Data**:
+- Camera: Canon EOS 5D
+- Date: 2024-01-15
+- Resolution: 4000x3000
+```
+
+**Output With AI**:
+```python
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this scientific diagram in detail"
+)
+result = md.convert("graph.png")
+```
+
+**OCR for Text Extraction**:
+Requires Tesseract OCR:
+```bash
+# macOS
+brew install tesseract
+
+# Ubuntu
+sudo apt-get install tesseract-ocr
+```
+
+---
+
+## Audio Formats
+
+### Audio (.wav, .mp3)
+
+**Capabilities**:
+- Metadata extraction
+- Speech-to-text transcription
+- Duration and technical info
+
+**Dependencies**:
+```bash
+pip install 'markitdown[audio-transcription]'
+```
+
+**Best For**:
+- Lecture recordings
+- Interviews
+- Podcasts
+- Meeting recordings
+
+**Output Format**:
+```markdown
+# Audio: interview.mp3
+
+**Metadata**:
+- Duration: 45:32
+- Bitrate: 320kbps
+- Sample Rate: 44100Hz
+
+**Transcription**:
+[Transcribed text appears here...]
+```
+
+**Example**:
+```python
+result = md.convert("lecture.mp3")
+```
+
+---
+
+## Web Formats
+
+### HTML (.html, .htm)
+
+**Capabilities**:
+- Clean HTML to Markdown conversion
+- Link preservation
+- Table conversion
+- List formatting
+
+**Best For**:
+- Web pages
+- Documentation
+- Blog posts
+- Online articles
+
+**Output Format**: Clean Markdown with preserved links and structure
+
+**Example**:
+```python
+result = md.convert("webpage.html")
+```
+
+---
+
+### YouTube URLs
+
+**Capabilities**:
+- Fetch video transcriptions
+- Extract video metadata
+- Caption download
+
+**Dependencies**:
+```bash
+pip install 'markitdown[youtube-transcription]'
+```
+
+**Best For**:
+- Educational videos
+- Lectures
+- Talks
+- Tutorials
+
+**Example**:
+```python
+result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
+```
+
+---
+
+## Data Formats
+
+### CSV (.csv)
+
+**Capabilities**:
+- Automatic table conversion
+- Delimiter detection
+- Header preservation
+
+**Output Format**: Markdown tables
+
+**Example**:
+```python
+result = md.convert("data.csv")
+```
+
+**Output**:
+```markdown
+| Column1 | Column2 | Column3 |
+|---------|---------|---------|
+| Value1  | Value2  | Value3  |
+```
+
+---
+
+### JSON (.json)
+
+**Capabilities**:
+- Structured representation
+- Pretty formatting
+- Nested data visualization
+
+**Best For**:
+- API responses
+- Configuration files
+- Data exports
+
+**Example**:
+```python
+result = md.convert("data.json")
+```
+
+---
+
+### XML (.xml)
+
+**Capabilities**:
+- Structure preservation
+- Attribute extraction
+- Formatted output
+
+**Best For**:
+- Configuration files
+- Data interchange
+- Structured documents
+
+**Example**:
+```python
+result = md.convert("config.xml")
+```
+
+---
+
+## Archive Formats
+
+### ZIP (.zip)
+
+**Capabilities**:
+- Iterates through archive contents
+- Converts each file individually
+- Maintains directory structure in output
+
+**Best For**:
+- Document collections
+- Project archives
+- Batch conversions
+
+**Output Format**:
+```markdown
+# Archive: documents.zip
+
+## File: document1.pdf
+[Content from document1.pdf...]
+
+---
+
+## File: document2.docx
+[Content from document2.docx...]
+```
+
+**Example**:
+```python
+result = md.convert("archive.zip")
+```
+
+---
+
+## E-book Formats
+
+### EPUB (.epub)
+
+**Capabilities**:
+- Full text extraction
+- Chapter structure
+- Metadata extraction
+
+**Best For**:
+- E-books
+- Digital publications
+- Long-form content
+
+**Output Format**: Markdown with preserved chapter structure
+
+**Example**:
+```python
+result = md.convert("book.epub")
+```
+
+---
+
+## Other Formats
+
+### Outlook Messages (.msg)
+
+**Capabilities**:
+- Email content extraction
+- Attachment listing
+- Metadata (from, to, subject, date)
+
+**Dependencies**:
+```bash
+pip install 'markitdown[outlook]'
+```
+
+**Best For**:
+- Email archives
+- Communication records
+
+**Example**:
+```python
+result = md.convert("message.msg")
+```
+
+---
+
+## Format-Specific Tips
+
+### PDF Best Practices
+
+1. **Use Azure Document Intelligence for complex layouts**:
+   ```python
+   md = MarkItDown(docintel_endpoint="endpoint_url")
+   ```
+
+2. **For scanned PDFs, ensure OCR is set up**:
+   ```bash
+   brew install tesseract  # macOS
+   ```
+
+3. **Split very large PDFs before conversion** for better performance
+
+### PowerPoint Best Practices
+
+1. **Use AI for visual content**:
+   ```python
+   md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+   ```
+
+2. **Check speaker notes** - they're included in output
+
+3. **Complex animations won't be captured** - static content only
+
+### Excel Best Practices
+
+1. **Large spreadsheets** may take time to convert
+
+2. **Formulas are converted to their calculated values**
+
+3. **Multiple sheets** are all included in output
+
+4. **Charts become text descriptions** (use AI for better descriptions)
+
+### Image Best Practices
+
+1. **Use AI for meaningful descriptions**:
+   ```python
+   md = MarkItDown(
+       llm_client=client,
+       llm_model="gpt-4o",
+       llm_prompt="Describe this scientific figure in detail"
+   )
+   ```
+
+2. **For text-heavy images, ensure OCR dependencies** are installed
+
+3. **High-resolution images** may take longer to process
+
+### Audio Best Practices
+
+1. **Clear audio** produces better transcriptions
+
+2. **Long recordings** may take significant time
+
+3. **Consider splitting long audio files** for faster processing
+
+---
+
+## Unsupported Formats
+
+If you need to convert an unsupported format:
+
+1. **Create a custom converter** (see `api_reference.md`)
+2. **Look for plugins** on GitHub (#markitdown-plugin)
+3. **Pre-convert to supported format** (e.g., convert .rtf to .docx)
+
+---
+
+## Format Detection
+
+MarkItDown automatically detects format from:
+
+1. **File extension** (primary method)
+2. **MIME type** (fallback)
+3. **File signature** (magic bytes, fallback)
+
+**Override detection**:
+```python
+# Force specific format
+result = md.convert("file_without_extension", file_extension=".pdf")
+
+# With streams
+with open("file", "rb") as f:
+    result = md.convert_stream(f, file_extension=".pdf")
+```
+
--- a/scientific-skills/markitdown/references/media_processing.md
+++ b/scientific-skills/markitdown/references/media_processing.md
@@ -1,365 +0,0 @@
-# Media Processing Reference
-
-This document provides detailed information about processing images and audio files with MarkItDown.
-
-## Image Processing
-
-MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
-
-### Basic Image Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("photo.jpg")
-print(result.text_content)
-```
-
-### Image Processing Features
-
-**What's extracted:**
-1. **EXIF Metadata** - Camera settings, date, location, etc.
-2. **OCR Text** - Text detected in the image (requires tesseract)
-3. **Image Description** - AI-generated description (with LLM integration)
-
-### EXIF Metadata Extraction
-
-Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("IMG_1234.jpg")
-print(result.text_content)
-```
-
-**Example output includes:**
- Camera make and model
- Capture date and time
- GPS coordinates (if available)
- Exposure settings (ISO, shutter speed, aperture)
- Image dimensions
- Orientation
-
-### OCR (Optical Character Recognition)
-
-Extract text from images containing text (screenshots, scanned documents, photos of text):
-
-**Requirements:**
- Install tesseract OCR engine:
-  ```bash
-  # macOS
-  brew install tesseract
-
-  # Ubuntu/Debian
-  apt-get install tesseract-ocr
-
-  # Windows
-  # Download installer from https://github.com/UB-Mannheim/tesseract/wiki
-  ```
-
-**Usage:**
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("screenshot.png")
-print(result.text_content)  # Contains OCR'd text
-```
-
-**Best practices for OCR:**
- Use high-resolution images for better accuracy
- Ensure good contrast between text and background
- Straighten skewed text if possible
- Use well-lit, clear images
-
-### LLM-Generated Image Descriptions
-
-Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-result = md.convert("diagram.png")
-print(result.text_content)
-```
-
-**Custom prompts for specific needs:**
-
-```python
-# For diagrams
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
-)
-
-# For charts
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    llm_prompt="Analyze this chart and provide key data points and trends"
-)
-
-# For UI screenshots
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    llm_prompt="Describe this user interface, listing all visible elements and their layout"
-)
-```
-
-### Supported Image Formats
-
-MarkItDown supports all common image formats:
- JPEG/JPG
- PNG
- GIF
- BMP
- TIFF
- WebP
- HEIC (requires additional libraries on some platforms)
-
-## Audio Processing
-
-MarkItDown can transcribe audio files to text using speech recognition.
-
-### Basic Audio Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("recording.wav")
-print(result.text_content)  # Transcribed speech
-```
-
-### Audio Transcription Setup
-
-**Installation:**
-```bash
-pip install 'markitdown[audio]'
-```
-
-This installs the `speech_recognition` library and dependencies.
-
-### Supported Audio Formats
-
- WAV
- AIFF
- FLAC
- MP3 (requires ffmpeg or libav)
- OGG (requires ffmpeg or libav)
- Other formats supported by speech_recognition
-
-### Audio Transcription Engines
-
-MarkItDown uses the `speech_recognition` library, which supports multiple backends:
-
-**Default (Google Speech Recognition):**
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("audio.wav")
-```
-
-**Note:** Default Google Speech Recognition requires internet connection.
-
-### Audio Quality Considerations
-
-For best transcription accuracy:
- Use clear audio with minimal background noise
- Prefer WAV or FLAC for better quality
- Ensure speech is clear and at good volume
- Avoid multiple overlapping speakers
- Use mono audio when possible
-
-### Audio Preprocessing Tips
-
-For better results, consider preprocessing audio:
-
-```python
-# Example: If you have pydub installed
-from pydub import AudioSegment
-from pydub.effects import normalize
-
-# Load and normalize audio
-audio = AudioSegment.from_file("recording.mp3")
-audio = normalize(audio)
-audio.export("normalized.wav", format="wav")
-
-# Then convert with MarkItDown
-from markitdown import MarkItDown
-md = MarkItDown()
-result = md.convert("normalized.wav")
-```
-
-## Combined Media Workflows
-
-### Processing Multiple Images in Batch
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-import os
-
-client = OpenAI()
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-
-# Process all images in directory
-for filename in os.listdir("images"):
-    if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
-        result = md.convert(f"images/{filename}")
-
-        # Save markdown with same name
-        output = filename.rsplit('.', 1)[0] + '.md'
-        with open(f"output/{output}", "w") as f:
-            f.write(result.text_content)
-```
-
-### Screenshot Analysis Pipeline
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
-)
-
-screenshots = ["screen1.png", "screen2.png", "screen3.png"]
-analysis = []
-
-for screenshot in screenshots:
-    result = md.convert(screenshot)
-    analysis.append({
-        'file': screenshot,
-        'content': result.text_content
-    })
-
-# Now ready for further processing
-```
-
-### Document Images with OCR
-
-For scanned documents or photos of documents:
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-# Process scanned pages
-pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
-full_text = []
-
-for page in pages:
-    result = md.convert(page)
-    full_text.append(result.text_content)
-
-# Combine into single document
-document = "\n\n---\n\n".join(full_text)
-print(document)
-```
-
-### Presentation Slide Images
-
-When you have presentation slides as images:
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-md = MarkItDown(
-    llm_client=client,
-    llm_model="gpt-4o",
-    llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
-)
-
-# Process slide images
-for i in range(1, 21):  # 20 slides
-    result = md.convert(f"slides/slide_{i}.png")
-    print(f"## Slide {i}\n\n{result.text_content}\n\n")
-```
-
-## Error Handling
-
-### Image Processing Errors
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-try:
-    result = md.convert("image.jpg")
-    print(result.text_content)
-except FileNotFoundError:
-    print("Image file not found")
-except Exception as e:
-    print(f"Error processing image: {e}")
-```
-
-### Audio Processing Errors
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-try:
-    result = md.convert("audio.mp3")
-    print(result.text_content)
-except Exception as e:
-    print(f"Transcription failed: {e}")
-    # Common issues: format not supported, no speech detected, network error
-```
-
-## Performance Optimization
-
-### Image Processing
-
- **LLM descriptions**: Slower but more informative
- **OCR only**: Faster for text extraction
- **EXIF only**: Fastest, metadata only
- **Batch processing**: Process multiple images in parallel
-
-### Audio Processing
-
- **File size**: Larger files take longer
- **Audio length**: Transcription time scales with duration
- **Format conversion**: WAV/FLAC are faster than MP3/OGG
- **Network dependency**: Default transcription requires internet
-
-## Use Cases
-
-### Document Digitization
-Convert scanned documents or photos of documents to searchable text.
-
-### Meeting Notes
-Transcribe audio recordings of meetings to text for analysis.
-
-### Presentation Analysis
-Extract content from presentation slide images.
-
-### Screenshot Documentation
-Generate descriptions of UI screenshots for documentation.
-
-### Image Archiving
-Extract metadata and content from photo collections.
-
-### Accessibility
-Generate alt-text descriptions for images using LLM integration.
-
-### Data Extraction
-OCR text from images containing tables, forms, or structured data.
--- a/scientific-skills/markitdown/references/structured_data.md
+++ b/scientific-skills/markitdown/references/structured_data.md
@@ -1,575 +0,0 @@
-# Structured Data Handling Reference
-
-This document provides detailed information about converting structured data formats (CSV, JSON, XML) to Markdown.
-
-## CSV Files
-
-Convert CSV (Comma-Separated Values) files to Markdown tables.
-
-### Basic CSV Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("data.csv")
-print(result.text_content)
-```
-
-### CSV to Markdown Table
-
-CSV files are automatically converted to Markdown table format:
-
-**Input CSV (`data.csv`):**
-```csv
-Name,Age,City
-Alice,30,New York
-Bob,25,Los Angeles
-Charlie,35,Chicago
-```
-
-**Output Markdown:**
-```markdown
-| Name    | Age | City        |
-|---------|-----|-------------|
-| Alice   | 30  | New York    |
-| Bob     | 25  | Los Angeles |
-| Charlie | 35  | Chicago     |
-```
-
-### CSV Conversion Features
-
-**What's preserved:**
- All column headers
- All data rows
- Cell values (text and numbers)
- Column structure
-
-**Formatting:**
- Headers are bolded (Markdown table format)
- Columns are aligned
- Empty cells are preserved
- Special characters are escaped
-
-### Large CSV Files
-
-For large CSV files:
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-# Convert large CSV
-result = md.convert("large_dataset.csv")
-
-# Save to file instead of printing
-with open("output.md", "w") as f:
-    f.write(result.text_content)
-```
-
-**Performance considerations:**
- Very large files may take time to process
- Consider previewing first few rows for testing
- Memory usage scales with file size
- Very wide tables may not display well in all Markdown viewers
-
-### CSV with Special Characters
-
-CSV files containing special characters are handled automatically:
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-# Handles UTF-8, special characters, quotes, etc.
-result = md.convert("international_data.csv")
-```
-
-### CSV Delimiters
-
-Standard CSV delimiters are supported:
- Comma (`,`) - standard
- Semicolon (`;`) - common in European formats
- Tab (`\t`) - TSV files
-
-### Command-Line CSV Conversion
-
-```bash
-# Basic conversion
-markitdown data.csv -o data.md
-
-# Multiple CSV files
-for file in *.csv; do
-    markitdown "$file" -o "${file%.csv}.md"
-done
-```
-
-## JSON Files
-
-Convert JSON data to readable Markdown format.
-
-### Basic JSON Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("data.json")
-print(result.text_content)
-```
-
-### JSON Formatting
-
-JSON is converted to a readable, structured Markdown format:
-
-**Input JSON (`config.json`):**
-```json
-{
-  "name": "MyApp",
-  "version": "1.0.0",
-  "dependencies": {
-    "library1": "^2.0.0",
-    "library2": "^3.1.0"
-  },
-  "features": ["auth", "api", "database"]
-}
-```
-
-**Output Markdown:**
-```markdown
-## Configuration
-
-**name:** MyApp
-**version:** 1.0.0
-
-### dependencies
- **library1:** ^2.0.0
- **library2:** ^3.1.0
-
-### features
- auth
- api
- database
-```
-
-### JSON Array Handling
-
-JSON arrays are converted to lists or tables:
-
-**Array of objects:**
-```json
-[
-  {"id": 1, "name": "Alice", "active": true},
-  {"id": 2, "name": "Bob", "active": false}
-]
-```
-
-**Converted to table:**
-```markdown
-| id | name  | active |
-|----|-------|--------|
-| 1  | Alice | true   |
-| 2  | Bob   | false  |
-```
-
-### Nested JSON Structures
-
-Nested JSON is converted with appropriate indentation and hierarchy:
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-# Handles deeply nested structures
-result = md.convert("complex_config.json")
-print(result.text_content)
-```
-
-### JSON Lines (JSONL)
-
-For JSON Lines format (one JSON object per line):
-
-```python
-from markitdown import MarkItDown
-import json
-
-md = MarkItDown()
-
-# Read JSONL file
-with open("data.jsonl", "r") as f:
-    for line in f:
-        obj = json.loads(line)
-
-        # Convert to JSON temporarily
-        with open("temp.json", "w") as temp:
-            json.dump(obj, temp)
-
-        result = md.convert("temp.json")
-        print(result.text_content)
-        print("\n---\n")
-```
-
-### Large JSON Files
-
-For large JSON files:
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-# Convert large JSON
-result = md.convert("large_data.json")
-
-# Save to file
-with open("output.md", "w") as f:
-    f.write(result.text_content)
-```
-
-## XML Files
-
-Convert XML documents to structured Markdown.
-
-### Basic XML Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("data.xml")
-print(result.text_content)
-```
-
-### XML Structure Preservation
-
-XML is converted to Markdown maintaining hierarchical structure:
-
-**Input XML (`book.xml`):**
-```xml
-<?xml version="1.0"?>
-<book>
-  <title>Example Book</title>
-  <author>John Doe</author>
-  <chapters>
-    <chapter id="1">
-      <title>Introduction</title>
-      <content>Chapter 1 content...</content>
-    </chapter>
-    <chapter id="2">
-      <title>Background</title>
-      <content>Chapter 2 content...</content>
-    </chapter>
-  </chapters>
-</book>
-```
-
-**Output Markdown:**
-```markdown
-# book
-
-## title
-Example Book
-
-## author
-John Doe
-
-## chapters
-
-### chapter (id: 1)
-#### title
-Introduction
-
-#### content
-Chapter 1 content...
-
-### chapter (id: 2)
-#### title
-Background
-
-#### content
-Chapter 2 content...
-```
-
-### XML Attributes
-
-XML attributes are preserved in the conversion:
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("data.xml")
-# Attributes shown as (attr: value) in headings
-```
-
-### XML Namespaces
-
-XML namespaces are handled:
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-# Handles xmlns and namespaced elements
-result = md.convert("namespaced.xml")
-```
-
-### XML Use Cases
-
-**Configuration files:**
- Convert XML configs to readable format
- Document system configurations
- Compare configuration files
-
-**Data interchange:**
- Convert XML APIs responses
- Process XML data feeds
- Transform between formats
-
-**Document processing:**
- Convert DocBook to Markdown
- Process SVG descriptions
- Extract structured data
-
-## Structured Data Workflows
-
-### CSV Data Analysis Pipeline
-
-```python
-from markitdown import MarkItDown
-import pandas as pd
-
-md = MarkItDown()
-
-# Read CSV for analysis
-df = pd.read_csv("data.csv")
-
-# Do analysis
-summary = df.describe()
-
-# Convert both to Markdown
-original = md.convert("data.csv")
-
-# Save summary as CSV then convert
-summary.to_csv("summary.csv")
-summary_md = md.convert("summary.csv")
-
-print("## Original Data\n")
-print(original.text_content)
-print("\n## Statistical Summary\n")
-print(summary_md.text_content)
-```
-
-### JSON API Documentation
-
-```python
-from markitdown import MarkItDown
-import requests
-import json
-
-md = MarkItDown()
-
-# Fetch JSON from API
-response = requests.get("https://api.example.com/data")
-data = response.json()
-
-# Save as JSON
-with open("api_response.json", "w") as f:
-    json.dump(data, f, indent=2)
-
-# Convert to Markdown
-result = md.convert("api_response.json")
-
-# Create documentation
-doc = f"""# API Response Documentation
-
-## Endpoint
-GET https://api.example.com/data
-
-## Response
-{result.text_content}
-"""
-
-with open("api_docs.md", "w") as f:
-    f.write(doc)
-```
-
-### XML to Markdown Documentation
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-# Convert XML documentation
-xml_files = ["config.xml", "schema.xml", "data.xml"]
-
-for xml_file in xml_files:
-    result = md.convert(xml_file)
-
-    output_name = xml_file.replace('.xml', '.md')
-    with open(f"docs/{output_name}", "w") as f:
-        f.write(result.text_content)
-```
-
-### Multi-Format Data Processing
-
-```python
-from markitdown import MarkItDown
-import os
-
-md = MarkItDown()
-
-def convert_structured_data(directory):
-    """Convert all structured data files in directory."""
-    extensions = {'.csv', '.json', '.xml'}
-
-    for filename in os.listdir(directory):
-        ext = os.path.splitext(filename)[1]
-
-        if ext in extensions:
-            input_path = os.path.join(directory, filename)
-            result = md.convert(input_path)
-
-            # Save Markdown
-            output_name = filename.replace(ext, '.md')
-            output_path = os.path.join("markdown", output_name)
-
-            with open(output_path, 'w') as f:
-                f.write(result.text_content)
-
-            print(f"Converted: {filename} → {output_name}")
-
-# Process all structured data
-convert_structured_data("data")
-```
-
-### CSV to JSON to Markdown
-
-```python
-import pandas as pd
-from markitdown import MarkItDown
-import json
-
-md = MarkItDown()
-
-# Read CSV
-df = pd.read_csv("data.csv")
-
-# Convert to JSON
-json_data = df.to_dict(orient='records')
-with open("temp.json", "w") as f:
-    json.dump(json_data, f, indent=2)
-
-# Convert JSON to Markdown
-result = md.convert("temp.json")
-print(result.text_content)
-```
-
-### Database Export to Markdown
-
-```python
-from markitdown import MarkItDown
-import sqlite3
-import csv
-
-md = MarkItDown()
-
-# Export database query to CSV
-conn = sqlite3.connect("database.db")
-cursor = conn.execute("SELECT * FROM users")
-
-with open("users.csv", "w", newline='') as f:
-    writer = csv.writer(f)
-    writer.writerow([description[0] for description in cursor.description])
-    writer.writerows(cursor.fetchall())
-
-# Convert to Markdown
-result = md.convert("users.csv")
-print(result.text_content)
-```
-
-## Error Handling
-
-### CSV Errors
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-try:
-    result = md.convert("data.csv")
-    print(result.text_content)
-except FileNotFoundError:
-    print("CSV file not found")
-except Exception as e:
-    print(f"CSV conversion error: {e}")
-    # Common issues: encoding problems, malformed CSV, delimiter issues
-```
-
-### JSON Errors
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-try:
-    result = md.convert("data.json")
-    print(result.text_content)
-except Exception as e:
-    print(f"JSON conversion error: {e}")
-    # Common issues: invalid JSON syntax, encoding issues
-```
-
-### XML Errors
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-try:
-    result = md.convert("data.xml")
-    print(result.text_content)
-except Exception as e:
-    print(f"XML conversion error: {e}")
-    # Common issues: malformed XML, encoding problems, namespace issues
-```
-
-## Best Practices
-
-### CSV Processing
- Check delimiter before conversion
- Verify encoding (UTF-8 recommended)
- Handle large files with streaming if needed
- Preview output for very wide tables
-
-### JSON Processing
- Validate JSON before conversion
- Consider pretty-printing complex structures
- Handle circular references appropriately
- Be aware of large array performance
-
-### XML Processing
- Validate XML structure first
- Handle namespaces consistently
- Consider XPath for selective extraction
- Be mindful of very deep nesting
-
-### Data Quality
- Clean data before conversion when possible
- Handle missing values appropriately
- Verify special character handling
- Test with representative samples
-
-### Performance
- Process large files in batches
- Use streaming for very large datasets
- Monitor memory usage
- Cache converted results when appropriate
--- a/scientific-skills/markitdown/references/web_content.md
+++ b/scientific-skills/markitdown/references/web_content.md
@@ -1,478 +0,0 @@
-# Web Content Extraction Reference
-
-This document provides detailed information about extracting content from HTML, YouTube, EPUB, and other web-based formats.
-
-## HTML Conversion
-
-Convert HTML files and web pages to clean Markdown format.
-
-### Basic HTML Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("webpage.html")
-print(result.text_content)
-```
-
-### HTML Processing Features
-
-**What's preserved:**
- Headings (`<h1>` → `#`, `<h2>` → `##`, etc.)
- Paragraphs and text formatting
- Links (`<a>` → `[text](url)`)
- Lists (ordered and unordered)
- Tables → Markdown tables
- Code blocks and inline code
- Emphasis (bold, italic)
-
-**What's removed:**
- Scripts and styles
- Navigation elements
- Advertising content
- Boilerplate markup
- HTML comments
-
-### HTML from URLs
-
-Convert web pages directly from URLs:
-
-```python
-from markitdown import MarkItDown
-import requests
-
-md = MarkItDown()
-
-# Fetch and convert web page
-response = requests.get("https://example.com/article")
-with open("temp.html", "wb") as f:
-    f.write(response.content)
-
-result = md.convert("temp.html")
-print(result.text_content)
-```
-
-### Clean Web Article Extraction
-
-For extracting main content from web articles:
-
-```python
-from markitdown import MarkItDown
-import requests
-from readability import Document  # pip install readability-lxml
-
-md = MarkItDown()
-
-# Fetch page
-url = "https://example.com/article"
-response = requests.get(url)
-
-# Extract main content
-doc = Document(response.content)
-html_content = doc.summary()
-
-# Save and convert
-with open("article.html", "w") as f:
-    f.write(html_content)
-
-result = md.convert("article.html")
-print(result.text_content)
-```
-
-### HTML with Images
-
-HTML files containing images can be enhanced with LLM descriptions:
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-result = md.convert("page_with_images.html")
-```
-
-## YouTube Transcripts
-
-Extract video transcripts from YouTube videos.
-
-### Basic YouTube Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
-print(result.text_content)
-```
-
-### YouTube Installation
-
-```bash
-pip install 'markitdown[youtube]'
-```
-
-This installs the `youtube-transcript-api` dependency.
-
-### YouTube URL Formats
-
-MarkItDown supports various YouTube URL formats:
- `https://www.youtube.com/watch?v=VIDEO_ID`
- `https://youtu.be/VIDEO_ID`
- `https://www.youtube.com/embed/VIDEO_ID`
- `https://m.youtube.com/watch?v=VIDEO_ID`
-
-### YouTube Transcript Features
-
-**What's included:**
- Full video transcript text
- Timestamps (optional, depending on availability)
- Video metadata (title, description)
- Captions in available languages
-
-**Transcript languages:**
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-# Get transcript in specific language (if available)
-# Language codes: 'en', 'es', 'fr', 'de', etc.
-result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
-```
-
-### YouTube Playlist Processing
-
-Process multiple videos from a playlist:
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-video_ids = [
-    "VIDEO_ID_1",
-    "VIDEO_ID_2",
-    "VIDEO_ID_3"
-]
-
-transcripts = []
-for vid_id in video_ids:
-    url = f"https://youtube.com/watch?v={vid_id}"
-    result = md.convert(url)
-    transcripts.append({
-        'video_id': vid_id,
-        'transcript': result.text_content
-    })
-```
-
-### YouTube Use Cases
-
-**Content Analysis:**
- Analyze video content without watching
- Extract key information from tutorials
- Build searchable transcript databases
-
-**Research:**
- Process interview transcripts
- Extract lecture content
- Analyze presentation content
-
-**Accessibility:**
- Generate text versions of video content
- Create searchable video archives
-
-### YouTube Limitations
-
- Requires videos to have captions/transcripts available
- Auto-generated captions may have transcription errors
- Some videos may disable transcript access
- Rate limiting may apply for bulk processing
-
-## EPUB Books
-
-Convert EPUB e-books to Markdown format.
-
-### Basic EPUB Conversion
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("book.epub")
-print(result.text_content)
-```
-
-### EPUB Processing Features
-
-**What's extracted:**
- Book text content
- Chapter structure
- Headings and formatting
- Tables of contents
- Footnotes and references
-
-**What's preserved:**
- Heading hierarchy
- Text emphasis (bold, italic)
- Links and references
- Lists and tables
-
-### EPUB with Images
-
-EPUB files often contain images (covers, diagrams, illustrations):
-
-```python
-from markitdown import MarkItDown
-from openai import OpenAI
-
-client = OpenAI()
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-result = md.convert("illustrated_book.epub")
-```
-
-### EPUB Use Cases
-
-**Research:**
- Convert textbooks to searchable format
- Extract content for analysis
- Build digital libraries
-
-**Content Processing:**
- Prepare books for LLM training data
- Convert to different formats
- Create summaries and extracts
-
-**Accessibility:**
- Convert to more accessible formats
- Extract text for screen readers
- Process for text-to-speech
-
-## RSS Feeds
-
-Process RSS feeds to extract article content.
-
-### Basic RSS Processing
-
-```python
-from markitdown import MarkItDown
-import feedparser
-
-md = MarkItDown()
-
-# Parse RSS feed
-feed = feedparser.parse("https://example.com/feed.xml")
-
-# Convert each entry
-for entry in feed.entries:
-    # Save entry HTML
-    with open("temp.html", "w") as f:
-        f.write(entry.summary)
-
-    result = md.convert("temp.html")
-    print(f"## {entry.title}\n\n{result.text_content}\n\n")
-```
-
-## Combined Web Content Workflows
-
-### Web Scraping Pipeline
-
-```python
-from markitdown import MarkItDown
-import requests
-from bs4 import BeautifulSoup
-
-md = MarkItDown()
-
-def scrape_and_convert(url):
-    """Scrape webpage and convert to Markdown."""
-    response = requests.get(url)
-    soup = BeautifulSoup(response.content, 'html.parser')
-
-    # Extract main content
-    main_content = soup.find('article') or soup.find('main')
-
-    if main_content:
-        # Save HTML
-        with open("temp.html", "w") as f:
-            f.write(str(main_content))
-
-        # Convert to Markdown
-        result = md.convert("temp.html")
-        return result.text_content
-
-    return None
-
-# Use it
-markdown = scrape_and_convert("https://example.com/article")
-print(markdown)
-```
-
-### YouTube Learning Content Extraction
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-# Course videos
-course_videos = [
-    ("https://youtube.com/watch?v=ID1", "Lesson 1: Introduction"),
-    ("https://youtube.com/watch?v=ID2", "Lesson 2: Basics"),
-    ("https://youtube.com/watch?v=ID3", "Lesson 3: Advanced")
-]
-
-course_content = []
-for url, title in course_videos:
-    result = md.convert(url)
-    course_content.append(f"# {title}\n\n{result.text_content}")
-
-# Combine into course document
-full_course = "\n\n---\n\n".join(course_content)
-with open("course_transcript.md", "w") as f:
-    f.write(full_course)
-```
-
-### Documentation Scraping
-
-```python
-from markitdown import MarkItDown
-import requests
-from urllib.parse import urljoin, urlparse
-
-md = MarkItDown()
-
-def scrape_documentation(base_url, page_urls):
-    """Scrape multiple documentation pages."""
-    docs = []
-
-    for page_url in page_urls:
-        full_url = urljoin(base_url, page_url)
-
-        # Fetch page
-        response = requests.get(full_url)
-        with open("temp.html", "wb") as f:
-            f.write(response.content)
-
-        # Convert
-        result = md.convert("temp.html")
-        docs.append({
-            'url': full_url,
-            'content': result.text_content
-        })
-
-    return docs
-
-# Example usage
-base = "https://docs.example.com/"
-pages = ["intro.html", "getting-started.html", "api.html"]
-documentation = scrape_documentation(base, pages)
-```
-
-### EPUB Library Processing
-
-```python
-from markitdown import MarkItDown
-import os
-
-md = MarkItDown()
-
-def process_epub_library(library_path, output_path):
-    """Convert all EPUB books in a directory."""
-    for filename in os.listdir(library_path):
-        if filename.endswith('.epub'):
-            epub_path = os.path.join(library_path, filename)
-
-            try:
-                result = md.convert(epub_path)
-
-                # Save markdown
-                output_file = filename.replace('.epub', '.md')
-                output_full = os.path.join(output_path, output_file)
-
-                with open(output_full, 'w') as f:
-                    f.write(result.text_content)
-
-                print(f"Converted: {filename}")
-            except Exception as e:
-                print(f"Failed to convert {filename}: {e}")
-
-# Process library
-process_epub_library("books", "markdown_books")
-```
-
-## Error Handling
-
-### HTML Conversion Errors
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-try:
-    result = md.convert("webpage.html")
-    print(result.text_content)
-except FileNotFoundError:
-    print("HTML file not found")
-except Exception as e:
-    print(f"Conversion error: {e}")
-```
-
-### YouTube Transcript Errors
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-try:
-    result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
-    print(result.text_content)
-except Exception as e:
-    print(f"Failed to get transcript: {e}")
-    # Common issues: No transcript available, video unavailable, network error
-```
-
-### EPUB Conversion Errors
-
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-
-try:
-    result = md.convert("book.epub")
-    print(result.text_content)
-except Exception as e:
-    print(f"EPUB processing error: {e}")
-    # Common issues: Corrupted file, unsupported DRM, invalid format
-```
-
-## Best Practices
-
-### HTML Processing
- Clean HTML before conversion for better results
- Use readability libraries to extract main content
- Handle different encodings appropriately
- Remove unnecessary markup
-
-### YouTube Processing
- Check transcript availability before batch processing
- Handle API rate limits gracefully
- Store transcripts to avoid re-fetching
- Respect YouTube's terms of service
-
-### EPUB Processing
- DRM-protected EPUBs cannot be processed
- Large EPUBs may require more memory
- Some formatting may not translate perfectly
- Test with representative samples first
-
-### Web Scraping Ethics
- Respect robots.txt
- Add delays between requests
- Identify your scraper in User-Agent
- Cache results to minimize requests
- Follow website terms of service