Update all the latest writing skills

This commit is contained in:
Vinayak Agarwal
2025-12-12 11:42:41 -08:00
parent c85faf039a
commit cf1d4aac5d
30 changed files with 5895 additions and 2983 deletions

View File

@@ -1,538 +0,0 @@
# Advanced Integrations Reference
This document provides detailed information about advanced MarkItDown features including Azure Document Intelligence integration, LLM-powered descriptions, and plugin system.
## Azure Document Intelligence Integration
Azure Document Intelligence (formerly Form Recognizer) provides superior PDF processing with advanced table extraction and layout analysis.
### Setup
**Prerequisites:**
1. Azure subscription
2. Document Intelligence resource created in Azure
3. Endpoint URL and API key
**Create Azure Resource:**
```bash
# Using Azure CLI
az cognitiveservices account create \
--name my-doc-intelligence \
--resource-group my-resource-group \
--kind FormRecognizer \
--sku F0 \
--location eastus
```
### Basic Usage
```python
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/",
docintel_key="YOUR-API-KEY"
)
result = md.convert("complex_document.pdf")
print(result.text_content)
```
### Configuration from Environment Variables
```python
import os
from markitdown import MarkItDown
# Set environment variables
os.environ['AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'] = 'YOUR-ENDPOINT'
os.environ['AZURE_DOCUMENT_INTELLIGENCE_KEY'] = 'YOUR-KEY'
# Use without explicit credentials
md = MarkItDown(
docintel_endpoint=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'),
docintel_key=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
)
result = md.convert("document.pdf")
```
### When to Use Azure Document Intelligence
**Use for:**
- Complex PDFs with sophisticated tables
- Multi-column layouts
- Forms and structured documents
- Scanned documents requiring OCR
- PDFs with mixed content types
- Documents with intricate formatting
**Benefits over standard extraction:**
- **Superior table extraction** - Better handling of merged cells, complex layouts
- **Layout analysis** - Understands document structure (headers, footers, columns)
- **Form fields** - Extracts key-value pairs from forms
- **Reading order** - Maintains correct text flow in complex layouts
- **OCR quality** - High-quality text extraction from scanned documents
### Comparison Example
**Standard extraction:**
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("complex_table.pdf")
# May struggle with complex tables
```
**Azure Document Intelligence:**
```python
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="YOUR-ENDPOINT",
docintel_key="YOUR-KEY"
)
result = md.convert("complex_table.pdf")
# Better table reconstruction and layout understanding
```
### Cost Considerations
Azure Document Intelligence is a paid service:
- **Free tier**: 500 pages per month
- **Paid tiers**: Pay per page processed
- Monitor usage to control costs
- Use standard extraction for simple documents
### Error Handling
```python
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="YOUR-ENDPOINT",
docintel_key="YOUR-KEY"
)
try:
result = md.convert("document.pdf")
print(result.text_content)
except Exception as e:
print(f"Document Intelligence error: {e}")
# Common issues: authentication, quota exceeded, unsupported file
```
## LLM-Powered Image Descriptions
Generate detailed, contextual descriptions for images using large language models.
### Setup with OpenAI
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI(api_key="YOUR-OPENAI-API-KEY")
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("image.jpg")
print(result.text_content)
```
### Supported Use Cases
**Images in documents:**
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
# PowerPoint with images
result = md.convert("presentation.pptx")
# Word documents with images
result = md.convert("report.docx")
# Standalone images
result = md.convert("diagram.png")
```
### Custom Prompts
Customize the LLM prompt for specific needs:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
# For diagrams
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Analyze this diagram and explain all components, connections, and relationships in detail"
)
# For charts
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this chart, including the type, axes, data points, trends, and key insights"
)
# For UI screenshots
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this user interface screenshot, listing all UI elements, their layout, and functionality"
)
# For scientific figures
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this scientific figure in detail, including methodology, results shown, and significance"
)
```
### Model Selection
**GPT-4o (Recommended):**
- Best vision capabilities
- High-quality descriptions
- Good at understanding context
- Higher cost per image
**GPT-4o-mini:**
- Lower cost alternative
- Good for simpler images
- Faster processing
- May miss subtle details
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
# High quality (more expensive)
md_quality = MarkItDown(llm_client=client, llm_model="gpt-4o")
# Budget option (less expensive)
md_budget = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")
```
### Configuration from Environment
```python
import os
from markitdown import MarkItDown
from openai import OpenAI
# Set API key in environment
os.environ['OPENAI_API_KEY'] = 'YOUR-API-KEY'
client = OpenAI() # Uses env variable
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
```
### Alternative LLM Providers
**Anthropic Claude:**
```python
from markitdown import MarkItDown
from anthropic import Anthropic
# Note: Check current compatibility with MarkItDown
client = Anthropic(api_key="YOUR-API-KEY")
# May require adapter for MarkItDown compatibility
```
**Azure OpenAI:**
```python
from markitdown import MarkItDown
from openai import AzureOpenAI
client = AzureOpenAI(
api_key="YOUR-AZURE-KEY",
api_version="2024-02-01",
azure_endpoint="https://YOUR-RESOURCE.openai.azure.com"
)
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
```
### Cost Management
**Strategies to reduce LLM costs:**
1. **Selective processing:**
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
# Only use LLM for important documents
if is_important_document(file):
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
else:
md = MarkItDown() # Standard processing
result = md.convert(file)
```
2. **Image filtering:**
```python
# Pre-process to identify images that need descriptions
# Only use LLM for complex/important images
```
3. **Batch processing:**
```python
# Process multiple images in batches
# Monitor costs and set limits
```
4. **Model selection:**
```python
# Use gpt-4o-mini for simple images
# Reserve gpt-4o for complex visualizations
```
### Performance Considerations
**LLM processing adds latency:**
- Each image requires an API call
- Processing time: 1-5 seconds per image
- Network dependent
- Consider parallel processing for multiple images
**Batch optimization:**
```python
from markitdown import MarkItDown
from openai import OpenAI
import concurrent.futures
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
def process_image(image_path):
return md.convert(image_path)
# Process multiple images in parallel
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
results = list(executor.map(process_image, images))
```
## Combined Advanced Features
### Azure Document Intelligence + LLM Descriptions
Combine both for maximum quality:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
docintel_endpoint="YOUR-AZURE-ENDPOINT",
docintel_key="YOUR-AZURE-KEY"
)
# Best possible PDF conversion with image descriptions
result = md.convert("complex_report.pdf")
```
**Use cases:**
- Research papers with figures
- Business reports with charts
- Technical documentation with diagrams
- Presentations with visual data
### Smart Document Processing Pipeline
```python
from markitdown import MarkItDown
from openai import OpenAI
import os
def smart_convert(file_path):
"""Intelligently choose processing method based on file type."""
client = OpenAI()
ext = os.path.splitext(file_path)[1].lower()
# PDFs with complex tables: Use Azure
if ext == '.pdf':
md = MarkItDown(
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
docintel_key=os.getenv('AZURE_KEY')
)
# Documents/presentations with images: Use LLM
elif ext in ['.pptx', '.docx']:
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o"
)
# Simple formats: Standard processing
else:
md = MarkItDown()
return md.convert(file_path)
# Use it
result = smart_convert("document.pdf")
```
## Plugin System
MarkItDown supports custom plugins for extending functionality.
### Plugin Architecture
Plugins are disabled by default for security:
```python
from markitdown import MarkItDown
# Enable plugins
md = MarkItDown(enable_plugins=True)
```
### Creating Custom Plugins
**Plugin structure:**
```python
class CustomConverter:
"""Custom converter plugin for MarkItDown."""
def can_convert(self, file_path):
"""Check if this plugin can handle the file."""
return file_path.endswith('.custom')
def convert(self, file_path):
"""Convert file to Markdown."""
# Your conversion logic here
return {
'text_content': '# Converted Content\n\n...'
}
```
### Plugin Registration
```python
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=True)
# Register custom plugin
md.register_plugin(CustomConverter())
# Use normally
result = md.convert("file.custom")
```
### Plugin Use Cases
**Custom formats:**
- Proprietary document formats
- Specialized scientific data formats
- Legacy file formats
**Enhanced processing:**
- Custom OCR engines
- Specialized table extraction
- Domain-specific parsing
**Integration:**
- Enterprise document systems
- Custom databases
- Specialized APIs
### Plugin Security
**Important security considerations:**
- Plugins run with full system access
- Only enable for trusted plugins
- Validate plugin code before use
- Disable plugins in production unless required
## Error Handling for Advanced Features
```python
from markitdown import MarkItDown
from openai import OpenAI
def robust_convert(file_path):
"""Convert with fallback strategies."""
try:
# Try with all advanced features
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
docintel_key=os.getenv('AZURE_KEY')
)
return md.convert(file_path)
except Exception as azure_error:
print(f"Azure failed: {azure_error}")
try:
# Fallback: LLM only
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
return md.convert(file_path)
except Exception as llm_error:
print(f"LLM failed: {llm_error}")
# Final fallback: Standard processing
md = MarkItDown()
return md.convert(file_path)
# Use it
result = robust_convert("document.pdf")
```
## Best Practices
### Azure Document Intelligence
- Use for complex PDFs only (cost optimization)
- Monitor usage and costs
- Store credentials securely
- Handle quota limits gracefully
- Fall back to standard processing if needed
### LLM Integration
- Use appropriate models for task complexity
- Customize prompts for specific use cases
- Monitor API costs
- Implement rate limiting
- Cache results when possible
- Handle API errors gracefully
### Combined Features
- Test cost/quality tradeoffs
- Use selectively for important documents
- Implement intelligent routing
- Monitor performance and costs
- Have fallback strategies
### Security
- Store API keys securely (environment variables, secrets manager)
- Never commit credentials to code
- Disable plugins unless required
- Validate all inputs
- Use least privilege access

View File

@@ -0,0 +1,399 @@
# MarkItDown API Reference
## Core Classes
### MarkItDown
The main class for converting files to Markdown.
```python
from markitdown import MarkItDown
md = MarkItDown(
llm_client=None,
llm_model=None,
llm_prompt=None,
docintel_endpoint=None,
enable_plugins=False
)
```
#### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `llm_client` | OpenAI client | `None` | OpenAI-compatible client for AI image descriptions |
| `llm_model` | str | `None` | Model name (e.g., "anthropic/claude-sonnet-4.5") for image descriptions |
| `llm_prompt` | str | `None` | Custom prompt for image description |
| `docintel_endpoint` | str | `None` | Azure Document Intelligence endpoint |
| `enable_plugins` | bool | `False` | Enable 3rd-party plugins |
#### Methods
##### convert()
Convert a file to Markdown.
```python
result = md.convert(
source,
file_extension=None
)
```
**Parameters**:
- `source` (str): Path to the file to convert
- `file_extension` (str, optional): Override file extension detection
**Returns**: `DocumentConverterResult` object
**Example**:
```python
result = md.convert("document.pdf")
print(result.text_content)
```
##### convert_stream()
Convert from a file-like binary stream.
```python
result = md.convert_stream(
stream,
file_extension
)
```
**Parameters**:
- `stream` (BinaryIO): Binary file-like object (e.g., file opened in `"rb"` mode)
- `file_extension` (str): File extension to determine conversion method (e.g., ".pdf")
**Returns**: `DocumentConverterResult` object
**Example**:
```python
with open("document.pdf", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")
print(result.text_content)
```
**Important**: The stream must be opened in binary mode (`"rb"`), not text mode.
## Result Object
### DocumentConverterResult
The result of a conversion operation.
#### Attributes
| Attribute | Type | Description |
|-----------|------|-------------|
| `text_content` | str | The converted Markdown text |
| `title` | str | Document title (if available) |
#### Example
```python
result = md.convert("paper.pdf")
# Access content
content = result.text_content
# Access title (if available)
title = result.title
```
## Custom Converters
You can create custom document converters by implementing the `DocumentConverter` interface.
### DocumentConverter Interface
```python
from markitdown import DocumentConverter
class CustomConverter(DocumentConverter):
def convert(self, stream, file_extension):
"""
Convert a document from a binary stream.
Parameters:
stream (BinaryIO): Binary file-like object
file_extension (str): File extension (e.g., ".custom")
Returns:
DocumentConverterResult: Conversion result
"""
# Your conversion logic here
pass
```
### Registering Custom Converters
```python
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult
class MyCustomConverter(DocumentConverter):
def convert(self, stream, file_extension):
content = stream.read().decode('utf-8')
markdown_text = f"# Custom Format\n\n{content}"
return DocumentConverterResult(
text_content=markdown_text,
title="Custom Document"
)
# Create MarkItDown instance
md = MarkItDown()
# Register custom converter for .custom files
md.register_converter(".custom", MyCustomConverter())
# Use it
result = md.convert("myfile.custom")
```
## Plugin System
### Finding Plugins
Search GitHub for `#markitdown-plugin` tag.
### Using Plugins
```python
from markitdown import MarkItDown
# Enable plugins
md = MarkItDown(enable_plugins=True)
result = md.convert("document.pdf")
```
### Creating Plugins
Plugins are Python packages that register converters with MarkItDown.
**Plugin Structure**:
```
my-markitdown-plugin/
├── setup.py
├── my_plugin/
│ ├── __init__.py
│ └── converter.py
└── README.md
```
**setup.py**:
```python
from setuptools import setup
setup(
name="markitdown-my-plugin",
version="0.1.0",
packages=["my_plugin"],
entry_points={
"markitdown.plugins": [
"my_plugin = my_plugin.converter:MyConverter",
],
},
)
```
**converter.py**:
```python
from markitdown import DocumentConverter, DocumentConverterResult
class MyConverter(DocumentConverter):
def convert(self, stream, file_extension):
# Your conversion logic
content = stream.read()
markdown = self.process(content)
return DocumentConverterResult(
text_content=markdown,
title="My Document"
)
def process(self, content):
# Process content
return "# Converted Content\n\n..."
```
## AI-Enhanced Conversions
### Using OpenRouter for Image Descriptions
```python
from markitdown import MarkItDown
from openai import OpenAI
# Initialize OpenRouter client (OpenAI-compatible API)
client = OpenAI(
api_key="your-openrouter-api-key",
base_url="https://openrouter.ai/api/v1"
)
# Create MarkItDown with AI support
md = MarkItDown(
llm_client=client,
llm_model="anthropic/claude-sonnet-4.5", # recommended for scientific vision
llm_prompt="Describe this image in detail for scientific documentation"
)
# Convert files with images
result = md.convert("presentation.pptx")
```
### Available Models via OpenRouter
Popular models with vision support:
- `anthropic/claude-sonnet-4.5` - **Recommended for scientific vision**
- `anthropic/claude-opus-4.5` - Advanced vision model
- `openai/gpt-4o` - GPT-4 Omni
- `openai/gpt-4-vision` - GPT-4 Vision
- `google/gemini-pro-vision` - Gemini Pro Vision
See https://openrouter.ai/models for the complete list.
### Custom Prompts
```python
# For scientific diagrams
scientific_prompt = """
Analyze this scientific diagram or chart. Describe:
1. The type of visualization (graph, chart, diagram, etc.)
2. Key data points or trends
3. Labels and axes
4. Scientific significance
Be precise and technical.
"""
md = MarkItDown(
llm_client=client,
llm_model="anthropic/claude-sonnet-4.5",
llm_prompt=scientific_prompt
)
```
## Azure Document Intelligence
### Setup
1. Create Azure Document Intelligence resource
2. Get endpoint URL
3. Set authentication
### Usage
```python
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/"
)
result = md.convert("complex_document.pdf")
```
### Authentication
Set environment variables:
```bash
export AZURE_DOCUMENT_INTELLIGENCE_KEY="your-key"
```
Or pass credentials programmatically.
## Error Handling
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("document.pdf")
print(result.text_content)
except FileNotFoundError:
print("File not found")
except ValueError as e:
print(f"Invalid file format: {e}")
except Exception as e:
print(f"Conversion error: {e}")
```
## Performance Tips
### 1. Reuse MarkItDown Instance
```python
# Good: Create once, use many times
md = MarkItDown()
for file in files:
result = md.convert(file)
process(result)
```
### 2. Use Streaming for Large Files
```python
# For large files
with open("large_file.pdf", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")
```
### 3. Batch Processing
```python
from concurrent.futures import ThreadPoolExecutor
md = MarkItDown()
def convert_file(filepath):
return md.convert(filepath)
with ThreadPoolExecutor(max_workers=4) as executor:
results = executor.map(convert_file, file_list)
```
## Breaking Changes (v0.0.1 to v0.1.0)
1. **Dependencies**: Now organized into optional feature groups
```bash
# Old
pip install markitdown
# New
pip install 'markitdown[all]'
```
2. **convert_stream()**: Now requires binary file-like object
```python
# Old (also accepted text)
with open("file.pdf", "r") as f: # text mode
result = md.convert_stream(f)
# New (binary only)
with open("file.pdf", "rb") as f: # binary mode
result = md.convert_stream(f, file_extension=".pdf")
```
3. **DocumentConverter Interface**: Changed to read from streams instead of file paths
- No temporary files created
- More memory efficient
- Plugins need updating
## Version Compatibility
- **Python**: 3.10 or higher required
- **Dependencies**: Check `setup.py` for version constraints
- **OpenAI**: Compatible with OpenAI Python SDK v1.0+
## Environment Variables
| Variable | Description | Example |
|----------|-------------|---------|
| `OPENROUTER_API_KEY` | OpenRouter API key for image descriptions | `sk-or-v1-...` |
| `AZURE_DOCUMENT_INTELLIGENCE_KEY` | Azure DI authentication | `key123...` |
| `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` | Azure DI endpoint | `https://...` |

View File

@@ -1,273 +0,0 @@
# Document Conversion Reference
This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.
## PDF Files
PDF conversion extracts text, tables, and structure from PDF documents.
### Basic PDF Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
```
### PDF with Azure Document Intelligence
For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:
```python
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
docintel_key="YOUR-API-KEY"
)
result = md.convert("complex_table.pdf")
print(result.text_content)
```
**Benefits of Azure Document Intelligence:**
- Superior table extraction and reconstruction
- Better handling of multi-column layouts
- Form field recognition
- Improved text ordering in complex documents
### PDF Handling Notes
- Scanned PDFs require OCR (automatically handled if tesseract is installed)
- Password-protected PDFs are not supported
- Large PDFs may take longer to process
- Vector graphics and embedded images are extracted where possible
## Word Documents (DOCX)
Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.
### Basic DOCX Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.docx")
print(result.text_content)
```
### DOCX Structure Preservation
MarkItDown preserves:
- **Headings** → Markdown headers (`#`, `##`, etc.)
- **Bold/Italic** → Markdown emphasis (`**bold**`, `*italic*`)
- **Lists** → Markdown lists (ordered and unordered)
- **Tables** → Markdown tables
- **Hyperlinks** → Markdown links `[text](url)`
- **Images** → Referenced with descriptions (can use LLM for descriptions)
### Command-Line Usage
```bash
# Basic conversion
markitdown report.docx -o report.md
# With output directory
markitdown report.docx -o output/report.md
```
### DOCX with Images
To generate descriptions for images in Word documents, use LLM integration:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("document_with_images.docx")
```
## PowerPoint Presentations (PPTX)
PowerPoint conversion extracts text from slides while preserving structure.
### Basic PPTX Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("presentation.pptx")
print(result.text_content)
```
### PPTX Structure
MarkItDown processes presentations as:
- Each slide becomes a major section
- Slide titles become headers
- Bullet points are preserved
- Tables are converted to Markdown tables
- Notes are included if present
### PPTX with Image Descriptions
Presentations often contain important visual information. Use LLM integration to describe images:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this slide image in detail, focusing on key information"
)
result = md.convert("presentation.pptx")
```
**Custom prompts for presentations:**
- "Describe charts and graphs with their key data points"
- "Explain diagrams and their relationships"
- "Summarize visual content for accessibility"
## Excel Spreadsheets (XLSX, XLS)
Excel conversion formats spreadsheet data as Markdown tables.
### Basic XLSX Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xlsx")
print(result.text_content)
```
### Multi-Sheet Workbooks
For workbooks with multiple sheets:
- Each sheet becomes a separate section
- Sheet names are used as headers
- Empty sheets are skipped
- Formulas are evaluated (values shown, not formulas)
### XLSX Conversion Details
**What's preserved:**
- Cell values (text, numbers, dates)
- Table structure (rows and columns)
- Sheet names
- Cell formatting (bold headers)
**What's not preserved:**
- Formulas (only computed values)
- Charts and graphs (use LLM integration for descriptions)
- Cell colors and conditional formatting
- Comments and notes
### Large Spreadsheets
For large spreadsheets, consider:
- Processing may be slower for files with many rows/columns
- Very wide tables may not format well in Markdown
- Consider filtering or preprocessing data if possible
### XLS (Legacy Excel) Files
Legacy `.xls` files are supported but require additional dependencies:
```bash
pip install 'markitdown[xls]'
```
Then use normally:
```python
md = MarkItDown()
result = md.convert("legacy_data.xls")
```
## Common Document Conversion Patterns
### Batch Document Processing
```python
from markitdown import MarkItDown
import os
md = MarkItDown()
# Process all documents in a directory
for filename in os.listdir("documents"):
if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
result = md.convert(f"documents/{filename}")
# Save to output directory
output_name = os.path.splitext(filename)[0] + ".md"
with open(f"markdown/{output_name}", "w") as f:
f.write(result.text_content)
```
### Document with Mixed Content
For documents containing multiple types of content (text, tables, images):
```python
from markitdown import MarkItDown
from openai import OpenAI
# Use LLM for image descriptions + Azure for complex tables
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
docintel_endpoint="YOUR-ENDPOINT",
docintel_key="YOUR-KEY"
)
result = md.convert("complex_report.pdf")
```
### Error Handling
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("document.pdf")
print(result.text_content)
except Exception as e:
print(f"Conversion failed: {e}")
# Handle specific errors (file not found, unsupported format, etc.)
```
## Output Quality Tips
**For best results:**
1. Use Azure Document Intelligence for PDFs with complex tables
2. Enable LLM descriptions for documents with important visual content
3. Ensure source documents are well-structured (proper headings, etc.)
4. For scanned documents, ensure good scan quality for OCR accuracy
5. Test with sample documents to verify output quality
## Performance Considerations
**Conversion speed depends on:**
- Document size and complexity
- Number of images (especially with LLM descriptions)
- Use of Azure Document Intelligence
- Available system resources
**Optimization tips:**
- Disable LLM integration if image descriptions aren't needed
- Use standard extraction (not Azure) for simple documents
- Process large batches in parallel when possible
- Consider streaming for very large documents

View File

@@ -0,0 +1,542 @@
# File Format Support
This document provides detailed information about each file format supported by MarkItDown.
## Document Formats
### PDF (.pdf)
**Capabilities**:
- Text extraction
- Table detection
- Metadata extraction
- OCR for scanned documents (with dependencies)
**Dependencies**:
```bash
pip install 'markitdown[pdf]'
```
**Best For**:
- Scientific papers
- Reports
- Books
- Forms
**Limitations**:
- Complex layouts may not preserve perfect formatting
- Scanned PDFs require OCR setup
- Some PDF features (annotations, forms) may not convert
**Example**:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("research_paper.pdf")
print(result.text_content)
```
**Enhanced with Azure Document Intelligence**:
```python
md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
result = md.convert("complex_layout.pdf")
```
---
### Microsoft Word (.docx)
**Capabilities**:
- Text extraction
- Table conversion
- Heading hierarchy
- List formatting
- Basic text formatting (bold, italic)
**Dependencies**:
```bash
pip install 'markitdown[docx]'
```
**Best For**:
- Research papers
- Reports
- Documentation
- Manuscripts
**Preserved Elements**:
- Headings (converted to Markdown headers)
- Tables (converted to Markdown tables)
- Lists (bulleted and numbered)
- Basic formatting (bold, italic)
- Paragraphs
**Example**:
```python
result = md.convert("manuscript.docx")
```
---
### PowerPoint (.pptx)
**Capabilities**:
- Slide content extraction
- Speaker notes
- Table extraction
- Image descriptions (with AI)
**Dependencies**:
```bash
pip install 'markitdown[pptx]'
```
**Best For**:
- Presentations
- Lecture slides
- Conference talks
**Output Format**:
```markdown
# Slide 1: Title
Content from slide 1...
**Notes**: Speaker notes appear here
---
# Slide 2: Next Topic
...
```
**With AI Image Descriptions**:
```python
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx")
```
---
### Excel (.xlsx, .xls)
**Capabilities**:
- Sheet extraction
- Table formatting
- Data preservation
- Formula values (calculated)
**Dependencies**:
```bash
pip install 'markitdown[xlsx]' # Modern Excel
pip install 'markitdown[xls]' # Legacy Excel
```
**Best For**:
- Data tables
- Research data
- Statistical results
- Experimental data
**Output Format**:
```markdown
# Sheet: Results
| Sample | Control | Treatment | P-value |
|--------|---------|-----------|---------|
| 1 | 10.2 | 12.5 | 0.023 |
| 2 | 9.8 | 11.9 | 0.031 |
```
**Example**:
```python
result = md.convert("experimental_data.xlsx")
```
---
## Image Formats
### Images (.jpg, .jpeg, .png, .gif, .webp)
**Capabilities**:
- EXIF metadata extraction
- OCR text extraction
- AI-powered image descriptions
**Dependencies**:
```bash
pip install 'markitdown[all]' # Includes image support
```
**Best For**:
- Scanned documents
- Charts and graphs
- Scientific diagrams
- Photographs with text
**Output Without AI**:
```markdown
![Image](image.jpg)
**EXIF Data**:
- Camera: Canon EOS 5D
- Date: 2024-01-15
- Resolution: 4000x3000
```
**Output With AI**:
```python
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this scientific diagram in detail"
)
result = md.convert("graph.png")
```
**OCR for Text Extraction**:
Requires Tesseract OCR:
```bash
# macOS
brew install tesseract
# Ubuntu
sudo apt-get install tesseract-ocr
```
---
## Audio Formats
### Audio (.wav, .mp3)
**Capabilities**:
- Metadata extraction
- Speech-to-text transcription
- Duration and technical info
**Dependencies**:
```bash
pip install 'markitdown[audio-transcription]'
```
**Best For**:
- Lecture recordings
- Interviews
- Podcasts
- Meeting recordings
**Output Format**:
```markdown
# Audio: interview.mp3
**Metadata**:
- Duration: 45:32
- Bitrate: 320kbps
- Sample Rate: 44100Hz
**Transcription**:
[Transcribed text appears here...]
```
**Example**:
```python
result = md.convert("lecture.mp3")
```
---
## Web Formats
### HTML (.html, .htm)
**Capabilities**:
- Clean HTML to Markdown conversion
- Link preservation
- Table conversion
- List formatting
**Best For**:
- Web pages
- Documentation
- Blog posts
- Online articles
**Output Format**: Clean Markdown with preserved links and structure
**Example**:
```python
result = md.convert("webpage.html")
```
---
### YouTube URLs
**Capabilities**:
- Fetch video transcriptions
- Extract video metadata
- Caption download
**Dependencies**:
```bash
pip install 'markitdown[youtube-transcription]'
```
**Best For**:
- Educational videos
- Lectures
- Talks
- Tutorials
**Example**:
```python
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
```
---
## Data Formats
### CSV (.csv)
**Capabilities**:
- Automatic table conversion
- Delimiter detection
- Header preservation
**Output Format**: Markdown tables
**Example**:
```python
result = md.convert("data.csv")
```
**Output**:
```markdown
| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Value1 | Value2 | Value3 |
```
---
### JSON (.json)
**Capabilities**:
- Structured representation
- Pretty formatting
- Nested data visualization
**Best For**:
- API responses
- Configuration files
- Data exports
**Example**:
```python
result = md.convert("data.json")
```
---
### XML (.xml)
**Capabilities**:
- Structure preservation
- Attribute extraction
- Formatted output
**Best For**:
- Configuration files
- Data interchange
- Structured documents
**Example**:
```python
result = md.convert("config.xml")
```
---
## Archive Formats
### ZIP (.zip)
**Capabilities**:
- Iterates through archive contents
- Converts each file individually
- Maintains directory structure in output
**Best For**:
- Document collections
- Project archives
- Batch conversions
**Output Format**:
```markdown
# Archive: documents.zip
## File: document1.pdf
[Content from document1.pdf...]
---
## File: document2.docx
[Content from document2.docx...]
```
**Example**:
```python
result = md.convert("archive.zip")
```
---
## E-book Formats
### EPUB (.epub)
**Capabilities**:
- Full text extraction
- Chapter structure
- Metadata extraction
**Best For**:
- E-books
- Digital publications
- Long-form content
**Output Format**: Markdown with preserved chapter structure
**Example**:
```python
result = md.convert("book.epub")
```
---
## Other Formats
### Outlook Messages (.msg)
**Capabilities**:
- Email content extraction
- Attachment listing
- Metadata (from, to, subject, date)
**Dependencies**:
```bash
pip install 'markitdown[outlook]'
```
**Best For**:
- Email archives
- Communication records
**Example**:
```python
result = md.convert("message.msg")
```
---
## Format-Specific Tips
### PDF Best Practices
1. **Use Azure Document Intelligence for complex layouts**:
```python
md = MarkItDown(docintel_endpoint="endpoint_url")
```
2. **For scanned PDFs, ensure OCR is set up**:
```bash
brew install tesseract # macOS
```
3. **Split very large PDFs before conversion** for better performance
### PowerPoint Best Practices
1. **Use AI for visual content**:
```python
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
```
2. **Check speaker notes** - they're included in output
3. **Complex animations won't be captured** - static content only
### Excel Best Practices
1. **Large spreadsheets** may take time to convert
2. **Formulas are converted to their calculated values**
3. **Multiple sheets** are all included in output
4. **Charts become text descriptions** (use AI for better descriptions)
### Image Best Practices
1. **Use AI for meaningful descriptions**:
```python
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this scientific figure in detail"
)
```
2. **For text-heavy images, ensure OCR dependencies** are installed
3. **High-resolution images** may take longer to process
### Audio Best Practices
1. **Clear audio** produces better transcriptions
2. **Long recordings** may take significant time
3. **Consider splitting long audio files** for faster processing
---
## Unsupported Formats
If you need to convert an unsupported format:
1. **Create a custom converter** (see `api_reference.md`)
2. **Look for plugins** on GitHub (#markitdown-plugin)
3. **Pre-convert to supported format** (e.g., convert .rtf to .docx)
---
## Format Detection
MarkItDown automatically detects format from:
1. **File extension** (primary method)
2. **MIME type** (fallback)
3. **File signature** (magic bytes, fallback)
**Override detection**:
```python
# Force specific format
result = md.convert("file_without_extension", file_extension=".pdf")
# With streams
with open("file", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")
```

View File

@@ -1,365 +0,0 @@
# Media Processing Reference
This document provides detailed information about processing images and audio files with MarkItDown.
## Image Processing
MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
### Basic Image Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("photo.jpg")
print(result.text_content)
```
### Image Processing Features
**What's extracted:**
1. **EXIF Metadata** - Camera settings, date, location, etc.
2. **OCR Text** - Text detected in the image (requires tesseract)
3. **Image Description** - AI-generated description (with LLM integration)
### EXIF Metadata Extraction
Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("IMG_1234.jpg")
print(result.text_content)
```
**Example output includes:**
- Camera make and model
- Capture date and time
- GPS coordinates (if available)
- Exposure settings (ISO, shutter speed, aperture)
- Image dimensions
- Orientation
### OCR (Optical Character Recognition)
Extract text from images containing text (screenshots, scanned documents, photos of text):
**Requirements:**
- Install tesseract OCR engine:
```bash
# macOS
brew install tesseract
# Ubuntu/Debian
apt-get install tesseract-ocr
# Windows
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki
```
**Usage:**
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("screenshot.png")
print(result.text_content) # Contains OCR'd text
```
**Best practices for OCR:**
- Use high-resolution images for better accuracy
- Ensure good contrast between text and background
- Straighten skewed text if possible
- Use well-lit, clear images
### LLM-Generated Image Descriptions
Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("diagram.png")
print(result.text_content)
```
**Custom prompts for specific needs:**
```python
# For diagrams
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
)
# For charts
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Analyze this chart and provide key data points and trends"
)
# For UI screenshots
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this user interface, listing all visible elements and their layout"
)
```
### Supported Image Formats
MarkItDown supports all common image formats:
- JPEG/JPG
- PNG
- GIF
- BMP
- TIFF
- WebP
- HEIC (requires additional libraries on some platforms)
## Audio Processing
MarkItDown can transcribe audio files to text using speech recognition.
### Basic Audio Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("recording.wav")
print(result.text_content) # Transcribed speech
```
### Audio Transcription Setup
**Installation:**
```bash
pip install 'markitdown[audio]'
```
This installs the `speech_recognition` library and dependencies.
### Supported Audio Formats
- WAV
- AIFF
- FLAC
- MP3 (requires ffmpeg or libav)
- OGG (requires ffmpeg or libav)
- Other formats supported by speech_recognition
### Audio Transcription Engines
MarkItDown uses the `speech_recognition` library, which supports multiple backends:
**Default (Google Speech Recognition):**
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("audio.wav")
```
**Note:** Default Google Speech Recognition requires internet connection.
### Audio Quality Considerations
For best transcription accuracy:
- Use clear audio with minimal background noise
- Prefer WAV or FLAC for better quality
- Ensure speech is clear and at good volume
- Avoid multiple overlapping speakers
- Use mono audio when possible
### Audio Preprocessing Tips
For better results, consider preprocessing audio:
```python
# Example: If you have pydub installed
from pydub import AudioSegment
from pydub.effects import normalize
# Load and normalize audio
audio = AudioSegment.from_file("recording.mp3")
audio = normalize(audio)
audio.export("normalized.wav", format="wav")
# Then convert with MarkItDown
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("normalized.wav")
```
## Combined Media Workflows
### Processing Multiple Images in Batch
```python
from markitdown import MarkItDown
from openai import OpenAI
import os
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
# Process all images in directory
for filename in os.listdir("images"):
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
result = md.convert(f"images/{filename}")
# Save markdown with same name
output = filename.rsplit('.', 1)[0] + '.md'
with open(f"output/{output}", "w") as f:
f.write(result.text_content)
```
### Screenshot Analysis Pipeline
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
)
screenshots = ["screen1.png", "screen2.png", "screen3.png"]
analysis = []
for screenshot in screenshots:
result = md.convert(screenshot)
analysis.append({
'file': screenshot,
'content': result.text_content
})
# Now ready for further processing
```
### Document Images with OCR
For scanned documents or photos of documents:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Process scanned pages
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
full_text = []
for page in pages:
result = md.convert(page)
full_text.append(result.text_content)
# Combine into single document
document = "\n\n---\n\n".join(full_text)
print(document)
```
### Presentation Slide Images
When you have presentation slides as images:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
)
# Process slide images
for i in range(1, 21): # 20 slides
result = md.convert(f"slides/slide_{i}.png")
print(f"## Slide {i}\n\n{result.text_content}\n\n")
```
## Error Handling
### Image Processing Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("image.jpg")
print(result.text_content)
except FileNotFoundError:
print("Image file not found")
except Exception as e:
print(f"Error processing image: {e}")
```
### Audio Processing Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("audio.mp3")
print(result.text_content)
except Exception as e:
print(f"Transcription failed: {e}")
# Common issues: format not supported, no speech detected, network error
```
## Performance Optimization
### Image Processing
- **LLM descriptions**: Slower but more informative
- **OCR only**: Faster for text extraction
- **EXIF only**: Fastest, metadata only
- **Batch processing**: Process multiple images in parallel
### Audio Processing
- **File size**: Larger files take longer
- **Audio length**: Transcription time scales with duration
- **Format conversion**: WAV/FLAC are faster than MP3/OGG
- **Network dependency**: Default transcription requires internet
## Use Cases
### Document Digitization
Convert scanned documents or photos of documents to searchable text.
### Meeting Notes
Transcribe audio recordings of meetings to text for analysis.
### Presentation Analysis
Extract content from presentation slide images.
### Screenshot Documentation
Generate descriptions of UI screenshots for documentation.
### Image Archiving
Extract metadata and content from photo collections.
### Accessibility
Generate alt-text descriptions for images using LLM integration.
### Data Extraction
OCR text from images containing tables, forms, or structured data.

View File

@@ -1,575 +0,0 @@
# Structured Data Handling Reference
This document provides detailed information about converting structured data formats (CSV, JSON, XML) to Markdown.
## CSV Files
Convert CSV (Comma-Separated Values) files to Markdown tables.
### Basic CSV Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.csv")
print(result.text_content)
```
### CSV to Markdown Table
CSV files are automatically converted to Markdown table format:
**Input CSV (`data.csv`):**
```csv
Name,Age,City
Alice,30,New York
Bob,25,Los Angeles
Charlie,35,Chicago
```
**Output Markdown:**
```markdown
| Name | Age | City |
|---------|-----|-------------|
| Alice | 30 | New York |
| Bob | 25 | Los Angeles |
| Charlie | 35 | Chicago |
```
### CSV Conversion Features
**What's preserved:**
- All column headers
- All data rows
- Cell values (text and numbers)
- Column structure
**Formatting:**
- Headers are bolded (Markdown table format)
- Columns are aligned
- Empty cells are preserved
- Special characters are escaped
### Large CSV Files
For large CSV files:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Convert large CSV
result = md.convert("large_dataset.csv")
# Save to file instead of printing
with open("output.md", "w") as f:
f.write(result.text_content)
```
**Performance considerations:**
- Very large files may take time to process
- Consider previewing first few rows for testing
- Memory usage scales with file size
- Very wide tables may not display well in all Markdown viewers
### CSV with Special Characters
CSV files containing special characters are handled automatically:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Handles UTF-8, special characters, quotes, etc.
result = md.convert("international_data.csv")
```
### CSV Delimiters
Standard CSV delimiters are supported:
- Comma (`,`) - standard
- Semicolon (`;`) - common in European formats
- Tab (`\t`) - TSV files
### Command-Line CSV Conversion
```bash
# Basic conversion
markitdown data.csv -o data.md
# Multiple CSV files
for file in *.csv; do
markitdown "$file" -o "${file%.csv}.md"
done
```
## JSON Files
Convert JSON data to readable Markdown format.
### Basic JSON Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.json")
print(result.text_content)
```
### JSON Formatting
JSON is converted to a readable, structured Markdown format:
**Input JSON (`config.json`):**
```json
{
"name": "MyApp",
"version": "1.0.0",
"dependencies": {
"library1": "^2.0.0",
"library2": "^3.1.0"
},
"features": ["auth", "api", "database"]
}
```
**Output Markdown:**
```markdown
## Configuration
**name:** MyApp
**version:** 1.0.0
### dependencies
- **library1:** ^2.0.0
- **library2:** ^3.1.0
### features
- auth
- api
- database
```
### JSON Array Handling
JSON arrays are converted to lists or tables:
**Array of objects:**
```json
[
{"id": 1, "name": "Alice", "active": true},
{"id": 2, "name": "Bob", "active": false}
]
```
**Converted to table:**
```markdown
| id | name | active |
|----|-------|--------|
| 1 | Alice | true |
| 2 | Bob | false |
```
### Nested JSON Structures
Nested JSON is converted with appropriate indentation and hierarchy:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Handles deeply nested structures
result = md.convert("complex_config.json")
print(result.text_content)
```
### JSON Lines (JSONL)
For JSON Lines format (one JSON object per line):
```python
from markitdown import MarkItDown
import json
md = MarkItDown()
# Read JSONL file
with open("data.jsonl", "r") as f:
for line in f:
obj = json.loads(line)
# Convert to JSON temporarily
with open("temp.json", "w") as temp:
json.dump(obj, temp)
result = md.convert("temp.json")
print(result.text_content)
print("\n---\n")
```
### Large JSON Files
For large JSON files:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Convert large JSON
result = md.convert("large_data.json")
# Save to file
with open("output.md", "w") as f:
f.write(result.text_content)
```
## XML Files
Convert XML documents to structured Markdown.
### Basic XML Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xml")
print(result.text_content)
```
### XML Structure Preservation
XML is converted to Markdown maintaining hierarchical structure:
**Input XML (`book.xml`):**
```xml
<?xml version="1.0"?>
<book>
<title>Example Book</title>
<author>John Doe</author>
<chapters>
<chapter id="1">
<title>Introduction</title>
<content>Chapter 1 content...</content>
</chapter>
<chapter id="2">
<title>Background</title>
<content>Chapter 2 content...</content>
</chapter>
</chapters>
</book>
```
**Output Markdown:**
```markdown
# book
## title
Example Book
## author
John Doe
## chapters
### chapter (id: 1)
#### title
Introduction
#### content
Chapter 1 content...
### chapter (id: 2)
#### title
Background
#### content
Chapter 2 content...
```
### XML Attributes
XML attributes are preserved in the conversion:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xml")
# Attributes shown as (attr: value) in headings
```
### XML Namespaces
XML namespaces are handled:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Handles xmlns and namespaced elements
result = md.convert("namespaced.xml")
```
### XML Use Cases
**Configuration files:**
- Convert XML configs to readable format
- Document system configurations
- Compare configuration files
**Data interchange:**
- Convert XML APIs responses
- Process XML data feeds
- Transform between formats
**Document processing:**
- Convert DocBook to Markdown
- Process SVG descriptions
- Extract structured data
## Structured Data Workflows
### CSV Data Analysis Pipeline
```python
from markitdown import MarkItDown
import pandas as pd
md = MarkItDown()
# Read CSV for analysis
df = pd.read_csv("data.csv")
# Do analysis
summary = df.describe()
# Convert both to Markdown
original = md.convert("data.csv")
# Save summary as CSV then convert
summary.to_csv("summary.csv")
summary_md = md.convert("summary.csv")
print("## Original Data\n")
print(original.text_content)
print("\n## Statistical Summary\n")
print(summary_md.text_content)
```
### JSON API Documentation
```python
from markitdown import MarkItDown
import requests
import json
md = MarkItDown()
# Fetch JSON from API
response = requests.get("https://api.example.com/data")
data = response.json()
# Save as JSON
with open("api_response.json", "w") as f:
json.dump(data, f, indent=2)
# Convert to Markdown
result = md.convert("api_response.json")
# Create documentation
doc = f"""# API Response Documentation
## Endpoint
GET https://api.example.com/data
## Response
{result.text_content}
"""
with open("api_docs.md", "w") as f:
f.write(doc)
```
### XML to Markdown Documentation
```python
from markitdown import MarkItDown
md = MarkItDown()
# Convert XML documentation
xml_files = ["config.xml", "schema.xml", "data.xml"]
for xml_file in xml_files:
result = md.convert(xml_file)
output_name = xml_file.replace('.xml', '.md')
with open(f"docs/{output_name}", "w") as f:
f.write(result.text_content)
```
### Multi-Format Data Processing
```python
from markitdown import MarkItDown
import os
md = MarkItDown()
def convert_structured_data(directory):
"""Convert all structured data files in directory."""
extensions = {'.csv', '.json', '.xml'}
for filename in os.listdir(directory):
ext = os.path.splitext(filename)[1]
if ext in extensions:
input_path = os.path.join(directory, filename)
result = md.convert(input_path)
# Save Markdown
output_name = filename.replace(ext, '.md')
output_path = os.path.join("markdown", output_name)
with open(output_path, 'w') as f:
f.write(result.text_content)
print(f"Converted: {filename}{output_name}")
# Process all structured data
convert_structured_data("data")
```
### CSV to JSON to Markdown
```python
import pandas as pd
from markitdown import MarkItDown
import json
md = MarkItDown()
# Read CSV
df = pd.read_csv("data.csv")
# Convert to JSON
json_data = df.to_dict(orient='records')
with open("temp.json", "w") as f:
json.dump(json_data, f, indent=2)
# Convert JSON to Markdown
result = md.convert("temp.json")
print(result.text_content)
```
### Database Export to Markdown
```python
from markitdown import MarkItDown
import sqlite3
import csv
md = MarkItDown()
# Export database query to CSV
conn = sqlite3.connect("database.db")
cursor = conn.execute("SELECT * FROM users")
with open("users.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerow([description[0] for description in cursor.description])
writer.writerows(cursor.fetchall())
# Convert to Markdown
result = md.convert("users.csv")
print(result.text_content)
```
## Error Handling
### CSV Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("data.csv")
print(result.text_content)
except FileNotFoundError:
print("CSV file not found")
except Exception as e:
print(f"CSV conversion error: {e}")
# Common issues: encoding problems, malformed CSV, delimiter issues
```
### JSON Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("data.json")
print(result.text_content)
except Exception as e:
print(f"JSON conversion error: {e}")
# Common issues: invalid JSON syntax, encoding issues
```
### XML Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("data.xml")
print(result.text_content)
except Exception as e:
print(f"XML conversion error: {e}")
# Common issues: malformed XML, encoding problems, namespace issues
```
## Best Practices
### CSV Processing
- Check delimiter before conversion
- Verify encoding (UTF-8 recommended)
- Handle large files with streaming if needed
- Preview output for very wide tables
### JSON Processing
- Validate JSON before conversion
- Consider pretty-printing complex structures
- Handle circular references appropriately
- Be aware of large array performance
### XML Processing
- Validate XML structure first
- Handle namespaces consistently
- Consider XPath for selective extraction
- Be mindful of very deep nesting
### Data Quality
- Clean data before conversion when possible
- Handle missing values appropriately
- Verify special character handling
- Test with representative samples
### Performance
- Process large files in batches
- Use streaming for very large datasets
- Monitor memory usage
- Cache converted results when appropriate

View File

@@ -1,478 +0,0 @@
# Web Content Extraction Reference
This document provides detailed information about extracting content from HTML, YouTube, EPUB, and other web-based formats.
## HTML Conversion
Convert HTML files and web pages to clean Markdown format.
### Basic HTML Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("webpage.html")
print(result.text_content)
```
### HTML Processing Features
**What's preserved:**
- Headings (`<h1>``#`, `<h2>``##`, etc.)
- Paragraphs and text formatting
- Links (`<a>``[text](url)`)
- Lists (ordered and unordered)
- Tables → Markdown tables
- Code blocks and inline code
- Emphasis (bold, italic)
**What's removed:**
- Scripts and styles
- Navigation elements
- Advertising content
- Boilerplate markup
- HTML comments
### HTML from URLs
Convert web pages directly from URLs:
```python
from markitdown import MarkItDown
import requests
md = MarkItDown()
# Fetch and convert web page
response = requests.get("https://example.com/article")
with open("temp.html", "wb") as f:
f.write(response.content)
result = md.convert("temp.html")
print(result.text_content)
```
### Clean Web Article Extraction
For extracting main content from web articles:
```python
from markitdown import MarkItDown
import requests
from readability import Document # pip install readability-lxml
md = MarkItDown()
# Fetch page
url = "https://example.com/article"
response = requests.get(url)
# Extract main content
doc = Document(response.content)
html_content = doc.summary()
# Save and convert
with open("article.html", "w") as f:
f.write(html_content)
result = md.convert("article.html")
print(result.text_content)
```
### HTML with Images
HTML files containing images can be enhanced with LLM descriptions:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("page_with_images.html")
```
## YouTube Transcripts
Extract video transcripts from YouTube videos.
### Basic YouTube Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
```
### YouTube Installation
```bash
pip install 'markitdown[youtube]'
```
This installs the `youtube-transcript-api` dependency.
### YouTube URL Formats
MarkItDown supports various YouTube URL formats:
- `https://www.youtube.com/watch?v=VIDEO_ID`
- `https://youtu.be/VIDEO_ID`
- `https://www.youtube.com/embed/VIDEO_ID`
- `https://m.youtube.com/watch?v=VIDEO_ID`
### YouTube Transcript Features
**What's included:**
- Full video transcript text
- Timestamps (optional, depending on availability)
- Video metadata (title, description)
- Captions in available languages
**Transcript languages:**
```python
from markitdown import MarkItDown
md = MarkItDown()
# Get transcript in specific language (if available)
# Language codes: 'en', 'es', 'fr', 'de', etc.
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
```
### YouTube Playlist Processing
Process multiple videos from a playlist:
```python
from markitdown import MarkItDown
md = MarkItDown()
video_ids = [
"VIDEO_ID_1",
"VIDEO_ID_2",
"VIDEO_ID_3"
]
transcripts = []
for vid_id in video_ids:
url = f"https://youtube.com/watch?v={vid_id}"
result = md.convert(url)
transcripts.append({
'video_id': vid_id,
'transcript': result.text_content
})
```
### YouTube Use Cases
**Content Analysis:**
- Analyze video content without watching
- Extract key information from tutorials
- Build searchable transcript databases
**Research:**
- Process interview transcripts
- Extract lecture content
- Analyze presentation content
**Accessibility:**
- Generate text versions of video content
- Create searchable video archives
### YouTube Limitations
- Requires videos to have captions/transcripts available
- Auto-generated captions may have transcription errors
- Some videos may disable transcript access
- Rate limiting may apply for bulk processing
## EPUB Books
Convert EPUB e-books to Markdown format.
### Basic EPUB Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("book.epub")
print(result.text_content)
```
### EPUB Processing Features
**What's extracted:**
- Book text content
- Chapter structure
- Headings and formatting
- Tables of contents
- Footnotes and references
**What's preserved:**
- Heading hierarchy
- Text emphasis (bold, italic)
- Links and references
- Lists and tables
### EPUB with Images
EPUB files often contain images (covers, diagrams, illustrations):
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("illustrated_book.epub")
```
### EPUB Use Cases
**Research:**
- Convert textbooks to searchable format
- Extract content for analysis
- Build digital libraries
**Content Processing:**
- Prepare books for LLM training data
- Convert to different formats
- Create summaries and extracts
**Accessibility:**
- Convert to more accessible formats
- Extract text for screen readers
- Process for text-to-speech
## RSS Feeds
Process RSS feeds to extract article content.
### Basic RSS Processing
```python
from markitdown import MarkItDown
import feedparser
md = MarkItDown()
# Parse RSS feed
feed = feedparser.parse("https://example.com/feed.xml")
# Convert each entry
for entry in feed.entries:
# Save entry HTML
with open("temp.html", "w") as f:
f.write(entry.summary)
result = md.convert("temp.html")
print(f"## {entry.title}\n\n{result.text_content}\n\n")
```
## Combined Web Content Workflows
### Web Scraping Pipeline
```python
from markitdown import MarkItDown
import requests
from bs4 import BeautifulSoup
md = MarkItDown()
def scrape_and_convert(url):
"""Scrape webpage and convert to Markdown."""
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract main content
main_content = soup.find('article') or soup.find('main')
if main_content:
# Save HTML
with open("temp.html", "w") as f:
f.write(str(main_content))
# Convert to Markdown
result = md.convert("temp.html")
return result.text_content
return None
# Use it
markdown = scrape_and_convert("https://example.com/article")
print(markdown)
```
### YouTube Learning Content Extraction
```python
from markitdown import MarkItDown
md = MarkItDown()
# Course videos
course_videos = [
("https://youtube.com/watch?v=ID1", "Lesson 1: Introduction"),
("https://youtube.com/watch?v=ID2", "Lesson 2: Basics"),
("https://youtube.com/watch?v=ID3", "Lesson 3: Advanced")
]
course_content = []
for url, title in course_videos:
result = md.convert(url)
course_content.append(f"# {title}\n\n{result.text_content}")
# Combine into course document
full_course = "\n\n---\n\n".join(course_content)
with open("course_transcript.md", "w") as f:
f.write(full_course)
```
### Documentation Scraping
```python
from markitdown import MarkItDown
import requests
from urllib.parse import urljoin, urlparse
md = MarkItDown()
def scrape_documentation(base_url, page_urls):
"""Scrape multiple documentation pages."""
docs = []
for page_url in page_urls:
full_url = urljoin(base_url, page_url)
# Fetch page
response = requests.get(full_url)
with open("temp.html", "wb") as f:
f.write(response.content)
# Convert
result = md.convert("temp.html")
docs.append({
'url': full_url,
'content': result.text_content
})
return docs
# Example usage
base = "https://docs.example.com/"
pages = ["intro.html", "getting-started.html", "api.html"]
documentation = scrape_documentation(base, pages)
```
### EPUB Library Processing
```python
from markitdown import MarkItDown
import os
md = MarkItDown()
def process_epub_library(library_path, output_path):
"""Convert all EPUB books in a directory."""
for filename in os.listdir(library_path):
if filename.endswith('.epub'):
epub_path = os.path.join(library_path, filename)
try:
result = md.convert(epub_path)
# Save markdown
output_file = filename.replace('.epub', '.md')
output_full = os.path.join(output_path, output_file)
with open(output_full, 'w') as f:
f.write(result.text_content)
print(f"Converted: {filename}")
except Exception as e:
print(f"Failed to convert {filename}: {e}")
# Process library
process_epub_library("books", "markdown_books")
```
## Error Handling
### HTML Conversion Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("webpage.html")
print(result.text_content)
except FileNotFoundError:
print("HTML file not found")
except Exception as e:
print(f"Conversion error: {e}")
```
### YouTube Transcript Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
except Exception as e:
print(f"Failed to get transcript: {e}")
# Common issues: No transcript available, video unavailable, network error
```
### EPUB Conversion Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("book.epub")
print(result.text_content)
except Exception as e:
print(f"EPUB processing error: {e}")
# Common issues: Corrupted file, unsupported DRM, invalid format
```
## Best Practices
### HTML Processing
- Clean HTML before conversion for better results
- Use readability libraries to extract main content
- Handle different encodings appropriately
- Remove unnecessary markup
### YouTube Processing
- Check transcript availability before batch processing
- Handle API rate limits gracefully
- Store transcripts to avoid re-fetching
- Respect YouTube's terms of service
### EPUB Processing
- DRM-protected EPUBs cannot be processed
- Large EPUBs may require more memory
- Some formatting may not translate perfectly
- Test with representative samples first
### Web Scraping Ethics
- Respect robots.txt
- Add delays between requests
- Identify your scraper in User-Agent
- Cache results to minimize requests
- Follow website terms of service