Update all the latest writing skills

2026-01-26 16:58:56 +08:00 · 2025-12-12 11:42:41 -08:00
parent c85faf039a
commit cf1d4aac5d
30 changed files with 5895 additions and 2983 deletions
--- a/scientific-skills/markitdown/SKILL.md
+++ b/scientific-skills/markitdown/SKILL.md
@@ -1,241 +1,486 @@
 ---
 name: markitdown
-description: Convert various file formats (PDF, Office documents, images, audio, web content, structured data) to Markdown optimized for LLM processing. Use when converting documents to markdown, extracting text from PDFs/Office files, transcribing audio, performing OCR on images, extracting YouTube transcripts, or processing batches of files. Supports 20+ formats including DOCX, XLSX, PPTX, PDF, HTML, EPUB, CSV, JSON, images with OCR, and audio with transcription.
+description: "Convert files and office documents to Markdown. Supports PDF, DOCX, PPTX, XLSX, images (with OCR), audio (with transcription), HTML, CSV, JSON, XML, ZIP, YouTube URLs, EPubs and more."
+allowed-tools: [Read, Write, Edit, Bash]
+license: MIT
+source: https://github.com/microsoft/markitdown
 ---

-# MarkItDown
+# MarkItDown - File to Markdown Conversion

 ## Overview

-MarkItDown is a Python utility that converts various file formats into Markdown format, optimized for use with large language models and text analysis pipelines. It preserves document structure (headings, lists, tables, hyperlinks) while producing clean, token-efficient Markdown output.
+MarkItDown is a Python tool developed by Microsoft for converting various file formats to Markdown. It's particularly useful for converting documents into LLM-friendly text format, as Markdown is token-efficient and well-understood by modern language models.

-## When to Use This Skill
+**Key Benefits**:
+- Convert documents to clean, structured Markdown
+- Token-efficient format for LLM processing
+- Supports 15+ file formats
+- Optional AI-enhanced image descriptions
+- OCR for images and scanned documents
+- Speech transcription for audio files

-Use this skill when users request:
- Converting documents to Markdown format
- Extracting text from PDF, Word, PowerPoint, or Excel files
- Performing OCR on images to extract text
- Transcribing audio files to text
- Extracting YouTube video transcripts
- Processing HTML, EPUB, or web content to Markdown
- Converting structured data (CSV, JSON, XML) to readable Markdown
- Batch converting multiple files or ZIP archives
- Preparing documents for LLM analysis or RAG systems
+## Visual Enhancement with Scientific Schematics

-## Core Capabilities
+**When creating documents with this skill, always consider adding scientific diagrams and schematics to enhance visual communication.**

-### 1. Document Conversion
+If your document does not already contain schematics or diagrams:
+- Use the **scientific-schematics** skill to generate AI-powered publication-quality diagrams
+- Simply describe your desired diagram in natural language
+- Nano Banana Pro will automatically generate, review, and refine the schematic

-Convert Office documents and PDFs to Markdown while preserving structure.
+**For new documents:** Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.

-**Supported formats:**
- PDF files (with optional Azure Document Intelligence integration)
- Word documents (DOCX)
- PowerPoint presentations (PPTX)
- Excel spreadsheets (XLSX, XLS)
+**How to generate schematics:**
+```bash
+python scripts/generate_schematic.py "your diagram description" -o figures/output.png
+```
+
+The AI will automatically:
+- Create publication-quality images with proper formatting
+- Review and refine through multiple iterations
+- Ensure accessibility (colorblind-friendly, high contrast)
+- Save outputs in the figures/ directory
+
+**When to add schematics:**
+- Document conversion workflow diagrams
+- File format architecture illustrations
+- OCR processing pipeline diagrams
+- Integration workflow visualizations
+- System architecture diagrams
+- Data flow diagrams
+- Any complex concept that benefits from visualization
+
+For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.
+
+---
+
+## Supported Formats
+
+| Format | Description | Notes |
+|--------|-------------|-------|
+| **PDF** | Portable Document Format | Full text extraction |
+| **DOCX** | Microsoft Word | Tables, formatting preserved |
+| **PPTX** | PowerPoint | Slides with notes |
+| **XLSX** | Excel spreadsheets | Tables and data |
+| **Images** | JPEG, PNG, GIF, WebP | EXIF metadata + OCR |
+| **Audio** | WAV, MP3 | Metadata + transcription |
+| **HTML** | Web pages | Clean conversion |
+| **CSV** | Comma-separated values | Table format |
+| **JSON** | JSON data | Structured representation |
+| **XML** | XML documents | Structured format |
+| **ZIP** | Archive files | Iterates contents |
+| **EPUB** | E-books | Full text extraction |
+| **YouTube** | Video URLs | Fetch transcriptions |
+
+## Quick Start
+
+### Installation
+
+```bash
+# Install with all features
+pip install 'markitdown[all]'
+
+# Or from source
+git clone https://github.com/microsoft/markitdown.git
+cd markitdown
+pip install -e 'packages/markitdown[all]'
+```
+
+### Command-Line Usage
+
+```bash
+# Basic conversion
+markitdown document.pdf > output.md
+
+# Specify output file
+markitdown document.pdf -o output.md
+
+# Pipe content
+cat document.pdf | markitdown > output.md
+
+# Enable plugins
+markitdown --list-plugins  # List available plugins
+markitdown --use-plugins document.pdf -o output.md
+```
+
+### Python API

-**Basic usage:**
 ```python
 from markitdown import MarkItDown

+# Basic usage
 md = MarkItDown()
 result = md.convert("document.pdf")
 print(result.text_content)
+
+# Convert from stream
+with open("document.pdf", "rb") as f:
+    result = md.convert_stream(f, file_extension=".pdf")
+    print(result.text_content)
 ```

-**Command-line:**
-```bash
-markitdown document.pdf -o output.md
-```
+## Advanced Features

-See `references/document_conversion.md` for detailed documentation on document-specific features.
+### 1. AI-Enhanced Image Descriptions

-### 2. Media Processing
+Use LLMs via OpenRouter to generate detailed image descriptions (for PPTX and image files):

-Extract text from images using OCR and transcribe audio files to text.
-
-**Supported formats:**
- Images (JPEG, PNG, GIF, etc.) with EXIF metadata extraction
- Audio files with speech transcription (requires speech_recognition)
-
-**Image with OCR:**
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("image.jpg")
-print(result.text_content)  # Includes EXIF metadata and OCR text
-```
-
-**Audio transcription:**
-```python
-result = md.convert("audio.wav")
-print(result.text_content)  # Transcribed speech
-```
-
-See `references/media_processing.md` for advanced media handling options.
-
-### 3. Web Content Extraction
-
-Convert web-based content and e-books to Markdown.
-
-**Supported formats:**
- HTML files and web pages
- YouTube video transcripts (via URL)
- EPUB books
- RSS feeds
-
-**YouTube transcript:**
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
-print(result.text_content)
-```
-
-See `references/web_content.md` for web extraction details.
-
-### 4. Structured Data Handling
-
-Convert structured data formats to readable Markdown tables.
-
-**Supported formats:**
- CSV files
- JSON files
- XML files
-
-**CSV to Markdown table:**
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown()
-result = md.convert("data.csv")
-print(result.text_content)  # Formatted as Markdown table
-```
-
-See `references/structured_data.md` for format-specific options.
-
-### 5. Advanced Integrations
-
-Enhance conversion quality with AI-powered features.
-
-**Azure Document Intelligence:**
-For enhanced PDF processing with better table extraction and layout analysis:
-```python
-from markitdown import MarkItDown
-
-md = MarkItDown(docintel_endpoint="<endpoint>", docintel_key="<key>")
-result = md.convert("complex.pdf")
-```
-
-**LLM-Powered Image Descriptions:**
-Generate detailed image descriptions using GPT-4o:
 ```python
 from markitdown import MarkItDown
 from openai import OpenAI

-client = OpenAI()
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
-result = md.convert("presentation.pptx")  # Images described with LLM
+# Initialize OpenRouter client (OpenAI-compatible API)
+client = OpenAI(
+    api_key="your-openrouter-api-key",
+    base_url="https://openrouter.ai/api/v1"
+)
+
+md = MarkItDown(
+    llm_client=client,
+    llm_model="anthropic/claude-sonnet-4.5",  # recommended for scientific vision
+    llm_prompt="Describe this image in detail for scientific documentation"
+)
+
+result = md.convert("presentation.pptx")
+print(result.text_content)
 ```

-See `references/advanced_integrations.md` for integration details.
+### 2. Azure Document Intelligence

-### 6. Batch Processing
+For enhanced PDF conversion with Microsoft Document Intelligence:

-Process multiple files or entire ZIP archives at once.
+```bash
+# Command line
+markitdown document.pdf -o output.md -d -e "<document_intelligence_endpoint>"
+```

-**ZIP file processing:**
 ```python
+# Python API
 from markitdown import MarkItDown

-md = MarkItDown()
-result = md.convert("archive.zip")
-print(result.text_content)  # All files converted and concatenated
+md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
+result = md.convert("complex_document.pdf")
+print(result.text_content)
 ```

-**Batch script:**
-Use the provided batch processing script for directory conversion:
+### 3. Plugin System
+
+MarkItDown supports 3rd-party plugins for extending functionality:
+
 ```bash
-python scripts/batch_convert.py /path/to/documents /path/to/output
+# List installed plugins
+markitdown --list-plugins
+
+# Enable plugins
+markitdown --use-plugins file.pdf -o output.md
 ```

-See `scripts/batch_convert.py` for implementation details.
+Find plugins on GitHub with hashtag: `#markitdown-plugin`

-## Installation
+## Optional Dependencies
+
+Control which file formats you support:

-**Full installation (all features):**
 ```bash
-uv pip install 'markitdown[all]'
+# Install specific formats
+pip install 'markitdown[pdf, docx, pptx]'
+
+# All available options:
+# [all]                  - All optional dependencies
+# [pptx]                 - PowerPoint files
+# [docx]                 - Word documents
+# [xlsx]                 - Excel spreadsheets
+# [xls]                  - Older Excel files
+# [pdf]                  - PDF documents
+# [outlook]              - Outlook messages
+# [az-doc-intel]         - Azure Document Intelligence
+# [audio-transcription]  - WAV and MP3 transcription
+# [youtube-transcription] - YouTube video transcription
 ```

-**Modular installation (specific features):**
-```bash
-uv pip install 'markitdown[pdf]'           # PDF support
-uv pip install 'markitdown[docx]'          # Word support
-uv pip install 'markitdown[pptx]'          # PowerPoint support
-uv pip install 'markitdown[xlsx]'          # Excel support
-uv pip install 'markitdown[audio]'         # Audio transcription
-uv pip install 'markitdown[youtube]'       # YouTube transcripts
-```
+## Common Use Cases

-**Requirements:**
- Python 3.10 or higher
+### 1. Convert Scientific Papers to Markdown

-## Output Format
-
-MarkItDown produces clean, token-efficient Markdown optimized for LLM consumption:
- Preserves headings, lists, and tables
- Maintains hyperlinks and formatting
- Includes metadata where relevant (EXIF, document properties)
- No temporary files created (streaming approach)
-
-## Common Workflows
-
-**Preparing documents for RAG:**
 ```python
 from markitdown import MarkItDown

 md = MarkItDown()

-# Convert knowledge base documents
-docs = ["manual.pdf", "guide.docx", "faq.html"]
-markdown_content = []
-
-for doc in docs:
-    result = md.convert(doc)
-    markdown_content.append(result.text_content)
-
-# Now ready for embedding and indexing
+# Convert PDF paper
+result = md.convert("research_paper.pdf")
+with open("paper.md", "w") as f:
+    f.write(result.text_content)
 ```

-**Document analysis pipeline:**
-```bash
-# Convert all PDFs in directory
-for file in documents/*.pdf; do
-    markitdown "$file" -o "markdown/$(basename "$file" .pdf).md"
-done
-```
-
-## Plugin System
-
-MarkItDown supports extensible plugins for custom conversion logic. Plugins are disabled by default for security:
+### 2. Extract Data from Excel for Analysis

 ```python
 from markitdown import MarkItDown

-# Enable plugins if needed
-md = MarkItDown(enable_plugins=True)
+md = MarkItDown()
+result = md.convert("data.xlsx")
+
+# Result will be in Markdown table format
+print(result.text_content)
 ```

+### 3. Process Multiple Documents
+
+```python
+from markitdown import MarkItDown
+import os
+from pathlib import Path
+
+md = MarkItDown()
+
+# Process all PDFs in a directory
+pdf_dir = Path("papers/")
+output_dir = Path("markdown_output/")
+output_dir.mkdir(exist_ok=True)
+
+for pdf_file in pdf_dir.glob("*.pdf"):
+    result = md.convert(str(pdf_file))
+    output_file = output_dir / f"{pdf_file.stem}.md"
+    output_file.write_text(result.text_content)
+    print(f"Converted: {pdf_file.name}")
+```
+
+### 4. Convert PowerPoint with AI Descriptions
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+# Use OpenRouter for access to multiple AI models
+client = OpenAI(
+    api_key="your-openrouter-api-key",
+    base_url="https://openrouter.ai/api/v1"
+)
+
+md = MarkItDown(
+    llm_client=client,
+    llm_model="anthropic/claude-sonnet-4.5",  # recommended for presentations
+    llm_prompt="Describe this slide image in detail, focusing on key visual elements and data"
+)
+
+result = md.convert("presentation.pptx")
+with open("presentation.md", "w") as f:
+    f.write(result.text_content)
+```
+
+### 5. Batch Convert with Different Formats
+
+```python
+from markitdown import MarkItDown
+from pathlib import Path
+
+md = MarkItDown()
+
+# Files to convert
+files = [
+    "document.pdf",
+    "spreadsheet.xlsx",
+    "presentation.pptx",
+    "notes.docx"
+]
+
+for file in files:
+    try:
+        result = md.convert(file)
+        output = Path(file).stem + ".md"
+        with open(output, "w") as f:
+            f.write(result.text_content)
+        print(f"✓ Converted {file}")
+    except Exception as e:
+        print(f"✗ Error converting {file}: {e}")
+```
+
+### 6. Extract YouTube Video Transcription
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Convert YouTube video to transcript
+result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
+print(result.text_content)
+```
+
+## Docker Usage
+
+```bash
+# Build image
+docker build -t markitdown:latest .
+
+# Run conversion
+docker run --rm -i markitdown:latest < ~/document.pdf > output.md
+```
+
+## Best Practices
+
+### 1. Choose the Right Conversion Method
+
+- **Simple documents**: Use basic `MarkItDown()`
+- **Complex PDFs**: Use Azure Document Intelligence
+- **Visual content**: Enable AI image descriptions
+- **Scanned documents**: Ensure OCR dependencies are installed
+
+### 2. Handle Errors Gracefully
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("document.pdf")
+    print(result.text_content)
+except FileNotFoundError:
+    print("File not found")
+except Exception as e:
+    print(f"Conversion error: {e}")
+```
+
+### 3. Process Large Files Efficiently
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# For large files, use streaming
+with open("large_file.pdf", "rb") as f:
+    result = md.convert_stream(f, file_extension=".pdf")
+    
+    # Process in chunks or save directly
+    with open("output.md", "w") as out:
+        out.write(result.text_content)
+```
+
+### 4. Optimize for Token Efficiency
+
+Markdown output is already token-efficient, but you can:
+- Remove excessive whitespace
+- Consolidate similar sections
+- Strip metadata if not needed
+
+```python
+from markitdown import MarkItDown
+import re
+
+md = MarkItDown()
+result = md.convert("document.pdf")
+
+# Clean up extra whitespace
+clean_text = re.sub(r'\n{3,}', '\n\n', result.text_content)
+clean_text = clean_text.strip()
+
+print(clean_text)
+```
+
+## Integration with Scientific Workflows
+
+### Convert Literature for Review
+
+```python
+from markitdown import MarkItDown
+from pathlib import Path
+
+md = MarkItDown()
+
+# Convert all papers in literature folder
+papers_dir = Path("literature/pdfs")
+output_dir = Path("literature/markdown")
+output_dir.mkdir(exist_ok=True)
+
+for paper in papers_dir.glob("*.pdf"):
+    result = md.convert(str(paper))
+    
+    # Save with metadata
+    output_file = output_dir / f"{paper.stem}.md"
+    content = f"# {paper.stem}\n\n"
+    content += f"**Source**: {paper.name}\n\n"
+    content += "---\n\n"
+    content += result.text_content
+    
+    output_file.write_text(content)
+
+# For AI-enhanced conversion with figures
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="your-openrouter-api-key",
+    base_url="https://openrouter.ai/api/v1"
+)
+
+md_ai = MarkItDown(
+    llm_client=client,
+    llm_model="anthropic/claude-sonnet-4.5",
+    llm_prompt="Describe scientific figures with technical precision"
+)
+```
+
+### Extract Tables for Analysis
+
+```python
+from markitdown import MarkItDown
+import re
+
+md = MarkItDown()
+result = md.convert("data_tables.xlsx")
+
+# Markdown tables can be parsed or used directly
+print(result.text_content)
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Missing dependencies**: Install feature-specific packages
+   ```bash
+   pip install 'markitdown[pdf]'  # For PDF support
+   ```
+
+2. **Binary file errors**: Ensure files are opened in binary mode
+   ```python
+   with open("file.pdf", "rb") as f:  # Note the "rb"
+       result = md.convert_stream(f, file_extension=".pdf")
+   ```
+
+3. **OCR not working**: Install tesseract
+   ```bash
+   # macOS
+   brew install tesseract
+   
+   # Ubuntu
+   sudo apt-get install tesseract-ocr
+   ```
+
+## Performance Considerations
+
+- **PDF files**: Large PDFs may take time; consider page ranges if supported
+- **Image OCR**: OCR processing is CPU-intensive
+- **Audio transcription**: Requires additional compute resources
+- **AI image descriptions**: Requires API calls (costs may apply)
+
+## Next Steps
+
+- See `references/api_reference.md` for complete API documentation
+- Check `references/file_formats.md` for format-specific details
+- Review `scripts/batch_convert.py` for automation examples
+- Explore `scripts/convert_with_ai.py` for AI-enhanced conversions
+
 ## Resources

-This skill includes comprehensive reference documentation for each capability:
+- **MarkItDown GitHub**: https://github.com/microsoft/markitdown
+- **PyPI**: https://pypi.org/project/markitdown/
+- **OpenRouter**: https://openrouter.ai (for AI-enhanced conversions)
+- **OpenRouter API Keys**: https://openrouter.ai/keys
+- **OpenRouter Models**: https://openrouter.ai/models
+- **MCP Server**: markitdown-mcp (for Claude Desktop integration)
+- **Plugin Development**: See `packages/markitdown-sample-plugin`

- **references/document_conversion.md** - Detailed PDF, DOCX, PPTX, XLSX conversion options
- **references/media_processing.md** - Image OCR and audio transcription details
- **references/web_content.md** - HTML, YouTube, and EPUB extraction
- **references/structured_data.md** - CSV, JSON, XML conversion formats
- **references/advanced_integrations.md** - Azure Document Intelligence and LLM integration
- **scripts/batch_convert.py** - Batch processing utility for directories