mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
Update all the latest writing skills
This commit is contained in:
@@ -1,11 +1,29 @@
|
||||
---
|
||||
name: generate-image
|
||||
description: Generate or edit images using AI models (FLUX, Gemini). Use for scientific illustrations, diagrams, schematics, infographics, concept visualizations, and artistic images. Supports image editing to modify existing images (change colors, add/remove elements, style transfer). Useful for figures, posters, and visual explanations.
|
||||
description: Generate or edit images using AI models (FLUX, Gemini). Use for general-purpose image generation including photos, illustrations, artwork, visual assets, concept art, and any image that isn't a technical diagram or schematic. For flowcharts, circuits, pathways, and technical diagrams, use the scientific-schematics skill instead.
|
||||
---
|
||||
|
||||
# Generate Image
|
||||
|
||||
Generate and edit high-quality images using OpenRouter's image generation models including FLUX.2 Pro and Nano Banana Pro (Gemini 3 Pro).
|
||||
Generate and edit high-quality images using OpenRouter's image generation models including FLUX.2 Pro and Gemini 3 Pro.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
**Use generate-image for:**
|
||||
- Photos and photorealistic images
|
||||
- Artistic illustrations and artwork
|
||||
- Concept art and visual concepts
|
||||
- Visual assets for presentations or documents
|
||||
- Image editing and modifications
|
||||
- Any general-purpose image generation needs
|
||||
|
||||
**Use scientific-schematics instead for:**
|
||||
- Flowcharts and process diagrams
|
||||
- Circuit diagrams and electrical schematics
|
||||
- Biological pathways and signaling cascades
|
||||
- System architecture diagrams
|
||||
- CONSORT diagrams and methodology flowcharts
|
||||
- Any technical/schematic diagrams
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -46,8 +64,8 @@ The script will automatically detect the `.env` file and provide clear error mes
|
||||
- `black-forest-labs/flux.2-flex` - Fast and cheap, but not as high quality as pro
|
||||
|
||||
Select based on:
|
||||
- **Quality**: Use gemini-3-pro or flux.2-flex
|
||||
- **Editing**: Use gemini-3-pro or flux.2-flex (both support image editing)
|
||||
- **Quality**: Use gemini-3-pro or flux.2-pro
|
||||
- **Editing**: Use gemini-3-pro or flux.2-pro (both support image editing)
|
||||
- **Cost**: Use flux.2-flex for generation only
|
||||
|
||||
## Common Usage Patterns
|
||||
@@ -97,6 +115,35 @@ python scripts/generate_image.py "Image 2 description" --output image2.png
|
||||
- `--output` or `-o`: Output file path (default: generated_image.png)
|
||||
- `--api-key`: OpenRouter API key (overrides .env file)
|
||||
|
||||
## Example Use Cases
|
||||
|
||||
### For Scientific Documents
|
||||
```bash
|
||||
# Generate a conceptual illustration for a paper
|
||||
python scripts/generate_image.py "Microscopic view of cancer cells being attacked by immunotherapy agents, scientific illustration style" --output figures/immunotherapy_concept.png
|
||||
|
||||
# Create a visual for a presentation
|
||||
python scripts/generate_image.py "DNA double helix structure with highlighted mutation site, modern scientific visualization" --output slides/dna_mutation.png
|
||||
```
|
||||
|
||||
### For Presentations and Posters
|
||||
```bash
|
||||
# Title slide background
|
||||
python scripts/generate_image.py "Abstract blue and white background with subtle molecular patterns, professional presentation style" --output slides/background.png
|
||||
|
||||
# Poster hero image
|
||||
python scripts/generate_image.py "Laboratory setting with modern equipment, photorealistic, well-lit" --output poster/hero.png
|
||||
```
|
||||
|
||||
### For General Visual Content
|
||||
```bash
|
||||
# Website or documentation images
|
||||
python scripts/generate_image.py "Professional team collaboration around a digital whiteboard, modern office" --output docs/team_collaboration.png
|
||||
|
||||
# Marketing materials
|
||||
python scripts/generate_image.py "Futuristic AI brain concept with glowing neural networks" --output marketing/ai_concept.png
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
The script provides clear error messages for:
|
||||
@@ -122,3 +169,10 @@ If the script fails, read the error message and address the issue before retryin
|
||||
- Reference specific elements in the image when possible
|
||||
- For best results, use clear and detailed editing instructions
|
||||
- Both Gemini 3 Pro and FLUX.2 Pro support image editing through OpenRouter
|
||||
|
||||
## Integration with Other Skills
|
||||
|
||||
- **scientific-schematics**: Use for technical diagrams, flowcharts, circuits, pathways
|
||||
- **generate-image**: Use for photos, illustrations, artwork, visual concepts
|
||||
- **scientific-slides**: Combine with generate-image for visually rich presentations
|
||||
- **latex-posters**: Use generate-image for poster visuals and hero images
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
---
|
||||
name: hypothesis-generation
|
||||
description: "Generate testable hypotheses. Formulate from observations, design experiments, explore competing explanations, develop predictions, propose mechanisms, for scientific inquiry across domains."
|
||||
allowed-tools: [Read, Write, Edit, Bash]
|
||||
---
|
||||
|
||||
# Scientific Hypothesis Generation
|
||||
@@ -19,6 +20,43 @@ This skill should be used when:
|
||||
- Conducting literature-based hypothesis generation
|
||||
- Planning mechanistic studies across scientific domains
|
||||
|
||||
## Visual Enhancement with Scientific Schematics
|
||||
|
||||
**⚠️ MANDATORY: Every hypothesis generation report MUST include at least 1-2 AI-generated figures using the scientific-schematics skill.**
|
||||
|
||||
This is not optional. Hypothesis reports without visual elements are incomplete. Before finalizing any document:
|
||||
1. Generate at minimum ONE schematic or diagram (e.g., hypothesis framework showing competing explanations)
|
||||
2. Prefer 2-3 figures for comprehensive reports (mechanistic pathway, experimental design flowchart, prediction decision tree)
|
||||
|
||||
**How to generate figures:**
|
||||
- Use the **scientific-schematics** skill to generate AI-powered publication-quality diagrams
|
||||
- Simply describe your desired diagram in natural language
|
||||
- Nano Banana Pro will automatically generate, review, and refine the schematic
|
||||
|
||||
**How to generate schematics:**
|
||||
```bash
|
||||
python scripts/generate_schematic.py "your diagram description" -o figures/output.png
|
||||
```
|
||||
|
||||
The AI will automatically:
|
||||
- Create publication-quality images with proper formatting
|
||||
- Review and refine through multiple iterations
|
||||
- Ensure accessibility (colorblind-friendly, high contrast)
|
||||
- Save outputs in the figures/ directory
|
||||
|
||||
**When to add schematics:**
|
||||
- Hypothesis framework diagrams showing competing explanations
|
||||
- Experimental design flowcharts
|
||||
- Mechanistic pathway diagrams
|
||||
- Prediction decision trees
|
||||
- Causal relationship diagrams
|
||||
- Theoretical model visualizations
|
||||
- Any complex concept that benefits from visualization
|
||||
|
||||
For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.
|
||||
|
||||
---
|
||||
|
||||
## Workflow
|
||||
|
||||
Follow this systematic process to generate robust scientific hypotheses:
|
||||
@@ -122,15 +160,106 @@ For each hypothesis, generate specific, quantitative predictions:
|
||||
|
||||
### 8. Present Structured Output
|
||||
|
||||
Use the template in `assets/hypothesis_output_template.md` to present hypotheses in a clear, consistent format:
|
||||
Generate a professional LaTeX document using the template in `assets/hypothesis_report_template.tex`. The report should be well-formatted with colored boxes for visual organization and divided into a concise main text with comprehensive appendices.
|
||||
|
||||
**Standard structure:**
|
||||
1. **Background & Context** - Phenomenon and literature summary
|
||||
2. **Competing Hypotheses** - Enumerated hypotheses with mechanistic explanations
|
||||
3. **Quality Assessment** - Evaluation of each hypothesis
|
||||
4. **Experimental Designs** - Proposed tests for each hypothesis
|
||||
5. **Testable Predictions** - Specific, measurable predictions
|
||||
6. **Critical Comparisons** - How to distinguish between hypotheses
|
||||
**Document Structure:**
|
||||
|
||||
**Main Text (Maximum 4 pages):**
|
||||
1. **Executive Summary** - Brief overview in summary box (0.5-1 page)
|
||||
2. **Competing Hypotheses** - Each hypothesis in its own colored box with brief mechanistic explanation and key evidence (2-2.5 pages for 3-5 hypotheses)
|
||||
- **IMPORTANT:** Use `\newpage` before each hypothesis box to prevent content overflow
|
||||
- Each box should be ≤0.6 pages maximum
|
||||
3. **Testable Predictions** - Key predictions in amber boxes (0.5-1 page)
|
||||
4. **Critical Comparisons** - Priority comparison boxes (0.5-1 page)
|
||||
|
||||
Keep main text highly concise - only the most essential information. All details go to appendices.
|
||||
|
||||
**Page Break Strategy:**
|
||||
- Always use `\newpage` before hypothesis boxes to ensure they start on fresh pages
|
||||
- This prevents content from overflowing off page boundaries
|
||||
- LaTeX boxes (tcolorbox) do not automatically break across pages
|
||||
|
||||
**Appendices (Comprehensive, Detailed):**
|
||||
- **Appendix A:** Comprehensive literature review with extensive citations
|
||||
- **Appendix B:** Detailed experimental designs with full protocols
|
||||
- **Appendix C:** Quality assessment tables and detailed evaluations
|
||||
- **Appendix D:** Supplementary evidence and analogous systems
|
||||
|
||||
**Colored Box Usage:**
|
||||
|
||||
Use the custom box environments from `hypothesis_generation.sty`:
|
||||
|
||||
- `hypothesisbox1` through `hypothesisbox5` - For each competing hypothesis (blue, green, purple, teal, orange)
|
||||
- `predictionbox` - For testable predictions (amber)
|
||||
- `comparisonbox` - For critical comparisons (steel gray)
|
||||
- `evidencebox` - For supporting evidence highlights (light blue)
|
||||
- `summarybox` - For executive summary (blue)
|
||||
|
||||
**Each hypothesis box should contain (keep concise for 4-page limit):**
|
||||
- **Mechanistic Explanation:** 1-2 brief paragraphs (6-10 sentences max) explaining HOW and WHY
|
||||
- **Key Supporting Evidence:** 2-3 bullet points with citations (most important evidence only)
|
||||
- **Core Assumptions:** 1-2 critical assumptions
|
||||
|
||||
All detailed explanations, additional evidence, and comprehensive discussions belong in the appendices.
|
||||
|
||||
**Critical Overflow Prevention:**
|
||||
- Insert `\newpage` before each hypothesis box to start it on a fresh page
|
||||
- Keep each complete hypothesis box to ≤0.6 pages (approximately 15-20 lines of content)
|
||||
- If content exceeds this, move additional details to Appendix A
|
||||
- Never let boxes overflow off page boundaries - this creates unreadable PDFs
|
||||
|
||||
**Citation Requirements:**
|
||||
|
||||
Aim for extensive citation to support all claims:
|
||||
- **Main text:** 10-15 key citations for most important evidence only (keep concise for 4-page limit)
|
||||
- **Appendix A:** 40-70+ comprehensive citations covering all relevant literature
|
||||
- **Total target:** 50+ references in bibliography
|
||||
|
||||
Main text citations should be selective - cite only the most critical papers. All comprehensive citation and detailed literature discussion belongs in the appendices. Use `\citep{author2023}` for parenthetical citations.
|
||||
|
||||
**LaTeX Compilation:**
|
||||
|
||||
The template requires XeLaTeX or LuaLaTeX for proper rendering:
|
||||
|
||||
```bash
|
||||
xelatex hypothesis_report.tex
|
||||
bibtex hypothesis_report
|
||||
xelatex hypothesis_report.tex
|
||||
xelatex hypothesis_report.tex
|
||||
```
|
||||
|
||||
**Required packages:** The `hypothesis_generation.sty` style package must be in the same directory or LaTeX path. It requires: tcolorbox, xcolor, fontspec, fancyhdr, titlesec, enumitem, booktabs, natbib.
|
||||
|
||||
**Page Overflow Prevention:**
|
||||
|
||||
To prevent content from overflowing on pages, follow these critical guidelines:
|
||||
|
||||
1. **Monitor Box Content Length:** Each hypothesis box should fit comfortably on a single page. If content exceeds ~0.7 pages, it will likely overflow.
|
||||
|
||||
2. **Use Strategic Page Breaks:** Insert `\newpage` before boxes that contain substantial content:
|
||||
```latex
|
||||
\newpage
|
||||
\begin{hypothesisbox1}[Hypothesis 1: Title]
|
||||
% Long content here
|
||||
\end{hypothesisbox1}
|
||||
```
|
||||
|
||||
3. **Keep Main Text Boxes Concise:** For the 4-page main text limit:
|
||||
- Each hypothesis box: Maximum 0.5-0.6 pages
|
||||
- Mechanistic explanation: 1-2 brief paragraphs only (6-10 sentences max)
|
||||
- Key evidence: 2-3 bullet points only
|
||||
- Core assumptions: 1-2 items only
|
||||
- If content is longer, move details to appendices
|
||||
|
||||
4. **Break Long Content:** If a hypothesis requires extensive explanation, split across main text and appendix:
|
||||
- Main text box: Brief mechanistic overview + 2-3 key evidence points
|
||||
- Appendix A: Detailed mechanism explanation, comprehensive evidence, extended discussion
|
||||
|
||||
5. **Test Page Boundaries:** Before each new box, consider if remaining page space is sufficient. If less than 0.6 pages remain, use `\newpage` to start the box on a fresh page.
|
||||
|
||||
6. **Appendix Page Management:** In appendices, use `\newpage` between major sections to avoid overflow in detailed content areas.
|
||||
|
||||
**Quick Reference:** See `assets/FORMATTING_GUIDE.md` for detailed examples of all box types, color schemes, and common formatting patterns.
|
||||
|
||||
## Quality Standards
|
||||
|
||||
@@ -152,4 +281,6 @@ Ensure all generated hypotheses meet these standards:
|
||||
|
||||
### assets/
|
||||
|
||||
- `hypothesis_output_template.md` - Structured format for presenting hypotheses consistently with all required sections
|
||||
- `hypothesis_generation.sty` - LaTeX style package providing colored boxes, professional formatting, and custom environments for hypothesis reports
|
||||
- `hypothesis_report_template.tex` - Complete LaTeX template with main text structure and comprehensive appendix sections
|
||||
- `FORMATTING_GUIDE.md` - Quick reference guide with examples of all box types, color schemes, citation practices, and troubleshooting tips
|
||||
|
||||
@@ -0,0 +1,672 @@
|
||||
# Hypothesis Generation Report - Formatting Quick Reference
|
||||
|
||||
## Overview
|
||||
|
||||
This guide provides quick reference for using the hypothesis generation LaTeX template and style package. For complete documentation, see `SKILL.md`.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```latex
|
||||
% !TEX program = xelatex
|
||||
\documentclass[11pt,letterpaper]{article}
|
||||
\usepackage{hypothesis_generation}
|
||||
\usepackage{natbib}
|
||||
|
||||
\title{Your Phenomenon Name}
|
||||
\begin{document}
|
||||
\maketitle
|
||||
% Your content
|
||||
\end{document}
|
||||
```
|
||||
|
||||
**Compilation:** Use XeLaTeX or LuaLaTeX for best results
|
||||
```bash
|
||||
xelatex your_document.tex
|
||||
bibtex your_document
|
||||
xelatex your_document.tex
|
||||
xelatex your_document.tex
|
||||
```
|
||||
|
||||
## Color Scheme Reference
|
||||
|
||||
### Hypothesis Colors
|
||||
- **Hypothesis 1**: Deep Blue (RGB: 0, 102, 153) - Use for first hypothesis
|
||||
- **Hypothesis 2**: Forest Green (RGB: 0, 128, 96) - Use for second hypothesis
|
||||
- **Hypothesis 3**: Royal Purple (RGB: 102, 51, 153) - Use for third hypothesis
|
||||
- **Hypothesis 4**: Teal (RGB: 0, 128, 128) - Use for fourth hypothesis (if needed)
|
||||
- **Hypothesis 5**: Burnt Orange (RGB: 204, 85, 0) - Use for fifth hypothesis (if needed)
|
||||
|
||||
### Utility Colors
|
||||
- **Predictions**: Amber (RGB: 255, 191, 0) - For testable predictions
|
||||
- **Evidence**: Light Blue (RGB: 102, 178, 204) - For supporting evidence
|
||||
- **Comparisons**: Steel Gray (RGB: 108, 117, 125) - For critical comparisons
|
||||
- **Limitations**: Coral Red (RGB: 220, 53, 69) - For limitations/challenges
|
||||
|
||||
## Custom Box Environments
|
||||
|
||||
### 1. Executive Summary Box
|
||||
|
||||
```latex
|
||||
\begin{summarybox}[Executive Summary]
|
||||
Content here
|
||||
\end{summarybox}
|
||||
```
|
||||
|
||||
**Use for:** High-level overview at the beginning of the document
|
||||
|
||||
---
|
||||
|
||||
### 2. Hypothesis Boxes (5 variants)
|
||||
|
||||
```latex
|
||||
\begin{hypothesisbox1}[Hypothesis 1: Title]
|
||||
\textbf{Mechanistic Explanation:}
|
||||
[2-3 paragraphs explaining HOW and WHY]
|
||||
|
||||
\textbf{Key Supporting Evidence:}
|
||||
\begin{itemize}
|
||||
\item Evidence point 1 \citep{ref1}
|
||||
\item Evidence point 2 \citep{ref2}
|
||||
\end{itemize}
|
||||
|
||||
\textbf{Core Assumptions:}
|
||||
\begin{enumerate}
|
||||
\item Assumption 1
|
||||
\item Assumption 2
|
||||
\end{enumerate}
|
||||
\end{hypothesisbox1}
|
||||
```
|
||||
|
||||
**Available boxes:** `hypothesisbox1`, `hypothesisbox2`, `hypothesisbox3`, `hypothesisbox4`, `hypothesisbox5`
|
||||
|
||||
**Use for:** Presenting each competing hypothesis with its mechanism, evidence, and assumptions
|
||||
|
||||
**Best practices for 4-page main text:**
|
||||
- Keep mechanistic explanations to 1-2 brief paragraphs only (6-10 sentences max)
|
||||
- Include 2-3 most essential evidence points with citations
|
||||
- List 1-2 most critical assumptions
|
||||
- Ensure each hypothesis is genuinely distinct
|
||||
- All detailed explanations go to Appendix A
|
||||
- **Use `\newpage` before each hypothesis box to prevent overflow**
|
||||
- Each complete hypothesis box should be ≤0.6 pages
|
||||
|
||||
---
|
||||
|
||||
### 3. Prediction Box
|
||||
|
||||
```latex
|
||||
\begin{predictionbox}[Predictions: Hypothesis 1]
|
||||
\textbf{Prediction 1.1:} [Specific prediction]
|
||||
\begin{itemize}
|
||||
\item \textbf{Conditions:} When/where this applies
|
||||
\item \textbf{Expected Outcome:} Specific measurable result
|
||||
\item \textbf{Falsification:} What would disprove it
|
||||
\end{itemize}
|
||||
\end{predictionbox}
|
||||
```
|
||||
|
||||
**Use for:** Testable predictions derived from each hypothesis
|
||||
|
||||
**Best practices for 4-page main text:**
|
||||
- Make predictions specific and quantitative when possible
|
||||
- Clearly state conditions under which prediction should hold
|
||||
- Always specify falsification criteria
|
||||
- Include only 1-2 most critical predictions per hypothesis in main text
|
||||
- Additional predictions go to appendices
|
||||
|
||||
---
|
||||
|
||||
### 4. Evidence Box
|
||||
|
||||
```latex
|
||||
\begin{evidencebox}[Supporting Evidence]
|
||||
Content discussing supporting evidence
|
||||
\end{evidencebox}
|
||||
```
|
||||
|
||||
**Use for:** Highlighting key supporting evidence or literature synthesis
|
||||
|
||||
**Best practices:**
|
||||
- Use sparingly in main text (detailed evidence goes in Appendix A)
|
||||
- Include citations for all evidence
|
||||
- Focus on most compelling evidence
|
||||
|
||||
---
|
||||
|
||||
### 5. Comparison Box
|
||||
|
||||
```latex
|
||||
\begin{comparisonbox}[H1 vs. H2: Key Distinction]
|
||||
\textbf{Fundamental Difference:}
|
||||
[Description of core difference]
|
||||
|
||||
\textbf{Discriminating Experiment:}
|
||||
[Description of experiment]
|
||||
|
||||
\textbf{Outcome Interpretation:}
|
||||
\begin{itemize}
|
||||
\item \textbf{If [Result A]:} H1 supported
|
||||
\item \textbf{If [Result B]:} H2 supported
|
||||
\end{itemize}
|
||||
\end{comparisonbox}
|
||||
```
|
||||
|
||||
**Use for:** Explaining how to distinguish between competing hypotheses
|
||||
|
||||
**Best practices:**
|
||||
- Focus on fundamental mechanistic differences
|
||||
- Propose clear, feasible discriminating experiments
|
||||
- Specify concrete outcome interpretations
|
||||
- Create comparisons for all major hypothesis pairs
|
||||
|
||||
---
|
||||
|
||||
### 6. Limitation Box
|
||||
|
||||
```latex
|
||||
\begin{limitationbox}[Limitations \& Challenges]
|
||||
Discussion of limitations
|
||||
\end{limitationbox}
|
||||
```
|
||||
|
||||
**Use for:** Highlighting important limitations or challenges
|
||||
|
||||
**Best practices:**
|
||||
- Use when limitations are particularly important
|
||||
- Be honest about challenges
|
||||
- Suggest how limitations might be addressed
|
||||
|
||||
---
|
||||
|
||||
## Document Structure
|
||||
|
||||
### Main Text (Maximum 4 Pages - Highly Concise)
|
||||
|
||||
1. **Executive Summary** (0.5-1 page)
|
||||
- Use `summarybox`
|
||||
- Brief phenomenon overview
|
||||
- List all hypotheses in 1 sentence each
|
||||
- Recommended approach
|
||||
|
||||
2. **Competing Hypotheses** (2-2.5 pages)
|
||||
- Use `hypothesisbox1`, `hypothesisbox2`, etc.
|
||||
- One box per hypothesis
|
||||
- Brief mechanistic explanation (1-2 paragraphs) + essential evidence (2-3 points) + key assumptions (1-2)
|
||||
- Target: 3-5 hypotheses
|
||||
- Keep highly concise - details go to appendices
|
||||
|
||||
3. **Testable Predictions** (0.5-1 page)
|
||||
- Use `predictionbox` for each hypothesis
|
||||
- 1-2 most critical predictions per hypothesis only
|
||||
- Very brief - full predictions in appendices
|
||||
|
||||
4. **Critical Comparisons** (0.5-1 page)
|
||||
- Use `comparisonbox` for highest priority comparison only
|
||||
- Show how to distinguish top hypotheses
|
||||
- Additional comparisons in appendices
|
||||
|
||||
**Main text total: Maximum 4 pages - be extremely selective about what goes here**
|
||||
|
||||
### Appendices (Comprehensive, Detailed)
|
||||
|
||||
**Appendix A: Comprehensive Literature Review**
|
||||
- Detailed background (extensive citations)
|
||||
- Current understanding
|
||||
- Evidence for each hypothesis (detailed)
|
||||
- Conflicting findings
|
||||
- Knowledge gaps
|
||||
- **Target: 40-60+ citations**
|
||||
|
||||
**Appendix B: Detailed Experimental Designs**
|
||||
- Full protocols for each hypothesis
|
||||
- Methods, controls, sample sizes
|
||||
- Statistical approaches
|
||||
- Feasibility assessments
|
||||
- Timeline and resource requirements
|
||||
|
||||
**Appendix C: Quality Assessment**
|
||||
- Detailed evaluation tables
|
||||
- Strengths and weaknesses analysis
|
||||
- Comparative scoring
|
||||
- Recommendations
|
||||
|
||||
**Appendix D: Supplementary Evidence**
|
||||
- Analogous mechanisms
|
||||
- Preliminary data
|
||||
- Theoretical frameworks
|
||||
- Historical context
|
||||
|
||||
**References**
|
||||
- **Target: 50+ total references**
|
||||
|
||||
## Citation Best Practices
|
||||
|
||||
### In Main Text
|
||||
- Cite 15-20 key papers
|
||||
- Use `\citep{author2023}` for parenthetical citations
|
||||
- Use `\citet{author2023}` for textual citations
|
||||
- Focus on most important/recent evidence
|
||||
|
||||
### In Appendices
|
||||
- Cite 40-60+ papers total
|
||||
- Comprehensive coverage of relevant literature
|
||||
- Include reviews, primary research, theoretical papers
|
||||
- Cite every claim and piece of evidence
|
||||
|
||||
### Citation Density Guidelines
|
||||
- Main hypothesis boxes: 2-3 citations per box (most essential only)
|
||||
- Main text total: 10-15 citations maximum (keep concise)
|
||||
- Appendix A literature sections: 8-15 citations per subsection
|
||||
- Experimental designs: 2-5 citations for methods/precedents
|
||||
- Quality assessments: Citations as needed for evaluation criteria
|
||||
- Total document: 50+ citations (vast majority in appendices)
|
||||
|
||||
## Tables
|
||||
|
||||
### Professional Table Formatting
|
||||
|
||||
```latex
|
||||
\begin{hypotable}{Caption}
|
||||
\begin{tabular}{|l|l|l|}
|
||||
\hline
|
||||
\tableheadercolor
|
||||
\textcolor{white}{\textbf{Header 1}} & \textcolor{white}{\textbf{Header 2}} \\
|
||||
\hline
|
||||
Data row 1 & Data \\
|
||||
\hline
|
||||
\tablerowcolor % Alternating gray background
|
||||
Data row 2 & Data \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\caption{Your caption}
|
||||
\end{hypotable}
|
||||
```
|
||||
|
||||
**Best practices:**
|
||||
- Use `\tableheadercolor` for header rows
|
||||
- Alternate `\tablerowcolor` for tables >3 rows
|
||||
- Keep tables readable (not too wide)
|
||||
- Use for quality assessments, comparisons
|
||||
|
||||
## Common Formatting Patterns
|
||||
|
||||
### Hypothesis Section Pattern
|
||||
|
||||
```latex
|
||||
% Use \newpage before hypothesis box to prevent overflow
|
||||
\newpage
|
||||
\subsection*{Hypothesis N: [Concise Title]}
|
||||
|
||||
\begin{hypothesisboxN}[Hypothesis N: [Title]]
|
||||
|
||||
\textbf{Mechanistic Explanation:}
|
||||
|
||||
[1-2 brief paragraphs of explanation - 6-10 sentences max]
|
||||
|
||||
\vspace{0.3cm}
|
||||
|
||||
\textbf{Key Supporting Evidence:}
|
||||
\begin{itemize}
|
||||
\item [Evidence 1] \citep{ref1}
|
||||
\item [Evidence 2] \citep{ref2}
|
||||
\item [Evidence 3] \citep{ref3}
|
||||
\end{itemize}
|
||||
|
||||
\vspace{0.3cm}
|
||||
|
||||
\textbf{Core Assumptions:}
|
||||
\begin{enumerate}
|
||||
\item [Assumption 1]
|
||||
\item [Assumption 2]
|
||||
\end{enumerate}
|
||||
|
||||
\end{hypothesisboxN}
|
||||
|
||||
\vspace{0.5cm}
|
||||
```
|
||||
|
||||
**Note:** The `\newpage` before the hypothesis box ensures it starts on a fresh page, preventing overflow. This is especially important when boxes contain substantial content.
|
||||
|
||||
### Prediction Section Pattern
|
||||
|
||||
```latex
|
||||
\subsection*{Predictions from Hypothesis N}
|
||||
|
||||
\begin{predictionbox}[Predictions: Hypothesis N]
|
||||
|
||||
\textbf{Prediction N.1:} [Statement]
|
||||
\begin{itemize}
|
||||
\item \textbf{Conditions:} [Conditions]
|
||||
\item \textbf{Expected Outcome:} [Outcome]
|
||||
\item \textbf{Falsification:} [Falsification]
|
||||
\end{itemize}
|
||||
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textbf{Prediction N.2:} [Statement]
|
||||
[... continue ...]
|
||||
|
||||
\end{predictionbox}
|
||||
```
|
||||
|
||||
### Comparison Section Pattern
|
||||
|
||||
```latex
|
||||
\subsection*{Distinguishing Hypothesis X vs. Hypothesis Y}
|
||||
|
||||
\begin{comparisonbox}[HX vs. HY: Key Distinction]
|
||||
|
||||
\textbf{Fundamental Difference:}
|
||||
|
||||
[Description of core difference]
|
||||
|
||||
\vspace{0.3cm}
|
||||
|
||||
\textbf{Discriminating Experiment:}
|
||||
|
||||
[Experiment description]
|
||||
|
||||
\vspace{0.3cm}
|
||||
|
||||
\textbf{Outcome Interpretation:}
|
||||
\begin{itemize}
|
||||
\item \textbf{If [Result A]:} HX supported
|
||||
\item \textbf{If [Result B]:} HY supported
|
||||
\item \textbf{If [Result C]:} Both/neither supported
|
||||
\end{itemize}
|
||||
|
||||
\end{comparisonbox}
|
||||
```
|
||||
|
||||
## Spacing and Layout
|
||||
|
||||
### Vertical Spacing
|
||||
- `\vspace{0.3cm}` - Between elements within boxes
|
||||
- `\vspace{0.5cm}` - Between major sections or boxes
|
||||
- `\vspace{1cm}` - After title, before main content
|
||||
|
||||
### Page Breaks and Overflow Prevention
|
||||
|
||||
**CRITICAL: Prevent Content Overflow**
|
||||
|
||||
LaTeX boxes (tcolorbox environments) do not automatically break across pages. Content that exceeds the remaining page space will overflow and cause formatting issues. Follow these guidelines:
|
||||
|
||||
1. **Strategic Page Breaks Before Long Boxes:**
|
||||
```latex
|
||||
\newpage % Start on fresh page if box will be long
|
||||
\begin{hypothesisbox1}[Hypothesis 1: Title]
|
||||
% Substantial content here
|
||||
\end{hypothesisbox1}
|
||||
```
|
||||
|
||||
2. **Monitor Box Content Length:**
|
||||
- Each hypothesis box should be ≤0.7 pages maximum
|
||||
- If mechanistic explanation + evidence + assumptions exceeds ~0.6 pages, content is too long
|
||||
- Solution: Move detailed content to appendices, keep only essentials in main text boxes
|
||||
|
||||
3. **When to Use `\newpage`:**
|
||||
- Before any hypothesis box with >3 subsections or >15 lines of content
|
||||
- Before comparison boxes with extensive experimental descriptions
|
||||
- Between major appendix sections
|
||||
- If less than 0.6 pages remain on current page before starting a new box
|
||||
|
||||
4. **Content Length Guidelines for Main Text:**
|
||||
- Executive summary box: 0.5-0.8 pages max
|
||||
- Each hypothesis box: 0.4-0.6 pages max
|
||||
- Each prediction box: 0.3-0.5 pages max
|
||||
- Each comparison box: 0.4-0.6 pages max
|
||||
|
||||
5. **Breaking Up Long Content:**
|
||||
```latex
|
||||
% GOOD: Concise main text with page break
|
||||
\newpage
|
||||
\begin{hypothesisbox1}[Hypothesis 1: Brief Title]
|
||||
\textbf{Mechanistic Explanation:}
|
||||
Brief overview in 1-2 paragraphs (6-10 sentences).
|
||||
|
||||
\textbf{Key Supporting Evidence:}
|
||||
\begin{itemize}
|
||||
\item Evidence 1 \citep{ref1}
|
||||
\item Evidence 2 \citep{ref2}
|
||||
\end{itemize}
|
||||
|
||||
\textbf{Core Assumptions:}
|
||||
\begin{enumerate}
|
||||
\item Assumption 1
|
||||
\end{enumerate}
|
||||
|
||||
See Appendix A for detailed mechanism and comprehensive evidence.
|
||||
\end{hypothesisbox1}
|
||||
```
|
||||
|
||||
```latex
|
||||
% BAD: Overly long content that will overflow
|
||||
\begin{hypothesisbox1}[Hypothesis 1]
|
||||
\subsection{Very Long Section}
|
||||
Multiple paragraphs...
|
||||
\subsection{Another Long Section}
|
||||
More paragraphs...
|
||||
\subsection{Even More Content}
|
||||
[Content continues beyond page boundary → OVERFLOW!]
|
||||
\end{hypothesisbox1}
|
||||
```
|
||||
|
||||
6. **Page Break Commands:**
|
||||
- `\newpage` - Force new page (recommended before long boxes)
|
||||
- `\clearpage` - Force new page and flush floats (use before appendices)
|
||||
|
||||
### Section Spacing
|
||||
Already handled by style package, but you can adjust:
|
||||
```latex
|
||||
\vspace{0.5cm} % Add extra space if needed
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue: "File hypothesis_generation.sty not found"**
|
||||
- Solution: Ensure the .sty file is in the same directory as your .tex file, or in your LaTeX path
|
||||
|
||||
**Issue: Boxes don't have colors**
|
||||
- Solution: Compile with XeLaTeX or LuaLaTeX, not pdfLaTeX
|
||||
- Command: `xelatex yourfile.tex`
|
||||
|
||||
**Issue: Citations show as [?]**
|
||||
- Solution: Run bibtex after first xelatex compilation
|
||||
```bash
|
||||
xelatex yourfile.tex
|
||||
bibtex yourfile
|
||||
xelatex yourfile.tex
|
||||
xelatex yourfile.tex
|
||||
```
|
||||
|
||||
**Issue: Fonts not found**
|
||||
- Solution: Comment out font lines in the .sty file if custom fonts aren't installed
|
||||
- Lines to comment: `\setmainfont{...}` and `\setsansfont{...}`
|
||||
|
||||
**Issue: Box titles overlap with content**
|
||||
- Solution: Add more vertical space with `\vspace{0.3cm}` after titles
|
||||
|
||||
**Issue: Tables too wide**
|
||||
- Solution: Use `\small` or `\footnotesize` before tabular, or use `p{width}` column specs
|
||||
|
||||
**Issue: Content overflowing off the page**
|
||||
- **Cause:** Boxes (tcolorbox environments) are too long to fit on remaining page space
|
||||
- **Solution 1:** Add `\newpage` before the box to start it on a fresh page
|
||||
- **Solution 2:** Reduce box content - move detailed information to appendices
|
||||
- **Solution 3:** Break content into multiple smaller boxes
|
||||
- **Prevention:** Keep each hypothesis box to 0.4-0.6 pages maximum; use `\newpage` liberally before boxes with substantial content
|
||||
|
||||
**Issue: Main text exceeds 4 pages**
|
||||
- **Cause:** Boxes contain too much detailed information
|
||||
- **Solution:** Aggressively move content to appendices - main text boxes should contain only:
|
||||
- Brief mechanistic overview (1-2 paragraphs)
|
||||
- 2-3 key evidence bullets
|
||||
- 1-2 core assumptions
|
||||
- All detailed explanations, additional evidence, and comprehensive discussions belong in Appendix A
|
||||
|
||||
### Package Requirements
|
||||
|
||||
Ensure these packages are installed:
|
||||
- `tcolorbox` (with `most` option)
|
||||
- `xcolor`
|
||||
- `fontspec` (for XeLaTeX/LuaLaTeX)
|
||||
- `fancyhdr`
|
||||
- `titlesec`
|
||||
- `enumitem`
|
||||
- `booktabs`
|
||||
- `natbib`
|
||||
|
||||
Install missing packages:
|
||||
```bash
|
||||
# For TeX Live
|
||||
tlmgr install tcolorbox xcolor fontspec fancyhdr titlesec enumitem booktabs natbib
|
||||
|
||||
# For MiKTeX (Windows)
|
||||
# Use MiKTeX Package Manager GUI
|
||||
```
|
||||
|
||||
## Style Consistency Tips
|
||||
|
||||
1. **Color Usage**
|
||||
- Always use the same color for each hypothesis throughout the document
|
||||
- H1 = blue, H2 = green, H3 = purple, etc.
|
||||
- Don't mix colors for the same hypothesis
|
||||
|
||||
2. **Box Usage**
|
||||
- Main text: Hypothesis boxes, prediction boxes, comparison boxes
|
||||
- Appendix: Can use evidence boxes, limitation boxes as needed
|
||||
- Don't overuse boxes - reserve for key content
|
||||
|
||||
3. **Citation Style**
|
||||
- Consistent citation format throughout
|
||||
- Use `\citep{}` for most citations
|
||||
- Group multiple citations: `\citep{ref1, ref2, ref3}`
|
||||
|
||||
4. **Hypothesis Numbering**
|
||||
- Number hypotheses consistently (H1, H2, H3, etc.)
|
||||
- Use same numbering in predictions (P1.1, P1.2 for H1)
|
||||
- Use same numbering in comparisons (H1 vs. H2)
|
||||
|
||||
5. **Language**
|
||||
- Be precise and specific
|
||||
- Avoid vague language ("may", "could", "possibly")
|
||||
- Use active voice when possible
|
||||
- Make predictions quantitative when feasible
|
||||
|
||||
## Quick Checklist
|
||||
|
||||
Before finalizing your document:
|
||||
|
||||
- [ ] Title page has phenomenon name
|
||||
- [ ] **Main text is 4 pages maximum**
|
||||
- [ ] Executive summary is concise (0.5-1 page)
|
||||
- [ ] Each hypothesis in its own colored box
|
||||
- [ ] 3-5 hypotheses presented (not more)
|
||||
- [ ] Each hypothesis has brief mechanistic explanation (1-2 paragraphs)
|
||||
- [ ] Each hypothesis has 2-3 most essential evidence points with citations
|
||||
- [ ] Each hypothesis has 1-2 most critical assumptions
|
||||
- [ ] Predictions boxes with 1-2 key predictions per hypothesis
|
||||
- [ ] Priority comparison box in main text (others in appendix)
|
||||
- [ ] Priority experiments identified
|
||||
- [ ] **Page breaks (`\newpage`) used before long boxes to prevent overflow**
|
||||
- [ ] **No content overflows off page boundaries (check PDF carefully)**
|
||||
- [ ] **Each hypothesis box is ≤0.6 pages (if longer, move details to appendix)**
|
||||
- [ ] Appendix A has comprehensive literature review with detailed evidence
|
||||
- [ ] Appendix B has detailed experimental protocols
|
||||
- [ ] Appendix C has quality assessment tables
|
||||
- [ ] Appendix D has supplementary evidence
|
||||
- [ ] 10-15 citations in main text (selective)
|
||||
- [ ] 50+ total citations in full document
|
||||
- [ ] All boxes use correct colors
|
||||
- [ ] Document compiles without errors
|
||||
- [ ] References formatted correctly
|
||||
- [ ] **Compiled PDF checked visually for overflow issues**
|
||||
|
||||
## Example Minimal Document
|
||||
|
||||
```latex
|
||||
% !TEX program = xelatex
|
||||
\documentclass[11pt,letterpaper]{article}
|
||||
\usepackage{hypothesis_generation}
|
||||
\usepackage{natbib}
|
||||
|
||||
\title{Role of X in Y}
|
||||
|
||||
\begin{document}
|
||||
\maketitle
|
||||
|
||||
\section*{Executive Summary}
|
||||
\begin{summarybox}[Executive Summary]
|
||||
Brief overview of phenomenon and hypotheses.
|
||||
\end{summarybox}
|
||||
|
||||
\section{Competing Hypotheses}
|
||||
|
||||
% Use \newpage before each hypothesis box to prevent overflow
|
||||
\newpage
|
||||
\subsection*{Hypothesis 1: Title}
|
||||
\begin{hypothesisbox1}[Hypothesis 1: Title]
|
||||
\textbf{Mechanistic Explanation:}
|
||||
Brief explanation in 1-2 paragraphs.
|
||||
|
||||
\textbf{Key Supporting Evidence:}
|
||||
\begin{itemize}
|
||||
\item Evidence point \citep{ref1}
|
||||
\end{itemize}
|
||||
\end{hypothesisbox1}
|
||||
|
||||
\newpage
|
||||
\subsection*{Hypothesis 2: Title}
|
||||
\begin{hypothesisbox2}[Hypothesis 2: Title]
|
||||
\textbf{Mechanistic Explanation:}
|
||||
Brief explanation in 1-2 paragraphs.
|
||||
|
||||
\textbf{Key Supporting Evidence:}
|
||||
\begin{itemize}
|
||||
\item Evidence point \citep{ref2}
|
||||
\end{itemize}
|
||||
\end{hypothesisbox2}
|
||||
|
||||
\section{Testable Predictions}
|
||||
|
||||
\subsection*{Predictions from Hypothesis 1}
|
||||
\begin{predictionbox}[Predictions: Hypothesis 1]
|
||||
Predictions here.
|
||||
\end{predictionbox}
|
||||
|
||||
\section{Critical Comparisons}
|
||||
|
||||
\subsection*{H1 vs. H2}
|
||||
\begin{comparisonbox}[H1 vs. H2]
|
||||
Comparison here.
|
||||
\end{comparisonbox}
|
||||
|
||||
% Force new page before appendices
|
||||
\appendix
|
||||
\newpage
|
||||
\appendixsection{Appendix A: Literature Review}
|
||||
Detailed literature review here.
|
||||
|
||||
\newpage
|
||||
\bibliographystyle{plainnat}
|
||||
\bibliography{references}
|
||||
|
||||
\end{document}
|
||||
```
|
||||
|
||||
**Key Points:**
|
||||
- `\newpage` used before each hypothesis box to ensure they start on fresh pages
|
||||
- This prevents content overflow issues
|
||||
- Main text boxes kept concise (1-2 paragraphs + bullet points)
|
||||
- Detailed content goes to appendices
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- See `hypothesis_report_template.tex` for complete annotated template
|
||||
- See `SKILL.md` for workflow and methodology guidance
|
||||
- See `references/hypothesis_quality_criteria.md` for evaluation framework
|
||||
- See `references/experimental_design_patterns.md` for design guidance
|
||||
- See treatment-plans skill for additional LaTeX styling examples
|
||||
|
||||
@@ -0,0 +1,307 @@
|
||||
% hypothesis_generation.sty
|
||||
% Professional Scientific Hypothesis Generation Report Style
|
||||
% Provides modern, color-coded styling for hypothesis generation documents
|
||||
|
||||
\NeedsTeXFormat{LaTeX2e}
|
||||
\ProvidesPackage{hypothesis_generation}[2025/11/17 Hypothesis Generation Report Style]
|
||||
|
||||
% Required packages
|
||||
\RequirePackage[margin=1in, top=1.2in, bottom=1.2in]{geometry}
|
||||
\RequirePackage{graphicx}
|
||||
\RequirePackage{xcolor}
|
||||
\RequirePackage[most]{tcolorbox}
|
||||
\RequirePackage{tikz}
|
||||
\RequirePackage{fontspec}
|
||||
\RequirePackage{fancyhdr}
|
||||
\RequirePackage{titlesec}
|
||||
\RequirePackage{enumitem}
|
||||
\RequirePackage{booktabs}
|
||||
\RequirePackage{longtable}
|
||||
\RequirePackage{array}
|
||||
\RequirePackage{colortbl}
|
||||
\RequirePackage{hyperref}
|
||||
\RequirePackage{natbib}
|
||||
|
||||
% Color scheme - Distinct colors for each hypothesis plus utility colors
|
||||
\definecolor{hypothesis1}{RGB}{0, 102, 153} % Deep Blue
|
||||
\definecolor{hypothesis2}{RGB}{0, 128, 96} % Forest Green
|
||||
\definecolor{hypothesis3}{RGB}{102, 51, 153} % Royal Purple
|
||||
\definecolor{hypothesis4}{RGB}{0, 128, 128} % Teal
|
||||
\definecolor{hypothesis5}{RGB}{204, 85, 0} % Burnt Orange
|
||||
\definecolor{predictioncolor}{RGB}{255, 191, 0} % Amber
|
||||
\definecolor{evidencecolor}{RGB}{102, 178, 204} % Light Blue
|
||||
\definecolor{comparisoncolor}{RGB}{108, 117, 125} % Steel Gray
|
||||
\definecolor{limitationcolor}{RGB}{220, 53, 69} % Coral Red
|
||||
\definecolor{darkgray}{RGB}{64, 64, 64} % Dark gray for text
|
||||
\definecolor{lightgray}{RGB}{245, 245, 245} % Light background
|
||||
|
||||
% Fonts (if using XeLaTeX/LuaLaTeX)
|
||||
% Comment these out if fonts are not available
|
||||
% \setmainfont{Lato}
|
||||
% \setsansfont{Roboto}
|
||||
|
||||
% Hyperlink setup
|
||||
\hypersetup{
|
||||
colorlinks=true,
|
||||
linkcolor=hypothesis1,
|
||||
citecolor=hypothesis1,
|
||||
urlcolor=evidencecolor,
|
||||
pdfborder={0 0 0}
|
||||
}
|
||||
|
||||
% Header and footer styling
|
||||
\setlength{\headheight}{22pt}
|
||||
\pagestyle{fancy}
|
||||
\fancyhf{}
|
||||
\fancyhead[L]{\color{hypothesis1}\sffamily\small\textbf{Hypothesis Generation Report}}
|
||||
\fancyhead[R]{\color{darkgray}\sffamily\small\thepage}
|
||||
\fancyfoot[C]{\color{darkgray}\small Generated: \today}
|
||||
\renewcommand{\headrulewidth}{2pt}
|
||||
\renewcommand{\headrule}{\hbox to\headwidth{\color{hypothesis1}\leaders\hrule height \headrulewidth\hfill}}
|
||||
\renewcommand{\footrulewidth}{0.5pt}
|
||||
\renewcommand{\footrule}{\hbox to\headwidth{\color{lightgray}\leaders\hrule height \footrulewidth\hfill}}
|
||||
|
||||
% Section styling
|
||||
\titleformat{\section}
|
||||
{\color{hypothesis1}\Large\sffamily\bfseries}
|
||||
{\thesection}{1em}{}
|
||||
[\color{hypothesis1}\titlerule]
|
||||
|
||||
\titleformat{\subsection}
|
||||
{\color{evidencecolor}\large\sffamily\bfseries}
|
||||
{\thesubsection}{1em}{}
|
||||
|
||||
\titleformat{\subsubsection}
|
||||
{\color{darkgray}\normalsize\sffamily\bfseries}
|
||||
{\thesubsubsection}{1em}{}
|
||||
|
||||
% Title page styling
|
||||
\renewcommand{\maketitle}{
|
||||
\begin{tcolorbox}[
|
||||
enhanced,
|
||||
colback=hypothesis1,
|
||||
colframe=hypothesis1,
|
||||
arc=0mm,
|
||||
boxrule=0pt,
|
||||
left=20pt,
|
||||
right=20pt,
|
||||
top=30pt,
|
||||
bottom=30pt,
|
||||
width=\textwidth
|
||||
]
|
||||
\color{white}
|
||||
\begin{center}
|
||||
{\Huge\sffamily\bfseries Scientific Hypothesis\\Generation Report}\\[10pt]
|
||||
{\Large\sffamily\@title}\\[15pt]
|
||||
{\large\sffamily Evidence-Based Competing Hypotheses}\\[8pt]
|
||||
{\normalsize\sffamily\color{evidencecolor}\today}
|
||||
\end{center}
|
||||
\end{tcolorbox}
|
||||
\vspace{1cm}
|
||||
}
|
||||
|
||||
% Custom boxes for hypotheses (5 different colors)
|
||||
\newtcolorbox{hypothesisbox1}[1][Hypothesis 1]{
|
||||
enhanced,
|
||||
colback=hypothesis1!5,
|
||||
colframe=hypothesis1,
|
||||
arc=3mm,
|
||||
boxrule=2pt,
|
||||
left=12pt,
|
||||
right=12pt,
|
||||
top=12pt,
|
||||
bottom=12pt,
|
||||
title=#1,
|
||||
fonttitle=\sffamily\bfseries\large,
|
||||
coltitle=white,
|
||||
colbacktitle=hypothesis1,
|
||||
attach boxed title to top left={yshift=-3mm, xshift=5mm},
|
||||
boxed title style={arc=2mm}
|
||||
}
|
||||
|
||||
\newtcolorbox{hypothesisbox2}[1][Hypothesis 2]{
|
||||
enhanced,
|
||||
colback=hypothesis2!5,
|
||||
colframe=hypothesis2,
|
||||
arc=3mm,
|
||||
boxrule=2pt,
|
||||
left=12pt,
|
||||
right=12pt,
|
||||
top=12pt,
|
||||
bottom=12pt,
|
||||
title=#1,
|
||||
fonttitle=\sffamily\bfseries\large,
|
||||
coltitle=white,
|
||||
colbacktitle=hypothesis2,
|
||||
attach boxed title to top left={yshift=-3mm, xshift=5mm},
|
||||
boxed title style={arc=2mm}
|
||||
}
|
||||
|
||||
\newtcolorbox{hypothesisbox3}[1][Hypothesis 3]{
|
||||
enhanced,
|
||||
colback=hypothesis3!5,
|
||||
colframe=hypothesis3,
|
||||
arc=3mm,
|
||||
boxrule=2pt,
|
||||
left=12pt,
|
||||
right=12pt,
|
||||
top=12pt,
|
||||
bottom=12pt,
|
||||
title=#1,
|
||||
fonttitle=\sffamily\bfseries\large,
|
||||
coltitle=white,
|
||||
colbacktitle=hypothesis3,
|
||||
attach boxed title to top left={yshift=-3mm, xshift=5mm},
|
||||
boxed title style={arc=2mm}
|
||||
}
|
||||
|
||||
\newtcolorbox{hypothesisbox4}[1][Hypothesis 4]{
|
||||
enhanced,
|
||||
colback=hypothesis4!5,
|
||||
colframe=hypothesis4,
|
||||
arc=3mm,
|
||||
boxrule=2pt,
|
||||
left=12pt,
|
||||
right=12pt,
|
||||
top=12pt,
|
||||
bottom=12pt,
|
||||
title=#1,
|
||||
fonttitle=\sffamily\bfseries\large,
|
||||
coltitle=white,
|
||||
colbacktitle=hypothesis4,
|
||||
attach boxed title to top left={yshift=-3mm, xshift=5mm},
|
||||
boxed title style={arc=2mm}
|
||||
}
|
||||
|
||||
\newtcolorbox{hypothesisbox5}[1][Hypothesis 5]{
|
||||
enhanced,
|
||||
colback=hypothesis5!5,
|
||||
colframe=hypothesis5,
|
||||
arc=3mm,
|
||||
boxrule=2pt,
|
||||
left=12pt,
|
||||
right=12pt,
|
||||
top=12pt,
|
||||
bottom=12pt,
|
||||
title=#1,
|
||||
fonttitle=\sffamily\bfseries\large,
|
||||
coltitle=white,
|
||||
colbacktitle=hypothesis5,
|
||||
attach boxed title to top left={yshift=-3mm, xshift=5mm},
|
||||
boxed title style={arc=2mm}
|
||||
}
|
||||
|
||||
% Prediction box (amber)
|
||||
\newtcolorbox{predictionbox}[1][Testable Predictions]{
|
||||
enhanced,
|
||||
colback=predictioncolor!10,
|
||||
colframe=predictioncolor!80!black,
|
||||
arc=3mm,
|
||||
boxrule=1.5pt,
|
||||
left=10pt,
|
||||
right=10pt,
|
||||
top=10pt,
|
||||
bottom=10pt,
|
||||
title=#1,
|
||||
fonttitle=\sffamily\bfseries,
|
||||
coltitle=black,
|
||||
colbacktitle=predictioncolor
|
||||
}
|
||||
|
||||
% Evidence/Support box (light blue)
|
||||
\newtcolorbox{evidencebox}[1][Supporting Evidence]{
|
||||
enhanced,
|
||||
colback=evidencecolor!8,
|
||||
colframe=evidencecolor,
|
||||
arc=3mm,
|
||||
boxrule=1.5pt,
|
||||
left=10pt,
|
||||
right=10pt,
|
||||
top=10pt,
|
||||
bottom=10pt,
|
||||
title=#1,
|
||||
fonttitle=\sffamily\bfseries,
|
||||
coltitle=white,
|
||||
colbacktitle=evidencecolor
|
||||
}
|
||||
|
||||
% Comparison box (steel gray)
|
||||
\newtcolorbox{comparisonbox}[1][Critical Comparison]{
|
||||
enhanced,
|
||||
colback=comparisoncolor!8,
|
||||
colframe=comparisoncolor,
|
||||
arc=3mm,
|
||||
boxrule=1.5pt,
|
||||
left=10pt,
|
||||
right=10pt,
|
||||
top=10pt,
|
||||
bottom=10pt,
|
||||
title=#1,
|
||||
fonttitle=\sffamily\bfseries,
|
||||
coltitle=white,
|
||||
colbacktitle=comparisoncolor
|
||||
}
|
||||
|
||||
% Limitation box (coral red)
|
||||
\newtcolorbox{limitationbox}[1][Limitations \& Challenges]{
|
||||
enhanced,
|
||||
colback=limitationcolor!8,
|
||||
colframe=limitationcolor,
|
||||
arc=3mm,
|
||||
boxrule=1.5pt,
|
||||
left=10pt,
|
||||
right=10pt,
|
||||
top=10pt,
|
||||
bottom=10pt,
|
||||
title=#1,
|
||||
fonttitle=\sffamily\bfseries,
|
||||
coltitle=white,
|
||||
colbacktitle=limitationcolor
|
||||
}
|
||||
|
||||
% Executive summary box (using evidence color for consistency)
|
||||
\newtcolorbox{summarybox}[1][Executive Summary]{
|
||||
enhanced,
|
||||
colback=evidencecolor!15,
|
||||
colframe=hypothesis1,
|
||||
arc=3mm,
|
||||
boxrule=2pt,
|
||||
left=15pt,
|
||||
right=15pt,
|
||||
top=15pt,
|
||||
bottom=15pt,
|
||||
title=#1,
|
||||
fonttitle=\sffamily\bfseries\Large,
|
||||
coltitle=white,
|
||||
colbacktitle=hypothesis1
|
||||
}
|
||||
|
||||
% Table styling
|
||||
\newcommand{\tableheadercolor}{\rowcolor{hypothesis1}}
|
||||
\newcommand{\tablerowcolor}{\rowcolor{lightgray}}
|
||||
|
||||
% Custom table environment
|
||||
\newenvironment{hypotable}[1]{
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\small\sffamily
|
||||
\renewcommand{\arraystretch}{1.3}
|
||||
}{
|
||||
\end{table}
|
||||
}
|
||||
|
||||
% Custom list styling
|
||||
\setlist[itemize,1]{label=\textcolor{hypothesis1}{\textbullet}, leftmargin=*, itemsep=3pt}
|
||||
\setlist[enumerate,1]{label=\textcolor{hypothesis1}{\arabic*.}, leftmargin=*, itemsep=3pt}
|
||||
|
||||
% Appendix styling
|
||||
\newcommand{\appendixsection}[1]{
|
||||
\section*{#1}
|
||||
\addcontentsline{toc}{section}{#1}
|
||||
}
|
||||
|
||||
% Citation styling helper
|
||||
\newcommand{\citehighlight}[1]{\textcolor{evidencecolor}{\citep{#1}}}
|
||||
|
||||
\endinput
|
||||
|
||||
@@ -1,302 +0,0 @@
|
||||
# Scientific Hypothesis Generation: [Phenomenon Name]
|
||||
|
||||
## 1. Background & Context
|
||||
|
||||
### Phenomenon Description
|
||||
[Clear description of the observation, pattern, or question that requires explanation. Include:
|
||||
- What was observed or what question needs answering
|
||||
- The specific context or system in which it occurs
|
||||
- Any relevant constraints or boundary conditions
|
||||
- Why this phenomenon is interesting or important]
|
||||
|
||||
### Current Understanding
|
||||
[Synthesis of existing literature, including:
|
||||
- What is already known about this phenomenon
|
||||
- Established mechanisms or theories that may be relevant
|
||||
- Key findings from recent research
|
||||
- Gaps or limitations in current understanding
|
||||
- Conflicting findings or unresolved debates
|
||||
|
||||
Include citations to key papers (Author et al., Year, Journal)]
|
||||
|
||||
### Knowledge Gaps
|
||||
[Specific aspects that remain unexplained or poorly understood:
|
||||
- What aspects of the phenomenon lack clear explanation?
|
||||
- What contradictions exist in current understanding?
|
||||
- What questions remain unanswered?]
|
||||
|
||||
---
|
||||
|
||||
## 2. Competing Hypotheses
|
||||
|
||||
### Hypothesis 1: [Concise Title]
|
||||
|
||||
**Mechanistic Explanation:**
|
||||
[Detailed explanation of the proposed mechanism. This should explain HOW and WHY the phenomenon occurs, not just describe WHAT occurs. Include:
|
||||
- Specific molecular, cellular, physiological, or population-level mechanisms
|
||||
- Causal chain from initial trigger to observed outcome
|
||||
- Key components, pathways, or factors involved
|
||||
- Scale or level of explanation (molecular, cellular, organ, organism, population)]
|
||||
|
||||
**Supporting Evidence:**
|
||||
[Evidence from literature that supports this hypothesis:
|
||||
- Analogous mechanisms in related systems
|
||||
- Direct evidence from relevant studies
|
||||
- Theoretical frameworks that align with this hypothesis
|
||||
- Include citations]
|
||||
|
||||
**Key Assumptions:**
|
||||
[Explicit statement of assumptions underlying this hypothesis:
|
||||
- What must be true for this hypothesis to hold?
|
||||
- What conditions or contexts does it require?]
|
||||
|
||||
---
|
||||
|
||||
### Hypothesis 2: [Concise Title]
|
||||
|
||||
**Mechanistic Explanation:**
|
||||
[Detailed mechanistic explanation distinct from Hypothesis 1]
|
||||
|
||||
**Supporting Evidence:**
|
||||
[Evidence supporting this alternative explanation]
|
||||
|
||||
**Key Assumptions:**
|
||||
[Assumptions underlying this hypothesis]
|
||||
|
||||
---
|
||||
|
||||
### Hypothesis 3: [Concise Title]
|
||||
|
||||
**Mechanistic Explanation:**
|
||||
[Detailed mechanistic explanation distinct from previous hypotheses]
|
||||
|
||||
**Supporting Evidence:**
|
||||
[Evidence supporting this explanation]
|
||||
|
||||
**Key Assumptions:**
|
||||
[Assumptions underlying this hypothesis]
|
||||
|
||||
---
|
||||
|
||||
[Continue for Hypothesis 4, 5, etc. if applicable]
|
||||
|
||||
---
|
||||
|
||||
## 3. Quality Assessment
|
||||
|
||||
### Evaluation Against Core Criteria
|
||||
|
||||
| Criterion | Hypothesis 1 | Hypothesis 2 | Hypothesis 3 | [H4] | [H5] |
|
||||
|-----------|--------------|--------------|--------------|------|------|
|
||||
| **Testability** | [Rating & brief note] | [Rating & brief note] | [Rating & brief note] | | |
|
||||
| **Falsifiability** | [Rating & brief note] | [Rating & brief note] | [Rating & brief note] | | |
|
||||
| **Parsimony** | [Rating & brief note] | [Rating & brief note] | [Rating & brief note] | | |
|
||||
| **Explanatory Power** | [Rating & brief note] | [Rating & brief note] | [Rating & brief note] | | |
|
||||
| **Scope** | [Rating & brief note] | [Rating & brief note] | [Rating & brief note] | | |
|
||||
| **Consistency** | [Rating & brief note] | [Rating & brief note] | [Rating & brief note] | | |
|
||||
|
||||
**Rating scale:** Strong / Moderate / Weak
|
||||
|
||||
### Detailed Evaluation
|
||||
|
||||
#### Hypothesis 1
|
||||
**Strengths:**
|
||||
- [Specific strength 1]
|
||||
- [Specific strength 2]
|
||||
|
||||
**Weaknesses:**
|
||||
- [Specific weakness 1]
|
||||
- [Specific weakness 2]
|
||||
|
||||
**Overall Assessment:**
|
||||
[Brief summary of hypothesis quality and viability]
|
||||
|
||||
#### Hypothesis 2
|
||||
[Similar structure]
|
||||
|
||||
#### Hypothesis 3
|
||||
[Similar structure]
|
||||
|
||||
---
|
||||
|
||||
## 4. Experimental Designs
|
||||
|
||||
### Testing Hypothesis 1: [Title]
|
||||
|
||||
**Experiment 1A: [Brief title]**
|
||||
|
||||
*Design Type:* [e.g., In vitro dose-response / In vivo knockout / Clinical RCT / Observational cohort / Computational model]
|
||||
|
||||
*Objective:* [What specific aspect of the hypothesis does this test?]
|
||||
|
||||
*Methods:*
|
||||
- **System/Model:** [What system, organism, or population?]
|
||||
- **Intervention/Manipulation:** [What is varied or manipulated?]
|
||||
- **Measurements:** [What outcomes are measured?]
|
||||
- **Controls:** [What control conditions?]
|
||||
- **Sample Size:** [Estimated n, with justification if possible]
|
||||
- **Analysis:** [Statistical or analytical approach]
|
||||
|
||||
*Expected Timeline:* [Rough estimate]
|
||||
|
||||
*Feasibility:* [High/Medium/Low, with brief justification]
|
||||
|
||||
**Experiment 1B: [Brief title - alternative or complementary approach]**
|
||||
[Similar structure to 1A]
|
||||
|
||||
---
|
||||
|
||||
### Testing Hypothesis 2: [Title]
|
||||
|
||||
**Experiment 2A: [Brief title]**
|
||||
[Structure as above]
|
||||
|
||||
**Experiment 2B: [Brief title]**
|
||||
[Structure as above]
|
||||
|
||||
---
|
||||
|
||||
### Testing Hypothesis 3: [Title]
|
||||
|
||||
**Experiment 3A: [Brief title]**
|
||||
[Structure as above]
|
||||
|
||||
---
|
||||
|
||||
## 5. Testable Predictions
|
||||
|
||||
### Predictions from Hypothesis 1
|
||||
|
||||
1. **Prediction 1.1:** [Specific, measurable prediction]
|
||||
- **Conditions:** [Under what conditions should this be observed?]
|
||||
- **Magnitude:** [Expected effect size or direction, if quantifiable]
|
||||
- **Falsification:** [What observation would falsify this prediction?]
|
||||
|
||||
2. **Prediction 1.2:** [Specific, measurable prediction]
|
||||
- **Conditions:** [Conditions]
|
||||
- **Magnitude:** [Expected effect]
|
||||
- **Falsification:** [Falsifying observation]
|
||||
|
||||
3. **Prediction 1.3:** [Additional prediction]
|
||||
|
||||
---
|
||||
|
||||
### Predictions from Hypothesis 2
|
||||
|
||||
1. **Prediction 2.1:** [Specific, measurable prediction]
|
||||
- **Conditions:** [Conditions]
|
||||
- **Magnitude:** [Expected effect]
|
||||
- **Falsification:** [Falsifying observation]
|
||||
|
||||
2. **Prediction 2.2:** [Additional prediction]
|
||||
|
||||
---
|
||||
|
||||
### Predictions from Hypothesis 3
|
||||
|
||||
1. **Prediction 3.1:** [Specific, measurable prediction]
|
||||
- **Conditions:** [Conditions]
|
||||
- **Magnitude:** [Expected effect]
|
||||
- **Falsification:** [Falsifying observation]
|
||||
|
||||
---
|
||||
|
||||
## 6. Critical Comparisons
|
||||
|
||||
### Distinguishing Between Hypotheses
|
||||
|
||||
**Comparison: Hypothesis 1 vs. Hypothesis 2**
|
||||
|
||||
*Key Distinguishing Feature:*
|
||||
[What is the fundamental difference in mechanism or prediction?]
|
||||
|
||||
*Discriminating Experiment:*
|
||||
[What experiment or observation would clearly favor one over the other?]
|
||||
|
||||
*Outcome Interpretation:*
|
||||
- If [Result A], then Hypothesis 1 is supported
|
||||
- If [Result B], then Hypothesis 2 is supported
|
||||
- If [Result C], then both/neither are supported
|
||||
|
||||
---
|
||||
|
||||
**Comparison: Hypothesis 1 vs. Hypothesis 3**
|
||||
[Similar structure]
|
||||
|
||||
---
|
||||
|
||||
**Comparison: Hypothesis 2 vs. Hypothesis 3**
|
||||
[Similar structure]
|
||||
|
||||
---
|
||||
|
||||
### Priority Experiments
|
||||
|
||||
**Highest Priority Test:**
|
||||
[Which experiment would most efficiently distinguish between hypotheses or most definitively test a hypothesis?]
|
||||
|
||||
**Justification:**
|
||||
[Why is this the highest priority? Consider informativeness, feasibility, and cost]
|
||||
|
||||
**Secondary Priority Tests:**
|
||||
1. [Second most important experiment]
|
||||
2. [Third most important]
|
||||
|
||||
---
|
||||
|
||||
## 7. Summary & Recommendations
|
||||
|
||||
### Summary of Hypotheses
|
||||
|
||||
[Brief paragraph summarizing the competing hypotheses and their relationships]
|
||||
|
||||
### Recommended Testing Sequence
|
||||
|
||||
**Phase 1 (Initial Tests):**
|
||||
[Which experiments should be done first? Why?]
|
||||
|
||||
**Phase 2 (Contingent on Phase 1 results):**
|
||||
[What follow-up experiments depend on initial results?]
|
||||
|
||||
**Phase 3 (Validation and Extension):**
|
||||
[How to validate findings and extend to broader contexts?]
|
||||
|
||||
### Expected Outcomes and Implications
|
||||
|
||||
**If Hypothesis 1 is supported:**
|
||||
[What would this mean for the field? What new questions arise?]
|
||||
|
||||
**If Hypothesis 2 is supported:**
|
||||
[Implications and new questions]
|
||||
|
||||
**If Hypothesis 3 is supported:**
|
||||
[Implications and new questions]
|
||||
|
||||
**If multiple hypotheses are partially supported:**
|
||||
[How might mechanisms combine or interact?]
|
||||
|
||||
### Open Questions
|
||||
|
||||
[What questions remain even after these hypotheses are tested?]
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
[List key papers cited in the document, formatted consistently]
|
||||
|
||||
1. Author1, A.B., & Author2, C.D. (Year). Title of paper. *Journal Name*, Volume(Issue), pages. DOI or URL
|
||||
|
||||
2. [Continue for all citations]
|
||||
|
||||
---
|
||||
|
||||
## Notes on Using This Template
|
||||
|
||||
- Replace all bracketed instructions with actual content
|
||||
- Not all sections are mandatory - adapt to your specific hypothesis generation task
|
||||
- For simpler phenomena, 3 hypotheses may be sufficient; complex phenomena may warrant 4-5
|
||||
- Experimental designs should be detailed enough to be actionable but can be refined later
|
||||
- Predictions should be as specific and quantitative as possible
|
||||
- The template emphasizes both generating hypotheses and planning how to test them
|
||||
- Citation format can be adjusted to field-specific standards
|
||||
@@ -0,0 +1,572 @@
|
||||
% !TEX program = xelatex
|
||||
\documentclass[11pt,letterpaper]{article}
|
||||
\usepackage{hypothesis_generation}
|
||||
\usepackage{natbib}
|
||||
|
||||
% Document metadata
|
||||
\title{[Phenomenon Name]}
|
||||
\author{Scientific Hypothesis Generation}
|
||||
\date{\today}
|
||||
|
||||
\begin{document}
|
||||
|
||||
\maketitle
|
||||
|
||||
% ============================================================================
|
||||
% EXECUTIVE SUMMARY
|
||||
% ============================================================================
|
||||
% NOTE: Keep main text to 4 pages maximum. All details go to appendices.
|
||||
% Executive Summary: 0.5-1 page
|
||||
|
||||
\section*{Executive Summary}
|
||||
\addcontentsline{toc}{section}{Executive Summary}
|
||||
|
||||
\begin{summarybox}[Executive Summary]
|
||||
\textbf{Phenomenon:} [One paragraph: What was observed? Why is it important?]
|
||||
|
||||
\vspace{0.2cm}
|
||||
\textbf{Key Question:} [Single sentence stating the central question]
|
||||
|
||||
\vspace{0.2cm}
|
||||
\textbf{Competing Hypotheses:}
|
||||
\begin{enumerate}
|
||||
\item \textbf{[H1 Title]:} [One sentence mechanistic summary]
|
||||
\item \textbf{[H2 Title]:} [One sentence mechanistic summary]
|
||||
\item \textbf{[H3 Title]:} [One sentence mechanistic summary]
|
||||
\item \textbf{[Add H4 \& H5 if applicable]}
|
||||
\end{enumerate}
|
||||
|
||||
\vspace{0.2cm}
|
||||
\textbf{Recommended Approach:} [One sentence on priority experiments]
|
||||
|
||||
\end{summarybox}
|
||||
|
||||
\vspace{0.3cm}
|
||||
|
||||
% ============================================================================
|
||||
% COMPETING HYPOTHESES
|
||||
% ============================================================================
|
||||
% NOTE: Keep this section to 2-2.5 pages for 3-5 hypotheses
|
||||
% Each hypothesis: 1-2 brief paragraphs + 2-3 key evidence points + 1-2 assumptions
|
||||
% Detailed explanations and additional evidence go to Appendix A
|
||||
|
||||
\section{Competing Hypotheses}
|
||||
|
||||
This section presents [3-5] distinct mechanistic hypotheses. Detailed literature review and comprehensive evidence are in Appendix A.
|
||||
|
||||
\subsection*{Hypothesis 1: [Concise Descriptive Title]}
|
||||
|
||||
\begin{hypothesisbox1}[Hypothesis 1: [Title]]
|
||||
|
||||
\textbf{Mechanistic Explanation:}
|
||||
|
||||
[Provide a BRIEF mechanistic explanation (1-2 paragraphs) of HOW and WHY. Keep concise - main text is limited to 4 pages total. Include only the essential mechanism. All detailed explanations go to Appendix A.
|
||||
|
||||
Example: "This hypothesis proposes that [mechanism X] operates through [pathway Y], resulting in [outcome Z]. The process initiates when [trigger], activating [component A] and ultimately producing the observed [phenomenon] \citep{key-ref}."
|
||||
]
|
||||
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textbf{Key Supporting Evidence:}
|
||||
\begin{itemize}
|
||||
\item [Most essential evidence point 1 \citep{author2023}]
|
||||
\item [Most essential evidence point 2 \citep{author2022}]
|
||||
\item [Most essential evidence point 3 \citep{author2021}]
|
||||
\end{itemize}
|
||||
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textbf{Core Assumptions:}
|
||||
\begin{enumerate}
|
||||
\item [Most critical assumption 1]
|
||||
\item [Most critical assumption 2]
|
||||
\end{enumerate}
|
||||
|
||||
\end{hypothesisbox1}
|
||||
|
||||
\vspace{0.3cm}
|
||||
|
||||
\subsection*{Hypothesis 2: [Concise Descriptive Title]}
|
||||
|
||||
\begin{hypothesisbox2}[Hypothesis 2: [Title]]
|
||||
|
||||
\textbf{Mechanistic Explanation:}
|
||||
|
||||
[BRIEF mechanistic explanation (1-2 paragraphs) distinct from Hypothesis 1. Keep concise.]
|
||||
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textbf{Key Supporting Evidence:}
|
||||
\begin{itemize}
|
||||
\item [Essential evidence point 1 with citation]
|
||||
\item [Essential evidence point 2 with citation]
|
||||
\item [Essential evidence point 3 with citation]
|
||||
\end{itemize}
|
||||
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textbf{Core Assumptions:}
|
||||
\begin{enumerate}
|
||||
\item [Critical assumption 1]
|
||||
\item [Critical assumption 2]
|
||||
\end{enumerate}
|
||||
|
||||
\end{hypothesisbox2}
|
||||
|
||||
\vspace{0.3cm}
|
||||
|
||||
\subsection*{Hypothesis 3: [Concise Descriptive Title]}
|
||||
|
||||
\begin{hypothesisbox3}[Hypothesis 3: [Title]]
|
||||
|
||||
\textbf{Mechanistic Explanation:}
|
||||
|
||||
[BRIEF mechanistic explanation (1-2 paragraphs) distinct from previous hypotheses.]
|
||||
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textbf{Key Supporting Evidence:}
|
||||
\begin{itemize}
|
||||
\item [Essential evidence point 1 with citation]
|
||||
\item [Essential evidence point 2 with citation]
|
||||
\item [Essential evidence point 3 with citation]
|
||||
\end{itemize}
|
||||
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textbf{Core Assumptions:}
|
||||
\begin{enumerate}
|
||||
\item [Critical assumption 1]
|
||||
\item [Critical assumption 2]
|
||||
\end{enumerate}
|
||||
|
||||
\end{hypothesisbox3}
|
||||
|
||||
\vspace{0.3cm}
|
||||
|
||||
% Optional: Include Hypothesis 4 and 5 if needed
|
||||
% \subsection*{Hypothesis 4: [Title]}
|
||||
% \begin{hypothesisbox4}[Hypothesis 4: [Title]]
|
||||
% [Content following same structure]
|
||||
% \end{hypothesisbox4}
|
||||
|
||||
% \subsection*{Hypothesis 5: [Title]}
|
||||
% \begin{hypothesisbox5}[Hypothesis 5: [Title]]
|
||||
% [Content following same structure]
|
||||
% \end{hypothesisbox5}
|
||||
|
||||
% ============================================================================
|
||||
% TESTABLE PREDICTIONS
|
||||
% ============================================================================
|
||||
% NOTE: Keep this section to 0.5-1 page
|
||||
% Include only 1-2 most critical predictions per hypothesis
|
||||
% Additional predictions go to Appendix B with experimental designs
|
||||
|
||||
\section{Testable Predictions}
|
||||
|
||||
Key predictions from each hypothesis. Full prediction details and additional predictions in Appendix B.
|
||||
|
||||
\subsection*{Predictions from Hypothesis 1}
|
||||
|
||||
\begin{predictionbox}[Predictions: Hypothesis 1]
|
||||
|
||||
\textbf{Prediction 1.1:} [Most critical prediction]
|
||||
\begin{itemize}
|
||||
\item \textbf{Expected Outcome:} [Specific result with magnitude if possible]
|
||||
\item \textbf{Falsification:} [What would disprove it]
|
||||
\end{itemize}
|
||||
|
||||
\vspace{0.15cm}
|
||||
|
||||
\textbf{Prediction 1.2:} [Second most critical prediction]
|
||||
\begin{itemize}
|
||||
\item \textbf{Expected Outcome:} [Specific result]
|
||||
\item \textbf{Falsification:} [What would disprove it]
|
||||
\end{itemize}
|
||||
|
||||
\end{predictionbox}
|
||||
|
||||
\vspace{0.3cm}
|
||||
|
||||
\subsection*{Predictions from Hypothesis 2}
|
||||
|
||||
\begin{predictionbox}[Predictions: Hypothesis 2]
|
||||
|
||||
\textbf{Prediction 2.1:} [Most critical prediction]
|
||||
\begin{itemize}
|
||||
\item \textbf{Expected Outcome:} [Specific result]
|
||||
\item \textbf{Falsification:} [What would disprove it]
|
||||
\end{itemize}
|
||||
|
||||
\vspace{0.15cm}
|
||||
|
||||
\textbf{Prediction 2.2:} [Second most critical prediction]
|
||||
\begin{itemize}
|
||||
\item \textbf{Expected Outcome:} [Specific result]
|
||||
\item \textbf{Falsification:} [What would disprove it]
|
||||
\end{itemize}
|
||||
|
||||
\end{predictionbox}
|
||||
|
||||
\vspace{0.3cm}
|
||||
|
||||
\subsection*{Predictions from Hypothesis 3}
|
||||
|
||||
\begin{predictionbox}[Predictions: Hypothesis 3]
|
||||
|
||||
[1-2 most critical predictions only, following same brief structure]
|
||||
|
||||
\end{predictionbox}
|
||||
|
||||
% Add prediction boxes for Hypotheses 4 and 5 if applicable
|
||||
|
||||
% ============================================================================
|
||||
% CRITICAL COMPARISONS
|
||||
% ============================================================================
|
||||
% NOTE: Keep this section to 0.5-1 page
|
||||
% Include only the HIGHEST PRIORITY comparison
|
||||
% Additional comparisons go to Appendix B
|
||||
|
||||
\section{Critical Comparisons}
|
||||
|
||||
Highest priority comparison for distinguishing hypotheses. Additional comparisons in Appendix B.
|
||||
|
||||
\subsection*{Priority Comparison: Hypothesis 1 vs. Hypothesis 2}
|
||||
|
||||
\begin{comparisonbox}[H1 vs. H2: Key Distinction]
|
||||
|
||||
\textbf{Fundamental Difference:} [One sentence on core mechanistic difference]
|
||||
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textbf{Discriminating Experiment:} [Brief description of key experiment to distinguish them]
|
||||
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textbf{Outcome Interpretation:}
|
||||
\begin{itemize}
|
||||
\item \textbf{If [Result A]:} H1 supported
|
||||
\item \textbf{If [Result B]:} H2 supported
|
||||
\end{itemize}
|
||||
|
||||
\end{comparisonbox}
|
||||
|
||||
\vspace{0.3cm}
|
||||
|
||||
\textbf{Highest Priority Test:} [Name of single most important experiment]
|
||||
|
||||
\textbf{Justification:} [2-3 sentences on why this is highest priority considering informativeness and feasibility. Full experimental details in Appendix B.]
|
||||
|
||||
% ============================================================================
|
||||
% APPENDICES
|
||||
% ============================================================================
|
||||
\newpage
|
||||
\appendix
|
||||
|
||||
% ============================================================================
|
||||
% APPENDIX A: COMPREHENSIVE LITERATURE REVIEW
|
||||
% ============================================================================
|
||||
\appendixsection{Appendix A: Comprehensive Literature Review}
|
||||
|
||||
This appendix provides detailed synthesis of existing literature, extensive background context, and comprehensive citations supporting the hypotheses presented in this report.
|
||||
|
||||
\subsection*{A.1 Phenomenon Background and Context}
|
||||
|
||||
[Provide extensive background on the phenomenon. This section should be comprehensive, including:
|
||||
\begin{itemize}
|
||||
\item Historical context and when the phenomenon was first observed
|
||||
\item Detailed description of what is known about the phenomenon
|
||||
\item Why this phenomenon is scientifically important
|
||||
\item Practical or clinical implications if applicable
|
||||
\item Current debates or controversies in the field
|
||||
\end{itemize}
|
||||
|
||||
Include extensive citations throughout. Aim for 10-15 citations in this subsection alone.]
|
||||
|
||||
\subsection*{A.2 Current Understanding and Established Mechanisms}
|
||||
|
||||
[Synthesize what is currently understood about this phenomenon:
|
||||
\begin{itemize}
|
||||
\item Established theories or frameworks that may apply
|
||||
\item Known mechanisms from related systems or analogous phenomena
|
||||
\item Molecular, cellular, or systemic processes that are well-characterized
|
||||
\item Population-level patterns that have been documented
|
||||
\item Computational or theoretical models that have been proposed
|
||||
\end{itemize}
|
||||
|
||||
Include 15-20 citations covering recent reviews, primary research papers, and foundational studies.]
|
||||
|
||||
\subsection*{A.3 Evidence Supporting Hypothesis 1}
|
||||
|
||||
[Provide detailed discussion of all evidence supporting Hypothesis 1. This goes beyond the brief bullet points in the main text:
|
||||
\begin{itemize}
|
||||
\item Detailed findings from key papers
|
||||
\item Mechanistic studies showing relevant pathways
|
||||
\item Data from analogous systems
|
||||
\item Theoretical support
|
||||
\item Any preliminary or indirect evidence
|
||||
\end{itemize}
|
||||
|
||||
Include 8-12 citations specific to this hypothesis.]
|
||||
|
||||
\subsection*{A.4 Evidence Supporting Hypothesis 2}
|
||||
|
||||
[Same structure as A.3, focused on Hypothesis 2. Include 8-12 citations.]
|
||||
|
||||
\subsection*{A.5 Evidence Supporting Hypothesis 3}
|
||||
|
||||
[Same structure as A.3, focused on Hypothesis 3. Include 8-12 citations.]
|
||||
|
||||
% Add A.6, A.7 for Hypotheses 4 and 5 if applicable
|
||||
|
||||
\subsection*{A.6 Conflicting Findings and Unresolved Debates}
|
||||
|
||||
[Discuss contradictions in the literature:
|
||||
\begin{itemize}
|
||||
\item Studies with conflicting results
|
||||
\item Ongoing debates about mechanisms
|
||||
\item Alternative interpretations of existing data
|
||||
\item Methodological issues that complicate interpretation
|
||||
\item Areas where consensus has not been reached
|
||||
\end{itemize}
|
||||
|
||||
Include 5-10 citations highlighting key controversies.]
|
||||
|
||||
\subsection*{A.7 Knowledge Gaps and Limitations}
|
||||
|
||||
[Identify what is still unknown:
|
||||
\begin{itemize}
|
||||
\item Aspects of the phenomenon that lack clear explanation
|
||||
\item Missing data or unstudied conditions
|
||||
\item Limitations of current methods or approaches
|
||||
\item Questions that remain unanswered
|
||||
\item Assumptions that have not been tested
|
||||
\end{itemize}
|
||||
|
||||
Include 3-5 citations discussing limitations or identifying gaps.]
|
||||
|
||||
% ============================================================================
|
||||
% APPENDIX B: DETAILED EXPERIMENTAL DESIGNS
|
||||
% ============================================================================
|
||||
\newpage
|
||||
\appendixsection{Appendix B: Detailed Experimental Designs}
|
||||
|
||||
This appendix provides comprehensive experimental protocols for testing each hypothesis, including methods, controls, sample sizes, statistical approaches, and feasibility assessments.
|
||||
|
||||
\subsection*{B.1 Experiments for Testing Hypothesis 1}
|
||||
|
||||
\subsubsection*{Experiment 1A: [Descriptive Title]}
|
||||
|
||||
\textbf{Design Type:} [e.g., In vitro dose-response / In vivo knockout / Clinical RCT / Observational cohort / Computational model]
|
||||
|
||||
\textbf{Objective:} [What specific aspect of Hypothesis 1 does this experiment test? What question does it answer?]
|
||||
|
||||
\textbf{Detailed Methods:}
|
||||
\begin{itemize}
|
||||
\item \textbf{System/Model:} [What system, organism, cell type, or population will be studied? Include species, strains, patient populations, etc.]
|
||||
\item \textbf{Intervention/Manipulation:} [What will be varied or manipulated? Include specific treatments, genetic modifications, interventions, etc.]
|
||||
\item \textbf{Measurements:} [What outcomes will be measured? Include primary and secondary endpoints, measurement techniques, timing of measurements]
|
||||
\item \textbf{Controls:} [What control conditions will be included? Negative controls, positive controls, vehicle controls, sham procedures, etc.]
|
||||
\item \textbf{Sample Size:} [Estimated n per group with power analysis justification if possible. Include assumptions about effect size and variability.]
|
||||
\item \textbf{Randomization \& Blinding:} [How will subjects be randomized? Who will be blinded?]
|
||||
\item \textbf{Statistical Analysis:} [Specific statistical tests planned, correction for multiple comparisons, significance thresholds]
|
||||
\end{itemize}
|
||||
|
||||
\textbf{Expected Timeline:} [Rough estimate of duration from start to completion]
|
||||
|
||||
\textbf{Resource Requirements:}
|
||||
\begin{itemize}
|
||||
\item \textbf{Equipment:} [Specialized equipment needed]
|
||||
\item \textbf{Materials:} [Key reagents, animals, human subjects]
|
||||
\item \textbf{Expertise:} [Specialized skills or training required]
|
||||
\item \textbf{Estimated Cost:} [Rough cost estimate if applicable]
|
||||
\end{itemize}
|
||||
|
||||
\textbf{Feasibility Assessment:} [High/Medium/Low with justification. Consider technical challenges, resource availability, ethical considerations]
|
||||
|
||||
\textbf{Potential Confounds and Mitigation:}
|
||||
\begin{itemize}
|
||||
\item [Confound 1 and how to address it]
|
||||
\item [Confound 2 and how to address it]
|
||||
\item [Confound 3 and how to address it]
|
||||
\end{itemize}
|
||||
|
||||
\vspace{0.5cm}
|
||||
|
||||
\subsubsection*{Experiment 1B: [Alternative or Complementary Approach]}
|
||||
|
||||
[Follow same detailed structure as Experiment 1A. This should be an alternative method to test the same aspect of Hypothesis 1, or a complementary experiment that tests a different aspect.]
|
||||
|
||||
\vspace{0.5cm}
|
||||
|
||||
\subsection*{B.2 Experiments for Testing Hypothesis 2}
|
||||
|
||||
\subsubsection*{Experiment 2A: [Descriptive Title]}
|
||||
|
||||
[Follow same detailed structure as above]
|
||||
|
||||
\subsubsection*{Experiment 2B: [Alternative or Complementary Approach]}
|
||||
|
||||
[Follow same detailed structure as above]
|
||||
|
||||
\vspace{0.5cm}
|
||||
|
||||
\subsection*{B.3 Experiments for Testing Hypothesis 3}
|
||||
|
||||
[Continue with same structure for all hypotheses]
|
||||
|
||||
\vspace{0.5cm}
|
||||
|
||||
\subsection*{B.4 Discriminating Experiments}
|
||||
|
||||
[Provide detailed protocols for the priority experiments identified in Section 4 that distinguish between hypotheses]
|
||||
|
||||
% ============================================================================
|
||||
% APPENDIX C: QUALITY ASSESSMENT
|
||||
% ============================================================================
|
||||
\newpage
|
||||
\appendixsection{Appendix C: Quality Assessment}
|
||||
|
||||
This appendix provides detailed evaluation of each hypothesis against established quality criteria.
|
||||
|
||||
\subsection*{C.1 Comparative Quality Assessment}
|
||||
|
||||
\begin{hypotable}{Hypothesis Quality Criteria Evaluation}
|
||||
\begin{tabular}{|p{2.5cm}|p{3cm}|p{3cm}|p{3cm}|}
|
||||
\hline
|
||||
\tableheadercolor
|
||||
\textcolor{white}{\textbf{Criterion}} & \textcolor{white}{\textbf{Hypothesis 1}} & \textcolor{white}{\textbf{Hypothesis 2}} & \textcolor{white}{\textbf{Hypothesis 3}} \\
|
||||
\hline
|
||||
\textbf{Testability} & [Strong/Moderate/Weak] [Brief note: why?] & [Rating \& note] & [Rating \& note] \\
|
||||
\hline
|
||||
\tablerowcolor
|
||||
\textbf{Falsifiability} & [Rating \& note] & [Rating \& note] & [Rating \& note] \\
|
||||
\hline
|
||||
\textbf{Parsimony} & [Rating \& note] & [Rating \& note] & [Rating \& note] \\
|
||||
\hline
|
||||
\tablerowcolor
|
||||
\textbf{Explanatory Power} & [Rating \& note] & [Rating \& note] & [Rating \& note] \\
|
||||
\hline
|
||||
\textbf{Scope} & [Rating \& note] & [Rating \& note] & [Rating \& note] \\
|
||||
\hline
|
||||
\tablerowcolor
|
||||
\textbf{Consistency} & [Rating \& note] & [Rating \& note] & [Rating \& note] \\
|
||||
\hline
|
||||
\textbf{Novelty} & [Rating \& note] & [Rating \& note] & [Rating \& note] \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\caption{Comparative assessment of hypotheses across quality criteria. Strong = meets criterion very well; Moderate = partially meets criterion; Weak = does not meet criterion well.}
|
||||
\end{hypotable}
|
||||
|
||||
\subsection*{C.2 Detailed Evaluation: Hypothesis 1}
|
||||
|
||||
\textbf{Strengths:}
|
||||
\begin{enumerate}
|
||||
\item [Specific strength 1 with explanation of why this is advantageous]
|
||||
\item [Specific strength 2]
|
||||
\item [Specific strength 3]
|
||||
\item [Additional strengths as applicable]
|
||||
\end{enumerate}
|
||||
|
||||
\textbf{Weaknesses:}
|
||||
\begin{enumerate}
|
||||
\item [Specific weakness 1 with explanation of the limitation]
|
||||
\item [Specific weakness 2]
|
||||
\item [Specific weakness 3]
|
||||
\item [Additional weaknesses as applicable]
|
||||
\end{enumerate}
|
||||
|
||||
\textbf{Overall Assessment:}
|
||||
|
||||
[Provide a comprehensive 1-2 paragraph assessment of Hypothesis 1's quality and viability. Consider:
|
||||
\begin{itemize}
|
||||
\item How well does it balance the various quality criteria?
|
||||
\item What are the key trade-offs?
|
||||
\item Under what conditions would this be the most promising hypothesis?
|
||||
\item What are the major challenges to testing or validating it?
|
||||
\item How does it compare overall to competing hypotheses?
|
||||
\end{itemize}]
|
||||
|
||||
\subsection*{C.3 Detailed Evaluation: Hypothesis 2}
|
||||
|
||||
[Follow same structure as C.2]
|
||||
|
||||
\subsection*{C.4 Detailed Evaluation: Hypothesis 3}
|
||||
|
||||
[Follow same structure as C.2]
|
||||
|
||||
% Add C.5, C.6 for Hypotheses 4 and 5 if applicable
|
||||
|
||||
\subsection*{C.5 Recommendations Based on Quality Assessment}
|
||||
|
||||
[Synthesize the quality assessments to provide recommendations:
|
||||
\begin{itemize}
|
||||
\item Which hypothesis appears most promising overall?
|
||||
\item Which hypothesis should be tested first? Why?
|
||||
\item Are there scenarios where different hypotheses would be preferred?
|
||||
\item Could multiple hypotheses be partially correct?
|
||||
\item What would need to be true for each hypothesis to be viable?
|
||||
\end{itemize}]
|
||||
|
||||
% ============================================================================
|
||||
% APPENDIX D: SUPPLEMENTARY EVIDENCE
|
||||
% ============================================================================
|
||||
\newpage
|
||||
\appendixsection{Appendix D: Supplementary Evidence}
|
||||
|
||||
This appendix provides additional supporting information, including analogous mechanisms, relevant data, and context that further informs the hypotheses.
|
||||
|
||||
\subsection*{D.1 Analogous Mechanisms in Related Systems}
|
||||
|
||||
[Discuss similar mechanisms or phenomena in related systems that provide insight:
|
||||
\begin{itemize}
|
||||
\item How do analogous systems behave?
|
||||
\item What mechanisms operate in those systems?
|
||||
\item How might lessons from related systems apply here?
|
||||
\item What similarities and differences exist?
|
||||
\end{itemize}
|
||||
|
||||
Include citations to relevant comparative studies.]
|
||||
|
||||
\subsection*{D.2 Preliminary Data or Observations}
|
||||
|
||||
[If applicable, discuss any preliminary data, pilot studies, or anecdotal observations that informed hypothesis generation but weren't formally published or well-documented.]
|
||||
|
||||
\subsection*{D.3 Theoretical Frameworks}
|
||||
|
||||
[Discuss broader theoretical frameworks that relate to the hypotheses:
|
||||
\begin{itemize}
|
||||
\item What general principles or theories apply?
|
||||
\item How do the hypotheses fit within established frameworks?
|
||||
\item Are there mathematical or computational models that support any hypothesis?
|
||||
\end{itemize}]
|
||||
|
||||
\subsection*{D.4 Historical Context and Evolution of Ideas}
|
||||
|
||||
[Provide historical perspective on how thinking about this phenomenon has evolved, what previous hypotheses have been proposed and tested, and what lessons have been learned from past attempts to explain the phenomenon.]
|
||||
|
||||
% ============================================================================
|
||||
% REFERENCES
|
||||
% ============================================================================
|
||||
\newpage
|
||||
\bibliographystyle{plainnat}
|
||||
\bibliography{references}
|
||||
|
||||
% Alternatively, manually format references if not using BibTeX:
|
||||
% \begin{thebibliography}{99}
|
||||
%
|
||||
% \bibitem{author2023}
|
||||
% Author1, A.B., \& Author2, C.D. (2023).
|
||||
% Title of paper.
|
||||
% \textit{Journal Name}, \textit{Volume}(Issue), pages.
|
||||
% DOI or URL
|
||||
%
|
||||
% \bibitem{author2022}
|
||||
% [Continue with all references...]
|
||||
%
|
||||
% [Target: 50+ references covering all citations in main text and appendices]
|
||||
%
|
||||
% \end{thebibliography}
|
||||
|
||||
\end{document}
|
||||
|
||||
@@ -4,6 +4,8 @@
|
||||
|
||||
This reference provides patterns and frameworks for designing experiments across scientific domains. Use these patterns to develop rigorous tests for generated hypotheses.
|
||||
|
||||
**Note on Report Structure:** When generating hypothesis reports, mention only the key experimental approach (e.g., "in vivo knockout study" or "prospective cohort design") in the main text hypothesis boxes. Include comprehensive experimental protocols with full methods, controls, sample sizes, statistical approaches, feasibility assessments, and resource requirements in **Appendix B: Detailed Experimental Designs**.
|
||||
|
||||
## Design Selection Framework
|
||||
|
||||
Choose experimental approaches based on:
|
||||
|
||||
@@ -4,6 +4,8 @@
|
||||
|
||||
Use these criteria to assess the quality and rigor of generated hypotheses. A robust hypothesis should score well across multiple dimensions.
|
||||
|
||||
**Note on Report Structure:** When generating hypothesis reports, provide a brief quality assessment summary in the main text (comparative table with ratings), and include detailed evaluation with strengths, weaknesses, and comprehensive analysis in **Appendix C: Quality Assessment**.
|
||||
|
||||
## Core Criteria
|
||||
|
||||
### 1. Testability
|
||||
|
||||
@@ -367,6 +367,36 @@ Use WebSearch for:
|
||||
- What analogies exist in other systems?
|
||||
- What methods are commonly used?
|
||||
|
||||
### Citation Organization for Hypothesis Reports
|
||||
|
||||
**For report structure:** Organize citations for two audiences:
|
||||
|
||||
**Main Text (15-20 key citations):**
|
||||
- Most influential papers (highly cited, seminal studies)
|
||||
- Recent definitive evidence (last 2-3 years)
|
||||
- Key papers directly supporting each hypothesis (3-5 per hypothesis)
|
||||
- Major reviews synthesizing the field
|
||||
|
||||
**Appendix A: Comprehensive Literature Review (40-60+ citations):**
|
||||
- **Historical context:** Foundational papers establishing field
|
||||
- **Current understanding:** Recent reviews and meta-analyses
|
||||
- **Hypothesis-specific evidence:** 8-15 papers per hypothesis covering:
|
||||
- Direct supporting evidence
|
||||
- Analogous mechanisms in related systems
|
||||
- Methodological precedents
|
||||
- Theoretical framework papers
|
||||
- **Conflicting findings:** Papers representing different viewpoints
|
||||
- **Knowledge gaps:** Papers identifying limitations or unanswered questions
|
||||
|
||||
**Target citation density:** Aim for 50+ total references to provide comprehensive support for all claims and demonstrate thorough literature grounding.
|
||||
|
||||
**Grouping strategy for Appendix A:**
|
||||
1. Background and context papers
|
||||
2. Current understanding and established mechanisms
|
||||
3. Evidence supporting each hypothesis (separate subsections)
|
||||
4. Contradictory or alternative findings
|
||||
5. Methodological and technical papers
|
||||
|
||||
## Practical Search Workflow
|
||||
|
||||
### Step-by-Step Process
|
||||
|
||||
318
scientific-skills/markitdown/INSTALLATION_GUIDE.md
Normal file
318
scientific-skills/markitdown/INSTALLATION_GUIDE.md
Normal file
@@ -0,0 +1,318 @@
|
||||
# MarkItDown Installation Guide
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.10 or higher
|
||||
- pip package manager
|
||||
- Virtual environment (recommended)
|
||||
|
||||
## Basic Installation
|
||||
|
||||
### Install All Features (Recommended)
|
||||
|
||||
```bash
|
||||
pip install 'markitdown[all]'
|
||||
```
|
||||
|
||||
This installs support for all file formats and features.
|
||||
|
||||
### Install Specific Features
|
||||
|
||||
If you only need certain file formats, you can install specific dependencies:
|
||||
|
||||
```bash
|
||||
# PDF support only
|
||||
pip install 'markitdown[pdf]'
|
||||
|
||||
# Office documents
|
||||
pip install 'markitdown[docx,pptx,xlsx]'
|
||||
|
||||
# Multiple formats
|
||||
pip install 'markitdown[pdf,docx,pptx,xlsx,audio-transcription]'
|
||||
```
|
||||
|
||||
### Install from Source
|
||||
|
||||
```bash
|
||||
git clone https://github.com/microsoft/markitdown.git
|
||||
cd markitdown
|
||||
pip install -e 'packages/markitdown[all]'
|
||||
```
|
||||
|
||||
## Optional Dependencies
|
||||
|
||||
| Feature | Installation | Use Case |
|
||||
|---------|--------------|----------|
|
||||
| All formats | `pip install 'markitdown[all]'` | Everything |
|
||||
| PDF | `pip install 'markitdown[pdf]'` | PDF documents |
|
||||
| Word | `pip install 'markitdown[docx]'` | DOCX files |
|
||||
| PowerPoint | `pip install 'markitdown[pptx]'` | PPTX files |
|
||||
| Excel (new) | `pip install 'markitdown[xlsx]'` | XLSX files |
|
||||
| Excel (old) | `pip install 'markitdown[xls]'` | XLS files |
|
||||
| Outlook | `pip install 'markitdown[outlook]'` | MSG files |
|
||||
| Azure DI | `pip install 'markitdown[az-doc-intel]'` | Enhanced PDF |
|
||||
| Audio | `pip install 'markitdown[audio-transcription]'` | WAV/MP3 |
|
||||
| YouTube | `pip install 'markitdown[youtube-transcription]'` | YouTube videos |
|
||||
|
||||
## System Dependencies
|
||||
|
||||
### OCR Support (for scanned documents and images)
|
||||
|
||||
#### macOS
|
||||
```bash
|
||||
brew install tesseract
|
||||
```
|
||||
|
||||
#### Ubuntu/Debian
|
||||
```bash
|
||||
sudo apt-get update
|
||||
sudo apt-get install tesseract-ocr
|
||||
```
|
||||
|
||||
#### Windows
|
||||
Download from: https://github.com/UB-Mannheim/tesseract/wiki
|
||||
|
||||
### Poppler Utils (for advanced PDF operations)
|
||||
|
||||
#### macOS
|
||||
```bash
|
||||
brew install poppler
|
||||
```
|
||||
|
||||
#### Ubuntu/Debian
|
||||
```bash
|
||||
sudo apt-get install poppler-utils
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
Test your installation:
|
||||
|
||||
```bash
|
||||
# Check version
|
||||
python -c "import markitdown; print('MarkItDown installed successfully')"
|
||||
|
||||
# Test basic conversion
|
||||
echo "Test" > test.txt
|
||||
markitdown test.txt
|
||||
rm test.txt
|
||||
```
|
||||
|
||||
## Virtual Environment Setup
|
||||
|
||||
### Using venv
|
||||
|
||||
```bash
|
||||
# Create virtual environment
|
||||
python -m venv markitdown-env
|
||||
|
||||
# Activate (macOS/Linux)
|
||||
source markitdown-env/bin/activate
|
||||
|
||||
# Activate (Windows)
|
||||
markitdown-env\Scripts\activate
|
||||
|
||||
# Install
|
||||
pip install 'markitdown[all]'
|
||||
```
|
||||
|
||||
### Using conda
|
||||
|
||||
```bash
|
||||
# Create environment
|
||||
conda create -n markitdown python=3.12
|
||||
|
||||
# Activate
|
||||
conda activate markitdown
|
||||
|
||||
# Install
|
||||
pip install 'markitdown[all]'
|
||||
```
|
||||
|
||||
### Using uv
|
||||
|
||||
```bash
|
||||
# Create virtual environment
|
||||
uv venv --python=3.12 .venv
|
||||
|
||||
# Activate
|
||||
source .venv/bin/activate
|
||||
|
||||
# Install
|
||||
uv pip install 'markitdown[all]'
|
||||
```
|
||||
|
||||
## AI Enhancement Setup (Optional)
|
||||
|
||||
For AI-powered image descriptions using OpenRouter:
|
||||
|
||||
### OpenRouter API
|
||||
|
||||
OpenRouter provides unified access to multiple AI models (GPT-4, Claude, Gemini, etc.) through a single API.
|
||||
|
||||
```bash
|
||||
# Install OpenAI SDK (required, already included with markitdown)
|
||||
pip install openai
|
||||
|
||||
# Get API key from https://openrouter.ai/keys
|
||||
|
||||
# Set API key
|
||||
export OPENROUTER_API_KEY="sk-or-v1-..."
|
||||
|
||||
# Add to shell profile for persistence
|
||||
echo 'export OPENROUTER_API_KEY="sk-or-v1-..."' >> ~/.bashrc # Linux
|
||||
echo 'export OPENROUTER_API_KEY="sk-or-v1-..."' >> ~/.zshrc # macOS
|
||||
```
|
||||
|
||||
**Why OpenRouter?**
|
||||
- Access to 100+ AI models through one API
|
||||
- Choose between GPT-4, Claude, Gemini, and more
|
||||
- Competitive pricing
|
||||
- No vendor lock-in
|
||||
- Simple OpenAI-compatible interface
|
||||
|
||||
**Popular Models for Image Description:**
|
||||
- `anthropic/claude-sonnet-4.5` - **Recommended** - Best for scientific vision
|
||||
- `anthropic/claude-opus-4.5` - Excellent technical analysis
|
||||
- `openai/gpt-4o` - Good vision understanding
|
||||
- `google/gemini-pro-vision` - Cost-effective option
|
||||
|
||||
See https://openrouter.ai/models for complete model list and pricing.
|
||||
|
||||
## Azure Document Intelligence Setup (Optional)
|
||||
|
||||
For enhanced PDF conversion:
|
||||
|
||||
1. Create Azure Document Intelligence resource in Azure Portal
|
||||
2. Get endpoint and key
|
||||
3. Set environment variables:
|
||||
|
||||
```bash
|
||||
export AZURE_DOCUMENT_INTELLIGENCE_KEY="your-key"
|
||||
export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="https://your-endpoint.cognitiveservices.azure.com/"
|
||||
```
|
||||
|
||||
## Docker Installation (Alternative)
|
||||
|
||||
```bash
|
||||
# Clone repository
|
||||
git clone https://github.com/microsoft/markitdown.git
|
||||
cd markitdown
|
||||
|
||||
# Build image
|
||||
docker build -t markitdown:latest .
|
||||
|
||||
# Run
|
||||
docker run --rm -i markitdown:latest < input.pdf > output.md
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Import Error
|
||||
```
|
||||
ModuleNotFoundError: No module named 'markitdown'
|
||||
```
|
||||
|
||||
**Solution**: Ensure you're in the correct virtual environment and markitdown is installed:
|
||||
```bash
|
||||
pip install 'markitdown[all]'
|
||||
```
|
||||
|
||||
### Missing Feature
|
||||
```
|
||||
Error: PDF conversion not supported
|
||||
```
|
||||
|
||||
**Solution**: Install the specific feature:
|
||||
```bash
|
||||
pip install 'markitdown[pdf]'
|
||||
```
|
||||
|
||||
### OCR Not Working
|
||||
|
||||
**Solution**: Install Tesseract OCR (see System Dependencies above)
|
||||
|
||||
### Permission Errors
|
||||
|
||||
**Solution**: Use virtual environment or install with `--user` flag:
|
||||
```bash
|
||||
pip install --user 'markitdown[all]'
|
||||
```
|
||||
|
||||
## Upgrading
|
||||
|
||||
```bash
|
||||
# Upgrade to latest version
|
||||
pip install --upgrade 'markitdown[all]'
|
||||
|
||||
# Check version
|
||||
pip show markitdown
|
||||
```
|
||||
|
||||
## Uninstallation
|
||||
|
||||
```bash
|
||||
pip uninstall markitdown
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
After installation:
|
||||
1. Read `QUICK_REFERENCE.md` for basic usage
|
||||
2. See `SKILL.md` for comprehensive guide
|
||||
3. Try example scripts in `scripts/` directory
|
||||
4. Check `assets/example_usage.md` for practical examples
|
||||
|
||||
## Skill Scripts Setup
|
||||
|
||||
To use the skill scripts:
|
||||
|
||||
```bash
|
||||
# Navigate to scripts directory
|
||||
cd /Users/vinayak/Documents/claude-scientific-writer/.claude/skills/markitdown/scripts
|
||||
|
||||
# Scripts are already executable, just run them
|
||||
python batch_convert.py --help
|
||||
python convert_with_ai.py --help
|
||||
python convert_literature.py --help
|
||||
```
|
||||
|
||||
## Testing Installation
|
||||
|
||||
Create a test file to verify everything works:
|
||||
|
||||
```python
|
||||
# test_markitdown.py
|
||||
from markitdown import MarkItDown
|
||||
|
||||
def test_basic():
|
||||
md = MarkItDown()
|
||||
# Create a simple test file
|
||||
with open("test.txt", "w") as f:
|
||||
f.write("Hello MarkItDown!")
|
||||
|
||||
# Convert it
|
||||
result = md.convert("test.txt")
|
||||
print("✓ Basic conversion works")
|
||||
print(result.text_content)
|
||||
|
||||
# Cleanup
|
||||
import os
|
||||
os.remove("test.txt")
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_basic()
|
||||
```
|
||||
|
||||
Run it:
|
||||
```bash
|
||||
python test_markitdown.py
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
- **Documentation**: See `SKILL.md` and `README.md`
|
||||
- **GitHub Issues**: https://github.com/microsoft/markitdown/issues
|
||||
- **Examples**: `assets/example_usage.md`
|
||||
- **API Reference**: `references/api_reference.md`
|
||||
|
||||
22
scientific-skills/markitdown/LICENSE.txt
Normal file
22
scientific-skills/markitdown/LICENSE.txt
Normal file
@@ -0,0 +1,22 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) Microsoft Corporation.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
|
||||
359
scientific-skills/markitdown/OPENROUTER_INTEGRATION.md
Normal file
359
scientific-skills/markitdown/OPENROUTER_INTEGRATION.md
Normal file
@@ -0,0 +1,359 @@
|
||||
# OpenRouter Integration for MarkItDown
|
||||
|
||||
## Overview
|
||||
|
||||
This MarkItDown skill has been configured to use **OpenRouter** instead of direct OpenAI API access. OpenRouter provides a unified API gateway to access 100+ AI models from different providers through a single, OpenAI-compatible interface.
|
||||
|
||||
## Why OpenRouter?
|
||||
|
||||
### Benefits
|
||||
|
||||
1. **Multiple Model Access**: Access GPT-4, Claude, Gemini, and 100+ other models through one API
|
||||
2. **No Vendor Lock-in**: Switch between models without code changes
|
||||
3. **Competitive Pricing**: Often better rates than going direct
|
||||
4. **Simple Migration**: OpenAI-compatible API means minimal code changes
|
||||
5. **Flexible Choice**: Choose the best model for each task
|
||||
|
||||
### Popular Models for Image Description
|
||||
|
||||
| Model | Provider | Use Case | Vision Support |
|
||||
|-------|----------|----------|----------------|
|
||||
| `anthropic/claude-sonnet-4.5` | Anthropic | **Recommended** - Best overall for scientific analysis | ✅ |
|
||||
| `anthropic/claude-opus-4.5` | Anthropic | Excellent technical analysis | ✅ |
|
||||
| `openai/gpt-4o` | OpenAI | Strong vision understanding | ✅ |
|
||||
| `openai/gpt-4-vision` | OpenAI | GPT-4 with vision | ✅ |
|
||||
| `google/gemini-pro-vision` | Google | Cost-effective option | ✅ |
|
||||
|
||||
See https://openrouter.ai/models for the complete list.
|
||||
|
||||
## Getting Started
|
||||
|
||||
### 1. Get an API Key
|
||||
|
||||
1. Visit https://openrouter.ai/keys
|
||||
2. Sign up or log in
|
||||
3. Create a new API key
|
||||
4. Copy the key (starts with `sk-or-v1-...`)
|
||||
|
||||
### 2. Set Environment Variable
|
||||
|
||||
```bash
|
||||
# Add to your environment
|
||||
export OPENROUTER_API_KEY="sk-or-v1-..."
|
||||
|
||||
# Make it permanent
|
||||
echo 'export OPENROUTER_API_KEY="sk-or-v1-..."' >> ~/.zshrc # macOS
|
||||
echo 'export OPENROUTER_API_KEY="sk-or-v1-..."' >> ~/.bashrc # Linux
|
||||
|
||||
# Reload shell
|
||||
source ~/.zshrc # or source ~/.bashrc
|
||||
```
|
||||
|
||||
### 3. Use in Python
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Initialize OpenRouter client (OpenAI-compatible)
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key", # or use env var
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
# Create MarkItDown with AI support
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5" # Choose your model
|
||||
)
|
||||
|
||||
# Convert with AI-enhanced descriptions
|
||||
result = md.convert("presentation.pptx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
## Using the Scripts
|
||||
|
||||
All skill scripts have been updated to use OpenRouter:
|
||||
|
||||
### convert_with_ai.py
|
||||
|
||||
```bash
|
||||
# Set API key
|
||||
export OPENROUTER_API_KEY="sk-or-v1-..."
|
||||
|
||||
# Convert with default model (advanced vision model)
|
||||
python scripts/convert_with_ai.py paper.pdf output.md --prompt-type scientific
|
||||
|
||||
# Use GPT-4o as alternative
|
||||
python scripts/convert_with_ai.py paper.pdf output.md \
|
||||
--model openai/gpt-4o \
|
||||
--prompt-type scientific
|
||||
|
||||
# Use Gemini Pro Vision (cost-effective)
|
||||
python scripts/convert_with_ai.py slides.pptx output.md \
|
||||
--model google/gemini-pro-vision \
|
||||
--prompt-type presentation
|
||||
|
||||
# List available prompt types
|
||||
python scripts/convert_with_ai.py --list-prompts
|
||||
```
|
||||
|
||||
### Choosing the Right Model
|
||||
|
||||
```bash
|
||||
# For scientific papers - use advanced vision model for technical analysis
|
||||
python scripts/convert_with_ai.py research.pdf output.md \
|
||||
--model anthropic/claude-sonnet-4.5 \
|
||||
--prompt-type scientific
|
||||
|
||||
# For presentations - use advanced vision model
|
||||
python scripts/convert_with_ai.py slides.pptx output.md \
|
||||
--model anthropic/claude-sonnet-4.5 \
|
||||
--prompt-type presentation
|
||||
|
||||
# For data visualizations - use advanced vision model
|
||||
python scripts/convert_with_ai.py charts.pdf output.md \
|
||||
--model anthropic/claude-sonnet-4.5 \
|
||||
--prompt-type data_viz
|
||||
|
||||
# For medical images - use advanced vision model for detailed analysis
|
||||
python scripts/convert_with_ai.py xray.jpg output.md \
|
||||
--model anthropic/claude-sonnet-4.5 \
|
||||
--prompt-type medical
|
||||
```
|
||||
|
||||
## Code Examples
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import os
|
||||
|
||||
# Initialize OpenRouter client
|
||||
client = OpenAI(
|
||||
api_key=os.environ.get("OPENROUTER_API_KEY"),
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
# Use advanced vision model for image descriptions
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5"
|
||||
)
|
||||
|
||||
result = md.convert("document.pptx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Switching Models Dynamically
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import os
|
||||
|
||||
client = OpenAI(
|
||||
api_key=os.environ["OPENROUTER_API_KEY"],
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
# Use different models for different file types
|
||||
def convert_with_best_model(filepath):
|
||||
if filepath.endswith('.pdf'):
|
||||
# Use advanced vision model for technical PDFs
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5",
|
||||
llm_prompt="Describe scientific figures with technical precision"
|
||||
)
|
||||
elif filepath.endswith('.pptx'):
|
||||
# Use advanced vision model for presentations
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5",
|
||||
llm_prompt="Describe slide content and visual elements"
|
||||
)
|
||||
else:
|
||||
# Use advanced vision model as default
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5"
|
||||
)
|
||||
|
||||
return md.convert(filepath)
|
||||
|
||||
# Use it
|
||||
result = convert_with_best_model("paper.pdf")
|
||||
```
|
||||
|
||||
### Custom Prompts per Model
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
# Scientific analysis with advanced vision model
|
||||
scientific_prompt = """
|
||||
Analyze this scientific figure. Provide:
|
||||
1. Type of visualization and methodology
|
||||
2. Quantitative data points and trends
|
||||
3. Statistical significance
|
||||
4. Technical interpretation
|
||||
Be precise and use scientific terminology.
|
||||
"""
|
||||
|
||||
md_scientific = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5",
|
||||
llm_prompt=scientific_prompt
|
||||
)
|
||||
|
||||
# Visual analysis with advanced vision model
|
||||
visual_prompt = """
|
||||
Describe this image comprehensively:
|
||||
1. Main visual elements and composition
|
||||
2. Colors, layout, and design
|
||||
3. Text and labels
|
||||
4. Overall message
|
||||
"""
|
||||
|
||||
md_visual = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5",
|
||||
llm_prompt=visual_prompt
|
||||
)
|
||||
```
|
||||
|
||||
## Model Comparison
|
||||
|
||||
### For Scientific Content
|
||||
|
||||
**Recommended: anthropic/claude-sonnet-4.5**
|
||||
- Excellent at technical analysis
|
||||
- Superior reasoning capabilities
|
||||
- Best at understanding scientific figures
|
||||
- Most detailed and accurate explanations
|
||||
- Advanced vision capabilities
|
||||
|
||||
**Alternative: openai/gpt-4o**
|
||||
- Good vision understanding
|
||||
- Fast processing
|
||||
- Good at charts and graphs
|
||||
|
||||
### For Presentations
|
||||
|
||||
**Recommended: anthropic/claude-sonnet-4.5**
|
||||
- Superior vision capabilities
|
||||
- Excellent at understanding slide layouts
|
||||
- Fast and reliable
|
||||
- Best technical comprehension
|
||||
|
||||
### For Cost-Effectiveness
|
||||
|
||||
**Recommended: google/gemini-pro-vision**
|
||||
- Lower cost per request
|
||||
- Good quality
|
||||
- Fast processing
|
||||
|
||||
## Pricing Considerations
|
||||
|
||||
OpenRouter pricing varies by model. Check current rates at https://openrouter.ai/models
|
||||
|
||||
**Tips for Cost Optimization:**
|
||||
1. Use advanced vision models for best quality on complex scientific content
|
||||
2. Use cheaper models (Gemini) for simple images
|
||||
3. Batch process similar content with the same model
|
||||
4. Use appropriate prompts to get better results in fewer retries
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### API Key Issues
|
||||
|
||||
```bash
|
||||
# Check if key is set
|
||||
echo $OPENROUTER_API_KEY
|
||||
|
||||
# Should show: sk-or-v1-...
|
||||
# If empty, set it:
|
||||
export OPENROUTER_API_KEY="sk-or-v1-..."
|
||||
```
|
||||
|
||||
### Model Not Found
|
||||
|
||||
If you get a "model not found" error, check:
|
||||
1. Model name format: `provider/model-name`
|
||||
2. Model availability: https://openrouter.ai/models
|
||||
3. Vision support: Ensure model supports vision for image description
|
||||
|
||||
### Rate Limits
|
||||
|
||||
OpenRouter has rate limits. If you hit them:
|
||||
1. Add delays between requests
|
||||
2. Use batch processing scripts with `--workers` parameter
|
||||
3. Consider upgrading your OpenRouter plan
|
||||
|
||||
## Migration Notes
|
||||
|
||||
This skill was updated from direct OpenAI API to OpenRouter. Key changes:
|
||||
|
||||
1. **Environment Variable**: `OPENAI_API_KEY` → `OPENROUTER_API_KEY`
|
||||
2. **Client Initialization**: Added `base_url="https://openrouter.ai/api/v1"`
|
||||
3. **Model Names**: `gpt-4o` → `openai/gpt-4o` (with provider prefix)
|
||||
4. **Script Updates**: All scripts now use OpenRouter by default
|
||||
|
||||
## Resources
|
||||
|
||||
- **OpenRouter Website**: https://openrouter.ai
|
||||
- **Get API Keys**: https://openrouter.ai/keys
|
||||
- **Model List**: https://openrouter.ai/models
|
||||
- **Pricing**: https://openrouter.ai/models (click on model for details)
|
||||
- **Documentation**: https://openrouter.ai/docs
|
||||
- **Support**: https://openrouter.ai/discord
|
||||
|
||||
## Example Workflow
|
||||
|
||||
Here's a complete workflow using OpenRouter:
|
||||
|
||||
```bash
|
||||
# 1. Set up API key
|
||||
export OPENROUTER_API_KEY="sk-or-v1-your-key-here"
|
||||
|
||||
# 2. Convert a scientific paper with Claude
|
||||
python scripts/convert_with_ai.py \
|
||||
research_paper.pdf \
|
||||
output.md \
|
||||
--model anthropic/claude-opus-4.5 \
|
||||
--prompt-type scientific
|
||||
|
||||
# 3. Convert presentation with GPT-4o
|
||||
python scripts/convert_with_ai.py \
|
||||
talk_slides.pptx \
|
||||
slides.md \
|
||||
--model openai/gpt-4o \
|
||||
--prompt-type presentation
|
||||
|
||||
# 4. Batch convert with cost-effective model
|
||||
python scripts/batch_convert.py \
|
||||
images/ \
|
||||
markdown_output/ \
|
||||
--extensions .jpg .png
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
For OpenRouter-specific issues:
|
||||
- Discord: https://openrouter.ai/discord
|
||||
- Email: support@openrouter.ai
|
||||
|
||||
For MarkItDown skill issues:
|
||||
- Check documentation in this skill directory
|
||||
- Review examples in `assets/example_usage.md`
|
||||
|
||||
309
scientific-skills/markitdown/QUICK_REFERENCE.md
Normal file
309
scientific-skills/markitdown/QUICK_REFERENCE.md
Normal file
@@ -0,0 +1,309 @@
|
||||
# MarkItDown Quick Reference
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# All features
|
||||
pip install 'markitdown[all]'
|
||||
|
||||
# Specific formats
|
||||
pip install 'markitdown[pdf,docx,pptx,xlsx]'
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("file.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
## Command Line
|
||||
|
||||
```bash
|
||||
# Simple conversion
|
||||
markitdown input.pdf > output.md
|
||||
markitdown input.pdf -o output.md
|
||||
|
||||
# With plugins
|
||||
markitdown --use-plugins file.pdf -o output.md
|
||||
```
|
||||
|
||||
## Common Tasks
|
||||
|
||||
### Convert PDF
|
||||
```python
|
||||
md = MarkItDown()
|
||||
result = md.convert("paper.pdf")
|
||||
```
|
||||
|
||||
### Convert with AI
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Use OpenRouter for multiple model access
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5" # recommended for vision
|
||||
)
|
||||
result = md.convert("slides.pptx")
|
||||
```
|
||||
|
||||
### Batch Convert
|
||||
```bash
|
||||
python scripts/batch_convert.py input/ output/ --extensions .pdf .docx
|
||||
```
|
||||
|
||||
### Literature Conversion
|
||||
```bash
|
||||
python scripts/convert_literature.py papers/ markdown/ --create-index
|
||||
```
|
||||
|
||||
## Supported Formats
|
||||
|
||||
| Format | Extension | Notes |
|
||||
|--------|-----------|-------|
|
||||
| PDF | `.pdf` | Full text + OCR |
|
||||
| Word | `.docx` | Tables, formatting |
|
||||
| PowerPoint | `.pptx` | Slides + notes |
|
||||
| Excel | `.xlsx`, `.xls` | Tables |
|
||||
| Images | `.jpg`, `.png`, `.gif`, `.webp` | EXIF + OCR |
|
||||
| Audio | `.wav`, `.mp3` | Transcription |
|
||||
| HTML | `.html`, `.htm` | Clean conversion |
|
||||
| Data | `.csv`, `.json`, `.xml` | Structured |
|
||||
| Archives | `.zip` | Iterates contents |
|
||||
| E-books | `.epub` | Full text |
|
||||
| YouTube | URLs | Transcripts |
|
||||
|
||||
## Optional Dependencies
|
||||
|
||||
```bash
|
||||
[all] # All features
|
||||
[pdf] # PDF support
|
||||
[docx] # Word documents
|
||||
[pptx] # PowerPoint
|
||||
[xlsx] # Excel
|
||||
[xls] # Old Excel
|
||||
[outlook] # Outlook messages
|
||||
[az-doc-intel] # Azure Document Intelligence
|
||||
[audio-transcription] # Audio files
|
||||
[youtube-transcription] # YouTube videos
|
||||
```
|
||||
|
||||
## AI-Enhanced Conversion
|
||||
|
||||
### Scientific Papers
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Initialize OpenRouter client
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5", # recommended for scientific vision
|
||||
llm_prompt="Describe scientific figures with technical precision"
|
||||
)
|
||||
result = md.convert("paper.pdf")
|
||||
```
|
||||
|
||||
### Custom Prompts
|
||||
```python
|
||||
prompt = """
|
||||
Analyze this data visualization. Describe:
|
||||
- Type of chart/graph
|
||||
- Key trends and patterns
|
||||
- Notable data points
|
||||
"""
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5",
|
||||
llm_prompt=prompt
|
||||
)
|
||||
```
|
||||
|
||||
### Available Models via OpenRouter
|
||||
- `anthropic/claude-sonnet-4.5` - **Recommended for scientific vision**
|
||||
- `anthropic/claude-opus-4.5` - Advanced vision model
|
||||
- `openai/gpt-4o` - GPT-4 Omni (vision)
|
||||
- `openai/gpt-4-vision` - GPT-4 Vision
|
||||
- `google/gemini-pro-vision` - Gemini Pro Vision
|
||||
|
||||
See https://openrouter.ai/models for full list
|
||||
|
||||
## Azure Document Intelligence
|
||||
|
||||
```python
|
||||
md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
|
||||
result = md.convert("complex_layout.pdf")
|
||||
```
|
||||
|
||||
## Batch Processing
|
||||
|
||||
### Python
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from pathlib import Path
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
for file in Path("input/").glob("*.pdf"):
|
||||
result = md.convert(str(file))
|
||||
output = Path("output") / f"{file.stem}.md"
|
||||
output.write_text(result.text_content)
|
||||
```
|
||||
|
||||
### Script
|
||||
```bash
|
||||
# Parallel conversion
|
||||
python scripts/batch_convert.py input/ output/ --workers 8
|
||||
|
||||
# Recursive
|
||||
python scripts/batch_convert.py input/ output/ -r
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
try:
|
||||
result = md.convert("file.pdf")
|
||||
except FileNotFoundError:
|
||||
print("File not found")
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
```
|
||||
|
||||
## Streaming
|
||||
|
||||
```python
|
||||
with open("large_file.pdf", "rb") as f:
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
```
|
||||
|
||||
## Common Prompts
|
||||
|
||||
### Scientific
|
||||
```
|
||||
Analyze this scientific figure. Describe:
|
||||
- Type of visualization
|
||||
- Key data points and trends
|
||||
- Axes, labels, and legends
|
||||
- Scientific significance
|
||||
```
|
||||
|
||||
### Medical
|
||||
```
|
||||
Describe this medical image. Include:
|
||||
- Type of imaging (X-ray, MRI, CT, etc.)
|
||||
- Anatomical structures visible
|
||||
- Notable findings
|
||||
- Clinical relevance
|
||||
```
|
||||
|
||||
### Data Visualization
|
||||
```
|
||||
Analyze this data visualization:
|
||||
- Chart type
|
||||
- Variables and axes
|
||||
- Data ranges
|
||||
- Key patterns and outliers
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Reuse instance**: Create once, use many times
|
||||
2. **Parallel processing**: Use ThreadPoolExecutor for multiple files
|
||||
3. **Stream large files**: Use `convert_stream()` for big files
|
||||
4. **Choose right format**: Install only needed dependencies
|
||||
|
||||
## Environment Variables
|
||||
|
||||
```bash
|
||||
# OpenRouter for AI-enhanced conversions
|
||||
export OPENROUTER_API_KEY="sk-or-v1-..."
|
||||
|
||||
# Azure Document Intelligence (optional)
|
||||
export AZURE_DOCUMENT_INTELLIGENCE_KEY="key..."
|
||||
export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="https://..."
|
||||
```
|
||||
|
||||
## Scripts Quick Reference
|
||||
|
||||
### batch_convert.py
|
||||
```bash
|
||||
python scripts/batch_convert.py INPUT OUTPUT [OPTIONS]
|
||||
|
||||
Options:
|
||||
--extensions .pdf .docx File types to convert
|
||||
--recursive, -r Search subdirectories
|
||||
--workers 4 Parallel workers
|
||||
--verbose, -v Detailed output
|
||||
--plugins, -p Enable plugins
|
||||
```
|
||||
|
||||
### convert_with_ai.py
|
||||
```bash
|
||||
python scripts/convert_with_ai.py INPUT OUTPUT [OPTIONS]
|
||||
|
||||
Options:
|
||||
--api-key KEY OpenRouter API key
|
||||
--model MODEL Model name (default: anthropic/claude-sonnet-4.5)
|
||||
--prompt-type TYPE Preset prompt (scientific, medical, etc.)
|
||||
--custom-prompt TEXT Custom prompt
|
||||
--list-prompts Show available prompts
|
||||
```
|
||||
|
||||
### convert_literature.py
|
||||
```bash
|
||||
python scripts/convert_literature.py INPUT OUTPUT [OPTIONS]
|
||||
|
||||
Options:
|
||||
--organize-by-year, -y Organize by year
|
||||
--create-index, -i Create index file
|
||||
--recursive, -r Search subdirectories
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Missing Dependencies
|
||||
```bash
|
||||
pip install 'markitdown[pdf]' # Install PDF support
|
||||
```
|
||||
|
||||
### Binary File Error
|
||||
```python
|
||||
# Wrong
|
||||
with open("file.pdf", "r") as f:
|
||||
|
||||
# Correct
|
||||
with open("file.pdf", "rb") as f: # Binary mode
|
||||
```
|
||||
|
||||
### OCR Not Working
|
||||
```bash
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Ubuntu
|
||||
sudo apt-get install tesseract-ocr
|
||||
```
|
||||
|
||||
## More Information
|
||||
|
||||
- **Full Documentation**: See `SKILL.md`
|
||||
- **API Reference**: See `references/api_reference.md`
|
||||
- **Format Details**: See `references/file_formats.md`
|
||||
- **Examples**: See `assets/example_usage.md`
|
||||
- **GitHub**: https://github.com/microsoft/markitdown
|
||||
|
||||
184
scientific-skills/markitdown/README.md
Normal file
184
scientific-skills/markitdown/README.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# MarkItDown Skill
|
||||
|
||||
This skill provides comprehensive support for converting various file formats to Markdown using Microsoft's MarkItDown tool.
|
||||
|
||||
## Overview
|
||||
|
||||
MarkItDown is a Python tool that converts files and office documents to Markdown format. This skill includes:
|
||||
|
||||
- Complete API documentation
|
||||
- Format-specific conversion guides
|
||||
- Utility scripts for batch processing
|
||||
- AI-enhanced conversion examples
|
||||
- Integration with scientific workflows
|
||||
|
||||
## Contents
|
||||
|
||||
### Main Skill File
|
||||
- **SKILL.md** - Complete guide to using MarkItDown with quick start, examples, and best practices
|
||||
|
||||
### References
|
||||
- **api_reference.md** - Detailed API documentation, class references, and method signatures
|
||||
- **file_formats.md** - Format-specific details for all supported file types
|
||||
|
||||
### Scripts
|
||||
- **batch_convert.py** - Batch convert multiple files with parallel processing
|
||||
- **convert_with_ai.py** - AI-enhanced conversion with custom prompts
|
||||
- **convert_literature.py** - Scientific literature conversion with metadata extraction
|
||||
|
||||
### Assets
|
||||
- **example_usage.md** - Practical examples for common use cases
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Install with all features
|
||||
pip install 'markitdown[all]'
|
||||
|
||||
# Or install specific features
|
||||
pip install 'markitdown[pdf,docx,pptx,xlsx]'
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
## Supported Formats
|
||||
|
||||
- **Documents**: PDF, DOCX, PPTX, XLSX, EPUB
|
||||
- **Images**: JPEG, PNG, GIF, WebP (with OCR)
|
||||
- **Audio**: WAV, MP3 (with transcription)
|
||||
- **Web**: HTML, YouTube URLs
|
||||
- **Data**: CSV, JSON, XML
|
||||
- **Archives**: ZIP files
|
||||
|
||||
## Key Features
|
||||
|
||||
### 1. AI-Enhanced Conversions
|
||||
Use AI models via OpenRouter to generate detailed image descriptions:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# OpenRouter provides access to 100+ AI models
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5" # recommended for vision
|
||||
)
|
||||
result = md.convert("presentation.pptx")
|
||||
```
|
||||
|
||||
### 2. Batch Processing
|
||||
Convert multiple files efficiently:
|
||||
|
||||
```bash
|
||||
python scripts/batch_convert.py papers/ output/ --extensions .pdf .docx
|
||||
```
|
||||
|
||||
### 3. Scientific Literature
|
||||
Convert and organize research papers:
|
||||
|
||||
```bash
|
||||
python scripts/convert_literature.py papers/ output/ --organize-by-year --create-index
|
||||
```
|
||||
|
||||
### 4. Azure Document Intelligence
|
||||
Enhanced PDF conversion with Microsoft Document Intelligence:
|
||||
|
||||
```python
|
||||
md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
|
||||
result = md.convert("complex_document.pdf")
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Literature Review
|
||||
Convert research papers to Markdown for easier analysis and note-taking.
|
||||
|
||||
### Data Extraction
|
||||
Extract tables from Excel files into Markdown format.
|
||||
|
||||
### Presentation Processing
|
||||
Convert PowerPoint slides with AI-generated descriptions.
|
||||
|
||||
### Document Analysis
|
||||
Process documents for LLM consumption with token-efficient Markdown.
|
||||
|
||||
### YouTube Transcripts
|
||||
Fetch and convert YouTube video transcriptions.
|
||||
|
||||
## Scripts Usage
|
||||
|
||||
### Batch Convert
|
||||
```bash
|
||||
# Convert all PDFs in a directory
|
||||
python scripts/batch_convert.py input_dir/ output_dir/ --extensions .pdf
|
||||
|
||||
# Recursive with multiple formats
|
||||
python scripts/batch_convert.py docs/ markdown/ --extensions .pdf .docx .pptx -r
|
||||
```
|
||||
|
||||
### AI-Enhanced Conversion
|
||||
```bash
|
||||
# Convert with AI descriptions via OpenRouter
|
||||
export OPENROUTER_API_KEY="sk-or-v1-..."
|
||||
python scripts/convert_with_ai.py paper.pdf output.md --prompt-type scientific
|
||||
|
||||
# Use different models
|
||||
python scripts/convert_with_ai.py image.png output.md --model anthropic/claude-sonnet-4.5
|
||||
|
||||
# Use custom prompt
|
||||
python scripts/convert_with_ai.py image.png output.md --custom-prompt "Describe this diagram"
|
||||
```
|
||||
|
||||
### Literature Conversion
|
||||
```bash
|
||||
# Convert papers with metadata extraction
|
||||
python scripts/convert_literature.py papers/ markdown/ --organize-by-year --create-index
|
||||
```
|
||||
|
||||
## Integration with Scientific Writer
|
||||
|
||||
This skill integrates seamlessly with the Scientific Writer CLI for:
|
||||
- Converting source materials for paper writing
|
||||
- Processing literature for reviews
|
||||
- Extracting data from various document formats
|
||||
- Preparing documents for LLM analysis
|
||||
|
||||
## Resources
|
||||
|
||||
- **MarkItDown GitHub**: https://github.com/microsoft/markitdown
|
||||
- **PyPI**: https://pypi.org/project/markitdown/
|
||||
- **OpenRouter**: https://openrouter.ai (AI model access)
|
||||
- **OpenRouter API Keys**: https://openrouter.ai/keys
|
||||
- **OpenRouter Models**: https://openrouter.ai/models
|
||||
- **License**: MIT
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.10+
|
||||
- Optional dependencies based on formats needed
|
||||
- OpenRouter API key (for AI-enhanced conversions) - Get at https://openrouter.ai/keys
|
||||
- Azure subscription (optional, for Document Intelligence)
|
||||
|
||||
## Examples
|
||||
|
||||
See `assets/example_usage.md` for comprehensive examples covering:
|
||||
- Basic conversions
|
||||
- Scientific workflows
|
||||
- AI-enhanced processing
|
||||
- Batch operations
|
||||
- Error handling
|
||||
- Integration patterns
|
||||
|
||||
@@ -1,241 +1,486 @@
|
||||
---
|
||||
name: markitdown
|
||||
description: Convert various file formats (PDF, Office documents, images, audio, web content, structured data) to Markdown optimized for LLM processing. Use when converting documents to markdown, extracting text from PDFs/Office files, transcribing audio, performing OCR on images, extracting YouTube transcripts, or processing batches of files. Supports 20+ formats including DOCX, XLSX, PPTX, PDF, HTML, EPUB, CSV, JSON, images with OCR, and audio with transcription.
|
||||
description: "Convert files and office documents to Markdown. Supports PDF, DOCX, PPTX, XLSX, images (with OCR), audio (with transcription), HTML, CSV, JSON, XML, ZIP, YouTube URLs, EPubs and more."
|
||||
allowed-tools: [Read, Write, Edit, Bash]
|
||||
license: MIT
|
||||
source: https://github.com/microsoft/markitdown
|
||||
---
|
||||
|
||||
# MarkItDown
|
||||
# MarkItDown - File to Markdown Conversion
|
||||
|
||||
## Overview
|
||||
|
||||
MarkItDown is a Python utility that converts various file formats into Markdown format, optimized for use with large language models and text analysis pipelines. It preserves document structure (headings, lists, tables, hyperlinks) while producing clean, token-efficient Markdown output.
|
||||
MarkItDown is a Python tool developed by Microsoft for converting various file formats to Markdown. It's particularly useful for converting documents into LLM-friendly text format, as Markdown is token-efficient and well-understood by modern language models.
|
||||
|
||||
## When to Use This Skill
|
||||
**Key Benefits**:
|
||||
- Convert documents to clean, structured Markdown
|
||||
- Token-efficient format for LLM processing
|
||||
- Supports 15+ file formats
|
||||
- Optional AI-enhanced image descriptions
|
||||
- OCR for images and scanned documents
|
||||
- Speech transcription for audio files
|
||||
|
||||
Use this skill when users request:
|
||||
- Converting documents to Markdown format
|
||||
- Extracting text from PDF, Word, PowerPoint, or Excel files
|
||||
- Performing OCR on images to extract text
|
||||
- Transcribing audio files to text
|
||||
- Extracting YouTube video transcripts
|
||||
- Processing HTML, EPUB, or web content to Markdown
|
||||
- Converting structured data (CSV, JSON, XML) to readable Markdown
|
||||
- Batch converting multiple files or ZIP archives
|
||||
- Preparing documents for LLM analysis or RAG systems
|
||||
## Visual Enhancement with Scientific Schematics
|
||||
|
||||
## Core Capabilities
|
||||
**When creating documents with this skill, always consider adding scientific diagrams and schematics to enhance visual communication.**
|
||||
|
||||
### 1. Document Conversion
|
||||
If your document does not already contain schematics or diagrams:
|
||||
- Use the **scientific-schematics** skill to generate AI-powered publication-quality diagrams
|
||||
- Simply describe your desired diagram in natural language
|
||||
- Nano Banana Pro will automatically generate, review, and refine the schematic
|
||||
|
||||
Convert Office documents and PDFs to Markdown while preserving structure.
|
||||
**For new documents:** Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.
|
||||
|
||||
**Supported formats:**
|
||||
- PDF files (with optional Azure Document Intelligence integration)
|
||||
- Word documents (DOCX)
|
||||
- PowerPoint presentations (PPTX)
|
||||
- Excel spreadsheets (XLSX, XLS)
|
||||
**How to generate schematics:**
|
||||
```bash
|
||||
python scripts/generate_schematic.py "your diagram description" -o figures/output.png
|
||||
```
|
||||
|
||||
The AI will automatically:
|
||||
- Create publication-quality images with proper formatting
|
||||
- Review and refine through multiple iterations
|
||||
- Ensure accessibility (colorblind-friendly, high contrast)
|
||||
- Save outputs in the figures/ directory
|
||||
|
||||
**When to add schematics:**
|
||||
- Document conversion workflow diagrams
|
||||
- File format architecture illustrations
|
||||
- OCR processing pipeline diagrams
|
||||
- Integration workflow visualizations
|
||||
- System architecture diagrams
|
||||
- Data flow diagrams
|
||||
- Any complex concept that benefits from visualization
|
||||
|
||||
For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.
|
||||
|
||||
---
|
||||
|
||||
## Supported Formats
|
||||
|
||||
| Format | Description | Notes |
|
||||
|--------|-------------|-------|
|
||||
| **PDF** | Portable Document Format | Full text extraction |
|
||||
| **DOCX** | Microsoft Word | Tables, formatting preserved |
|
||||
| **PPTX** | PowerPoint | Slides with notes |
|
||||
| **XLSX** | Excel spreadsheets | Tables and data |
|
||||
| **Images** | JPEG, PNG, GIF, WebP | EXIF metadata + OCR |
|
||||
| **Audio** | WAV, MP3 | Metadata + transcription |
|
||||
| **HTML** | Web pages | Clean conversion |
|
||||
| **CSV** | Comma-separated values | Table format |
|
||||
| **JSON** | JSON data | Structured representation |
|
||||
| **XML** | XML documents | Structured format |
|
||||
| **ZIP** | Archive files | Iterates contents |
|
||||
| **EPUB** | E-books | Full text extraction |
|
||||
| **YouTube** | Video URLs | Fetch transcriptions |
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Install with all features
|
||||
pip install 'markitdown[all]'
|
||||
|
||||
# Or from source
|
||||
git clone https://github.com/microsoft/markitdown.git
|
||||
cd markitdown
|
||||
pip install -e 'packages/markitdown[all]'
|
||||
```
|
||||
|
||||
### Command-Line Usage
|
||||
|
||||
```bash
|
||||
# Basic conversion
|
||||
markitdown document.pdf > output.md
|
||||
|
||||
# Specify output file
|
||||
markitdown document.pdf -o output.md
|
||||
|
||||
# Pipe content
|
||||
cat document.pdf | markitdown > output.md
|
||||
|
||||
# Enable plugins
|
||||
markitdown --list-plugins # List available plugins
|
||||
markitdown --use-plugins document.pdf -o output.md
|
||||
```
|
||||
|
||||
### Python API
|
||||
|
||||
**Basic usage:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Basic usage
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Command-line:**
|
||||
```bash
|
||||
markitdown document.pdf -o output.md
|
||||
```
|
||||
|
||||
See `references/document_conversion.md` for detailed documentation on document-specific features.
|
||||
|
||||
### 2. Media Processing
|
||||
|
||||
Extract text from images using OCR and transcribe audio files to text.
|
||||
|
||||
**Supported formats:**
|
||||
- Images (JPEG, PNG, GIF, etc.) with EXIF metadata extraction
|
||||
- Audio files with speech transcription (requires speech_recognition)
|
||||
|
||||
**Image with OCR:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("image.jpg")
|
||||
print(result.text_content) # Includes EXIF metadata and OCR text
|
||||
```
|
||||
|
||||
**Audio transcription:**
|
||||
```python
|
||||
result = md.convert("audio.wav")
|
||||
print(result.text_content) # Transcribed speech
|
||||
```
|
||||
|
||||
See `references/media_processing.md` for advanced media handling options.
|
||||
|
||||
### 3. Web Content Extraction
|
||||
|
||||
Convert web-based content and e-books to Markdown.
|
||||
|
||||
**Supported formats:**
|
||||
- HTML files and web pages
|
||||
- YouTube video transcripts (via URL)
|
||||
- EPUB books
|
||||
- RSS feeds
|
||||
|
||||
**YouTube transcript:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
|
||||
# Convert from stream
|
||||
with open("document.pdf", "rb") as f:
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
See `references/web_content.md` for web extraction details.
|
||||
## Advanced Features
|
||||
|
||||
### 4. Structured Data Handling
|
||||
### 1. AI-Enhanced Image Descriptions
|
||||
|
||||
Convert structured data formats to readable Markdown tables.
|
||||
Use LLMs via OpenRouter to generate detailed image descriptions (for PPTX and image files):
|
||||
|
||||
**Supported formats:**
|
||||
- CSV files
|
||||
- JSON files
|
||||
- XML files
|
||||
|
||||
**CSV to Markdown table:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.csv")
|
||||
print(result.text_content) # Formatted as Markdown table
|
||||
```
|
||||
|
||||
See `references/structured_data.md` for format-specific options.
|
||||
|
||||
### 5. Advanced Integrations
|
||||
|
||||
Enhance conversion quality with AI-powered features.
|
||||
|
||||
**Azure Document Intelligence:**
|
||||
For enhanced PDF processing with better table extraction and layout analysis:
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(docintel_endpoint="<endpoint>", docintel_key="<key>")
|
||||
result = md.convert("complex.pdf")
|
||||
```
|
||||
|
||||
**LLM-Powered Image Descriptions:**
|
||||
Generate detailed image descriptions using GPT-4o:
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("presentation.pptx") # Images described with LLM
|
||||
# Initialize OpenRouter client (OpenAI-compatible API)
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5", # recommended for scientific vision
|
||||
llm_prompt="Describe this image in detail for scientific documentation"
|
||||
)
|
||||
|
||||
result = md.convert("presentation.pptx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
See `references/advanced_integrations.md` for integration details.
|
||||
### 2. Azure Document Intelligence
|
||||
|
||||
### 6. Batch Processing
|
||||
For enhanced PDF conversion with Microsoft Document Intelligence:
|
||||
|
||||
Process multiple files or entire ZIP archives at once.
|
||||
```bash
|
||||
# Command line
|
||||
markitdown document.pdf -o output.md -d -e "<document_intelligence_endpoint>"
|
||||
```
|
||||
|
||||
**ZIP file processing:**
|
||||
```python
|
||||
# Python API
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("archive.zip")
|
||||
print(result.text_content) # All files converted and concatenated
|
||||
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
|
||||
result = md.convert("complex_document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Batch script:**
|
||||
Use the provided batch processing script for directory conversion:
|
||||
### 3. Plugin System
|
||||
|
||||
MarkItDown supports 3rd-party plugins for extending functionality:
|
||||
|
||||
```bash
|
||||
python scripts/batch_convert.py /path/to/documents /path/to/output
|
||||
# List installed plugins
|
||||
markitdown --list-plugins
|
||||
|
||||
# Enable plugins
|
||||
markitdown --use-plugins file.pdf -o output.md
|
||||
```
|
||||
|
||||
See `scripts/batch_convert.py` for implementation details.
|
||||
Find plugins on GitHub with hashtag: `#markitdown-plugin`
|
||||
|
||||
## Installation
|
||||
## Optional Dependencies
|
||||
|
||||
Control which file formats you support:
|
||||
|
||||
**Full installation (all features):**
|
||||
```bash
|
||||
uv pip install 'markitdown[all]'
|
||||
# Install specific formats
|
||||
pip install 'markitdown[pdf, docx, pptx]'
|
||||
|
||||
# All available options:
|
||||
# [all] - All optional dependencies
|
||||
# [pptx] - PowerPoint files
|
||||
# [docx] - Word documents
|
||||
# [xlsx] - Excel spreadsheets
|
||||
# [xls] - Older Excel files
|
||||
# [pdf] - PDF documents
|
||||
# [outlook] - Outlook messages
|
||||
# [az-doc-intel] - Azure Document Intelligence
|
||||
# [audio-transcription] - WAV and MP3 transcription
|
||||
# [youtube-transcription] - YouTube video transcription
|
||||
```
|
||||
|
||||
**Modular installation (specific features):**
|
||||
```bash
|
||||
uv pip install 'markitdown[pdf]' # PDF support
|
||||
uv pip install 'markitdown[docx]' # Word support
|
||||
uv pip install 'markitdown[pptx]' # PowerPoint support
|
||||
uv pip install 'markitdown[xlsx]' # Excel support
|
||||
uv pip install 'markitdown[audio]' # Audio transcription
|
||||
uv pip install 'markitdown[youtube]' # YouTube transcripts
|
||||
```
|
||||
## Common Use Cases
|
||||
|
||||
**Requirements:**
|
||||
- Python 3.10 or higher
|
||||
### 1. Convert Scientific Papers to Markdown
|
||||
|
||||
## Output Format
|
||||
|
||||
MarkItDown produces clean, token-efficient Markdown optimized for LLM consumption:
|
||||
- Preserves headings, lists, and tables
|
||||
- Maintains hyperlinks and formatting
|
||||
- Includes metadata where relevant (EXIF, document properties)
|
||||
- No temporary files created (streaming approach)
|
||||
|
||||
## Common Workflows
|
||||
|
||||
**Preparing documents for RAG:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert knowledge base documents
|
||||
docs = ["manual.pdf", "guide.docx", "faq.html"]
|
||||
markdown_content = []
|
||||
|
||||
for doc in docs:
|
||||
result = md.convert(doc)
|
||||
markdown_content.append(result.text_content)
|
||||
|
||||
# Now ready for embedding and indexing
|
||||
# Convert PDF paper
|
||||
result = md.convert("research_paper.pdf")
|
||||
with open("paper.md", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
**Document analysis pipeline:**
|
||||
```bash
|
||||
# Convert all PDFs in directory
|
||||
for file in documents/*.pdf; do
|
||||
markitdown "$file" -o "markdown/$(basename "$file" .pdf).md"
|
||||
done
|
||||
```
|
||||
|
||||
## Plugin System
|
||||
|
||||
MarkItDown supports extensible plugins for custom conversion logic. Plugins are disabled by default for security:
|
||||
### 2. Extract Data from Excel for Analysis
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Enable plugins if needed
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.xlsx")
|
||||
|
||||
# Result will be in Markdown table format
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### 3. Process Multiple Documents
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Process all PDFs in a directory
|
||||
pdf_dir = Path("papers/")
|
||||
output_dir = Path("markdown_output/")
|
||||
output_dir.mkdir(exist_ok=True)
|
||||
|
||||
for pdf_file in pdf_dir.glob("*.pdf"):
|
||||
result = md.convert(str(pdf_file))
|
||||
output_file = output_dir / f"{pdf_file.stem}.md"
|
||||
output_file.write_text(result.text_content)
|
||||
print(f"Converted: {pdf_file.name}")
|
||||
```
|
||||
|
||||
### 4. Convert PowerPoint with AI Descriptions
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Use OpenRouter for access to multiple AI models
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5", # recommended for presentations
|
||||
llm_prompt="Describe this slide image in detail, focusing on key visual elements and data"
|
||||
)
|
||||
|
||||
result = md.convert("presentation.pptx")
|
||||
with open("presentation.md", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### 5. Batch Convert with Different Formats
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from pathlib import Path
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Files to convert
|
||||
files = [
|
||||
"document.pdf",
|
||||
"spreadsheet.xlsx",
|
||||
"presentation.pptx",
|
||||
"notes.docx"
|
||||
]
|
||||
|
||||
for file in files:
|
||||
try:
|
||||
result = md.convert(file)
|
||||
output = Path(file).stem + ".md"
|
||||
with open(output, "w") as f:
|
||||
f.write(result.text_content)
|
||||
print(f"✓ Converted {file}")
|
||||
except Exception as e:
|
||||
print(f"✗ Error converting {file}: {e}")
|
||||
```
|
||||
|
||||
### 6. Extract YouTube Video Transcription
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert YouTube video to transcript
|
||||
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
## Docker Usage
|
||||
|
||||
```bash
|
||||
# Build image
|
||||
docker build -t markitdown:latest .
|
||||
|
||||
# Run conversion
|
||||
docker run --rm -i markitdown:latest < ~/document.pdf > output.md
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Choose the Right Conversion Method
|
||||
|
||||
- **Simple documents**: Use basic `MarkItDown()`
|
||||
- **Complex PDFs**: Use Azure Document Intelligence
|
||||
- **Visual content**: Enable AI image descriptions
|
||||
- **Scanned documents**: Ensure OCR dependencies are installed
|
||||
|
||||
### 2. Handle Errors Gracefully
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("File not found")
|
||||
except Exception as e:
|
||||
print(f"Conversion error: {e}")
|
||||
```
|
||||
|
||||
### 3. Process Large Files Efficiently
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# For large files, use streaming
|
||||
with open("large_file.pdf", "rb") as f:
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
|
||||
# Process in chunks or save directly
|
||||
with open("output.md", "w") as out:
|
||||
out.write(result.text_content)
|
||||
```
|
||||
|
||||
### 4. Optimize for Token Efficiency
|
||||
|
||||
Markdown output is already token-efficient, but you can:
|
||||
- Remove excessive whitespace
|
||||
- Consolidate similar sections
|
||||
- Strip metadata if not needed
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import re
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
|
||||
# Clean up extra whitespace
|
||||
clean_text = re.sub(r'\n{3,}', '\n\n', result.text_content)
|
||||
clean_text = clean_text.strip()
|
||||
|
||||
print(clean_text)
|
||||
```
|
||||
|
||||
## Integration with Scientific Workflows
|
||||
|
||||
### Convert Literature for Review
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from pathlib import Path
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert all papers in literature folder
|
||||
papers_dir = Path("literature/pdfs")
|
||||
output_dir = Path("literature/markdown")
|
||||
output_dir.mkdir(exist_ok=True)
|
||||
|
||||
for paper in papers_dir.glob("*.pdf"):
|
||||
result = md.convert(str(paper))
|
||||
|
||||
# Save with metadata
|
||||
output_file = output_dir / f"{paper.stem}.md"
|
||||
content = f"# {paper.stem}\n\n"
|
||||
content += f"**Source**: {paper.name}\n\n"
|
||||
content += "---\n\n"
|
||||
content += result.text_content
|
||||
|
||||
output_file.write_text(content)
|
||||
|
||||
# For AI-enhanced conversion with figures
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
md_ai = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5",
|
||||
llm_prompt="Describe scientific figures with technical precision"
|
||||
)
|
||||
```
|
||||
|
||||
### Extract Tables for Analysis
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import re
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data_tables.xlsx")
|
||||
|
||||
# Markdown tables can be parsed or used directly
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Missing dependencies**: Install feature-specific packages
|
||||
```bash
|
||||
pip install 'markitdown[pdf]' # For PDF support
|
||||
```
|
||||
|
||||
2. **Binary file errors**: Ensure files are opened in binary mode
|
||||
```python
|
||||
with open("file.pdf", "rb") as f: # Note the "rb"
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
```
|
||||
|
||||
3. **OCR not working**: Install tesseract
|
||||
```bash
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Ubuntu
|
||||
sudo apt-get install tesseract-ocr
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **PDF files**: Large PDFs may take time; consider page ranges if supported
|
||||
- **Image OCR**: OCR processing is CPU-intensive
|
||||
- **Audio transcription**: Requires additional compute resources
|
||||
- **AI image descriptions**: Requires API calls (costs may apply)
|
||||
|
||||
## Next Steps
|
||||
|
||||
- See `references/api_reference.md` for complete API documentation
|
||||
- Check `references/file_formats.md` for format-specific details
|
||||
- Review `scripts/batch_convert.py` for automation examples
|
||||
- Explore `scripts/convert_with_ai.py` for AI-enhanced conversions
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes comprehensive reference documentation for each capability:
|
||||
- **MarkItDown GitHub**: https://github.com/microsoft/markitdown
|
||||
- **PyPI**: https://pypi.org/project/markitdown/
|
||||
- **OpenRouter**: https://openrouter.ai (for AI-enhanced conversions)
|
||||
- **OpenRouter API Keys**: https://openrouter.ai/keys
|
||||
- **OpenRouter Models**: https://openrouter.ai/models
|
||||
- **MCP Server**: markitdown-mcp (for Claude Desktop integration)
|
||||
- **Plugin Development**: See `packages/markitdown-sample-plugin`
|
||||
|
||||
- **references/document_conversion.md** - Detailed PDF, DOCX, PPTX, XLSX conversion options
|
||||
- **references/media_processing.md** - Image OCR and audio transcription details
|
||||
- **references/web_content.md** - HTML, YouTube, and EPUB extraction
|
||||
- **references/structured_data.md** - CSV, JSON, XML conversion formats
|
||||
- **references/advanced_integrations.md** - Azure Document Intelligence and LLM integration
|
||||
- **scripts/batch_convert.py** - Batch processing utility for directories
|
||||
|
||||
307
scientific-skills/markitdown/SKILL_SUMMARY.md
Normal file
307
scientific-skills/markitdown/SKILL_SUMMARY.md
Normal file
@@ -0,0 +1,307 @@
|
||||
# MarkItDown Skill - Creation Summary
|
||||
|
||||
## Overview
|
||||
|
||||
A comprehensive skill for using Microsoft's MarkItDown tool has been created for the Claude Scientific Writer. This skill enables conversion of 15+ file formats to Markdown, optimized for LLM processing and scientific workflows.
|
||||
|
||||
## What Was Created
|
||||
|
||||
### Core Documentation
|
||||
|
||||
1. **SKILL.md** (Main skill file)
|
||||
- Complete guide to MarkItDown
|
||||
- Quick start examples
|
||||
- All supported formats
|
||||
- Advanced features (AI, Azure DI)
|
||||
- Best practices
|
||||
- Use cases and examples
|
||||
|
||||
2. **README.md**
|
||||
- Skill overview
|
||||
- Key features
|
||||
- Quick reference
|
||||
- Integration guide
|
||||
|
||||
3. **QUICK_REFERENCE.md**
|
||||
- Cheat sheet for common tasks
|
||||
- Quick syntax reference
|
||||
- Common commands
|
||||
- Troubleshooting tips
|
||||
|
||||
4. **INSTALLATION_GUIDE.md**
|
||||
- Step-by-step installation
|
||||
- System dependencies
|
||||
- Virtual environment setup
|
||||
- Optional features
|
||||
- Troubleshooting
|
||||
|
||||
### Reference Documentation
|
||||
|
||||
Located in `references/`:
|
||||
|
||||
1. **api_reference.md**
|
||||
- Complete API documentation
|
||||
- Class and method references
|
||||
- Custom converter development
|
||||
- Plugin system
|
||||
- Error handling
|
||||
- Breaking changes guide
|
||||
|
||||
2. **file_formats.md**
|
||||
- Detailed format-specific guides
|
||||
- 15+ supported formats
|
||||
- Format capabilities and limitations
|
||||
- Best practices per format
|
||||
- Example outputs
|
||||
|
||||
### Utility Scripts
|
||||
|
||||
Located in `scripts/`:
|
||||
|
||||
1. **batch_convert.py**
|
||||
- Parallel batch conversion
|
||||
- Multi-format support
|
||||
- Recursive directory search
|
||||
- Progress tracking
|
||||
- Error reporting
|
||||
- Command-line interface
|
||||
|
||||
2. **convert_with_ai.py**
|
||||
- AI-enhanced conversions
|
||||
- Predefined prompt types (scientific, medical, data viz, etc.)
|
||||
- Custom prompt support
|
||||
- Multiple model support
|
||||
- OpenRouter integration (advanced vision models)
|
||||
|
||||
3. **convert_literature.py**
|
||||
- Scientific literature conversion
|
||||
- Metadata extraction from filenames
|
||||
- Year-based organization
|
||||
- Automatic index generation
|
||||
- JSON catalog creation
|
||||
- Front matter support
|
||||
|
||||
### Assets
|
||||
|
||||
Located in `assets/`:
|
||||
|
||||
1. **example_usage.md**
|
||||
- 20+ practical examples
|
||||
- Basic conversions
|
||||
- Scientific workflows
|
||||
- AI-enhanced processing
|
||||
- Batch operations
|
||||
- Error handling patterns
|
||||
- Integration examples
|
||||
|
||||
### License
|
||||
|
||||
- **LICENSE.txt** - MIT License from Microsoft
|
||||
|
||||
## Skill Structure
|
||||
|
||||
```
|
||||
.claude/skills/markitdown/
|
||||
├── SKILL.md # Main skill documentation
|
||||
├── README.md # Skill overview
|
||||
├── QUICK_REFERENCE.md # Quick reference guide
|
||||
├── INSTALLATION_GUIDE.md # Installation instructions
|
||||
├── SKILL_SUMMARY.md # This file
|
||||
├── LICENSE.txt # MIT License
|
||||
├── references/
|
||||
│ ├── api_reference.md # Complete API docs
|
||||
│ └── file_formats.md # Format-specific guides
|
||||
├── scripts/
|
||||
│ ├── batch_convert.py # Batch conversion utility
|
||||
│ ├── convert_with_ai.py # AI-enhanced conversion
|
||||
│ └── convert_literature.py # Literature conversion
|
||||
└── assets/
|
||||
└── example_usage.md # Practical examples
|
||||
```
|
||||
|
||||
## Capabilities
|
||||
|
||||
### File Format Support
|
||||
|
||||
- **Documents**: PDF, DOCX, PPTX, XLSX, XLS, EPUB
|
||||
- **Images**: JPEG, PNG, GIF, WebP (with OCR)
|
||||
- **Audio**: WAV, MP3 (with transcription)
|
||||
- **Web**: HTML, YouTube URLs
|
||||
- **Data**: CSV, JSON, XML
|
||||
- **Archives**: ZIP files
|
||||
- **Email**: Outlook MSG files
|
||||
|
||||
### Advanced Features
|
||||
|
||||
1. **AI Enhancement via OpenRouter**
|
||||
- Access to 100+ AI models through OpenRouter
|
||||
- Multiple preset prompts (scientific, medical, data viz)
|
||||
- Custom prompt support
|
||||
- Default: Advanced vision model (best for scientific vision)
|
||||
- Choose best model for each task
|
||||
|
||||
2. **Azure Integration**
|
||||
- Azure Document Intelligence for complex PDFs
|
||||
- Enhanced layout understanding
|
||||
- Better table extraction
|
||||
|
||||
3. **Batch Processing**
|
||||
- Parallel conversion with configurable workers
|
||||
- Recursive directory processing
|
||||
- Progress tracking and error reporting
|
||||
- Format-specific organization
|
||||
|
||||
4. **Scientific Workflows**
|
||||
- Literature conversion with metadata
|
||||
- Automatic index generation
|
||||
- Year-based organization
|
||||
- Citation-friendly output
|
||||
|
||||
## Integration with Scientific Writer
|
||||
|
||||
The skill has been added to the Scientific Writer's skill catalog:
|
||||
|
||||
- **Location**: `.claude/skills/markitdown/`
|
||||
- **Skill Number**: #5 in Document Manipulation Skills
|
||||
- **SKILLS.md**: Updated with complete skill description
|
||||
|
||||
### Usage Examples
|
||||
|
||||
```
|
||||
> Convert all PDFs in the literature folder to Markdown
|
||||
> Convert this PowerPoint presentation to Markdown with AI-generated descriptions
|
||||
> Extract tables from this Excel file
|
||||
> Transcribe this lecture recording
|
||||
```
|
||||
|
||||
## Scripts Usage
|
||||
|
||||
### Batch Convert
|
||||
```bash
|
||||
python scripts/batch_convert.py input_dir/ output_dir/ --extensions .pdf .docx --workers 4
|
||||
```
|
||||
|
||||
### AI-Enhanced Convert
|
||||
```bash
|
||||
export OPENROUTER_API_KEY="sk-or-v1-..."
|
||||
python scripts/convert_with_ai.py paper.pdf output.md \
|
||||
--model anthropic/claude-sonnet-4.5 \
|
||||
--prompt-type scientific
|
||||
```
|
||||
|
||||
### Literature Convert
|
||||
```bash
|
||||
python scripts/convert_literature.py papers/ markdown/ --organize-by-year --create-index
|
||||
```
|
||||
|
||||
## Key Features
|
||||
|
||||
1. **Token-Efficient Output**: Markdown optimized for LLM processing
|
||||
2. **Comprehensive Format Support**: 15+ file types
|
||||
3. **AI Enhancement**: Detailed image descriptions via OpenAI
|
||||
4. **OCR Support**: Extract text from scanned documents
|
||||
5. **Audio Transcription**: Speech-to-text for audio files
|
||||
6. **YouTube Support**: Video transcript extraction
|
||||
7. **Plugin System**: Extensible architecture
|
||||
8. **Batch Processing**: Efficient parallel conversion
|
||||
9. **Error Handling**: Robust error management
|
||||
10. **Scientific Focus**: Optimized for research workflows
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Full installation
|
||||
pip install 'markitdown[all]'
|
||||
|
||||
# Selective installation
|
||||
pip install 'markitdown[pdf,docx,pptx,xlsx]'
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Basic usage
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
|
||||
# With AI via OpenRouter
|
||||
from openai import OpenAI
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5" # or openai/gpt-4o
|
||||
)
|
||||
result = md.convert("presentation.pptx")
|
||||
```
|
||||
|
||||
## Documentation Files
|
||||
|
||||
| File | Purpose | Lines |
|
||||
|------|---------|-------|
|
||||
| SKILL.md | Main documentation | 400+ |
|
||||
| api_reference.md | API documentation | 500+ |
|
||||
| file_formats.md | Format guides | 600+ |
|
||||
| example_usage.md | Practical examples | 500+ |
|
||||
| batch_convert.py | Batch conversion | 200+ |
|
||||
| convert_with_ai.py | AI conversion | 200+ |
|
||||
| convert_literature.py | Literature conversion | 250+ |
|
||||
| QUICK_REFERENCE.md | Quick reference | 300+ |
|
||||
| INSTALLATION_GUIDE.md | Installation guide | 300+ |
|
||||
|
||||
**Total**: ~3,000+ lines of documentation and code
|
||||
|
||||
## Use Cases
|
||||
|
||||
1. **Literature Review**: Convert research papers to Markdown for analysis
|
||||
2. **Data Extraction**: Extract tables from Excel/PDF for processing
|
||||
3. **Presentation Processing**: Convert slides with AI descriptions
|
||||
4. **Document Analysis**: Prepare documents for LLM consumption
|
||||
5. **Lecture Transcription**: Convert audio recordings to text
|
||||
6. **YouTube Analysis**: Extract video transcripts
|
||||
7. **Archive Processing**: Batch convert document collections
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Install MarkItDown: `pip install 'markitdown[all]'`
|
||||
2. Read `QUICK_REFERENCE.md` for common tasks
|
||||
3. Try example scripts in `scripts/` directory
|
||||
4. Explore `SKILL.md` for comprehensive guide
|
||||
5. Check `example_usage.md` for practical examples
|
||||
|
||||
## Resources
|
||||
|
||||
- **MarkItDown GitHub**: https://github.com/microsoft/markitdown
|
||||
- **PyPI**: https://pypi.org/project/markitdown/
|
||||
- **OpenRouter**: https://openrouter.ai (AI model access)
|
||||
- **OpenRouter API Keys**: https://openrouter.ai/keys
|
||||
- **OpenRouter Models**: https://openrouter.ai/models
|
||||
- **License**: MIT (Microsoft Corporation)
|
||||
- **Python**: 3.10+ required
|
||||
- **Skill Location**: `.claude/skills/markitdown/`
|
||||
|
||||
## Success Criteria
|
||||
|
||||
✅ Comprehensive skill documentation created
|
||||
✅ Complete API reference provided
|
||||
✅ Format-specific guides included
|
||||
✅ Utility scripts implemented
|
||||
✅ Practical examples documented
|
||||
✅ Installation guide created
|
||||
✅ Quick reference guide added
|
||||
✅ Integration with Scientific Writer complete
|
||||
✅ SKILLS.md updated
|
||||
✅ Scripts made executable
|
||||
✅ MIT License included
|
||||
|
||||
## Skill Status
|
||||
|
||||
**Status**: ✅ Complete and Ready to Use
|
||||
|
||||
The MarkItDown skill is fully integrated into the Claude Scientific Writer and ready for use. All documentation, scripts, and examples are in place.
|
||||
|
||||
463
scientific-skills/markitdown/assets/example_usage.md
Normal file
463
scientific-skills/markitdown/assets/example_usage.md
Normal file
@@ -0,0 +1,463 @@
|
||||
# MarkItDown Example Usage
|
||||
|
||||
This document provides practical examples of using MarkItDown in various scenarios.
|
||||
|
||||
## Basic Examples
|
||||
|
||||
### 1. Simple File Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert a PDF
|
||||
result = md.convert("research_paper.pdf")
|
||||
print(result.text_content)
|
||||
|
||||
# Convert a Word document
|
||||
result = md.convert("manuscript.docx")
|
||||
print(result.text_content)
|
||||
|
||||
# Convert a PowerPoint
|
||||
result = md.convert("presentation.pptx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### 2. Save to File
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
|
||||
with open("output.md", "w", encoding="utf-8") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### 3. Convert from Stream
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
with open("document.pdf", "rb") as f:
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
## Scientific Workflows
|
||||
|
||||
### Convert Research Papers
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from pathlib import Path
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert all papers in a directory
|
||||
papers_dir = Path("research_papers/")
|
||||
output_dir = Path("markdown_papers/")
|
||||
output_dir.mkdir(exist_ok=True)
|
||||
|
||||
for paper in papers_dir.glob("*.pdf"):
|
||||
result = md.convert(str(paper))
|
||||
|
||||
# Save with original filename
|
||||
output_file = output_dir / f"{paper.stem}.md"
|
||||
output_file.write_text(result.text_content)
|
||||
|
||||
print(f"Converted: {paper.name}")
|
||||
```
|
||||
|
||||
### Extract Tables from Excel
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert Excel to Markdown tables
|
||||
result = md.convert("experimental_data.xlsx")
|
||||
|
||||
# The result contains Markdown-formatted tables
|
||||
print(result.text_content)
|
||||
|
||||
# Save for further processing
|
||||
with open("data_tables.md", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Process Presentation Slides
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# With AI descriptions for images
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5",
|
||||
llm_prompt="Describe this scientific slide, focusing on data and key findings"
|
||||
)
|
||||
|
||||
result = md.convert("conference_talk.pptx")
|
||||
|
||||
# Save with metadata
|
||||
output = f"""# Conference Talk
|
||||
|
||||
{result.text_content}
|
||||
"""
|
||||
|
||||
with open("talk_notes.md", "w") as f:
|
||||
f.write(output)
|
||||
```
|
||||
|
||||
## AI-Enhanced Conversions
|
||||
|
||||
### Detailed Image Descriptions
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Initialize OpenRouter client
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
# Scientific diagram analysis
|
||||
scientific_prompt = """
|
||||
Analyze this scientific figure. Describe:
|
||||
- Type of visualization (graph, microscopy, diagram, etc.)
|
||||
- Key data points and trends
|
||||
- Axes, labels, and legends
|
||||
- Scientific significance
|
||||
Be technical and precise.
|
||||
"""
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5", # recommended for scientific vision
|
||||
llm_prompt=scientific_prompt
|
||||
)
|
||||
|
||||
# Convert paper with figures
|
||||
result = md.convert("paper_with_figures.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Different Prompts for Different Files
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Initialize OpenRouter client
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
# Scientific papers - use Claude for technical analysis
|
||||
scientific_md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5",
|
||||
llm_prompt="Describe scientific figures with technical precision"
|
||||
)
|
||||
|
||||
# Presentations - use GPT-4o for visual understanding
|
||||
presentation_md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5",
|
||||
llm_prompt="Summarize slide content and key visual elements"
|
||||
)
|
||||
|
||||
# Use appropriate instance for each file
|
||||
paper_result = scientific_md.convert("research.pdf")
|
||||
slides_result = presentation_md.convert("talk.pptx")
|
||||
```
|
||||
|
||||
## Batch Processing
|
||||
|
||||
### Process Multiple Files
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from pathlib import Path
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
files_to_convert = [
|
||||
"paper1.pdf",
|
||||
"data.xlsx",
|
||||
"presentation.pptx",
|
||||
"notes.docx"
|
||||
]
|
||||
|
||||
for file in files_to_convert:
|
||||
try:
|
||||
result = md.convert(file)
|
||||
output = Path(file).stem + ".md"
|
||||
|
||||
with open(output, "w") as f:
|
||||
f.write(result.text_content)
|
||||
|
||||
print(f"✓ {file} -> {output}")
|
||||
except Exception as e:
|
||||
print(f"✗ Error converting {file}: {e}")
|
||||
```
|
||||
|
||||
### Parallel Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from pathlib import Path
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
|
||||
def convert_file(filepath):
|
||||
md = MarkItDown()
|
||||
result = md.convert(filepath)
|
||||
|
||||
output = Path(filepath).stem + ".md"
|
||||
with open(output, "w") as f:
|
||||
f.write(result.text_content)
|
||||
|
||||
return filepath, output
|
||||
|
||||
files = list(Path("documents/").glob("*.pdf"))
|
||||
|
||||
with ThreadPoolExecutor(max_workers=4) as executor:
|
||||
results = executor.map(convert_file, [str(f) for f in files])
|
||||
|
||||
for input_file, output_file in results:
|
||||
print(f"Converted: {input_file} -> {output_file}")
|
||||
```
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Literature Review Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from pathlib import Path
|
||||
import json
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert papers and create metadata
|
||||
papers_dir = Path("literature/")
|
||||
output_dir = Path("literature_markdown/")
|
||||
output_dir.mkdir(exist_ok=True)
|
||||
|
||||
catalog = []
|
||||
|
||||
for paper in papers_dir.glob("*.pdf"):
|
||||
result = md.convert(str(paper))
|
||||
|
||||
# Save Markdown
|
||||
md_file = output_dir / f"{paper.stem}.md"
|
||||
md_file.write_text(result.text_content)
|
||||
|
||||
# Store metadata
|
||||
catalog.append({
|
||||
"title": result.title or paper.stem,
|
||||
"source": paper.name,
|
||||
"markdown": str(md_file),
|
||||
"word_count": len(result.text_content.split())
|
||||
})
|
||||
|
||||
# Save catalog
|
||||
with open(output_dir / "catalog.json", "w") as f:
|
||||
json.dump(catalog, f, indent=2)
|
||||
```
|
||||
|
||||
### Data Extraction Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import re
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert Excel data to Markdown
|
||||
result = md.convert("experimental_results.xlsx")
|
||||
|
||||
# Extract tables (Markdown tables start with |)
|
||||
tables = []
|
||||
current_table = []
|
||||
in_table = False
|
||||
|
||||
for line in result.text_content.split('\n'):
|
||||
if line.strip().startswith('|'):
|
||||
in_table = True
|
||||
current_table.append(line)
|
||||
elif in_table:
|
||||
if current_table:
|
||||
tables.append('\n'.join(current_table))
|
||||
current_table = []
|
||||
in_table = False
|
||||
|
||||
# Process each table
|
||||
for i, table in enumerate(tables):
|
||||
print(f"Table {i+1}:")
|
||||
print(table)
|
||||
print("\n" + "="*50 + "\n")
|
||||
```
|
||||
|
||||
### YouTube Transcript Analysis
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Get transcript
|
||||
video_url = "https://www.youtube.com/watch?v=VIDEO_ID"
|
||||
result = md.convert(video_url)
|
||||
|
||||
# Save transcript
|
||||
with open("lecture_transcript.md", "w") as f:
|
||||
f.write(f"# Lecture Transcript\n\n")
|
||||
f.write(f"**Source**: {video_url}\n\n")
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Robust Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from pathlib import Path
|
||||
import logging
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def safe_convert(filepath):
|
||||
"""Convert file with error handling."""
|
||||
try:
|
||||
result = md.convert(filepath)
|
||||
output = Path(filepath).stem + ".md"
|
||||
|
||||
with open(output, "w") as f:
|
||||
f.write(result.text_content)
|
||||
|
||||
logger.info(f"Successfully converted {filepath}")
|
||||
return True
|
||||
|
||||
except FileNotFoundError:
|
||||
logger.error(f"File not found: {filepath}")
|
||||
return False
|
||||
|
||||
except ValueError as e:
|
||||
logger.error(f"Invalid file format for {filepath}: {e}")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Unexpected error converting {filepath}: {e}")
|
||||
return False
|
||||
|
||||
# Use it
|
||||
files = ["paper.pdf", "data.xlsx", "slides.pptx"]
|
||||
results = [safe_convert(f) for f in files]
|
||||
|
||||
print(f"Successfully converted {sum(results)}/{len(files)} files")
|
||||
```
|
||||
|
||||
## Advanced Use Cases
|
||||
|
||||
### Custom Metadata Extraction
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import re
|
||||
from datetime import datetime
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def convert_with_metadata(filepath):
|
||||
result = md.convert(filepath)
|
||||
|
||||
# Extract metadata from content
|
||||
metadata = {
|
||||
"file": filepath,
|
||||
"title": result.title,
|
||||
"converted_at": datetime.now().isoformat(),
|
||||
"word_count": len(result.text_content.split()),
|
||||
"char_count": len(result.text_content)
|
||||
}
|
||||
|
||||
# Try to find author
|
||||
author_match = re.search(r'(?:Author|By):\s*(.+?)(?:\n|$)', result.text_content)
|
||||
if author_match:
|
||||
metadata["author"] = author_match.group(1).strip()
|
||||
|
||||
# Create formatted output
|
||||
output = f"""---
|
||||
title: {metadata['title']}
|
||||
author: {metadata.get('author', 'Unknown')}
|
||||
source: {metadata['file']}
|
||||
converted: {metadata['converted_at']}
|
||||
words: {metadata['word_count']}
|
||||
---
|
||||
|
||||
{result.text_content}
|
||||
"""
|
||||
|
||||
return output, metadata
|
||||
|
||||
# Use it
|
||||
content, meta = convert_with_metadata("paper.pdf")
|
||||
print(meta)
|
||||
```
|
||||
|
||||
### Format-Specific Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from pathlib import Path
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def process_by_format(filepath):
|
||||
path = Path(filepath)
|
||||
result = md.convert(filepath)
|
||||
|
||||
if path.suffix == '.pdf':
|
||||
# Add PDF-specific metadata
|
||||
output = f"# PDF Document: {path.stem}\n\n"
|
||||
output += result.text_content
|
||||
|
||||
elif path.suffix == '.xlsx':
|
||||
# Add table count
|
||||
table_count = result.text_content.count('|---')
|
||||
output = f"# Excel Data: {path.stem}\n\n"
|
||||
output += f"**Tables**: {table_count}\n\n"
|
||||
output += result.text_content
|
||||
|
||||
elif path.suffix == '.pptx':
|
||||
# Add slide count
|
||||
slide_count = result.text_content.count('## Slide')
|
||||
output = f"# Presentation: {path.stem}\n\n"
|
||||
output += f"**Slides**: {slide_count}\n\n"
|
||||
output += result.text_content
|
||||
|
||||
else:
|
||||
output = result.text_content
|
||||
|
||||
return output
|
||||
|
||||
# Use it
|
||||
content = process_by_format("presentation.pptx")
|
||||
print(content)
|
||||
```
|
||||
|
||||
@@ -1,538 +0,0 @@
|
||||
# Advanced Integrations Reference
|
||||
|
||||
This document provides detailed information about advanced MarkItDown features including Azure Document Intelligence integration, LLM-powered descriptions, and plugin system.
|
||||
|
||||
## Azure Document Intelligence Integration
|
||||
|
||||
Azure Document Intelligence (formerly Form Recognizer) provides superior PDF processing with advanced table extraction and layout analysis.
|
||||
|
||||
### Setup
|
||||
|
||||
**Prerequisites:**
|
||||
1. Azure subscription
|
||||
2. Document Intelligence resource created in Azure
|
||||
3. Endpoint URL and API key
|
||||
|
||||
**Create Azure Resource:**
|
||||
```bash
|
||||
# Using Azure CLI
|
||||
az cognitiveservices account create \
|
||||
--name my-doc-intelligence \
|
||||
--resource-group my-resource-group \
|
||||
--kind FormRecognizer \
|
||||
--sku F0 \
|
||||
--location eastus
|
||||
```
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/",
|
||||
docintel_key="YOUR-API-KEY"
|
||||
)
|
||||
|
||||
result = md.convert("complex_document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Configuration from Environment Variables
|
||||
|
||||
```python
|
||||
import os
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Set environment variables
|
||||
os.environ['AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'] = 'YOUR-ENDPOINT'
|
||||
os.environ['AZURE_DOCUMENT_INTELLIGENCE_KEY'] = 'YOUR-KEY'
|
||||
|
||||
# Use without explicit credentials
|
||||
md = MarkItDown(
|
||||
docintel_endpoint=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
|
||||
)
|
||||
|
||||
result = md.convert("document.pdf")
|
||||
```
|
||||
|
||||
### When to Use Azure Document Intelligence
|
||||
|
||||
**Use for:**
|
||||
- Complex PDFs with sophisticated tables
|
||||
- Multi-column layouts
|
||||
- Forms and structured documents
|
||||
- Scanned documents requiring OCR
|
||||
- PDFs with mixed content types
|
||||
- Documents with intricate formatting
|
||||
|
||||
**Benefits over standard extraction:**
|
||||
- **Superior table extraction** - Better handling of merged cells, complex layouts
|
||||
- **Layout analysis** - Understands document structure (headers, footers, columns)
|
||||
- **Form fields** - Extracts key-value pairs from forms
|
||||
- **Reading order** - Maintains correct text flow in complex layouts
|
||||
- **OCR quality** - High-quality text extraction from scanned documents
|
||||
|
||||
### Comparison Example
|
||||
|
||||
**Standard extraction:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("complex_table.pdf")
|
||||
# May struggle with complex tables
|
||||
```
|
||||
|
||||
**Azure Document Intelligence:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
result = md.convert("complex_table.pdf")
|
||||
# Better table reconstruction and layout understanding
|
||||
```
|
||||
|
||||
### Cost Considerations
|
||||
|
||||
Azure Document Intelligence is a paid service:
|
||||
- **Free tier**: 500 pages per month
|
||||
- **Paid tiers**: Pay per page processed
|
||||
- Monitor usage to control costs
|
||||
- Use standard extraction for simple documents
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
|
||||
try:
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Document Intelligence error: {e}")
|
||||
# Common issues: authentication, quota exceeded, unsupported file
|
||||
```
|
||||
|
||||
## LLM-Powered Image Descriptions
|
||||
|
||||
Generate detailed, contextual descriptions for images using large language models.
|
||||
|
||||
### Setup with OpenAI
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(api_key="YOUR-OPENAI-API-KEY")
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
result = md.convert("image.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Supported Use Cases
|
||||
|
||||
**Images in documents:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# PowerPoint with images
|
||||
result = md.convert("presentation.pptx")
|
||||
|
||||
# Word documents with images
|
||||
result = md.convert("report.docx")
|
||||
|
||||
# Standalone images
|
||||
result = md.convert("diagram.png")
|
||||
```
|
||||
|
||||
### Custom Prompts
|
||||
|
||||
Customize the LLM prompt for specific needs:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# For diagrams
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Analyze this diagram and explain all components, connections, and relationships in detail"
|
||||
)
|
||||
|
||||
# For charts
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this chart, including the type, axes, data points, trends, and key insights"
|
||||
)
|
||||
|
||||
# For UI screenshots
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this user interface screenshot, listing all UI elements, their layout, and functionality"
|
||||
)
|
||||
|
||||
# For scientific figures
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this scientific figure in detail, including methodology, results shown, and significance"
|
||||
)
|
||||
```
|
||||
|
||||
### Model Selection
|
||||
|
||||
**GPT-4o (Recommended):**
|
||||
- Best vision capabilities
|
||||
- High-quality descriptions
|
||||
- Good at understanding context
|
||||
- Higher cost per image
|
||||
|
||||
**GPT-4o-mini:**
|
||||
- Lower cost alternative
|
||||
- Good for simpler images
|
||||
- Faster processing
|
||||
- May miss subtle details
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# High quality (more expensive)
|
||||
md_quality = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# Budget option (less expensive)
|
||||
md_budget = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")
|
||||
```
|
||||
|
||||
### Configuration from Environment
|
||||
|
||||
```python
|
||||
import os
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Set API key in environment
|
||||
os.environ['OPENAI_API_KEY'] = 'YOUR-API-KEY'
|
||||
|
||||
client = OpenAI() # Uses env variable
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
```
|
||||
|
||||
### Alternative LLM Providers
|
||||
|
||||
**Anthropic Claude:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from anthropic import Anthropic
|
||||
|
||||
# Note: Check current compatibility with MarkItDown
|
||||
client = Anthropic(api_key="YOUR-API-KEY")
|
||||
# May require adapter for MarkItDown compatibility
|
||||
```
|
||||
|
||||
**Azure OpenAI:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import AzureOpenAI
|
||||
|
||||
client = AzureOpenAI(
|
||||
api_key="YOUR-AZURE-KEY",
|
||||
api_version="2024-02-01",
|
||||
azure_endpoint="https://YOUR-RESOURCE.openai.azure.com"
|
||||
)
|
||||
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
```
|
||||
|
||||
### Cost Management
|
||||
|
||||
**Strategies to reduce LLM costs:**
|
||||
|
||||
1. **Selective processing:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# Only use LLM for important documents
|
||||
if is_important_document(file):
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
else:
|
||||
md = MarkItDown() # Standard processing
|
||||
|
||||
result = md.convert(file)
|
||||
```
|
||||
|
||||
2. **Image filtering:**
|
||||
```python
|
||||
# Pre-process to identify images that need descriptions
|
||||
# Only use LLM for complex/important images
|
||||
```
|
||||
|
||||
3. **Batch processing:**
|
||||
```python
|
||||
# Process multiple images in batches
|
||||
# Monitor costs and set limits
|
||||
```
|
||||
|
||||
4. **Model selection:**
|
||||
```python
|
||||
# Use gpt-4o-mini for simple images
|
||||
# Reserve gpt-4o for complex visualizations
|
||||
```
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
**LLM processing adds latency:**
|
||||
- Each image requires an API call
|
||||
- Processing time: 1-5 seconds per image
|
||||
- Network dependent
|
||||
- Consider parallel processing for multiple images
|
||||
|
||||
**Batch optimization:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import concurrent.futures
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
def process_image(image_path):
|
||||
return md.convert(image_path)
|
||||
|
||||
# Process multiple images in parallel
|
||||
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
|
||||
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
|
||||
results = list(executor.map(process_image, images))
|
||||
```
|
||||
|
||||
## Combined Advanced Features
|
||||
|
||||
### Azure Document Intelligence + LLM Descriptions
|
||||
|
||||
Combine both for maximum quality:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint="YOUR-AZURE-ENDPOINT",
|
||||
docintel_key="YOUR-AZURE-KEY"
|
||||
)
|
||||
|
||||
# Best possible PDF conversion with image descriptions
|
||||
result = md.convert("complex_report.pdf")
|
||||
```
|
||||
|
||||
**Use cases:**
|
||||
- Research papers with figures
|
||||
- Business reports with charts
|
||||
- Technical documentation with diagrams
|
||||
- Presentations with visual data
|
||||
|
||||
### Smart Document Processing Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import os
|
||||
|
||||
def smart_convert(file_path):
|
||||
"""Intelligently choose processing method based on file type."""
|
||||
client = OpenAI()
|
||||
ext = os.path.splitext(file_path)[1].lower()
|
||||
|
||||
# PDFs with complex tables: Use Azure
|
||||
if ext == '.pdf':
|
||||
md = MarkItDown(
|
||||
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_KEY')
|
||||
)
|
||||
|
||||
# Documents/presentations with images: Use LLM
|
||||
elif ext in ['.pptx', '.docx']:
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o"
|
||||
)
|
||||
|
||||
# Simple formats: Standard processing
|
||||
else:
|
||||
md = MarkItDown()
|
||||
|
||||
return md.convert(file_path)
|
||||
|
||||
# Use it
|
||||
result = smart_convert("document.pdf")
|
||||
```
|
||||
|
||||
## Plugin System
|
||||
|
||||
MarkItDown supports custom plugins for extending functionality.
|
||||
|
||||
### Plugin Architecture
|
||||
|
||||
Plugins are disabled by default for security:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Enable plugins
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
```
|
||||
|
||||
### Creating Custom Plugins
|
||||
|
||||
**Plugin structure:**
|
||||
```python
|
||||
class CustomConverter:
|
||||
"""Custom converter plugin for MarkItDown."""
|
||||
|
||||
def can_convert(self, file_path):
|
||||
"""Check if this plugin can handle the file."""
|
||||
return file_path.endswith('.custom')
|
||||
|
||||
def convert(self, file_path):
|
||||
"""Convert file to Markdown."""
|
||||
# Your conversion logic here
|
||||
return {
|
||||
'text_content': '# Converted Content\n\n...'
|
||||
}
|
||||
```
|
||||
|
||||
### Plugin Registration
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
|
||||
# Register custom plugin
|
||||
md.register_plugin(CustomConverter())
|
||||
|
||||
# Use normally
|
||||
result = md.convert("file.custom")
|
||||
```
|
||||
|
||||
### Plugin Use Cases
|
||||
|
||||
**Custom formats:**
|
||||
- Proprietary document formats
|
||||
- Specialized scientific data formats
|
||||
- Legacy file formats
|
||||
|
||||
**Enhanced processing:**
|
||||
- Custom OCR engines
|
||||
- Specialized table extraction
|
||||
- Domain-specific parsing
|
||||
|
||||
**Integration:**
|
||||
- Enterprise document systems
|
||||
- Custom databases
|
||||
- Specialized APIs
|
||||
|
||||
### Plugin Security
|
||||
|
||||
**Important security considerations:**
|
||||
- Plugins run with full system access
|
||||
- Only enable for trusted plugins
|
||||
- Validate plugin code before use
|
||||
- Disable plugins in production unless required
|
||||
|
||||
## Error Handling for Advanced Features
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
def robust_convert(file_path):
|
||||
"""Convert with fallback strategies."""
|
||||
try:
|
||||
# Try with all advanced features
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_KEY')
|
||||
)
|
||||
return md.convert(file_path)
|
||||
|
||||
except Exception as azure_error:
|
||||
print(f"Azure failed: {azure_error}")
|
||||
|
||||
try:
|
||||
# Fallback: LLM only
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
return md.convert(file_path)
|
||||
|
||||
except Exception as llm_error:
|
||||
print(f"LLM failed: {llm_error}")
|
||||
|
||||
# Final fallback: Standard processing
|
||||
md = MarkItDown()
|
||||
return md.convert(file_path)
|
||||
|
||||
# Use it
|
||||
result = robust_convert("document.pdf")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Azure Document Intelligence
|
||||
- Use for complex PDFs only (cost optimization)
|
||||
- Monitor usage and costs
|
||||
- Store credentials securely
|
||||
- Handle quota limits gracefully
|
||||
- Fall back to standard processing if needed
|
||||
|
||||
### LLM Integration
|
||||
- Use appropriate models for task complexity
|
||||
- Customize prompts for specific use cases
|
||||
- Monitor API costs
|
||||
- Implement rate limiting
|
||||
- Cache results when possible
|
||||
- Handle API errors gracefully
|
||||
|
||||
### Combined Features
|
||||
- Test cost/quality tradeoffs
|
||||
- Use selectively for important documents
|
||||
- Implement intelligent routing
|
||||
- Monitor performance and costs
|
||||
- Have fallback strategies
|
||||
|
||||
### Security
|
||||
- Store API keys securely (environment variables, secrets manager)
|
||||
- Never commit credentials to code
|
||||
- Disable plugins unless required
|
||||
- Validate all inputs
|
||||
- Use least privilege access
|
||||
399
scientific-skills/markitdown/references/api_reference.md
Normal file
399
scientific-skills/markitdown/references/api_reference.md
Normal file
@@ -0,0 +1,399 @@
|
||||
# MarkItDown API Reference
|
||||
|
||||
## Core Classes
|
||||
|
||||
### MarkItDown
|
||||
|
||||
The main class for converting files to Markdown.
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=None,
|
||||
llm_model=None,
|
||||
llm_prompt=None,
|
||||
docintel_endpoint=None,
|
||||
enable_plugins=False
|
||||
)
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `llm_client` | OpenAI client | `None` | OpenAI-compatible client for AI image descriptions |
|
||||
| `llm_model` | str | `None` | Model name (e.g., "anthropic/claude-sonnet-4.5") for image descriptions |
|
||||
| `llm_prompt` | str | `None` | Custom prompt for image description |
|
||||
| `docintel_endpoint` | str | `None` | Azure Document Intelligence endpoint |
|
||||
| `enable_plugins` | bool | `False` | Enable 3rd-party plugins |
|
||||
|
||||
#### Methods
|
||||
|
||||
##### convert()
|
||||
|
||||
Convert a file to Markdown.
|
||||
|
||||
```python
|
||||
result = md.convert(
|
||||
source,
|
||||
file_extension=None
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters**:
|
||||
- `source` (str): Path to the file to convert
|
||||
- `file_extension` (str, optional): Override file extension detection
|
||||
|
||||
**Returns**: `DocumentConverterResult` object
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
##### convert_stream()
|
||||
|
||||
Convert from a file-like binary stream.
|
||||
|
||||
```python
|
||||
result = md.convert_stream(
|
||||
stream,
|
||||
file_extension
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters**:
|
||||
- `stream` (BinaryIO): Binary file-like object (e.g., file opened in `"rb"` mode)
|
||||
- `file_extension` (str): File extension to determine conversion method (e.g., ".pdf")
|
||||
|
||||
**Returns**: `DocumentConverterResult` object
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
with open("document.pdf", "rb") as f:
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Important**: The stream must be opened in binary mode (`"rb"`), not text mode.
|
||||
|
||||
## Result Object
|
||||
|
||||
### DocumentConverterResult
|
||||
|
||||
The result of a conversion operation.
|
||||
|
||||
#### Attributes
|
||||
|
||||
| Attribute | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `text_content` | str | The converted Markdown text |
|
||||
| `title` | str | Document title (if available) |
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
result = md.convert("paper.pdf")
|
||||
|
||||
# Access content
|
||||
content = result.text_content
|
||||
|
||||
# Access title (if available)
|
||||
title = result.title
|
||||
```
|
||||
|
||||
## Custom Converters
|
||||
|
||||
You can create custom document converters by implementing the `DocumentConverter` interface.
|
||||
|
||||
### DocumentConverter Interface
|
||||
|
||||
```python
|
||||
from markitdown import DocumentConverter
|
||||
|
||||
class CustomConverter(DocumentConverter):
|
||||
def convert(self, stream, file_extension):
|
||||
"""
|
||||
Convert a document from a binary stream.
|
||||
|
||||
Parameters:
|
||||
stream (BinaryIO): Binary file-like object
|
||||
file_extension (str): File extension (e.g., ".custom")
|
||||
|
||||
Returns:
|
||||
DocumentConverterResult: Conversion result
|
||||
"""
|
||||
# Your conversion logic here
|
||||
pass
|
||||
```
|
||||
|
||||
### Registering Custom Converters
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult
|
||||
|
||||
class MyCustomConverter(DocumentConverter):
|
||||
def convert(self, stream, file_extension):
|
||||
content = stream.read().decode('utf-8')
|
||||
markdown_text = f"# Custom Format\n\n{content}"
|
||||
return DocumentConverterResult(
|
||||
text_content=markdown_text,
|
||||
title="Custom Document"
|
||||
)
|
||||
|
||||
# Create MarkItDown instance
|
||||
md = MarkItDown()
|
||||
|
||||
# Register custom converter for .custom files
|
||||
md.register_converter(".custom", MyCustomConverter())
|
||||
|
||||
# Use it
|
||||
result = md.convert("myfile.custom")
|
||||
```
|
||||
|
||||
## Plugin System
|
||||
|
||||
### Finding Plugins
|
||||
|
||||
Search GitHub for `#markitdown-plugin` tag.
|
||||
|
||||
### Using Plugins
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Enable plugins
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
result = md.convert("document.pdf")
|
||||
```
|
||||
|
||||
### Creating Plugins
|
||||
|
||||
Plugins are Python packages that register converters with MarkItDown.
|
||||
|
||||
**Plugin Structure**:
|
||||
```
|
||||
my-markitdown-plugin/
|
||||
├── setup.py
|
||||
├── my_plugin/
|
||||
│ ├── __init__.py
|
||||
│ └── converter.py
|
||||
└── README.md
|
||||
```
|
||||
|
||||
**setup.py**:
|
||||
```python
|
||||
from setuptools import setup
|
||||
|
||||
setup(
|
||||
name="markitdown-my-plugin",
|
||||
version="0.1.0",
|
||||
packages=["my_plugin"],
|
||||
entry_points={
|
||||
"markitdown.plugins": [
|
||||
"my_plugin = my_plugin.converter:MyConverter",
|
||||
],
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
**converter.py**:
|
||||
```python
|
||||
from markitdown import DocumentConverter, DocumentConverterResult
|
||||
|
||||
class MyConverter(DocumentConverter):
|
||||
def convert(self, stream, file_extension):
|
||||
# Your conversion logic
|
||||
content = stream.read()
|
||||
markdown = self.process(content)
|
||||
return DocumentConverterResult(
|
||||
text_content=markdown,
|
||||
title="My Document"
|
||||
)
|
||||
|
||||
def process(self, content):
|
||||
# Process content
|
||||
return "# Converted Content\n\n..."
|
||||
```
|
||||
|
||||
## AI-Enhanced Conversions
|
||||
|
||||
### Using OpenRouter for Image Descriptions
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Initialize OpenRouter client (OpenAI-compatible API)
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
# Create MarkItDown with AI support
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5", # recommended for scientific vision
|
||||
llm_prompt="Describe this image in detail for scientific documentation"
|
||||
)
|
||||
|
||||
# Convert files with images
|
||||
result = md.convert("presentation.pptx")
|
||||
```
|
||||
|
||||
### Available Models via OpenRouter
|
||||
|
||||
Popular models with vision support:
|
||||
- `anthropic/claude-sonnet-4.5` - **Recommended for scientific vision**
|
||||
- `anthropic/claude-opus-4.5` - Advanced vision model
|
||||
- `openai/gpt-4o` - GPT-4 Omni
|
||||
- `openai/gpt-4-vision` - GPT-4 Vision
|
||||
- `google/gemini-pro-vision` - Gemini Pro Vision
|
||||
|
||||
See https://openrouter.ai/models for the complete list.
|
||||
|
||||
### Custom Prompts
|
||||
|
||||
```python
|
||||
# For scientific diagrams
|
||||
scientific_prompt = """
|
||||
Analyze this scientific diagram or chart. Describe:
|
||||
1. The type of visualization (graph, chart, diagram, etc.)
|
||||
2. Key data points or trends
|
||||
3. Labels and axes
|
||||
4. Scientific significance
|
||||
Be precise and technical.
|
||||
"""
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5",
|
||||
llm_prompt=scientific_prompt
|
||||
)
|
||||
```
|
||||
|
||||
## Azure Document Intelligence
|
||||
|
||||
### Setup
|
||||
|
||||
1. Create Azure Document Intelligence resource
|
||||
2. Get endpoint URL
|
||||
3. Set authentication
|
||||
|
||||
### Usage
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/"
|
||||
)
|
||||
|
||||
result = md.convert("complex_document.pdf")
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
Set environment variables:
|
||||
```bash
|
||||
export AZURE_DOCUMENT_INTELLIGENCE_KEY="your-key"
|
||||
```
|
||||
|
||||
Or pass credentials programmatically.
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("File not found")
|
||||
except ValueError as e:
|
||||
print(f"Invalid file format: {e}")
|
||||
except Exception as e:
|
||||
print(f"Conversion error: {e}")
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
### 1. Reuse MarkItDown Instance
|
||||
|
||||
```python
|
||||
# Good: Create once, use many times
|
||||
md = MarkItDown()
|
||||
|
||||
for file in files:
|
||||
result = md.convert(file)
|
||||
process(result)
|
||||
```
|
||||
|
||||
### 2. Use Streaming for Large Files
|
||||
|
||||
```python
|
||||
# For large files
|
||||
with open("large_file.pdf", "rb") as f:
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
```
|
||||
|
||||
### 3. Batch Processing
|
||||
|
||||
```python
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def convert_file(filepath):
|
||||
return md.convert(filepath)
|
||||
|
||||
with ThreadPoolExecutor(max_workers=4) as executor:
|
||||
results = executor.map(convert_file, file_list)
|
||||
```
|
||||
|
||||
## Breaking Changes (v0.0.1 to v0.1.0)
|
||||
|
||||
1. **Dependencies**: Now organized into optional feature groups
|
||||
```bash
|
||||
# Old
|
||||
pip install markitdown
|
||||
|
||||
# New
|
||||
pip install 'markitdown[all]'
|
||||
```
|
||||
|
||||
2. **convert_stream()**: Now requires binary file-like object
|
||||
```python
|
||||
# Old (also accepted text)
|
||||
with open("file.pdf", "r") as f: # text mode
|
||||
result = md.convert_stream(f)
|
||||
|
||||
# New (binary only)
|
||||
with open("file.pdf", "rb") as f: # binary mode
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
```
|
||||
|
||||
3. **DocumentConverter Interface**: Changed to read from streams instead of file paths
|
||||
- No temporary files created
|
||||
- More memory efficient
|
||||
- Plugins need updating
|
||||
|
||||
## Version Compatibility
|
||||
|
||||
- **Python**: 3.10 or higher required
|
||||
- **Dependencies**: Check `setup.py` for version constraints
|
||||
- **OpenAI**: Compatible with OpenAI Python SDK v1.0+
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Description | Example |
|
||||
|----------|-------------|---------|
|
||||
| `OPENROUTER_API_KEY` | OpenRouter API key for image descriptions | `sk-or-v1-...` |
|
||||
| `AZURE_DOCUMENT_INTELLIGENCE_KEY` | Azure DI authentication | `key123...` |
|
||||
| `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` | Azure DI endpoint | `https://...` |
|
||||
|
||||
@@ -1,273 +0,0 @@
|
||||
# Document Conversion Reference
|
||||
|
||||
This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.
|
||||
|
||||
## PDF Files
|
||||
|
||||
PDF conversion extracts text, tables, and structure from PDF documents.
|
||||
|
||||
### Basic PDF Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### PDF with Azure Document Intelligence
|
||||
|
||||
For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
|
||||
docintel_key="YOUR-API-KEY"
|
||||
)
|
||||
result = md.convert("complex_table.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Benefits of Azure Document Intelligence:**
|
||||
- Superior table extraction and reconstruction
|
||||
- Better handling of multi-column layouts
|
||||
- Form field recognition
|
||||
- Improved text ordering in complex documents
|
||||
|
||||
### PDF Handling Notes
|
||||
|
||||
- Scanned PDFs require OCR (automatically handled if tesseract is installed)
|
||||
- Password-protected PDFs are not supported
|
||||
- Large PDFs may take longer to process
|
||||
- Vector graphics and embedded images are extracted where possible
|
||||
|
||||
## Word Documents (DOCX)
|
||||
|
||||
Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.
|
||||
|
||||
### Basic DOCX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.docx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### DOCX Structure Preservation
|
||||
|
||||
MarkItDown preserves:
|
||||
- **Headings** → Markdown headers (`#`, `##`, etc.)
|
||||
- **Bold/Italic** → Markdown emphasis (`**bold**`, `*italic*`)
|
||||
- **Lists** → Markdown lists (ordered and unordered)
|
||||
- **Tables** → Markdown tables
|
||||
- **Hyperlinks** → Markdown links `[text](url)`
|
||||
- **Images** → Referenced with descriptions (can use LLM for descriptions)
|
||||
|
||||
### Command-Line Usage
|
||||
|
||||
```bash
|
||||
# Basic conversion
|
||||
markitdown report.docx -o report.md
|
||||
|
||||
# With output directory
|
||||
markitdown report.docx -o output/report.md
|
||||
```
|
||||
|
||||
### DOCX with Images
|
||||
|
||||
To generate descriptions for images in Word documents, use LLM integration:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("document_with_images.docx")
|
||||
```
|
||||
|
||||
## PowerPoint Presentations (PPTX)
|
||||
|
||||
PowerPoint conversion extracts text from slides while preserving structure.
|
||||
|
||||
### Basic PPTX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("presentation.pptx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### PPTX Structure
|
||||
|
||||
MarkItDown processes presentations as:
|
||||
- Each slide becomes a major section
|
||||
- Slide titles become headers
|
||||
- Bullet points are preserved
|
||||
- Tables are converted to Markdown tables
|
||||
- Notes are included if present
|
||||
|
||||
### PPTX with Image Descriptions
|
||||
|
||||
Presentations often contain important visual information. Use LLM integration to describe images:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this slide image in detail, focusing on key information"
|
||||
)
|
||||
result = md.convert("presentation.pptx")
|
||||
```
|
||||
|
||||
**Custom prompts for presentations:**
|
||||
- "Describe charts and graphs with their key data points"
|
||||
- "Explain diagrams and their relationships"
|
||||
- "Summarize visual content for accessibility"
|
||||
|
||||
## Excel Spreadsheets (XLSX, XLS)
|
||||
|
||||
Excel conversion formats spreadsheet data as Markdown tables.
|
||||
|
||||
### Basic XLSX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.xlsx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Multi-Sheet Workbooks
|
||||
|
||||
For workbooks with multiple sheets:
|
||||
- Each sheet becomes a separate section
|
||||
- Sheet names are used as headers
|
||||
- Empty sheets are skipped
|
||||
- Formulas are evaluated (values shown, not formulas)
|
||||
|
||||
### XLSX Conversion Details
|
||||
|
||||
**What's preserved:**
|
||||
- Cell values (text, numbers, dates)
|
||||
- Table structure (rows and columns)
|
||||
- Sheet names
|
||||
- Cell formatting (bold headers)
|
||||
|
||||
**What's not preserved:**
|
||||
- Formulas (only computed values)
|
||||
- Charts and graphs (use LLM integration for descriptions)
|
||||
- Cell colors and conditional formatting
|
||||
- Comments and notes
|
||||
|
||||
### Large Spreadsheets
|
||||
|
||||
For large spreadsheets, consider:
|
||||
- Processing may be slower for files with many rows/columns
|
||||
- Very wide tables may not format well in Markdown
|
||||
- Consider filtering or preprocessing data if possible
|
||||
|
||||
### XLS (Legacy Excel) Files
|
||||
|
||||
Legacy `.xls` files are supported but require additional dependencies:
|
||||
|
||||
```bash
|
||||
pip install 'markitdown[xls]'
|
||||
```
|
||||
|
||||
Then use normally:
|
||||
```python
|
||||
md = MarkItDown()
|
||||
result = md.convert("legacy_data.xls")
|
||||
```
|
||||
|
||||
## Common Document Conversion Patterns
|
||||
|
||||
### Batch Document Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Process all documents in a directory
|
||||
for filename in os.listdir("documents"):
|
||||
if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
|
||||
result = md.convert(f"documents/{filename}")
|
||||
|
||||
# Save to output directory
|
||||
output_name = os.path.splitext(filename)[0] + ".md"
|
||||
with open(f"markdown/{output_name}", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Document with Mixed Content
|
||||
|
||||
For documents containing multiple types of content (text, tables, images):
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Use LLM for image descriptions + Azure for complex tables
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
|
||||
result = md.convert("complex_report.pdf")
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Conversion failed: {e}")
|
||||
# Handle specific errors (file not found, unsupported format, etc.)
|
||||
```
|
||||
|
||||
## Output Quality Tips
|
||||
|
||||
**For best results:**
|
||||
1. Use Azure Document Intelligence for PDFs with complex tables
|
||||
2. Enable LLM descriptions for documents with important visual content
|
||||
3. Ensure source documents are well-structured (proper headings, etc.)
|
||||
4. For scanned documents, ensure good scan quality for OCR accuracy
|
||||
5. Test with sample documents to verify output quality
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
**Conversion speed depends on:**
|
||||
- Document size and complexity
|
||||
- Number of images (especially with LLM descriptions)
|
||||
- Use of Azure Document Intelligence
|
||||
- Available system resources
|
||||
|
||||
**Optimization tips:**
|
||||
- Disable LLM integration if image descriptions aren't needed
|
||||
- Use standard extraction (not Azure) for simple documents
|
||||
- Process large batches in parallel when possible
|
||||
- Consider streaming for very large documents
|
||||
542
scientific-skills/markitdown/references/file_formats.md
Normal file
542
scientific-skills/markitdown/references/file_formats.md
Normal file
@@ -0,0 +1,542 @@
|
||||
# File Format Support
|
||||
|
||||
This document provides detailed information about each file format supported by MarkItDown.
|
||||
|
||||
## Document Formats
|
||||
|
||||
### PDF (.pdf)
|
||||
|
||||
**Capabilities**:
|
||||
- Text extraction
|
||||
- Table detection
|
||||
- Metadata extraction
|
||||
- OCR for scanned documents (with dependencies)
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[pdf]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Scientific papers
|
||||
- Reports
|
||||
- Books
|
||||
- Forms
|
||||
|
||||
**Limitations**:
|
||||
- Complex layouts may not preserve perfect formatting
|
||||
- Scanned PDFs require OCR setup
|
||||
- Some PDF features (annotations, forms) may not convert
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("research_paper.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Enhanced with Azure Document Intelligence**:
|
||||
```python
|
||||
md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
|
||||
result = md.convert("complex_layout.pdf")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Microsoft Word (.docx)
|
||||
|
||||
**Capabilities**:
|
||||
- Text extraction
|
||||
- Table conversion
|
||||
- Heading hierarchy
|
||||
- List formatting
|
||||
- Basic text formatting (bold, italic)
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[docx]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Research papers
|
||||
- Reports
|
||||
- Documentation
|
||||
- Manuscripts
|
||||
|
||||
**Preserved Elements**:
|
||||
- Headings (converted to Markdown headers)
|
||||
- Tables (converted to Markdown tables)
|
||||
- Lists (bulleted and numbered)
|
||||
- Basic formatting (bold, italic)
|
||||
- Paragraphs
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("manuscript.docx")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### PowerPoint (.pptx)
|
||||
|
||||
**Capabilities**:
|
||||
- Slide content extraction
|
||||
- Speaker notes
|
||||
- Table extraction
|
||||
- Image descriptions (with AI)
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[pptx]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Presentations
|
||||
- Lecture slides
|
||||
- Conference talks
|
||||
|
||||
**Output Format**:
|
||||
```markdown
|
||||
# Slide 1: Title
|
||||
|
||||
Content from slide 1...
|
||||
|
||||
**Notes**: Speaker notes appear here
|
||||
|
||||
---
|
||||
|
||||
# Slide 2: Next Topic
|
||||
|
||||
...
|
||||
```
|
||||
|
||||
**With AI Image Descriptions**:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("presentation.pptx")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Excel (.xlsx, .xls)
|
||||
|
||||
**Capabilities**:
|
||||
- Sheet extraction
|
||||
- Table formatting
|
||||
- Data preservation
|
||||
- Formula values (calculated)
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[xlsx]' # Modern Excel
|
||||
pip install 'markitdown[xls]' # Legacy Excel
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Data tables
|
||||
- Research data
|
||||
- Statistical results
|
||||
- Experimental data
|
||||
|
||||
**Output Format**:
|
||||
```markdown
|
||||
# Sheet: Results
|
||||
|
||||
| Sample | Control | Treatment | P-value |
|
||||
|--------|---------|-----------|---------|
|
||||
| 1 | 10.2 | 12.5 | 0.023 |
|
||||
| 2 | 9.8 | 11.9 | 0.031 |
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("experimental_data.xlsx")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Image Formats
|
||||
|
||||
### Images (.jpg, .jpeg, .png, .gif, .webp)
|
||||
|
||||
**Capabilities**:
|
||||
- EXIF metadata extraction
|
||||
- OCR text extraction
|
||||
- AI-powered image descriptions
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[all]' # Includes image support
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Scanned documents
|
||||
- Charts and graphs
|
||||
- Scientific diagrams
|
||||
- Photographs with text
|
||||
|
||||
**Output Without AI**:
|
||||
```markdown
|
||||

|
||||
|
||||
**EXIF Data**:
|
||||
- Camera: Canon EOS 5D
|
||||
- Date: 2024-01-15
|
||||
- Resolution: 4000x3000
|
||||
```
|
||||
|
||||
**Output With AI**:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this scientific diagram in detail"
|
||||
)
|
||||
result = md.convert("graph.png")
|
||||
```
|
||||
|
||||
**OCR for Text Extraction**:
|
||||
Requires Tesseract OCR:
|
||||
```bash
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Ubuntu
|
||||
sudo apt-get install tesseract-ocr
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Audio Formats
|
||||
|
||||
### Audio (.wav, .mp3)
|
||||
|
||||
**Capabilities**:
|
||||
- Metadata extraction
|
||||
- Speech-to-text transcription
|
||||
- Duration and technical info
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[audio-transcription]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Lecture recordings
|
||||
- Interviews
|
||||
- Podcasts
|
||||
- Meeting recordings
|
||||
|
||||
**Output Format**:
|
||||
```markdown
|
||||
# Audio: interview.mp3
|
||||
|
||||
**Metadata**:
|
||||
- Duration: 45:32
|
||||
- Bitrate: 320kbps
|
||||
- Sample Rate: 44100Hz
|
||||
|
||||
**Transcription**:
|
||||
[Transcribed text appears here...]
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("lecture.mp3")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Web Formats
|
||||
|
||||
### HTML (.html, .htm)
|
||||
|
||||
**Capabilities**:
|
||||
- Clean HTML to Markdown conversion
|
||||
- Link preservation
|
||||
- Table conversion
|
||||
- List formatting
|
||||
|
||||
**Best For**:
|
||||
- Web pages
|
||||
- Documentation
|
||||
- Blog posts
|
||||
- Online articles
|
||||
|
||||
**Output Format**: Clean Markdown with preserved links and structure
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("webpage.html")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### YouTube URLs
|
||||
|
||||
**Capabilities**:
|
||||
- Fetch video transcriptions
|
||||
- Extract video metadata
|
||||
- Caption download
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[youtube-transcription]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Educational videos
|
||||
- Lectures
|
||||
- Talks
|
||||
- Tutorials
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Formats
|
||||
|
||||
### CSV (.csv)
|
||||
|
||||
**Capabilities**:
|
||||
- Automatic table conversion
|
||||
- Delimiter detection
|
||||
- Header preservation
|
||||
|
||||
**Output Format**: Markdown tables
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("data.csv")
|
||||
```
|
||||
|
||||
**Output**:
|
||||
```markdown
|
||||
| Column1 | Column2 | Column3 |
|
||||
|---------|---------|---------|
|
||||
| Value1 | Value2 | Value3 |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### JSON (.json)
|
||||
|
||||
**Capabilities**:
|
||||
- Structured representation
|
||||
- Pretty formatting
|
||||
- Nested data visualization
|
||||
|
||||
**Best For**:
|
||||
- API responses
|
||||
- Configuration files
|
||||
- Data exports
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("data.json")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### XML (.xml)
|
||||
|
||||
**Capabilities**:
|
||||
- Structure preservation
|
||||
- Attribute extraction
|
||||
- Formatted output
|
||||
|
||||
**Best For**:
|
||||
- Configuration files
|
||||
- Data interchange
|
||||
- Structured documents
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("config.xml")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Archive Formats
|
||||
|
||||
### ZIP (.zip)
|
||||
|
||||
**Capabilities**:
|
||||
- Iterates through archive contents
|
||||
- Converts each file individually
|
||||
- Maintains directory structure in output
|
||||
|
||||
**Best For**:
|
||||
- Document collections
|
||||
- Project archives
|
||||
- Batch conversions
|
||||
|
||||
**Output Format**:
|
||||
```markdown
|
||||
# Archive: documents.zip
|
||||
|
||||
## File: document1.pdf
|
||||
[Content from document1.pdf...]
|
||||
|
||||
---
|
||||
|
||||
## File: document2.docx
|
||||
[Content from document2.docx...]
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("archive.zip")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## E-book Formats
|
||||
|
||||
### EPUB (.epub)
|
||||
|
||||
**Capabilities**:
|
||||
- Full text extraction
|
||||
- Chapter structure
|
||||
- Metadata extraction
|
||||
|
||||
**Best For**:
|
||||
- E-books
|
||||
- Digital publications
|
||||
- Long-form content
|
||||
|
||||
**Output Format**: Markdown with preserved chapter structure
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("book.epub")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Other Formats
|
||||
|
||||
### Outlook Messages (.msg)
|
||||
|
||||
**Capabilities**:
|
||||
- Email content extraction
|
||||
- Attachment listing
|
||||
- Metadata (from, to, subject, date)
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[outlook]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Email archives
|
||||
- Communication records
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("message.msg")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Format-Specific Tips
|
||||
|
||||
### PDF Best Practices
|
||||
|
||||
1. **Use Azure Document Intelligence for complex layouts**:
|
||||
```python
|
||||
md = MarkItDown(docintel_endpoint="endpoint_url")
|
||||
```
|
||||
|
||||
2. **For scanned PDFs, ensure OCR is set up**:
|
||||
```bash
|
||||
brew install tesseract # macOS
|
||||
```
|
||||
|
||||
3. **Split very large PDFs before conversion** for better performance
|
||||
|
||||
### PowerPoint Best Practices
|
||||
|
||||
1. **Use AI for visual content**:
|
||||
```python
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
```
|
||||
|
||||
2. **Check speaker notes** - they're included in output
|
||||
|
||||
3. **Complex animations won't be captured** - static content only
|
||||
|
||||
### Excel Best Practices
|
||||
|
||||
1. **Large spreadsheets** may take time to convert
|
||||
|
||||
2. **Formulas are converted to their calculated values**
|
||||
|
||||
3. **Multiple sheets** are all included in output
|
||||
|
||||
4. **Charts become text descriptions** (use AI for better descriptions)
|
||||
|
||||
### Image Best Practices
|
||||
|
||||
1. **Use AI for meaningful descriptions**:
|
||||
```python
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this scientific figure in detail"
|
||||
)
|
||||
```
|
||||
|
||||
2. **For text-heavy images, ensure OCR dependencies** are installed
|
||||
|
||||
3. **High-resolution images** may take longer to process
|
||||
|
||||
### Audio Best Practices
|
||||
|
||||
1. **Clear audio** produces better transcriptions
|
||||
|
||||
2. **Long recordings** may take significant time
|
||||
|
||||
3. **Consider splitting long audio files** for faster processing
|
||||
|
||||
---
|
||||
|
||||
## Unsupported Formats
|
||||
|
||||
If you need to convert an unsupported format:
|
||||
|
||||
1. **Create a custom converter** (see `api_reference.md`)
|
||||
2. **Look for plugins** on GitHub (#markitdown-plugin)
|
||||
3. **Pre-convert to supported format** (e.g., convert .rtf to .docx)
|
||||
|
||||
---
|
||||
|
||||
## Format Detection
|
||||
|
||||
MarkItDown automatically detects format from:
|
||||
|
||||
1. **File extension** (primary method)
|
||||
2. **MIME type** (fallback)
|
||||
3. **File signature** (magic bytes, fallback)
|
||||
|
||||
**Override detection**:
|
||||
```python
|
||||
# Force specific format
|
||||
result = md.convert("file_without_extension", file_extension=".pdf")
|
||||
|
||||
# With streams
|
||||
with open("file", "rb") as f:
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
```
|
||||
|
||||
@@ -1,365 +0,0 @@
|
||||
# Media Processing Reference
|
||||
|
||||
This document provides detailed information about processing images and audio files with MarkItDown.
|
||||
|
||||
## Image Processing
|
||||
|
||||
MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
|
||||
|
||||
### Basic Image Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("photo.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Image Processing Features
|
||||
|
||||
**What's extracted:**
|
||||
1. **EXIF Metadata** - Camera settings, date, location, etc.
|
||||
2. **OCR Text** - Text detected in the image (requires tesseract)
|
||||
3. **Image Description** - AI-generated description (with LLM integration)
|
||||
|
||||
### EXIF Metadata Extraction
|
||||
|
||||
Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("IMG_1234.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Example output includes:**
|
||||
- Camera make and model
|
||||
- Capture date and time
|
||||
- GPS coordinates (if available)
|
||||
- Exposure settings (ISO, shutter speed, aperture)
|
||||
- Image dimensions
|
||||
- Orientation
|
||||
|
||||
### OCR (Optical Character Recognition)
|
||||
|
||||
Extract text from images containing text (screenshots, scanned documents, photos of text):
|
||||
|
||||
**Requirements:**
|
||||
- Install tesseract OCR engine:
|
||||
```bash
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Ubuntu/Debian
|
||||
apt-get install tesseract-ocr
|
||||
|
||||
# Windows
|
||||
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("screenshot.png")
|
||||
print(result.text_content) # Contains OCR'd text
|
||||
```
|
||||
|
||||
**Best practices for OCR:**
|
||||
- Use high-resolution images for better accuracy
|
||||
- Ensure good contrast between text and background
|
||||
- Straighten skewed text if possible
|
||||
- Use well-lit, clear images
|
||||
|
||||
### LLM-Generated Image Descriptions
|
||||
|
||||
Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("diagram.png")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Custom prompts for specific needs:**
|
||||
|
||||
```python
|
||||
# For diagrams
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
|
||||
)
|
||||
|
||||
# For charts
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Analyze this chart and provide key data points and trends"
|
||||
)
|
||||
|
||||
# For UI screenshots
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this user interface, listing all visible elements and their layout"
|
||||
)
|
||||
```
|
||||
|
||||
### Supported Image Formats
|
||||
|
||||
MarkItDown supports all common image formats:
|
||||
- JPEG/JPG
|
||||
- PNG
|
||||
- GIF
|
||||
- BMP
|
||||
- TIFF
|
||||
- WebP
|
||||
- HEIC (requires additional libraries on some platforms)
|
||||
|
||||
## Audio Processing
|
||||
|
||||
MarkItDown can transcribe audio files to text using speech recognition.
|
||||
|
||||
### Basic Audio Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("recording.wav")
|
||||
print(result.text_content) # Transcribed speech
|
||||
```
|
||||
|
||||
### Audio Transcription Setup
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
pip install 'markitdown[audio]'
|
||||
```
|
||||
|
||||
This installs the `speech_recognition` library and dependencies.
|
||||
|
||||
### Supported Audio Formats
|
||||
|
||||
- WAV
|
||||
- AIFF
|
||||
- FLAC
|
||||
- MP3 (requires ffmpeg or libav)
|
||||
- OGG (requires ffmpeg or libav)
|
||||
- Other formats supported by speech_recognition
|
||||
|
||||
### Audio Transcription Engines
|
||||
|
||||
MarkItDown uses the `speech_recognition` library, which supports multiple backends:
|
||||
|
||||
**Default (Google Speech Recognition):**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("audio.wav")
|
||||
```
|
||||
|
||||
**Note:** Default Google Speech Recognition requires internet connection.
|
||||
|
||||
### Audio Quality Considerations
|
||||
|
||||
For best transcription accuracy:
|
||||
- Use clear audio with minimal background noise
|
||||
- Prefer WAV or FLAC for better quality
|
||||
- Ensure speech is clear and at good volume
|
||||
- Avoid multiple overlapping speakers
|
||||
- Use mono audio when possible
|
||||
|
||||
### Audio Preprocessing Tips
|
||||
|
||||
For better results, consider preprocessing audio:
|
||||
|
||||
```python
|
||||
# Example: If you have pydub installed
|
||||
from pydub import AudioSegment
|
||||
from pydub.effects import normalize
|
||||
|
||||
# Load and normalize audio
|
||||
audio = AudioSegment.from_file("recording.mp3")
|
||||
audio = normalize(audio)
|
||||
audio.export("normalized.wav", format="wav")
|
||||
|
||||
# Then convert with MarkItDown
|
||||
from markitdown import MarkItDown
|
||||
md = MarkItDown()
|
||||
result = md.convert("normalized.wav")
|
||||
```
|
||||
|
||||
## Combined Media Workflows
|
||||
|
||||
### Processing Multiple Images in Batch
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import os
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# Process all images in directory
|
||||
for filename in os.listdir("images"):
|
||||
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
|
||||
result = md.convert(f"images/{filename}")
|
||||
|
||||
# Save markdown with same name
|
||||
output = filename.rsplit('.', 1)[0] + '.md'
|
||||
with open(f"output/{output}", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Screenshot Analysis Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
|
||||
)
|
||||
|
||||
screenshots = ["screen1.png", "screen2.png", "screen3.png"]
|
||||
analysis = []
|
||||
|
||||
for screenshot in screenshots:
|
||||
result = md.convert(screenshot)
|
||||
analysis.append({
|
||||
'file': screenshot,
|
||||
'content': result.text_content
|
||||
})
|
||||
|
||||
# Now ready for further processing
|
||||
```
|
||||
|
||||
### Document Images with OCR
|
||||
|
||||
For scanned documents or photos of documents:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Process scanned pages
|
||||
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
|
||||
full_text = []
|
||||
|
||||
for page in pages:
|
||||
result = md.convert(page)
|
||||
full_text.append(result.text_content)
|
||||
|
||||
# Combine into single document
|
||||
document = "\n\n---\n\n".join(full_text)
|
||||
print(document)
|
||||
```
|
||||
|
||||
### Presentation Slide Images
|
||||
|
||||
When you have presentation slides as images:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
|
||||
)
|
||||
|
||||
# Process slide images
|
||||
for i in range(1, 21): # 20 slides
|
||||
result = md.convert(f"slides/slide_{i}.png")
|
||||
print(f"## Slide {i}\n\n{result.text_content}\n\n")
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Image Processing Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("image.jpg")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("Image file not found")
|
||||
except Exception as e:
|
||||
print(f"Error processing image: {e}")
|
||||
```
|
||||
|
||||
### Audio Processing Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("audio.mp3")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Transcription failed: {e}")
|
||||
# Common issues: format not supported, no speech detected, network error
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Image Processing
|
||||
|
||||
- **LLM descriptions**: Slower but more informative
|
||||
- **OCR only**: Faster for text extraction
|
||||
- **EXIF only**: Fastest, metadata only
|
||||
- **Batch processing**: Process multiple images in parallel
|
||||
|
||||
### Audio Processing
|
||||
|
||||
- **File size**: Larger files take longer
|
||||
- **Audio length**: Transcription time scales with duration
|
||||
- **Format conversion**: WAV/FLAC are faster than MP3/OGG
|
||||
- **Network dependency**: Default transcription requires internet
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Document Digitization
|
||||
Convert scanned documents or photos of documents to searchable text.
|
||||
|
||||
### Meeting Notes
|
||||
Transcribe audio recordings of meetings to text for analysis.
|
||||
|
||||
### Presentation Analysis
|
||||
Extract content from presentation slide images.
|
||||
|
||||
### Screenshot Documentation
|
||||
Generate descriptions of UI screenshots for documentation.
|
||||
|
||||
### Image Archiving
|
||||
Extract metadata and content from photo collections.
|
||||
|
||||
### Accessibility
|
||||
Generate alt-text descriptions for images using LLM integration.
|
||||
|
||||
### Data Extraction
|
||||
OCR text from images containing tables, forms, or structured data.
|
||||
@@ -1,575 +0,0 @@
|
||||
# Structured Data Handling Reference
|
||||
|
||||
This document provides detailed information about converting structured data formats (CSV, JSON, XML) to Markdown.
|
||||
|
||||
## CSV Files
|
||||
|
||||
Convert CSV (Comma-Separated Values) files to Markdown tables.
|
||||
|
||||
### Basic CSV Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.csv")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### CSV to Markdown Table
|
||||
|
||||
CSV files are automatically converted to Markdown table format:
|
||||
|
||||
**Input CSV (`data.csv`):**
|
||||
```csv
|
||||
Name,Age,City
|
||||
Alice,30,New York
|
||||
Bob,25,Los Angeles
|
||||
Charlie,35,Chicago
|
||||
```
|
||||
|
||||
**Output Markdown:**
|
||||
```markdown
|
||||
| Name | Age | City |
|
||||
|---------|-----|-------------|
|
||||
| Alice | 30 | New York |
|
||||
| Bob | 25 | Los Angeles |
|
||||
| Charlie | 35 | Chicago |
|
||||
```
|
||||
|
||||
### CSV Conversion Features
|
||||
|
||||
**What's preserved:**
|
||||
- All column headers
|
||||
- All data rows
|
||||
- Cell values (text and numbers)
|
||||
- Column structure
|
||||
|
||||
**Formatting:**
|
||||
- Headers are bolded (Markdown table format)
|
||||
- Columns are aligned
|
||||
- Empty cells are preserved
|
||||
- Special characters are escaped
|
||||
|
||||
### Large CSV Files
|
||||
|
||||
For large CSV files:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert large CSV
|
||||
result = md.convert("large_dataset.csv")
|
||||
|
||||
# Save to file instead of printing
|
||||
with open("output.md", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
**Performance considerations:**
|
||||
- Very large files may take time to process
|
||||
- Consider previewing first few rows for testing
|
||||
- Memory usage scales with file size
|
||||
- Very wide tables may not display well in all Markdown viewers
|
||||
|
||||
### CSV with Special Characters
|
||||
|
||||
CSV files containing special characters are handled automatically:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Handles UTF-8, special characters, quotes, etc.
|
||||
result = md.convert("international_data.csv")
|
||||
```
|
||||
|
||||
### CSV Delimiters
|
||||
|
||||
Standard CSV delimiters are supported:
|
||||
- Comma (`,`) - standard
|
||||
- Semicolon (`;`) - common in European formats
|
||||
- Tab (`\t`) - TSV files
|
||||
|
||||
### Command-Line CSV Conversion
|
||||
|
||||
```bash
|
||||
# Basic conversion
|
||||
markitdown data.csv -o data.md
|
||||
|
||||
# Multiple CSV files
|
||||
for file in *.csv; do
|
||||
markitdown "$file" -o "${file%.csv}.md"
|
||||
done
|
||||
```
|
||||
|
||||
## JSON Files
|
||||
|
||||
Convert JSON data to readable Markdown format.
|
||||
|
||||
### Basic JSON Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.json")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### JSON Formatting
|
||||
|
||||
JSON is converted to a readable, structured Markdown format:
|
||||
|
||||
**Input JSON (`config.json`):**
|
||||
```json
|
||||
{
|
||||
"name": "MyApp",
|
||||
"version": "1.0.0",
|
||||
"dependencies": {
|
||||
"library1": "^2.0.0",
|
||||
"library2": "^3.1.0"
|
||||
},
|
||||
"features": ["auth", "api", "database"]
|
||||
}
|
||||
```
|
||||
|
||||
**Output Markdown:**
|
||||
```markdown
|
||||
## Configuration
|
||||
|
||||
**name:** MyApp
|
||||
**version:** 1.0.0
|
||||
|
||||
### dependencies
|
||||
- **library1:** ^2.0.0
|
||||
- **library2:** ^3.1.0
|
||||
|
||||
### features
|
||||
- auth
|
||||
- api
|
||||
- database
|
||||
```
|
||||
|
||||
### JSON Array Handling
|
||||
|
||||
JSON arrays are converted to lists or tables:
|
||||
|
||||
**Array of objects:**
|
||||
```json
|
||||
[
|
||||
{"id": 1, "name": "Alice", "active": true},
|
||||
{"id": 2, "name": "Bob", "active": false}
|
||||
]
|
||||
```
|
||||
|
||||
**Converted to table:**
|
||||
```markdown
|
||||
| id | name | active |
|
||||
|----|-------|--------|
|
||||
| 1 | Alice | true |
|
||||
| 2 | Bob | false |
|
||||
```
|
||||
|
||||
### Nested JSON Structures
|
||||
|
||||
Nested JSON is converted with appropriate indentation and hierarchy:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Handles deeply nested structures
|
||||
result = md.convert("complex_config.json")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### JSON Lines (JSONL)
|
||||
|
||||
For JSON Lines format (one JSON object per line):
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import json
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Read JSONL file
|
||||
with open("data.jsonl", "r") as f:
|
||||
for line in f:
|
||||
obj = json.loads(line)
|
||||
|
||||
# Convert to JSON temporarily
|
||||
with open("temp.json", "w") as temp:
|
||||
json.dump(obj, temp)
|
||||
|
||||
result = md.convert("temp.json")
|
||||
print(result.text_content)
|
||||
print("\n---\n")
|
||||
```
|
||||
|
||||
### Large JSON Files
|
||||
|
||||
For large JSON files:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert large JSON
|
||||
result = md.convert("large_data.json")
|
||||
|
||||
# Save to file
|
||||
with open("output.md", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
## XML Files
|
||||
|
||||
Convert XML documents to structured Markdown.
|
||||
|
||||
### Basic XML Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.xml")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### XML Structure Preservation
|
||||
|
||||
XML is converted to Markdown maintaining hierarchical structure:
|
||||
|
||||
**Input XML (`book.xml`):**
|
||||
```xml
|
||||
<?xml version="1.0"?>
|
||||
<book>
|
||||
<title>Example Book</title>
|
||||
<author>John Doe</author>
|
||||
<chapters>
|
||||
<chapter id="1">
|
||||
<title>Introduction</title>
|
||||
<content>Chapter 1 content...</content>
|
||||
</chapter>
|
||||
<chapter id="2">
|
||||
<title>Background</title>
|
||||
<content>Chapter 2 content...</content>
|
||||
</chapter>
|
||||
</chapters>
|
||||
</book>
|
||||
```
|
||||
|
||||
**Output Markdown:**
|
||||
```markdown
|
||||
# book
|
||||
|
||||
## title
|
||||
Example Book
|
||||
|
||||
## author
|
||||
John Doe
|
||||
|
||||
## chapters
|
||||
|
||||
### chapter (id: 1)
|
||||
#### title
|
||||
Introduction
|
||||
|
||||
#### content
|
||||
Chapter 1 content...
|
||||
|
||||
### chapter (id: 2)
|
||||
#### title
|
||||
Background
|
||||
|
||||
#### content
|
||||
Chapter 2 content...
|
||||
```
|
||||
|
||||
### XML Attributes
|
||||
|
||||
XML attributes are preserved in the conversion:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.xml")
|
||||
# Attributes shown as (attr: value) in headings
|
||||
```
|
||||
|
||||
### XML Namespaces
|
||||
|
||||
XML namespaces are handled:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Handles xmlns and namespaced elements
|
||||
result = md.convert("namespaced.xml")
|
||||
```
|
||||
|
||||
### XML Use Cases
|
||||
|
||||
**Configuration files:**
|
||||
- Convert XML configs to readable format
|
||||
- Document system configurations
|
||||
- Compare configuration files
|
||||
|
||||
**Data interchange:**
|
||||
- Convert XML APIs responses
|
||||
- Process XML data feeds
|
||||
- Transform between formats
|
||||
|
||||
**Document processing:**
|
||||
- Convert DocBook to Markdown
|
||||
- Process SVG descriptions
|
||||
- Extract structured data
|
||||
|
||||
## Structured Data Workflows
|
||||
|
||||
### CSV Data Analysis Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import pandas as pd
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Read CSV for analysis
|
||||
df = pd.read_csv("data.csv")
|
||||
|
||||
# Do analysis
|
||||
summary = df.describe()
|
||||
|
||||
# Convert both to Markdown
|
||||
original = md.convert("data.csv")
|
||||
|
||||
# Save summary as CSV then convert
|
||||
summary.to_csv("summary.csv")
|
||||
summary_md = md.convert("summary.csv")
|
||||
|
||||
print("## Original Data\n")
|
||||
print(original.text_content)
|
||||
print("\n## Statistical Summary\n")
|
||||
print(summary_md.text_content)
|
||||
```
|
||||
|
||||
### JSON API Documentation
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
import json
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Fetch JSON from API
|
||||
response = requests.get("https://api.example.com/data")
|
||||
data = response.json()
|
||||
|
||||
# Save as JSON
|
||||
with open("api_response.json", "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
# Convert to Markdown
|
||||
result = md.convert("api_response.json")
|
||||
|
||||
# Create documentation
|
||||
doc = f"""# API Response Documentation
|
||||
|
||||
## Endpoint
|
||||
GET https://api.example.com/data
|
||||
|
||||
## Response
|
||||
{result.text_content}
|
||||
"""
|
||||
|
||||
with open("api_docs.md", "w") as f:
|
||||
f.write(doc)
|
||||
```
|
||||
|
||||
### XML to Markdown Documentation
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert XML documentation
|
||||
xml_files = ["config.xml", "schema.xml", "data.xml"]
|
||||
|
||||
for xml_file in xml_files:
|
||||
result = md.convert(xml_file)
|
||||
|
||||
output_name = xml_file.replace('.xml', '.md')
|
||||
with open(f"docs/{output_name}", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Multi-Format Data Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def convert_structured_data(directory):
|
||||
"""Convert all structured data files in directory."""
|
||||
extensions = {'.csv', '.json', '.xml'}
|
||||
|
||||
for filename in os.listdir(directory):
|
||||
ext = os.path.splitext(filename)[1]
|
||||
|
||||
if ext in extensions:
|
||||
input_path = os.path.join(directory, filename)
|
||||
result = md.convert(input_path)
|
||||
|
||||
# Save Markdown
|
||||
output_name = filename.replace(ext, '.md')
|
||||
output_path = os.path.join("markdown", output_name)
|
||||
|
||||
with open(output_path, 'w') as f:
|
||||
f.write(result.text_content)
|
||||
|
||||
print(f"Converted: {filename} → {output_name}")
|
||||
|
||||
# Process all structured data
|
||||
convert_structured_data("data")
|
||||
```
|
||||
|
||||
### CSV to JSON to Markdown
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from markitdown import MarkItDown
|
||||
import json
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Read CSV
|
||||
df = pd.read_csv("data.csv")
|
||||
|
||||
# Convert to JSON
|
||||
json_data = df.to_dict(orient='records')
|
||||
with open("temp.json", "w") as f:
|
||||
json.dump(json_data, f, indent=2)
|
||||
|
||||
# Convert JSON to Markdown
|
||||
result = md.convert("temp.json")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Database Export to Markdown
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import sqlite3
|
||||
import csv
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Export database query to CSV
|
||||
conn = sqlite3.connect("database.db")
|
||||
cursor = conn.execute("SELECT * FROM users")
|
||||
|
||||
with open("users.csv", "w", newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
writer.writerow([description[0] for description in cursor.description])
|
||||
writer.writerows(cursor.fetchall())
|
||||
|
||||
# Convert to Markdown
|
||||
result = md.convert("users.csv")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### CSV Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("data.csv")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("CSV file not found")
|
||||
except Exception as e:
|
||||
print(f"CSV conversion error: {e}")
|
||||
# Common issues: encoding problems, malformed CSV, delimiter issues
|
||||
```
|
||||
|
||||
### JSON Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("data.json")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"JSON conversion error: {e}")
|
||||
# Common issues: invalid JSON syntax, encoding issues
|
||||
```
|
||||
|
||||
### XML Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("data.xml")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"XML conversion error: {e}")
|
||||
# Common issues: malformed XML, encoding problems, namespace issues
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### CSV Processing
|
||||
- Check delimiter before conversion
|
||||
- Verify encoding (UTF-8 recommended)
|
||||
- Handle large files with streaming if needed
|
||||
- Preview output for very wide tables
|
||||
|
||||
### JSON Processing
|
||||
- Validate JSON before conversion
|
||||
- Consider pretty-printing complex structures
|
||||
- Handle circular references appropriately
|
||||
- Be aware of large array performance
|
||||
|
||||
### XML Processing
|
||||
- Validate XML structure first
|
||||
- Handle namespaces consistently
|
||||
- Consider XPath for selective extraction
|
||||
- Be mindful of very deep nesting
|
||||
|
||||
### Data Quality
|
||||
- Clean data before conversion when possible
|
||||
- Handle missing values appropriately
|
||||
- Verify special character handling
|
||||
- Test with representative samples
|
||||
|
||||
### Performance
|
||||
- Process large files in batches
|
||||
- Use streaming for very large datasets
|
||||
- Monitor memory usage
|
||||
- Cache converted results when appropriate
|
||||
@@ -1,478 +0,0 @@
|
||||
# Web Content Extraction Reference
|
||||
|
||||
This document provides detailed information about extracting content from HTML, YouTube, EPUB, and other web-based formats.
|
||||
|
||||
## HTML Conversion
|
||||
|
||||
Convert HTML files and web pages to clean Markdown format.
|
||||
|
||||
### Basic HTML Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("webpage.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### HTML Processing Features
|
||||
|
||||
**What's preserved:**
|
||||
- Headings (`<h1>` → `#`, `<h2>` → `##`, etc.)
|
||||
- Paragraphs and text formatting
|
||||
- Links (`<a>` → `[text](url)`)
|
||||
- Lists (ordered and unordered)
|
||||
- Tables → Markdown tables
|
||||
- Code blocks and inline code
|
||||
- Emphasis (bold, italic)
|
||||
|
||||
**What's removed:**
|
||||
- Scripts and styles
|
||||
- Navigation elements
|
||||
- Advertising content
|
||||
- Boilerplate markup
|
||||
- HTML comments
|
||||
|
||||
### HTML from URLs
|
||||
|
||||
Convert web pages directly from URLs:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Fetch and convert web page
|
||||
response = requests.get("https://example.com/article")
|
||||
with open("temp.html", "wb") as f:
|
||||
f.write(response.content)
|
||||
|
||||
result = md.convert("temp.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Clean Web Article Extraction
|
||||
|
||||
For extracting main content from web articles:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from readability import Document # pip install readability-lxml
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Fetch page
|
||||
url = "https://example.com/article"
|
||||
response = requests.get(url)
|
||||
|
||||
# Extract main content
|
||||
doc = Document(response.content)
|
||||
html_content = doc.summary()
|
||||
|
||||
# Save and convert
|
||||
with open("article.html", "w") as f:
|
||||
f.write(html_content)
|
||||
|
||||
result = md.convert("article.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### HTML with Images
|
||||
|
||||
HTML files containing images can be enhanced with LLM descriptions:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("page_with_images.html")
|
||||
```
|
||||
|
||||
## YouTube Transcripts
|
||||
|
||||
Extract video transcripts from YouTube videos.
|
||||
|
||||
### Basic YouTube Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### YouTube Installation
|
||||
|
||||
```bash
|
||||
pip install 'markitdown[youtube]'
|
||||
```
|
||||
|
||||
This installs the `youtube-transcript-api` dependency.
|
||||
|
||||
### YouTube URL Formats
|
||||
|
||||
MarkItDown supports various YouTube URL formats:
|
||||
- `https://www.youtube.com/watch?v=VIDEO_ID`
|
||||
- `https://youtu.be/VIDEO_ID`
|
||||
- `https://www.youtube.com/embed/VIDEO_ID`
|
||||
- `https://m.youtube.com/watch?v=VIDEO_ID`
|
||||
|
||||
### YouTube Transcript Features
|
||||
|
||||
**What's included:**
|
||||
- Full video transcript text
|
||||
- Timestamps (optional, depending on availability)
|
||||
- Video metadata (title, description)
|
||||
- Captions in available languages
|
||||
|
||||
**Transcript languages:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Get transcript in specific language (if available)
|
||||
# Language codes: 'en', 'es', 'fr', 'de', etc.
|
||||
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
|
||||
```
|
||||
|
||||
### YouTube Playlist Processing
|
||||
|
||||
Process multiple videos from a playlist:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
video_ids = [
|
||||
"VIDEO_ID_1",
|
||||
"VIDEO_ID_2",
|
||||
"VIDEO_ID_3"
|
||||
]
|
||||
|
||||
transcripts = []
|
||||
for vid_id in video_ids:
|
||||
url = f"https://youtube.com/watch?v={vid_id}"
|
||||
result = md.convert(url)
|
||||
transcripts.append({
|
||||
'video_id': vid_id,
|
||||
'transcript': result.text_content
|
||||
})
|
||||
```
|
||||
|
||||
### YouTube Use Cases
|
||||
|
||||
**Content Analysis:**
|
||||
- Analyze video content without watching
|
||||
- Extract key information from tutorials
|
||||
- Build searchable transcript databases
|
||||
|
||||
**Research:**
|
||||
- Process interview transcripts
|
||||
- Extract lecture content
|
||||
- Analyze presentation content
|
||||
|
||||
**Accessibility:**
|
||||
- Generate text versions of video content
|
||||
- Create searchable video archives
|
||||
|
||||
### YouTube Limitations
|
||||
|
||||
- Requires videos to have captions/transcripts available
|
||||
- Auto-generated captions may have transcription errors
|
||||
- Some videos may disable transcript access
|
||||
- Rate limiting may apply for bulk processing
|
||||
|
||||
## EPUB Books
|
||||
|
||||
Convert EPUB e-books to Markdown format.
|
||||
|
||||
### Basic EPUB Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("book.epub")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### EPUB Processing Features
|
||||
|
||||
**What's extracted:**
|
||||
- Book text content
|
||||
- Chapter structure
|
||||
- Headings and formatting
|
||||
- Tables of contents
|
||||
- Footnotes and references
|
||||
|
||||
**What's preserved:**
|
||||
- Heading hierarchy
|
||||
- Text emphasis (bold, italic)
|
||||
- Links and references
|
||||
- Lists and tables
|
||||
|
||||
### EPUB with Images
|
||||
|
||||
EPUB files often contain images (covers, diagrams, illustrations):
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("illustrated_book.epub")
|
||||
```
|
||||
|
||||
### EPUB Use Cases
|
||||
|
||||
**Research:**
|
||||
- Convert textbooks to searchable format
|
||||
- Extract content for analysis
|
||||
- Build digital libraries
|
||||
|
||||
**Content Processing:**
|
||||
- Prepare books for LLM training data
|
||||
- Convert to different formats
|
||||
- Create summaries and extracts
|
||||
|
||||
**Accessibility:**
|
||||
- Convert to more accessible formats
|
||||
- Extract text for screen readers
|
||||
- Process for text-to-speech
|
||||
|
||||
## RSS Feeds
|
||||
|
||||
Process RSS feeds to extract article content.
|
||||
|
||||
### Basic RSS Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import feedparser
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Parse RSS feed
|
||||
feed = feedparser.parse("https://example.com/feed.xml")
|
||||
|
||||
# Convert each entry
|
||||
for entry in feed.entries:
|
||||
# Save entry HTML
|
||||
with open("temp.html", "w") as f:
|
||||
f.write(entry.summary)
|
||||
|
||||
result = md.convert("temp.html")
|
||||
print(f"## {entry.title}\n\n{result.text_content}\n\n")
|
||||
```
|
||||
|
||||
## Combined Web Content Workflows
|
||||
|
||||
### Web Scraping Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def scrape_and_convert(url):
|
||||
"""Scrape webpage and convert to Markdown."""
|
||||
response = requests.get(url)
|
||||
soup = BeautifulSoup(response.content, 'html.parser')
|
||||
|
||||
# Extract main content
|
||||
main_content = soup.find('article') or soup.find('main')
|
||||
|
||||
if main_content:
|
||||
# Save HTML
|
||||
with open("temp.html", "w") as f:
|
||||
f.write(str(main_content))
|
||||
|
||||
# Convert to Markdown
|
||||
result = md.convert("temp.html")
|
||||
return result.text_content
|
||||
|
||||
return None
|
||||
|
||||
# Use it
|
||||
markdown = scrape_and_convert("https://example.com/article")
|
||||
print(markdown)
|
||||
```
|
||||
|
||||
### YouTube Learning Content Extraction
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Course videos
|
||||
course_videos = [
|
||||
("https://youtube.com/watch?v=ID1", "Lesson 1: Introduction"),
|
||||
("https://youtube.com/watch?v=ID2", "Lesson 2: Basics"),
|
||||
("https://youtube.com/watch?v=ID3", "Lesson 3: Advanced")
|
||||
]
|
||||
|
||||
course_content = []
|
||||
for url, title in course_videos:
|
||||
result = md.convert(url)
|
||||
course_content.append(f"# {title}\n\n{result.text_content}")
|
||||
|
||||
# Combine into course document
|
||||
full_course = "\n\n---\n\n".join(course_content)
|
||||
with open("course_transcript.md", "w") as f:
|
||||
f.write(full_course)
|
||||
```
|
||||
|
||||
### Documentation Scraping
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from urllib.parse import urljoin, urlparse
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def scrape_documentation(base_url, page_urls):
|
||||
"""Scrape multiple documentation pages."""
|
||||
docs = []
|
||||
|
||||
for page_url in page_urls:
|
||||
full_url = urljoin(base_url, page_url)
|
||||
|
||||
# Fetch page
|
||||
response = requests.get(full_url)
|
||||
with open("temp.html", "wb") as f:
|
||||
f.write(response.content)
|
||||
|
||||
# Convert
|
||||
result = md.convert("temp.html")
|
||||
docs.append({
|
||||
'url': full_url,
|
||||
'content': result.text_content
|
||||
})
|
||||
|
||||
return docs
|
||||
|
||||
# Example usage
|
||||
base = "https://docs.example.com/"
|
||||
pages = ["intro.html", "getting-started.html", "api.html"]
|
||||
documentation = scrape_documentation(base, pages)
|
||||
```
|
||||
|
||||
### EPUB Library Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def process_epub_library(library_path, output_path):
|
||||
"""Convert all EPUB books in a directory."""
|
||||
for filename in os.listdir(library_path):
|
||||
if filename.endswith('.epub'):
|
||||
epub_path = os.path.join(library_path, filename)
|
||||
|
||||
try:
|
||||
result = md.convert(epub_path)
|
||||
|
||||
# Save markdown
|
||||
output_file = filename.replace('.epub', '.md')
|
||||
output_full = os.path.join(output_path, output_file)
|
||||
|
||||
with open(output_full, 'w') as f:
|
||||
f.write(result.text_content)
|
||||
|
||||
print(f"Converted: {filename}")
|
||||
except Exception as e:
|
||||
print(f"Failed to convert {filename}: {e}")
|
||||
|
||||
# Process library
|
||||
process_epub_library("books", "markdown_books")
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### HTML Conversion Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("webpage.html")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("HTML file not found")
|
||||
except Exception as e:
|
||||
print(f"Conversion error: {e}")
|
||||
```
|
||||
|
||||
### YouTube Transcript Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Failed to get transcript: {e}")
|
||||
# Common issues: No transcript available, video unavailable, network error
|
||||
```
|
||||
|
||||
### EPUB Conversion Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("book.epub")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"EPUB processing error: {e}")
|
||||
# Common issues: Corrupted file, unsupported DRM, invalid format
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### HTML Processing
|
||||
- Clean HTML before conversion for better results
|
||||
- Use readability libraries to extract main content
|
||||
- Handle different encodings appropriately
|
||||
- Remove unnecessary markup
|
||||
|
||||
### YouTube Processing
|
||||
- Check transcript availability before batch processing
|
||||
- Handle API rate limits gracefully
|
||||
- Store transcripts to avoid re-fetching
|
||||
- Respect YouTube's terms of service
|
||||
|
||||
### EPUB Processing
|
||||
- DRM-protected EPUBs cannot be processed
|
||||
- Large EPUBs may require more memory
|
||||
- Some formatting may not translate perfectly
|
||||
- Test with representative samples first
|
||||
|
||||
### Web Scraping Ethics
|
||||
- Respect robots.txt
|
||||
- Add delays between requests
|
||||
- Identify your scraper in User-Agent
|
||||
- Cache results to minimize requests
|
||||
- Follow website terms of service
|
||||
@@ -1,317 +1,228 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Batch conversion utility for MarkItDown.
|
||||
Batch convert multiple files to Markdown using MarkItDown.
|
||||
|
||||
Converts all supported files in a directory to Markdown format.
|
||||
This script demonstrates how to efficiently convert multiple files
|
||||
in a directory to Markdown format.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from markitdown import MarkItDown
|
||||
from typing import Optional, List
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from typing import List, Optional
|
||||
from markitdown import MarkItDown
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
import sys
|
||||
|
||||
|
||||
# Supported file extensions
|
||||
SUPPORTED_EXTENSIONS = {
|
||||
'.pdf', '.docx', '.pptx', '.xlsx', '.xls',
|
||||
'.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff',
|
||||
'.wav', '.mp3', '.flac', '.ogg', '.aiff',
|
||||
'.html', '.htm', '.epub',
|
||||
'.csv', '.json', '.xml',
|
||||
'.zip'
|
||||
}
|
||||
|
||||
|
||||
def setup_markitdown(
|
||||
use_llm: bool = False,
|
||||
llm_model: str = "gpt-4o",
|
||||
use_azure_di: bool = False,
|
||||
azure_endpoint: Optional[str] = None,
|
||||
azure_key: Optional[str] = None
|
||||
) -> MarkItDown:
|
||||
"""
|
||||
Setup MarkItDown instance with optional advanced features.
|
||||
|
||||
Args:
|
||||
use_llm: Enable LLM-powered image descriptions
|
||||
llm_model: LLM model to use (default: gpt-4o)
|
||||
use_azure_di: Enable Azure Document Intelligence
|
||||
azure_endpoint: Azure Document Intelligence endpoint
|
||||
azure_key: Azure Document Intelligence API key
|
||||
|
||||
Returns:
|
||||
Configured MarkItDown instance
|
||||
"""
|
||||
kwargs = {}
|
||||
|
||||
if use_llm:
|
||||
try:
|
||||
from openai import OpenAI
|
||||
client = OpenAI()
|
||||
kwargs['llm_client'] = client
|
||||
kwargs['llm_model'] = llm_model
|
||||
print(f"✓ LLM integration enabled ({llm_model})")
|
||||
except ImportError:
|
||||
print("✗ Warning: OpenAI not installed, LLM features disabled")
|
||||
print(" Install with: pip install openai")
|
||||
|
||||
if use_azure_di:
|
||||
if azure_endpoint and azure_key:
|
||||
kwargs['docintel_endpoint'] = azure_endpoint
|
||||
kwargs['docintel_key'] = azure_key
|
||||
print("✓ Azure Document Intelligence enabled")
|
||||
else:
|
||||
print("✗ Warning: Azure credentials not provided, Azure DI disabled")
|
||||
|
||||
return MarkItDown(**kwargs)
|
||||
|
||||
|
||||
def convert_file(
|
||||
md: MarkItDown,
|
||||
input_path: Path,
|
||||
output_dir: Path,
|
||||
verbose: bool = False
|
||||
) -> bool:
|
||||
def convert_file(md: MarkItDown, file_path: Path, output_dir: Path, verbose: bool = False) -> tuple[bool, str, str]:
|
||||
"""
|
||||
Convert a single file to Markdown.
|
||||
|
||||
Args:
|
||||
md: MarkItDown instance
|
||||
input_path: Path to input file
|
||||
file_path: Path to input file
|
||||
output_dir: Directory for output files
|
||||
verbose: Print detailed progress
|
||||
verbose: Print detailed messages
|
||||
|
||||
Returns:
|
||||
True if successful, False otherwise
|
||||
Tuple of (success, input_path, message)
|
||||
"""
|
||||
try:
|
||||
if verbose:
|
||||
print(f" Processing: {input_path.name}")
|
||||
print(f"Converting: {file_path}")
|
||||
|
||||
# Convert file
|
||||
result = md.convert(str(input_path))
|
||||
result = md.convert(str(file_path))
|
||||
|
||||
# Create output filename
|
||||
output_filename = input_path.stem + '.md'
|
||||
output_path = output_dir / output_filename
|
||||
# Create output path
|
||||
output_file = output_dir / f"{file_path.stem}.md"
|
||||
|
||||
# Write output
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
f.write(result.text_content)
|
||||
# Write content with metadata header
|
||||
content = f"# {result.title or file_path.stem}\n\n"
|
||||
content += f"**Source**: {file_path.name}\n"
|
||||
content += f"**Format**: {file_path.suffix}\n\n"
|
||||
content += "---\n\n"
|
||||
content += result.text_content
|
||||
|
||||
if verbose:
|
||||
print(f" ✓ Converted: {input_path.name} → {output_filename}")
|
||||
output_file.write_text(content, encoding='utf-8')
|
||||
|
||||
return True
|
||||
return True, str(file_path), f"✓ Converted to {output_file.name}"
|
||||
|
||||
except Exception as e:
|
||||
print(f" ✗ Error converting {input_path.name}: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def find_files(input_dir: Path, recursive: bool = False) -> List[Path]:
|
||||
"""
|
||||
Find all supported files in directory.
|
||||
|
||||
Args:
|
||||
input_dir: Directory to search
|
||||
recursive: Search subdirectories
|
||||
|
||||
Returns:
|
||||
List of file paths
|
||||
"""
|
||||
files = []
|
||||
|
||||
if recursive:
|
||||
for ext in SUPPORTED_EXTENSIONS:
|
||||
files.extend(input_dir.rglob(f"*{ext}"))
|
||||
else:
|
||||
for ext in SUPPORTED_EXTENSIONS:
|
||||
files.extend(input_dir.glob(f"*{ext}"))
|
||||
|
||||
return sorted(files)
|
||||
return False, str(file_path), f"✗ Error: {str(e)}"
|
||||
|
||||
|
||||
def batch_convert(
|
||||
input_dir: str,
|
||||
output_dir: str,
|
||||
input_dir: Path,
|
||||
output_dir: Path,
|
||||
extensions: Optional[List[str]] = None,
|
||||
recursive: bool = False,
|
||||
use_llm: bool = False,
|
||||
llm_model: str = "gpt-4o",
|
||||
use_azure_di: bool = False,
|
||||
azure_endpoint: Optional[str] = None,
|
||||
azure_key: Optional[str] = None,
|
||||
verbose: bool = False
|
||||
) -> None:
|
||||
workers: int = 4,
|
||||
verbose: bool = False,
|
||||
enable_plugins: bool = False
|
||||
) -> dict:
|
||||
"""
|
||||
Batch convert all supported files in a directory.
|
||||
Batch convert files in a directory.
|
||||
|
||||
Args:
|
||||
input_dir: Input directory containing files
|
||||
output_dir: Output directory for Markdown files
|
||||
input_dir: Input directory
|
||||
output_dir: Output directory
|
||||
extensions: List of file extensions to convert (e.g., ['.pdf', '.docx'])
|
||||
recursive: Search subdirectories
|
||||
use_llm: Enable LLM-powered descriptions
|
||||
llm_model: LLM model to use
|
||||
use_azure_di: Enable Azure Document Intelligence
|
||||
azure_endpoint: Azure DI endpoint
|
||||
azure_key: Azure DI API key
|
||||
verbose: Print detailed progress
|
||||
workers: Number of parallel workers
|
||||
verbose: Print detailed messages
|
||||
enable_plugins: Enable MarkItDown plugins
|
||||
|
||||
Returns:
|
||||
Dictionary with conversion statistics
|
||||
"""
|
||||
input_path = Path(input_dir)
|
||||
output_path = Path(output_dir)
|
||||
|
||||
# Validate input directory
|
||||
if not input_path.exists():
|
||||
print(f"✗ Error: Input directory '{input_dir}' does not exist")
|
||||
sys.exit(1)
|
||||
|
||||
if not input_path.is_dir():
|
||||
print(f"✗ Error: '{input_dir}' is not a directory")
|
||||
sys.exit(1)
|
||||
|
||||
# Create output directory
|
||||
output_path.mkdir(parents=True, exist_ok=True)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Setup MarkItDown
|
||||
print("Setting up MarkItDown...")
|
||||
md = setup_markitdown(
|
||||
use_llm=use_llm,
|
||||
llm_model=llm_model,
|
||||
use_azure_di=use_azure_di,
|
||||
azure_endpoint=azure_endpoint,
|
||||
azure_key=azure_key
|
||||
)
|
||||
# Default extensions if not specified
|
||||
if extensions is None:
|
||||
extensions = ['.pdf', '.docx', '.pptx', '.xlsx', '.html', '.jpg', '.png']
|
||||
|
||||
# Find files
|
||||
print(f"\nScanning directory: {input_dir}")
|
||||
files = []
|
||||
if recursive:
|
||||
print(" (including subdirectories)")
|
||||
|
||||
files = find_files(input_path, recursive)
|
||||
for ext in extensions:
|
||||
files.extend(input_dir.rglob(f"*{ext}"))
|
||||
else:
|
||||
for ext in extensions:
|
||||
files.extend(input_dir.glob(f"*{ext}"))
|
||||
|
||||
if not files:
|
||||
print("✗ No supported files found")
|
||||
print(f" Supported extensions: {', '.join(sorted(SUPPORTED_EXTENSIONS))}")
|
||||
sys.exit(0)
|
||||
print(f"No files found with extensions: {', '.join(extensions)}")
|
||||
return {'total': 0, 'success': 0, 'failed': 0}
|
||||
|
||||
print(f"✓ Found {len(files)} file(s) to convert\n")
|
||||
print(f"Found {len(files)} file(s) to convert")
|
||||
|
||||
# Convert files
|
||||
successful = 0
|
||||
failed = 0
|
||||
# Create MarkItDown instance
|
||||
md = MarkItDown(enable_plugins=enable_plugins)
|
||||
|
||||
for file_path in files:
|
||||
if convert_file(md, file_path, output_path, verbose):
|
||||
successful += 1
|
||||
# Convert files in parallel
|
||||
results = {
|
||||
'total': len(files),
|
||||
'success': 0,
|
||||
'failed': 0,
|
||||
'details': []
|
||||
}
|
||||
|
||||
with ThreadPoolExecutor(max_workers=workers) as executor:
|
||||
futures = {
|
||||
executor.submit(convert_file, md, file_path, output_dir, verbose): file_path
|
||||
for file_path in files
|
||||
}
|
||||
|
||||
for future in as_completed(futures):
|
||||
success, path, message = future.result()
|
||||
|
||||
if success:
|
||||
results['success'] += 1
|
||||
else:
|
||||
failed += 1
|
||||
results['failed'] += 1
|
||||
|
||||
# Summary
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Conversion complete!")
|
||||
print(f" Successful: {successful}")
|
||||
print(f" Failed: {failed}")
|
||||
print(f" Output: {output_dir}")
|
||||
print(f"{'='*60}")
|
||||
results['details'].append({
|
||||
'file': path,
|
||||
'success': success,
|
||||
'message': message
|
||||
})
|
||||
|
||||
print(message)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Batch convert files to Markdown using MarkItDown",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Basic usage
|
||||
python batch_convert.py documents/ output/
|
||||
# Convert all PDFs in a directory
|
||||
python batch_convert.py papers/ output/ --extensions .pdf
|
||||
|
||||
# Recursive conversion
|
||||
python batch_convert.py documents/ output/ --recursive
|
||||
# Convert multiple formats recursively
|
||||
python batch_convert.py documents/ markdown/ --extensions .pdf .docx .pptx -r
|
||||
|
||||
# With LLM-powered image descriptions
|
||||
python batch_convert.py documents/ output/ --llm
|
||||
# Use 8 parallel workers
|
||||
python batch_convert.py input/ output/ --workers 8
|
||||
|
||||
# With Azure Document Intelligence
|
||||
python batch_convert.py documents/ output/ --azure \\
|
||||
--azure-endpoint https://example.cognitiveservices.azure.com/ \\
|
||||
--azure-key YOUR-KEY
|
||||
|
||||
# All features enabled
|
||||
python batch_convert.py documents/ output/ --llm --azure \\
|
||||
--azure-endpoint $AZURE_ENDPOINT --azure-key $AZURE_KEY
|
||||
|
||||
Supported file types:
|
||||
Documents: PDF, DOCX, PPTX, XLSX, XLS
|
||||
Images: JPG, PNG, GIF, BMP, TIFF
|
||||
Audio: WAV, MP3, FLAC, OGG, AIFF
|
||||
Web: HTML, EPUB
|
||||
Data: CSV, JSON, XML
|
||||
Archives: ZIP
|
||||
# Enable plugins
|
||||
python batch_convert.py input/ output/ --plugins
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('input_dir', type=Path, help='Input directory')
|
||||
parser.add_argument('output_dir', type=Path, help='Output directory')
|
||||
parser.add_argument(
|
||||
'input_dir',
|
||||
help='Input directory containing files to convert'
|
||||
'--extensions', '-e',
|
||||
nargs='+',
|
||||
help='File extensions to convert (e.g., .pdf .docx)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'output_dir',
|
||||
help='Output directory for Markdown files'
|
||||
)
|
||||
parser.add_argument(
|
||||
'-r', '--recursive',
|
||||
'--recursive', '-r',
|
||||
action='store_true',
|
||||
help='Recursively search subdirectories'
|
||||
help='Search subdirectories recursively'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--llm',
|
||||
'--workers', '-w',
|
||||
type=int,
|
||||
default=4,
|
||||
help='Number of parallel workers (default: 4)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--verbose', '-v',
|
||||
action='store_true',
|
||||
help='Enable LLM-powered image descriptions (requires OpenAI API key)'
|
||||
help='Verbose output'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--llm-model',
|
||||
default='gpt-4o',
|
||||
help='LLM model to use (default: gpt-4o)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--azure',
|
||||
'--plugins', '-p',
|
||||
action='store_true',
|
||||
help='Enable Azure Document Intelligence for PDFs'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--azure-endpoint',
|
||||
help='Azure Document Intelligence endpoint URL'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--azure-key',
|
||||
help='Azure Document Intelligence API key'
|
||||
)
|
||||
parser.add_argument(
|
||||
'-v', '--verbose',
|
||||
action='store_true',
|
||||
help='Print detailed progress'
|
||||
help='Enable MarkItDown plugins'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Environment variable fallbacks for Azure
|
||||
azure_endpoint = args.azure_endpoint or os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT')
|
||||
azure_key = args.azure_key or os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
|
||||
# Validate input directory
|
||||
if not args.input_dir.exists():
|
||||
print(f"Error: Input directory '{args.input_dir}' does not exist")
|
||||
sys.exit(1)
|
||||
|
||||
batch_convert(
|
||||
if not args.input_dir.is_dir():
|
||||
print(f"Error: '{args.input_dir}' is not a directory")
|
||||
sys.exit(1)
|
||||
|
||||
# Run batch conversion
|
||||
results = batch_convert(
|
||||
input_dir=args.input_dir,
|
||||
output_dir=args.output_dir,
|
||||
extensions=args.extensions,
|
||||
recursive=args.recursive,
|
||||
use_llm=args.llm,
|
||||
llm_model=args.llm_model,
|
||||
use_azure_di=args.azure,
|
||||
azure_endpoint=azure_endpoint,
|
||||
azure_key=azure_key,
|
||||
verbose=args.verbose
|
||||
workers=args.workers,
|
||||
verbose=args.verbose,
|
||||
enable_plugins=args.plugins
|
||||
)
|
||||
|
||||
# Print summary
|
||||
print("\n" + "="*50)
|
||||
print("CONVERSION SUMMARY")
|
||||
print("="*50)
|
||||
print(f"Total files: {results['total']}")
|
||||
print(f"Successful: {results['success']}")
|
||||
print(f"Failed: {results['failed']}")
|
||||
print(f"Success rate: {results['success']/results['total']*100:.1f}%" if results['total'] > 0 else "N/A")
|
||||
|
||||
# Show failed files if any
|
||||
if results['failed'] > 0:
|
||||
print("\nFailed conversions:")
|
||||
for detail in results['details']:
|
||||
if not detail['success']:
|
||||
print(f" - {detail['file']}: {detail['message']}")
|
||||
|
||||
sys.exit(0 if results['failed'] == 0 else 1)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
|
||||
283
scientific-skills/markitdown/scripts/convert_literature.py
Executable file
283
scientific-skills/markitdown/scripts/convert_literature.py
Executable file
@@ -0,0 +1,283 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Convert scientific literature PDFs to Markdown for analysis and review.
|
||||
|
||||
This script is specifically designed for converting academic papers,
|
||||
organizing them, and preparing them for literature review workflows.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Optional
|
||||
from markitdown import MarkItDown
|
||||
from datetime import datetime
|
||||
|
||||
|
||||
def extract_metadata_from_filename(filename: str) -> Dict[str, str]:
|
||||
"""
|
||||
Try to extract metadata from filename.
|
||||
Supports patterns like: Author_Year_Title.pdf
|
||||
"""
|
||||
metadata = {}
|
||||
|
||||
# Remove extension
|
||||
name = Path(filename).stem
|
||||
|
||||
# Try to extract year
|
||||
year_match = re.search(r'\b(19|20)\d{2}\b', name)
|
||||
if year_match:
|
||||
metadata['year'] = year_match.group()
|
||||
|
||||
# Split by underscores or dashes
|
||||
parts = re.split(r'[_\-]', name)
|
||||
if len(parts) >= 2:
|
||||
metadata['author'] = parts[0].replace('_', ' ')
|
||||
metadata['title'] = ' '.join(parts[1:]).replace('_', ' ')
|
||||
else:
|
||||
metadata['title'] = name.replace('_', ' ')
|
||||
|
||||
return metadata
|
||||
|
||||
|
||||
def convert_paper(
|
||||
md: MarkItDown,
|
||||
input_file: Path,
|
||||
output_dir: Path,
|
||||
organize_by_year: bool = False
|
||||
) -> tuple[bool, Dict]:
|
||||
"""
|
||||
Convert a single paper to Markdown with metadata extraction.
|
||||
|
||||
Args:
|
||||
md: MarkItDown instance
|
||||
input_file: Path to PDF file
|
||||
output_dir: Output directory
|
||||
organize_by_year: Organize into year subdirectories
|
||||
|
||||
Returns:
|
||||
Tuple of (success, metadata_dict)
|
||||
"""
|
||||
try:
|
||||
print(f"Converting: {input_file.name}")
|
||||
|
||||
# Convert to Markdown
|
||||
result = md.convert(str(input_file))
|
||||
|
||||
# Extract metadata from filename
|
||||
metadata = extract_metadata_from_filename(input_file.name)
|
||||
metadata['source_file'] = input_file.name
|
||||
metadata['converted_date'] = datetime.now().isoformat()
|
||||
|
||||
# Try to extract title from content if not in filename
|
||||
if 'title' not in metadata and result.title:
|
||||
metadata['title'] = result.title
|
||||
|
||||
# Create output path
|
||||
if organize_by_year and 'year' in metadata:
|
||||
output_subdir = output_dir / metadata['year']
|
||||
output_subdir.mkdir(parents=True, exist_ok=True)
|
||||
else:
|
||||
output_subdir = output_dir
|
||||
output_subdir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
output_file = output_subdir / f"{input_file.stem}.md"
|
||||
|
||||
# Create formatted Markdown with front matter
|
||||
content = "---\n"
|
||||
content += f"title: \"{metadata.get('title', input_file.stem)}\"\n"
|
||||
if 'author' in metadata:
|
||||
content += f"author: \"{metadata['author']}\"\n"
|
||||
if 'year' in metadata:
|
||||
content += f"year: {metadata['year']}\n"
|
||||
content += f"source: \"{metadata['source_file']}\"\n"
|
||||
content += f"converted: \"{metadata['converted_date']}\"\n"
|
||||
content += "---\n\n"
|
||||
|
||||
# Add title
|
||||
content += f"# {metadata.get('title', input_file.stem)}\n\n"
|
||||
|
||||
# Add metadata section
|
||||
content += "## Document Information\n\n"
|
||||
if 'author' in metadata:
|
||||
content += f"**Author**: {metadata['author']}\n"
|
||||
if 'year' in metadata:
|
||||
content += f"**Year**: {metadata['year']}\n"
|
||||
content += f"**Source File**: {metadata['source_file']}\n"
|
||||
content += f"**Converted**: {metadata['converted_date']}\n\n"
|
||||
content += "---\n\n"
|
||||
|
||||
# Add content
|
||||
content += result.text_content
|
||||
|
||||
# Write to file
|
||||
output_file.write_text(content, encoding='utf-8')
|
||||
|
||||
print(f"✓ Saved to: {output_file}")
|
||||
|
||||
return True, metadata
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error converting {input_file.name}: {str(e)}")
|
||||
return False, {'source_file': input_file.name, 'error': str(e)}
|
||||
|
||||
|
||||
def create_index(papers: List[Dict], output_dir: Path):
|
||||
"""Create an index/catalog of all converted papers."""
|
||||
|
||||
# Sort by year (if available) and title
|
||||
papers_sorted = sorted(
|
||||
papers,
|
||||
key=lambda x: (x.get('year', '9999'), x.get('title', ''))
|
||||
)
|
||||
|
||||
# Create Markdown index
|
||||
index_content = "# Literature Review Index\n\n"
|
||||
index_content += f"**Generated**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n"
|
||||
index_content += f"**Total Papers**: {len(papers)}\n\n"
|
||||
index_content += "---\n\n"
|
||||
|
||||
# Group by year
|
||||
by_year = {}
|
||||
for paper in papers_sorted:
|
||||
year = paper.get('year', 'Unknown')
|
||||
if year not in by_year:
|
||||
by_year[year] = []
|
||||
by_year[year].append(paper)
|
||||
|
||||
# Write by year
|
||||
for year in sorted(by_year.keys()):
|
||||
index_content += f"## {year}\n\n"
|
||||
for paper in by_year[year]:
|
||||
title = paper.get('title', paper.get('source_file', 'Unknown'))
|
||||
author = paper.get('author', 'Unknown Author')
|
||||
source = paper.get('source_file', '')
|
||||
|
||||
# Create link to markdown file
|
||||
md_file = Path(source).stem + ".md"
|
||||
if 'year' in paper and paper['year'] != 'Unknown':
|
||||
md_file = f"{paper['year']}/{md_file}"
|
||||
|
||||
index_content += f"- **{title}**\n"
|
||||
index_content += f" - Author: {author}\n"
|
||||
index_content += f" - Source: {source}\n"
|
||||
index_content += f" - [Read Markdown]({md_file})\n\n"
|
||||
|
||||
# Write index
|
||||
index_file = output_dir / "INDEX.md"
|
||||
index_file.write_text(index_content, encoding='utf-8')
|
||||
print(f"\n✓ Created index: {index_file}")
|
||||
|
||||
# Also create JSON catalog
|
||||
catalog_file = output_dir / "catalog.json"
|
||||
with open(catalog_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(papers_sorted, f, indent=2, ensure_ascii=False)
|
||||
print(f"✓ Created catalog: {catalog_file}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Convert scientific literature PDFs to Markdown",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Convert all PDFs in a directory
|
||||
python convert_literature.py papers/ output/
|
||||
|
||||
# Organize by year
|
||||
python convert_literature.py papers/ output/ --organize-by-year
|
||||
|
||||
# Create index of all papers
|
||||
python convert_literature.py papers/ output/ --create-index
|
||||
|
||||
Filename Conventions:
|
||||
For best results, name your PDFs using this pattern:
|
||||
Author_Year_Title.pdf
|
||||
|
||||
Examples:
|
||||
Smith_2023_Machine_Learning_Applications.pdf
|
||||
Jones_2022_Climate_Change_Analysis.pdf
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('input_dir', type=Path, help='Directory with PDF files')
|
||||
parser.add_argument('output_dir', type=Path, help='Output directory for Markdown files')
|
||||
parser.add_argument(
|
||||
'--organize-by-year', '-y',
|
||||
action='store_true',
|
||||
help='Organize output into year subdirectories'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--create-index', '-i',
|
||||
action='store_true',
|
||||
help='Create an index/catalog of all papers'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--recursive', '-r',
|
||||
action='store_true',
|
||||
help='Search subdirectories recursively'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate input
|
||||
if not args.input_dir.exists():
|
||||
print(f"Error: Input directory '{args.input_dir}' does not exist")
|
||||
sys.exit(1)
|
||||
|
||||
if not args.input_dir.is_dir():
|
||||
print(f"Error: '{args.input_dir}' is not a directory")
|
||||
sys.exit(1)
|
||||
|
||||
# Find PDF files
|
||||
if args.recursive:
|
||||
pdf_files = list(args.input_dir.rglob("*.pdf"))
|
||||
else:
|
||||
pdf_files = list(args.input_dir.glob("*.pdf"))
|
||||
|
||||
if not pdf_files:
|
||||
print("No PDF files found")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"Found {len(pdf_files)} PDF file(s)")
|
||||
|
||||
# Create MarkItDown instance
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert all papers
|
||||
results = []
|
||||
success_count = 0
|
||||
|
||||
for pdf_file in pdf_files:
|
||||
success, metadata = convert_paper(
|
||||
md,
|
||||
pdf_file,
|
||||
args.output_dir,
|
||||
args.organize_by_year
|
||||
)
|
||||
|
||||
if success:
|
||||
success_count += 1
|
||||
results.append(metadata)
|
||||
|
||||
# Create index if requested
|
||||
if args.create_index and results:
|
||||
create_index(results, args.output_dir)
|
||||
|
||||
# Print summary
|
||||
print("\n" + "="*50)
|
||||
print("CONVERSION SUMMARY")
|
||||
print("="*50)
|
||||
print(f"Total papers: {len(pdf_files)}")
|
||||
print(f"Successful: {success_count}")
|
||||
print(f"Failed: {len(pdf_files) - success_count}")
|
||||
print(f"Success rate: {success_count/len(pdf_files)*100:.1f}%")
|
||||
|
||||
sys.exit(0 if success_count == len(pdf_files) else 1)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
243
scientific-skills/markitdown/scripts/convert_with_ai.py
Executable file
243
scientific-skills/markitdown/scripts/convert_with_ai.py
Executable file
@@ -0,0 +1,243 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Convert documents to Markdown with AI-enhanced image descriptions.
|
||||
|
||||
This script demonstrates how to use MarkItDown with OpenRouter to generate
|
||||
detailed descriptions of images in documents (PowerPoint, PDFs with images, etc.)
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
# Predefined prompts for different use cases
|
||||
PROMPTS = {
|
||||
'scientific': """
|
||||
Analyze this scientific image or diagram. Provide:
|
||||
1. Type of visualization (graph, chart, microscopy, diagram, etc.)
|
||||
2. Key data points, trends, or patterns
|
||||
3. Axes labels, legends, and scales
|
||||
4. Notable features or findings
|
||||
5. Scientific context and significance
|
||||
Be precise, technical, and detailed.
|
||||
""".strip(),
|
||||
|
||||
'presentation': """
|
||||
Describe this presentation slide image. Include:
|
||||
1. Main visual elements and their arrangement
|
||||
2. Key points or messages conveyed
|
||||
3. Data or information presented
|
||||
4. Visual hierarchy and emphasis
|
||||
Keep the description clear and informative.
|
||||
""".strip(),
|
||||
|
||||
'general': """
|
||||
Describe this image in detail. Include:
|
||||
1. Main subjects and objects
|
||||
2. Visual composition and layout
|
||||
3. Text content (if any)
|
||||
4. Notable details
|
||||
5. Overall context and purpose
|
||||
Be comprehensive and accurate.
|
||||
""".strip(),
|
||||
|
||||
'data_viz': """
|
||||
Analyze this data visualization. Provide:
|
||||
1. Type of chart/graph (bar, line, scatter, pie, etc.)
|
||||
2. Variables and axes
|
||||
3. Data ranges and scales
|
||||
4. Key patterns, trends, or outliers
|
||||
5. Statistical insights
|
||||
Focus on quantitative accuracy.
|
||||
""".strip(),
|
||||
|
||||
'medical': """
|
||||
Describe this medical image. Include:
|
||||
1. Type of medical imaging (X-ray, MRI, CT, microscopy, etc.)
|
||||
2. Anatomical structures visible
|
||||
3. Notable findings or abnormalities
|
||||
4. Image quality and contrast
|
||||
5. Clinical relevance
|
||||
Be professional and precise.
|
||||
""".strip()
|
||||
}
|
||||
|
||||
|
||||
def convert_with_ai(
|
||||
input_file: Path,
|
||||
output_file: Path,
|
||||
api_key: str,
|
||||
model: str = "anthropic/claude-sonnet-4.5",
|
||||
prompt_type: str = "general",
|
||||
custom_prompt: str = None
|
||||
) -> bool:
|
||||
"""
|
||||
Convert a file to Markdown with AI image descriptions.
|
||||
|
||||
Args:
|
||||
input_file: Path to input file
|
||||
output_file: Path to output Markdown file
|
||||
api_key: OpenRouter API key
|
||||
model: Model name (default: anthropic/claude-sonnet-4.5)
|
||||
prompt_type: Type of prompt to use
|
||||
custom_prompt: Custom prompt (overrides prompt_type)
|
||||
|
||||
Returns:
|
||||
True if successful, False otherwise
|
||||
"""
|
||||
try:
|
||||
# Initialize OpenRouter client (OpenAI-compatible)
|
||||
client = OpenAI(
|
||||
api_key=api_key,
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
# Select prompt
|
||||
if custom_prompt:
|
||||
prompt = custom_prompt
|
||||
else:
|
||||
prompt = PROMPTS.get(prompt_type, PROMPTS['general'])
|
||||
|
||||
print(f"Using model: {model}")
|
||||
print(f"Prompt type: {prompt_type if not custom_prompt else 'custom'}")
|
||||
print(f"Converting: {input_file}")
|
||||
|
||||
# Create MarkItDown with AI support
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model=model,
|
||||
llm_prompt=prompt
|
||||
)
|
||||
|
||||
# Convert file
|
||||
result = md.convert(str(input_file))
|
||||
|
||||
# Create output with metadata
|
||||
content = f"# {result.title or input_file.stem}\n\n"
|
||||
content += f"**Source**: {input_file.name}\n"
|
||||
content += f"**Format**: {input_file.suffix}\n"
|
||||
content += f"**AI Model**: {model}\n"
|
||||
content += f"**Prompt Type**: {prompt_type if not custom_prompt else 'custom'}\n\n"
|
||||
content += "---\n\n"
|
||||
content += result.text_content
|
||||
|
||||
# Write output
|
||||
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_file.write_text(content, encoding='utf-8')
|
||||
|
||||
print(f"✓ Successfully converted to: {output_file}")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {str(e)}", file=sys.stderr)
|
||||
return False
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Convert documents to Markdown with AI-enhanced image descriptions",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog=f"""
|
||||
Available prompt types:
|
||||
scientific - For scientific diagrams, graphs, and charts
|
||||
presentation - For presentation slides
|
||||
general - General-purpose image description
|
||||
data_viz - For data visualizations and charts
|
||||
medical - For medical imaging
|
||||
|
||||
Examples:
|
||||
# Convert a scientific paper
|
||||
python convert_with_ai.py paper.pdf output.md --prompt-type scientific
|
||||
|
||||
# Convert a presentation with custom model
|
||||
python convert_with_ai.py slides.pptx slides.md --model anthropic/claude-sonnet-4.5 --prompt-type presentation
|
||||
|
||||
# Use custom prompt with advanced vision model
|
||||
python convert_with_ai.py diagram.png diagram.md --model anthropic/claude-sonnet-4.5 --custom-prompt "Describe this technical diagram"
|
||||
|
||||
# Set API key via environment variable
|
||||
export OPENROUTER_API_KEY="sk-or-v1-..."
|
||||
python convert_with_ai.py image.jpg image.md
|
||||
|
||||
Environment Variables:
|
||||
OPENROUTER_API_KEY OpenRouter API key (required if not passed via --api-key)
|
||||
|
||||
Popular Models (use with --model):
|
||||
anthropic/claude-sonnet-4.5 - Recommended for scientific vision
|
||||
anthropic/claude-opus-4.5 - Advanced vision model
|
||||
openai/gpt-4o - GPT-4 Omni (vision support)
|
||||
openai/gpt-4-vision - GPT-4 Vision
|
||||
google/gemini-pro-vision - Gemini Pro Vision
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('input', type=Path, help='Input file')
|
||||
parser.add_argument('output', type=Path, help='Output Markdown file')
|
||||
parser.add_argument(
|
||||
'--api-key', '-k',
|
||||
help='OpenRouter API key (or set OPENROUTER_API_KEY env var)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--model', '-m',
|
||||
default='anthropic/claude-sonnet-4.5',
|
||||
help='Model to use via OpenRouter (default: anthropic/claude-sonnet-4.5)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--prompt-type', '-t',
|
||||
choices=list(PROMPTS.keys()),
|
||||
default='general',
|
||||
help='Type of prompt to use (default: general)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--custom-prompt', '-p',
|
||||
help='Custom prompt (overrides --prompt-type)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--list-prompts', '-l',
|
||||
action='store_true',
|
||||
help='List available prompt types and exit'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# List prompts and exit
|
||||
if args.list_prompts:
|
||||
print("Available prompt types:\n")
|
||||
for name, prompt in PROMPTS.items():
|
||||
print(f"[{name}]")
|
||||
print(prompt)
|
||||
print("\n" + "="*60 + "\n")
|
||||
sys.exit(0)
|
||||
|
||||
# Get API key
|
||||
api_key = args.api_key or os.environ.get('OPENROUTER_API_KEY')
|
||||
if not api_key:
|
||||
print("Error: OpenRouter API key required. Set OPENROUTER_API_KEY environment variable or use --api-key")
|
||||
print("Get your API key at: https://openrouter.ai/keys")
|
||||
sys.exit(1)
|
||||
|
||||
# Validate input file
|
||||
if not args.input.exists():
|
||||
print(f"Error: Input file '{args.input}' does not exist")
|
||||
sys.exit(1)
|
||||
|
||||
# Convert file
|
||||
success = convert_with_ai(
|
||||
input_file=args.input,
|
||||
output_file=args.output,
|
||||
api_key=api_key,
|
||||
model=args.model,
|
||||
prompt_type=args.prompt_type,
|
||||
custom_prompt=args.custom_prompt
|
||||
)
|
||||
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
@@ -1,8 +1,3 @@
|
||||
---
|
||||
name: scholar-evaluation
|
||||
description: Systematic framework for evaluating scholarly and research work based on the ScholarEval methodology. This skill should be used when assessing research papers, evaluating literature reviews, scoring research methodologies, analyzing scientific writing quality, or applying structured evaluation criteria to academic work. Provides comprehensive assessment across multiple dimensions including problem formulation, literature review, methodology, data collection, analysis, results interpretation, and scholarly writing quality.
|
||||
---
|
||||
|
||||
# Scholar Evaluation
|
||||
|
||||
## Overview
|
||||
@@ -19,6 +14,43 @@ Use this skill when:
|
||||
- Evaluating scholarly writing and presentation
|
||||
- Providing structured feedback on academic work
|
||||
- Benchmarking research quality against established criteria
|
||||
- Assessing publication readiness for target venues
|
||||
- Providing quantitative evaluation to complement qualitative peer review
|
||||
|
||||
## Visual Enhancement with Scientific Schematics
|
||||
|
||||
**When creating documents with this skill, always consider adding scientific diagrams and schematics to enhance visual communication.**
|
||||
|
||||
If your document does not already contain schematics or diagrams:
|
||||
- Use the **scientific-schematics** skill to generate AI-powered publication-quality diagrams
|
||||
- Simply describe your desired diagram in natural language
|
||||
- Nano Banana Pro will automatically generate, review, and refine the schematic
|
||||
|
||||
**For new documents:** Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.
|
||||
|
||||
**How to generate schematics:**
|
||||
```bash
|
||||
python scripts/generate_schematic.py "your diagram description" -o figures/output.png
|
||||
```
|
||||
|
||||
The AI will automatically:
|
||||
- Create publication-quality images with proper formatting
|
||||
- Review and refine through multiple iterations
|
||||
- Ensure accessibility (colorblind-friendly, high contrast)
|
||||
- Save outputs in the figures/ directory
|
||||
|
||||
**When to add schematics:**
|
||||
- Evaluation framework diagrams
|
||||
- Quality assessment criteria decision trees
|
||||
- Scholarly workflow visualizations
|
||||
- Assessment methodology flowcharts
|
||||
- Scoring rubric visualizations
|
||||
- Evaluation process diagrams
|
||||
- Any complex concept that benefits from visualization
|
||||
|
||||
For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.
|
||||
|
||||
---
|
||||
|
||||
## Evaluation Workflow
|
||||
|
||||
@@ -187,7 +219,7 @@ Search patterns for quick access:
|
||||
Python script for calculating aggregate evaluation scores from dimension-level ratings. Supports weighted averaging, threshold analysis, and score visualization.
|
||||
|
||||
Usage:
|
||||
```python
|
||||
```bash
|
||||
python scripts/calculate_scores.py --scores <dimension_scores.json> --output <report.txt>
|
||||
```
|
||||
|
||||
@@ -221,12 +253,32 @@ python scripts/calculate_scores.py --scores <dimension_scores.json> --output <re
|
||||
- Writing could improve clarity in results section
|
||||
6. Provide prioritized recommendations with specific suggestions
|
||||
|
||||
## Integration with Scientific Writer
|
||||
|
||||
This skill integrates seamlessly with the scientific writer workflow:
|
||||
|
||||
**After Paper Generation:**
|
||||
- Use Scholar Evaluation as an alternative or complement to peer review
|
||||
- Generate `SCHOLAR_EVALUATION.md` alongside `PEER_REVIEW.md`
|
||||
- Provide quantitative scores to track improvement across revisions
|
||||
|
||||
**During Revision:**
|
||||
- Re-evaluate specific dimensions after addressing feedback
|
||||
- Track score improvements over multiple versions
|
||||
- Identify persistent weaknesses requiring attention
|
||||
|
||||
**Publication Preparation:**
|
||||
- Assess readiness for target journal/conference
|
||||
- Identify gaps before submission
|
||||
- Benchmark against publication standards
|
||||
|
||||
## Notes
|
||||
|
||||
- Evaluation rigor should match the work's purpose and stage
|
||||
- Some dimensions may not apply to all work types (e.g., data collection for purely theoretical papers)
|
||||
- Cultural and disciplinary differences in scholarly norms should be considered
|
||||
- This framework complements, not replaces, domain-specific expertise
|
||||
- Use in combination with peer-review skill for comprehensive assessment
|
||||
|
||||
## Citation
|
||||
|
||||
|
||||
0
scientific-skills/scholar-evaluation/scripts/calculate_scores.py
Executable file → Normal file
0
scientific-skills/scholar-evaluation/scripts/calculate_scores.py
Executable file → Normal file
@@ -1,6 +1,7 @@
|
||||
---
|
||||
name: scientific-critical-thinking
|
||||
description: "Evaluate research rigor. Assess methodology, experimental design, statistical validity, biases, confounding, evidence quality (GRADE, Cochrane ROB), for critical analysis of scientific claims."
|
||||
allowed-tools: [Read, Write, Edit, Bash]
|
||||
---
|
||||
|
||||
# Scientific Critical Thinking
|
||||
@@ -20,6 +21,41 @@ This skill should be used when:
|
||||
- Applying GRADE or Cochrane risk of bias assessments
|
||||
- Providing critical analysis of research papers
|
||||
|
||||
## Visual Enhancement with Scientific Schematics
|
||||
|
||||
**When creating documents with this skill, always consider adding scientific diagrams and schematics to enhance visual communication.**
|
||||
|
||||
If your document does not already contain schematics or diagrams:
|
||||
- Use the **scientific-schematics** skill to generate AI-powered publication-quality diagrams
|
||||
- Simply describe your desired diagram in natural language
|
||||
- Nano Banana Pro will automatically generate, review, and refine the schematic
|
||||
|
||||
**For new documents:** Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.
|
||||
|
||||
**How to generate schematics:**
|
||||
```bash
|
||||
python scripts/generate_schematic.py "your diagram description" -o figures/output.png
|
||||
```
|
||||
|
||||
The AI will automatically:
|
||||
- Create publication-quality images with proper formatting
|
||||
- Review and refine through multiple iterations
|
||||
- Ensure accessibility (colorblind-friendly, high contrast)
|
||||
- Save outputs in the figures/ directory
|
||||
|
||||
**When to add schematics:**
|
||||
- Critical thinking framework diagrams
|
||||
- Bias identification decision trees
|
||||
- Evidence quality assessment flowcharts
|
||||
- GRADE assessment methodology diagrams
|
||||
- Risk of bias evaluation frameworks
|
||||
- Validity assessment visualizations
|
||||
- Any complex concept that benefits from visualization
|
||||
|
||||
For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.
|
||||
|
||||
---
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Methodology Critique
|
||||
|
||||
Reference in New Issue
Block a user