mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Remove obsolete DOCX documentation and related files, including tutorials, license, and schema definitions, to streamline the project structure.
This commit is contained in:
@@ -1,231 +0,0 @@
|
||||
---
|
||||
name: docx
|
||||
description: Document toolkit (.docx). Create/edit documents, tracked changes, comments, formatting preservation, text extraction, for professional document processing.
|
||||
license: Proprietary. LICENSE.txt has complete terms
|
||||
---
|
||||
|
||||
# DOCX creation, editing, and analysis
|
||||
|
||||
## Overview
|
||||
|
||||
A .docx file is a ZIP archive containing XML files and resources. Create, edit, or analyze Word documents using text extraction, raw XML access, or redlining workflows. Apply this skill for professional document processing, tracked changes, and content manipulation.
|
||||
|
||||
## Visual Enhancement with Scientific Schematics
|
||||
|
||||
**When creating documents with this skill, always consider adding scientific diagrams and schematics to enhance visual communication.**
|
||||
|
||||
If your document does not already contain schematics or diagrams:
|
||||
- Use the **scientific-schematics** skill to generate AI-powered publication-quality diagrams
|
||||
- Simply describe your desired diagram in natural language
|
||||
- Nano Banana Pro will automatically generate, review, and refine the schematic
|
||||
|
||||
**For new documents:** Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.
|
||||
|
||||
**How to generate schematics:**
|
||||
```bash
|
||||
python scripts/generate_schematic.py "your diagram description" -o figures/output.png
|
||||
```
|
||||
|
||||
The AI will automatically:
|
||||
- Create publication-quality images with proper formatting
|
||||
- Review and refine through multiple iterations
|
||||
- Ensure accessibility (colorblind-friendly, high contrast)
|
||||
- Save outputs in the figures/ directory
|
||||
|
||||
**When to add schematics:**
|
||||
- Document workflow diagrams
|
||||
- Process flowcharts
|
||||
- System architecture illustrations
|
||||
- Data flow diagrams
|
||||
- Organizational structure diagrams
|
||||
- Any complex concept that benefits from visualization
|
||||
|
||||
For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.
|
||||
|
||||
---
|
||||
|
||||
## Workflow Decision Tree
|
||||
|
||||
### Reading/Analyzing Content
|
||||
Use "Text extraction" or "Raw XML access" sections below
|
||||
|
||||
### Creating New Document
|
||||
Use "Creating a new Word document" workflow
|
||||
|
||||
### Editing Existing Document
|
||||
- **Your own document + simple changes**
|
||||
Use "Basic OOXML editing" workflow
|
||||
|
||||
- **Someone else's document**
|
||||
Use **"Redlining workflow"** (recommended default)
|
||||
|
||||
- **Legal, academic, business, or government docs**
|
||||
Use **"Redlining workflow"** (required)
|
||||
|
||||
## Reading and analyzing content
|
||||
|
||||
### Text extraction
|
||||
To read the text contents of a document, convert the document to markdown using pandoc. Pandoc provides excellent support for preserving document structure and can show tracked changes:
|
||||
|
||||
```bash
|
||||
# Convert document to markdown with tracked changes
|
||||
pandoc --track-changes=all path-to-file.docx -o output.md
|
||||
# Options: --track-changes=accept/reject/all
|
||||
```
|
||||
|
||||
### Raw XML access
|
||||
Raw XML access is required for: comments, complex formatting, document structure, embedded media, and metadata. For any of these features, unpack a document and read its raw XML contents.
|
||||
|
||||
#### Unpacking a file
|
||||
`python ooxml/scripts/unpack.py <office_file> <output_directory>`
|
||||
|
||||
#### Key file structures
|
||||
* `word/document.xml` - Main document contents
|
||||
* `word/comments.xml` - Comments referenced in document.xml
|
||||
* `word/media/` - Embedded images and media files
|
||||
* Tracked changes use `<w:ins>` (insertions) and `<w:del>` (deletions) tags
|
||||
|
||||
## Creating a new Word document
|
||||
|
||||
When creating a new Word document from scratch, use **docx-js**, which allows you to create Word documents using JavaScript/TypeScript.
|
||||
|
||||
### Workflow
|
||||
1. **MANDATORY - READ ENTIRE FILE**: Read [`docx-js.md`](docx-js.md) (~500 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for detailed syntax, critical formatting rules, and best practices before proceeding with document creation.
|
||||
2. Create a JavaScript/TypeScript file using Document, Paragraph, TextRun components (You can assume all dependencies are installed, but if not, refer to the dependencies section below)
|
||||
3. Export as .docx using Packer.toBuffer()
|
||||
|
||||
## Editing an existing Word document
|
||||
|
||||
When editing an existing Word document, use the **Document library** (a Python library for OOXML manipulation). The library automatically handles infrastructure setup and provides methods for document manipulation. For complex scenarios, you can access the underlying DOM directly through the library.
|
||||
|
||||
### Workflow
|
||||
1. **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for the Document library API and XML patterns for directly editing document files.
|
||||
2. Unpack the document: `python ooxml/scripts/unpack.py <office_file> <output_directory>`
|
||||
3. Create and run a Python script using the Document library (see "Document Library" section in ooxml.md)
|
||||
4. Pack the final document: `python ooxml/scripts/pack.py <input_directory> <office_file>`
|
||||
|
||||
The Document library provides both high-level methods for common operations and direct DOM access for complex scenarios.
|
||||
|
||||
## Redlining workflow for document review
|
||||
|
||||
This workflow allows planning comprehensive tracked changes using markdown before implementing them in OOXML. **CRITICAL**: For complete tracked changes, implement ALL changes systematically.
|
||||
|
||||
**Batching Strategy**: Group related changes into batches of 3-10 changes. This makes debugging manageable while maintaining efficiency. Test each batch before moving to the next.
|
||||
|
||||
**Principle: Minimal, Precise Edits**
|
||||
When implementing tracked changes, only mark text that actually changes. Repeating unchanged text makes edits harder to review and appears unprofessional. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]. Preserve the original run's RSID for unchanged text by extracting the `<w:r>` element from the original and reusing it.
|
||||
|
||||
Example - Changing "30 days" to "60 days" in a sentence:
|
||||
```python
|
||||
# BAD - Replaces entire sentence
|
||||
'<w:del><w:r><w:delText>The term is 30 days.</w:delText></w:r></w:del><w:ins><w:r><w:t>The term is 60 days.</w:t></w:r></w:ins>'
|
||||
|
||||
# GOOD - Only marks what changed, preserves original <w:r> for unchanged text
|
||||
'<w:r w:rsidR="00AB12CD"><w:t>The term is </w:t></w:r><w:del><w:r><w:delText>30</w:delText></w:r></w:del><w:ins><w:r><w:t>60</w:t></w:r></w:ins><w:r w:rsidR="00AB12CD"><w:t> days.</w:t></w:r>'
|
||||
```
|
||||
|
||||
### Tracked changes workflow
|
||||
|
||||
1. **Get markdown representation**: Convert document to markdown with tracked changes preserved:
|
||||
```bash
|
||||
pandoc --track-changes=all path-to-file.docx -o current.md
|
||||
```
|
||||
|
||||
2. **Identify and group changes**: Review the document and identify ALL changes needed, organizing them into logical batches:
|
||||
|
||||
**Location methods** (for finding changes in XML):
|
||||
- Section/heading numbers (e.g., "Section 3.2", "Article IV")
|
||||
- Paragraph identifiers if numbered
|
||||
- Grep patterns with unique surrounding text
|
||||
- Document structure (e.g., "first paragraph", "signature block")
|
||||
- **DO NOT use markdown line numbers** - they don't map to XML structure
|
||||
|
||||
**Batch organization** (group 3-10 related changes per batch):
|
||||
- By section: "Batch 1: Section 2 amendments", "Batch 2: Section 5 updates"
|
||||
- By type: "Batch 1: Date corrections", "Batch 2: Party name changes"
|
||||
- By complexity: Start with simple text replacements, then tackle complex structural changes
|
||||
- Sequential: "Batch 1: Pages 1-3", "Batch 2: Pages 4-6"
|
||||
|
||||
3. **Read documentation and unpack**:
|
||||
- **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Pay special attention to the "Document Library" and "Tracked Change Patterns" sections.
|
||||
- **Unpack the document**: `python ooxml/scripts/unpack.py <file.docx> <dir>`
|
||||
- **Note the suggested RSID**: The unpack script will suggest an RSID to use for your tracked changes. Copy this RSID for use in step 4b.
|
||||
|
||||
4. **Implement changes in batches**: Group changes logically (by section, by type, or by proximity) and implement them together in a single script. This approach:
|
||||
- Makes debugging easier (smaller batch = easier to isolate errors)
|
||||
- Allows incremental progress
|
||||
- Maintains efficiency (batch size of 3-10 changes works well)
|
||||
|
||||
**Suggested batch groupings:**
|
||||
- By document section (e.g., "Section 3 changes", "Definitions", "Termination clause")
|
||||
- By change type (e.g., "Date changes", "Party name updates", "Legal term replacements")
|
||||
- By proximity (e.g., "Changes on pages 1-3", "Changes in first half of document")
|
||||
|
||||
For each batch of related changes:
|
||||
|
||||
**a. Map text to XML**: Grep for text in `word/document.xml` to verify how text is split across `<w:r>` elements.
|
||||
|
||||
**b. Create and run script**: Use `get_node` to find nodes, implement changes, then `doc.save()`. See **"Document Library"** section in ooxml.md for patterns.
|
||||
|
||||
**Note**: Always grep `word/document.xml` immediately before writing a script to get current line numbers and verify text content. Line numbers change after each script run.
|
||||
|
||||
5. **Pack the document**: After all batches are complete, convert the unpacked directory back to .docx:
|
||||
```bash
|
||||
python ooxml/scripts/pack.py unpacked reviewed-document.docx
|
||||
```
|
||||
|
||||
6. **Final verification**: Do a comprehensive check of the complete document:
|
||||
- Convert final document to markdown:
|
||||
```bash
|
||||
pandoc --track-changes=all reviewed-document.docx -o verification.md
|
||||
```
|
||||
- Verify ALL changes were applied correctly:
|
||||
```bash
|
||||
grep "original phrase" verification.md # Should NOT find it
|
||||
grep "replacement phrase" verification.md # Should find it
|
||||
```
|
||||
- Check that no unintended changes were introduced
|
||||
|
||||
|
||||
## Converting Documents to Images
|
||||
|
||||
To visually analyze Word documents, convert them to images using a two-step process:
|
||||
|
||||
1. **Convert DOCX to PDF**:
|
||||
```bash
|
||||
soffice --headless --convert-to pdf document.docx
|
||||
```
|
||||
|
||||
2. **Convert PDF pages to JPEG images**:
|
||||
```bash
|
||||
pdftoppm -jpeg -r 150 document.pdf page
|
||||
```
|
||||
This creates files like `page-1.jpg`, `page-2.jpg`, etc.
|
||||
|
||||
Options:
|
||||
- `-r 150`: Sets resolution to 150 DPI (adjust for quality/size balance)
|
||||
- `-jpeg`: Output JPEG format (use `-png` for PNG if preferred)
|
||||
- `-f N`: First page to convert (e.g., `-f 2` starts from page 2)
|
||||
- `-l N`: Last page to convert (e.g., `-l 5` stops at page 5)
|
||||
- `page`: Prefix for output files
|
||||
|
||||
Example for specific range:
|
||||
```bash
|
||||
pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf page # Converts only pages 2-5
|
||||
```
|
||||
|
||||
## Code Style Guidelines
|
||||
**IMPORTANT**: When generating code for DOCX operations:
|
||||
- Write concise code
|
||||
- Avoid verbose variable names and redundant operations
|
||||
- Avoid unnecessary print statements
|
||||
|
||||
## Dependencies
|
||||
|
||||
Required dependencies (install if not available):
|
||||
|
||||
- **pandoc**: `sudo apt-get install pandoc` (for text extraction)
|
||||
- **docx**: `npm install -g docx` (for creating new documents)
|
||||
- **LibreOffice**: `sudo apt-get install libreoffice` (for PDF conversion)
|
||||
- **Poppler**: `sudo apt-get install poppler-utils` (for pdftoppm to convert PDF to images)
|
||||
- **defusedxml**: `pip install defusedxml` (for secure XML parsing)
|
||||
@@ -1,350 +0,0 @@
|
||||
# DOCX Library Tutorial
|
||||
|
||||
Generate .docx files with JavaScript/TypeScript.
|
||||
|
||||
**Important: Read this entire document before starting.** Critical formatting rules and common pitfalls are covered throughout - skipping sections may result in corrupted files or rendering issues.
|
||||
|
||||
## Setup
|
||||
Assumes docx is already installed globally
|
||||
If not installed: `npm install -g docx`
|
||||
|
||||
```javascript
|
||||
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun, Media,
|
||||
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
|
||||
InternalHyperlink, TableOfContents, HeadingLevel, BorderStyle, WidthType, TabStopType,
|
||||
TabStopPosition, UnderlineType, ShadingType, VerticalAlign, SymbolRun, PageNumber,
|
||||
FootnoteReferenceRun, Footnote, PageBreak } = require('docx');
|
||||
|
||||
// Create & Save
|
||||
const doc = new Document({ sections: [{ children: [/* content */] }] });
|
||||
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer)); // Node.js
|
||||
Packer.toBlob(doc).then(blob => { /* download logic */ }); // Browser
|
||||
```
|
||||
|
||||
## Text & Formatting
|
||||
```javascript
|
||||
// IMPORTANT: Never use \n for line breaks - always use separate Paragraph elements
|
||||
// ❌ WRONG: new TextRun("Line 1\nLine 2")
|
||||
// ✅ CORRECT: new Paragraph({ children: [new TextRun("Line 1")] }), new Paragraph({ children: [new TextRun("Line 2")] })
|
||||
|
||||
// Basic text with all formatting options
|
||||
new Paragraph({
|
||||
alignment: AlignmentType.CENTER,
|
||||
spacing: { before: 200, after: 200 },
|
||||
indent: { left: 720, right: 720 },
|
||||
children: [
|
||||
new TextRun({ text: "Bold", bold: true }),
|
||||
new TextRun({ text: "Italic", italics: true }),
|
||||
new TextRun({ text: "Underlined", underline: { type: UnderlineType.DOUBLE, color: "FF0000" } }),
|
||||
new TextRun({ text: "Colored", color: "FF0000", size: 28, font: "Arial" }), // Arial default
|
||||
new TextRun({ text: "Highlighted", highlight: "yellow" }),
|
||||
new TextRun({ text: "Strikethrough", strike: true }),
|
||||
new TextRun({ text: "x2", superScript: true }),
|
||||
new TextRun({ text: "H2O", subScript: true }),
|
||||
new TextRun({ text: "SMALL CAPS", smallCaps: true }),
|
||||
new SymbolRun({ char: "2022", font: "Symbol" }), // Bullet •
|
||||
new SymbolRun({ char: "00A9", font: "Arial" }) // Copyright © - Arial for symbols
|
||||
]
|
||||
})
|
||||
```
|
||||
|
||||
## Styles & Professional Formatting
|
||||
|
||||
```javascript
|
||||
const doc = new Document({
|
||||
styles: {
|
||||
default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default
|
||||
paragraphStyles: [
|
||||
// Document title style - override built-in Title style
|
||||
{ id: "Title", name: "Title", basedOn: "Normal",
|
||||
run: { size: 56, bold: true, color: "000000", font: "Arial" },
|
||||
paragraph: { spacing: { before: 240, after: 120 }, alignment: AlignmentType.CENTER } },
|
||||
// IMPORTANT: Override built-in heading styles by using their exact IDs
|
||||
{ id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
|
||||
run: { size: 32, bold: true, color: "000000", font: "Arial" }, // 16pt
|
||||
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // Required for TOC
|
||||
{ id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
|
||||
run: { size: 28, bold: true, color: "000000", font: "Arial" }, // 14pt
|
||||
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
|
||||
// Custom styles use your own IDs
|
||||
{ id: "myStyle", name: "My Style", basedOn: "Normal",
|
||||
run: { size: 28, bold: true, color: "000000" },
|
||||
paragraph: { spacing: { after: 120 }, alignment: AlignmentType.CENTER } }
|
||||
],
|
||||
characterStyles: [{ id: "myCharStyle", name: "My Char Style",
|
||||
run: { color: "FF0000", bold: true, underline: { type: UnderlineType.SINGLE } } }]
|
||||
},
|
||||
sections: [{
|
||||
properties: { page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } },
|
||||
children: [
|
||||
new Paragraph({ heading: HeadingLevel.TITLE, children: [new TextRun("Document Title")] }), // Uses overridden Title style
|
||||
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Heading 1")] }), // Uses overridden Heading1 style
|
||||
new Paragraph({ style: "myStyle", children: [new TextRun("Custom paragraph style")] }),
|
||||
new Paragraph({ children: [
|
||||
new TextRun("Normal with "),
|
||||
new TextRun({ text: "custom char style", style: "myCharStyle" })
|
||||
]})
|
||||
]
|
||||
}]
|
||||
});
|
||||
```
|
||||
|
||||
**Professional Font Combinations:**
|
||||
- **Arial (Headers) + Arial (Body)** - Most universally supported, clean and professional
|
||||
- **Times New Roman (Headers) + Arial (Body)** - Classic serif headers with modern sans-serif body
|
||||
- **Georgia (Headers) + Verdana (Body)** - Optimized for screen reading, elegant contrast
|
||||
|
||||
**Key Styling Principles:**
|
||||
- **Override built-in styles**: Use exact IDs like "Heading1", "Heading2", "Heading3" to override Word's built-in heading styles
|
||||
- **HeadingLevel constants**: `HeadingLevel.HEADING_1` uses "Heading1" style, `HeadingLevel.HEADING_2` uses "Heading2" style, etc.
|
||||
- **Include outlineLevel**: Set `outlineLevel: 0` for H1, `outlineLevel: 1` for H2, etc. to ensure TOC works correctly
|
||||
- **Use custom styles** instead of inline formatting for consistency
|
||||
- **Set a default font** using `styles.default.document.run.font` - Arial is universally supported
|
||||
- **Establish visual hierarchy** with different font sizes (titles > headers > body)
|
||||
- **Add proper spacing** with `before` and `after` paragraph spacing
|
||||
- **Use colors sparingly**: Default to black (000000) and shades of gray for titles and headings (heading 1, heading 2, etc.)
|
||||
- **Set consistent margins** (1440 = 1 inch is standard)
|
||||
|
||||
|
||||
## Lists (ALWAYS USE PROPER LISTS - NEVER USE UNICODE BULLETS)
|
||||
```javascript
|
||||
// Bullets - ALWAYS use the numbering config, NOT unicode symbols
|
||||
// CRITICAL: Use LevelFormat.BULLET constant, NOT the string "bullet"
|
||||
const doc = new Document({
|
||||
numbering: {
|
||||
config: [
|
||||
{ reference: "bullet-list",
|
||||
levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
|
||||
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
|
||||
{ reference: "first-numbered-list",
|
||||
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
|
||||
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
|
||||
{ reference: "second-numbered-list", // Different reference = restarts at 1
|
||||
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
|
||||
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] }
|
||||
]
|
||||
},
|
||||
sections: [{
|
||||
children: [
|
||||
// Bullet list items
|
||||
new Paragraph({ numbering: { reference: "bullet-list", level: 0 },
|
||||
children: [new TextRun("First bullet point")] }),
|
||||
new Paragraph({ numbering: { reference: "bullet-list", level: 0 },
|
||||
children: [new TextRun("Second bullet point")] }),
|
||||
// Numbered list items
|
||||
new Paragraph({ numbering: { reference: "first-numbered-list", level: 0 },
|
||||
children: [new TextRun("First numbered item")] }),
|
||||
new Paragraph({ numbering: { reference: "first-numbered-list", level: 0 },
|
||||
children: [new TextRun("Second numbered item")] }),
|
||||
// ⚠️ CRITICAL: Different reference = INDEPENDENT list that restarts at 1
|
||||
// Same reference = CONTINUES previous numbering
|
||||
new Paragraph({ numbering: { reference: "second-numbered-list", level: 0 },
|
||||
children: [new TextRun("Starts at 1 again (because different reference)")] })
|
||||
]
|
||||
}]
|
||||
});
|
||||
|
||||
// ⚠️ CRITICAL NUMBERING RULE: Each reference creates an INDEPENDENT numbered list
|
||||
// - Same reference = continues numbering (1, 2, 3... then 4, 5, 6...)
|
||||
// - Different reference = restarts at 1 (1, 2, 3... then 1, 2, 3...)
|
||||
// Use unique reference names for each separate numbered section!
|
||||
|
||||
// ⚠️ CRITICAL: NEVER use unicode bullets - they create fake lists that don't work properly
|
||||
// new TextRun("• Item") // WRONG
|
||||
// new SymbolRun({ char: "2022" }) // WRONG
|
||||
// ✅ ALWAYS use numbering config with LevelFormat.BULLET for real Word lists
|
||||
```
|
||||
|
||||
## Tables
|
||||
```javascript
|
||||
// Complete table with margins, borders, headers, and bullet points
|
||||
const tableBorder = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
|
||||
const cellBorders = { top: tableBorder, bottom: tableBorder, left: tableBorder, right: tableBorder };
|
||||
|
||||
new Table({
|
||||
columnWidths: [4680, 4680], // ⚠️ CRITICAL: Set column widths at table level - values in DXA (twentieths of a point)
|
||||
margins: { top: 100, bottom: 100, left: 180, right: 180 }, // Set once for all cells
|
||||
rows: [
|
||||
new TableRow({
|
||||
tableHeader: true,
|
||||
children: [
|
||||
new TableCell({
|
||||
borders: cellBorders,
|
||||
width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell
|
||||
// ⚠️ CRITICAL: Always use ShadingType.CLEAR to prevent black backgrounds in Word.
|
||||
shading: { fill: "D5E8F0", type: ShadingType.CLEAR },
|
||||
verticalAlign: VerticalAlign.CENTER,
|
||||
children: [new Paragraph({
|
||||
alignment: AlignmentType.CENTER,
|
||||
children: [new TextRun({ text: "Header", bold: true, size: 22 })]
|
||||
})]
|
||||
}),
|
||||
new TableCell({
|
||||
borders: cellBorders,
|
||||
width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell
|
||||
shading: { fill: "D5E8F0", type: ShadingType.CLEAR },
|
||||
children: [new Paragraph({
|
||||
alignment: AlignmentType.CENTER,
|
||||
children: [new TextRun({ text: "Bullet Points", bold: true, size: 22 })]
|
||||
})]
|
||||
})
|
||||
]
|
||||
}),
|
||||
new TableRow({
|
||||
children: [
|
||||
new TableCell({
|
||||
borders: cellBorders,
|
||||
width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell
|
||||
children: [new Paragraph({ children: [new TextRun("Regular data")] })]
|
||||
}),
|
||||
new TableCell({
|
||||
borders: cellBorders,
|
||||
width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell
|
||||
children: [
|
||||
new Paragraph({
|
||||
numbering: { reference: "bullet-list", level: 0 },
|
||||
children: [new TextRun("First bullet point")]
|
||||
}),
|
||||
new Paragraph({
|
||||
numbering: { reference: "bullet-list", level: 0 },
|
||||
children: [new TextRun("Second bullet point")]
|
||||
})
|
||||
]
|
||||
})
|
||||
]
|
||||
})
|
||||
]
|
||||
})
|
||||
```
|
||||
|
||||
**IMPORTANT: Table Width & Borders**
|
||||
- Use BOTH `columnWidths: [width1, width2, ...]` array AND `width: { size: X, type: WidthType.DXA }` on each cell
|
||||
- Values in DXA (twentieths of a point): 1440 = 1 inch, Letter usable width = 9360 DXA (with 1" margins)
|
||||
- Apply borders to individual `TableCell` elements, NOT the `Table` itself
|
||||
|
||||
**Precomputed Column Widths (Letter size with 1" margins = 9360 DXA total):**
|
||||
- **2 columns:** `columnWidths: [4680, 4680]` (equal width)
|
||||
- **3 columns:** `columnWidths: [3120, 3120, 3120]` (equal width)
|
||||
|
||||
## Links & Navigation
|
||||
```javascript
|
||||
// TOC (requires headings) - CRITICAL: Use HeadingLevel only, NOT custom styles
|
||||
// ❌ WRONG: new Paragraph({ heading: HeadingLevel.HEADING_1, style: "customHeader", children: [new TextRun("Title")] })
|
||||
// ✅ CORRECT: new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] })
|
||||
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" }),
|
||||
|
||||
// External link
|
||||
new Paragraph({
|
||||
children: [new ExternalHyperlink({
|
||||
children: [new TextRun({ text: "Google", style: "Hyperlink" })],
|
||||
link: "https://www.google.com"
|
||||
})]
|
||||
}),
|
||||
|
||||
// Internal link & bookmark
|
||||
new Paragraph({
|
||||
children: [new InternalHyperlink({
|
||||
children: [new TextRun({ text: "Go to Section", style: "Hyperlink" })],
|
||||
anchor: "section1"
|
||||
})]
|
||||
}),
|
||||
new Paragraph({
|
||||
children: [new TextRun("Section Content")],
|
||||
bookmark: { id: "section1", name: "section1" }
|
||||
}),
|
||||
```
|
||||
|
||||
## Images & Media
|
||||
```javascript
|
||||
// Basic image with sizing & positioning
|
||||
// CRITICAL: Always specify 'type' parameter - it's REQUIRED for ImageRun
|
||||
new Paragraph({
|
||||
alignment: AlignmentType.CENTER,
|
||||
children: [new ImageRun({
|
||||
type: "png", // NEW REQUIREMENT: Must specify image type (png, jpg, jpeg, gif, bmp, svg)
|
||||
data: fs.readFileSync("image.png"),
|
||||
transformation: { width: 200, height: 150, rotation: 0 }, // rotation in degrees
|
||||
altText: { title: "Logo", description: "Company logo", name: "Name" } // IMPORTANT: All three fields are required
|
||||
})]
|
||||
})
|
||||
```
|
||||
|
||||
## Page Breaks
|
||||
```javascript
|
||||
// Manual page break
|
||||
new Paragraph({ children: [new PageBreak()] }),
|
||||
|
||||
// Page break before paragraph
|
||||
new Paragraph({
|
||||
pageBreakBefore: true,
|
||||
children: [new TextRun("This starts on a new page")]
|
||||
})
|
||||
|
||||
// ⚠️ CRITICAL: NEVER use PageBreak standalone - it will create invalid XML that Word cannot open
|
||||
// ❌ WRONG: new PageBreak()
|
||||
// ✅ CORRECT: new Paragraph({ children: [new PageBreak()] })
|
||||
```
|
||||
|
||||
## Headers/Footers & Page Setup
|
||||
```javascript
|
||||
const doc = new Document({
|
||||
sections: [{
|
||||
properties: {
|
||||
page: {
|
||||
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 }, // 1440 = 1 inch
|
||||
size: { orientation: PageOrientation.LANDSCAPE },
|
||||
pageNumbers: { start: 1, formatType: "decimal" } // "upperRoman", "lowerRoman", "upperLetter", "lowerLetter"
|
||||
}
|
||||
},
|
||||
headers: {
|
||||
default: new Header({ children: [new Paragraph({
|
||||
alignment: AlignmentType.RIGHT,
|
||||
children: [new TextRun("Header Text")]
|
||||
})] })
|
||||
},
|
||||
footers: {
|
||||
default: new Footer({ children: [new Paragraph({
|
||||
alignment: AlignmentType.CENTER,
|
||||
children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] }), new TextRun(" of "), new TextRun({ children: [PageNumber.TOTAL_PAGES] })]
|
||||
})] })
|
||||
},
|
||||
children: [/* content */]
|
||||
}]
|
||||
});
|
||||
```
|
||||
|
||||
## Tabs
|
||||
```javascript
|
||||
new Paragraph({
|
||||
tabStops: [
|
||||
{ type: TabStopType.LEFT, position: TabStopPosition.MAX / 4 },
|
||||
{ type: TabStopType.CENTER, position: TabStopPosition.MAX / 2 },
|
||||
{ type: TabStopType.RIGHT, position: TabStopPosition.MAX * 3 / 4 }
|
||||
],
|
||||
children: [new TextRun("Left\tCenter\tRight")]
|
||||
})
|
||||
```
|
||||
|
||||
## Constants & Quick Reference
|
||||
- **Underlines:** `SINGLE`, `DOUBLE`, `WAVY`, `DASH`
|
||||
- **Borders:** `SINGLE`, `DOUBLE`, `DASHED`, `DOTTED`
|
||||
- **Numbering:** `DECIMAL` (1,2,3), `UPPER_ROMAN` (I,II,III), `LOWER_LETTER` (a,b,c)
|
||||
- **Tabs:** `LEFT`, `CENTER`, `RIGHT`, `DECIMAL`
|
||||
- **Symbols:** `"2022"` (•), `"00A9"` (©), `"00AE"` (®), `"2122"` (™), `"00B0"` (°), `"F070"` (✓), `"F0FC"` (✗)
|
||||
|
||||
## Critical Issues & Common Mistakes
|
||||
- **CRITICAL: PageBreak must ALWAYS be inside a Paragraph** - standalone PageBreak creates invalid XML that Word cannot open
|
||||
- **ALWAYS use ShadingType.CLEAR for table cell shading** - Never use ShadingType.SOLID (causes black background).
|
||||
- Measurements in DXA (1440 = 1 inch) | Each table cell needs ≥1 Paragraph | TOC requires HeadingLevel styles only
|
||||
- **ALWAYS use custom styles** with Arial font for professional appearance and proper visual hierarchy
|
||||
- **ALWAYS set a default font** using `styles.default.document.run.font` - Arial recommended
|
||||
- **ALWAYS use columnWidths array for tables** + individual cell widths for compatibility
|
||||
- **NEVER use unicode symbols for bullets** - always use proper numbering configuration with `LevelFormat.BULLET` constant (NOT the string "bullet")
|
||||
- **NEVER use \n for line breaks anywhere** - always use separate Paragraph elements for each line
|
||||
- **ALWAYS use TextRun objects within Paragraph children** - never use text property directly on Paragraph
|
||||
- **CRITICAL for images**: ImageRun REQUIRES `type` parameter - always specify "png", "jpg", "jpeg", "gif", "bmp", or "svg"
|
||||
- **CRITICAL for bullets**: Must use `LevelFormat.BULLET` constant, not string "bullet", and include `text: "•"` for the bullet character
|
||||
- **CRITICAL for numbering**: Each numbering reference creates an INDEPENDENT list. Same reference = continues numbering (1,2,3 then 4,5,6). Different reference = restarts at 1 (1,2,3 then 1,2,3). Use unique reference names for each separate numbered section!
|
||||
- **CRITICAL for TOC**: When using TableOfContents, headings must use HeadingLevel ONLY - do NOT add custom styles to heading paragraphs or TOC will break
|
||||
- **Tables**: Set `columnWidths` array + individual cell widths, apply borders to cells not table
|
||||
- **Set table margins at TABLE level** for consistent cell padding (avoids repetition per cell)
|
||||
@@ -1,610 +0,0 @@
|
||||
# Office Open XML Technical Reference
|
||||
|
||||
**Important: Read this entire document before starting.** This document covers:
|
||||
- [Technical Guidelines](#technical-guidelines) - Schema compliance rules and validation requirements
|
||||
- [Document Content Patterns](#document-content-patterns) - XML patterns for headings, lists, tables, formatting, etc.
|
||||
- [Document Library (Python)](#document-library-python) - Recommended approach for OOXML manipulation with automatic infrastructure setup
|
||||
- [Tracked Changes (Redlining)](#tracked-changes-redlining) - XML patterns for implementing tracked changes
|
||||
|
||||
## Technical Guidelines
|
||||
|
||||
### Schema Compliance
|
||||
- **Element ordering in `<w:pPr>`**: `<w:pStyle>`, `<w:numPr>`, `<w:spacing>`, `<w:ind>`, `<w:jc>`
|
||||
- **Whitespace**: Add `xml:space='preserve'` to `<w:t>` elements with leading/trailing spaces
|
||||
- **Unicode**: Escape characters in ASCII content: `"` becomes `“`
|
||||
- **Character encoding reference**: Curly quotes `""` become `“”`, apostrophe `'` becomes `’`, em-dash `—` becomes `—`
|
||||
- **Tracked changes**: Use `<w:del>` and `<w:ins>` tags with `w:author="Scientific-Writer"` outside `<w:r>` elements
|
||||
- **Critical**: `<w:ins>` closes with `</w:ins>`, `<w:del>` closes with `</w:del>` - never mix
|
||||
- **RSIDs must be 8-digit hex**: Use values like `00AB1234` (only 0-9, A-F characters)
|
||||
- **trackRevisions placement**: Add `<w:trackRevisions/>` after `<w:proofState>` in settings.xml
|
||||
- **Images**: Add to `word/media/`, reference in `document.xml`, set dimensions to prevent overflow
|
||||
|
||||
## Document Content Patterns
|
||||
|
||||
### Basic Structure
|
||||
```xml
|
||||
<w:p>
|
||||
<w:r><w:t>Text content</w:t></w:r>
|
||||
</w:p>
|
||||
```
|
||||
|
||||
### Headings and Styles
|
||||
```xml
|
||||
<w:p>
|
||||
<w:pPr>
|
||||
<w:pStyle w:val="Title"/>
|
||||
<w:jc w:val="center"/>
|
||||
</w:pPr>
|
||||
<w:r><w:t>Document Title</w:t></w:r>
|
||||
</w:p>
|
||||
|
||||
<w:p>
|
||||
<w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
|
||||
<w:r><w:t>Section Heading</w:t></w:r>
|
||||
</w:p>
|
||||
```
|
||||
|
||||
### Text Formatting
|
||||
```xml
|
||||
<!-- Bold -->
|
||||
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Bold</w:t></w:r>
|
||||
<!-- Italic -->
|
||||
<w:r><w:rPr><w:i/><w:iCs/></w:rPr><w:t>Italic</w:t></w:r>
|
||||
<!-- Underline -->
|
||||
<w:r><w:rPr><w:u w:val="single"/></w:rPr><w:t>Underlined</w:t></w:r>
|
||||
<!-- Highlight -->
|
||||
<w:r><w:rPr><w:highlight w:val="yellow"/></w:rPr><w:t>Highlighted</w:t></w:r>
|
||||
```
|
||||
|
||||
### Lists
|
||||
```xml
|
||||
<!-- Numbered list -->
|
||||
<w:p>
|
||||
<w:pPr>
|
||||
<w:pStyle w:val="ListParagraph"/>
|
||||
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr>
|
||||
<w:spacing w:before="240"/>
|
||||
</w:pPr>
|
||||
<w:r><w:t>First item</w:t></w:r>
|
||||
</w:p>
|
||||
|
||||
<!-- Restart numbered list at 1 - use different numId -->
|
||||
<w:p>
|
||||
<w:pPr>
|
||||
<w:pStyle w:val="ListParagraph"/>
|
||||
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="2"/></w:numPr>
|
||||
<w:spacing w:before="240"/>
|
||||
</w:pPr>
|
||||
<w:r><w:t>New list item 1</w:t></w:r>
|
||||
</w:p>
|
||||
|
||||
<!-- Bullet list (level 2) -->
|
||||
<w:p>
|
||||
<w:pPr>
|
||||
<w:pStyle w:val="ListParagraph"/>
|
||||
<w:numPr><w:ilvl w:val="1"/><w:numId w:val="1"/></w:numPr>
|
||||
<w:spacing w:before="240"/>
|
||||
<w:ind w:left="900"/>
|
||||
</w:pPr>
|
||||
<w:r><w:t>Bullet item</w:t></w:r>
|
||||
</w:p>
|
||||
```
|
||||
|
||||
### Tables
|
||||
```xml
|
||||
<w:tbl>
|
||||
<w:tblPr>
|
||||
<w:tblStyle w:val="TableGrid"/>
|
||||
<w:tblW w:w="0" w:type="auto"/>
|
||||
</w:tblPr>
|
||||
<w:tblGrid>
|
||||
<w:gridCol w:w="4675"/><w:gridCol w:w="4675"/>
|
||||
</w:tblGrid>
|
||||
<w:tr>
|
||||
<w:tc>
|
||||
<w:tcPr><w:tcW w:w="4675" w:type="dxa"/></w:tcPr>
|
||||
<w:p><w:r><w:t>Cell 1</w:t></w:r></w:p>
|
||||
</w:tc>
|
||||
<w:tc>
|
||||
<w:tcPr><w:tcW w:w="4675" w:type="dxa"/></w:tcPr>
|
||||
<w:p><w:r><w:t>Cell 2</w:t></w:r></w:p>
|
||||
</w:tc>
|
||||
</w:tr>
|
||||
</w:tbl>
|
||||
```
|
||||
|
||||
### Layout
|
||||
```xml
|
||||
<!-- Page break before new section (common pattern) -->
|
||||
<w:p>
|
||||
<w:r>
|
||||
<w:br w:type="page"/>
|
||||
</w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:pPr>
|
||||
<w:pStyle w:val="Heading1"/>
|
||||
</w:pPr>
|
||||
<w:r>
|
||||
<w:t>New Section Title</w:t>
|
||||
</w:r>
|
||||
</w:p>
|
||||
|
||||
<!-- Centered paragraph -->
|
||||
<w:p>
|
||||
<w:pPr>
|
||||
<w:spacing w:before="240" w:after="0"/>
|
||||
<w:jc w:val="center"/>
|
||||
</w:pPr>
|
||||
<w:r><w:t>Centered text</w:t></w:r>
|
||||
</w:p>
|
||||
|
||||
<!-- Font change - paragraph level (applies to all runs) -->
|
||||
<w:p>
|
||||
<w:pPr>
|
||||
<w:rPr><w:rFonts w:ascii="Courier New" w:hAnsi="Courier New"/></w:rPr>
|
||||
</w:pPr>
|
||||
<w:r><w:t>Monospace text</w:t></w:r>
|
||||
</w:p>
|
||||
|
||||
<!-- Font change - run level (specific to this text) -->
|
||||
<w:p>
|
||||
<w:r>
|
||||
<w:rPr><w:rFonts w:ascii="Courier New" w:hAnsi="Courier New"/></w:rPr>
|
||||
<w:t>This text is Courier New</w:t>
|
||||
</w:r>
|
||||
<w:r><w:t> and this text uses default font</w:t></w:r>
|
||||
</w:p>
|
||||
```
|
||||
|
||||
## File Updates
|
||||
|
||||
When adding content, update these files:
|
||||
|
||||
**`word/_rels/document.xml.rels`:**
|
||||
```xml
|
||||
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" Target="numbering.xml"/>
|
||||
<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
|
||||
```
|
||||
|
||||
**`[Content_Types].xml`:**
|
||||
```xml
|
||||
<Default Extension="png" ContentType="image/png"/>
|
||||
<Override PartName="/word/numbering.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml"/>
|
||||
```
|
||||
|
||||
### Images
|
||||
**CRITICAL**: Calculate dimensions to prevent page overflow and maintain aspect ratio.
|
||||
|
||||
```xml
|
||||
<!-- Minimal required structure -->
|
||||
<w:p>
|
||||
<w:r>
|
||||
<w:drawing>
|
||||
<wp:inline>
|
||||
<wp:extent cx="2743200" cy="1828800"/>
|
||||
<wp:docPr id="1" name="Picture 1"/>
|
||||
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
|
||||
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
|
||||
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
|
||||
<pic:nvPicPr>
|
||||
<pic:cNvPr id="0" name="image1.png"/>
|
||||
<pic:cNvPicPr/>
|
||||
</pic:nvPicPr>
|
||||
<pic:blipFill>
|
||||
<a:blip r:embed="rId5"/>
|
||||
<!-- Add for stretch fill with aspect ratio preservation -->
|
||||
<a:stretch>
|
||||
<a:fillRect/>
|
||||
</a:stretch>
|
||||
</pic:blipFill>
|
||||
<pic:spPr>
|
||||
<a:xfrm>
|
||||
<a:ext cx="2743200" cy="1828800"/>
|
||||
</a:xfrm>
|
||||
<a:prstGeom prst="rect"/>
|
||||
</pic:spPr>
|
||||
</pic:pic>
|
||||
</a:graphicData>
|
||||
</a:graphic>
|
||||
</wp:inline>
|
||||
</w:drawing>
|
||||
</w:r>
|
||||
</w:p>
|
||||
```
|
||||
|
||||
### Links (Hyperlinks)
|
||||
|
||||
**IMPORTANT**: All hyperlinks (both internal and external) require the Hyperlink style to be defined in styles.xml. Without this style, links will look like regular text instead of blue underlined clickable links.
|
||||
|
||||
**External Links:**
|
||||
```xml
|
||||
<!-- In document.xml -->
|
||||
<w:hyperlink r:id="rId5">
|
||||
<w:r>
|
||||
<w:rPr><w:rStyle w:val="Hyperlink"/></w:rPr>
|
||||
<w:t>Link Text</w:t>
|
||||
</w:r>
|
||||
</w:hyperlink>
|
||||
|
||||
<!-- In word/_rels/document.xml.rels -->
|
||||
<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink"
|
||||
Target="https://www.example.com/" TargetMode="External"/>
|
||||
```
|
||||
|
||||
**Internal Links:**
|
||||
|
||||
```xml
|
||||
<!-- Link to bookmark -->
|
||||
<w:hyperlink w:anchor="myBookmark">
|
||||
<w:r>
|
||||
<w:rPr><w:rStyle w:val="Hyperlink"/></w:rPr>
|
||||
<w:t>Link Text</w:t>
|
||||
</w:r>
|
||||
</w:hyperlink>
|
||||
|
||||
<!-- Bookmark target -->
|
||||
<w:bookmarkStart w:id="0" w:name="myBookmark"/>
|
||||
<w:r><w:t>Target content</w:t></w:r>
|
||||
<w:bookmarkEnd w:id="0"/>
|
||||
```
|
||||
|
||||
**Hyperlink Style (required in styles.xml):**
|
||||
```xml
|
||||
<w:style w:type="character" w:styleId="Hyperlink">
|
||||
<w:name w:val="Hyperlink"/>
|
||||
<w:basedOn w:val="DefaultParagraphFont"/>
|
||||
<w:uiPriority w:val="99"/>
|
||||
<w:unhideWhenUsed/>
|
||||
<w:rPr>
|
||||
<w:color w:val="467886" w:themeColor="hyperlink"/>
|
||||
<w:u w:val="single"/>
|
||||
</w:rPr>
|
||||
</w:style>
|
||||
```
|
||||
|
||||
## Document Library (Python)
|
||||
|
||||
Use the Document class from `scripts/document.py` for all tracked changes and comments. It automatically handles infrastructure setup (people.xml, RSIDs, settings.xml, comment files, relationships, content types). Only use direct XML manipulation for complex scenarios not supported by the library.
|
||||
|
||||
**Working with Unicode and Entities:**
|
||||
- **Searching**: Both entity notation and Unicode characters work - `contains="“Company"` and `contains="\u201cCompany"` find the same text
|
||||
- **Replacing**: Use either entities (`“`) or Unicode (`\u201c`) - both work and will be converted appropriately based on the file's encoding (ascii → entities, utf-8 → Unicode)
|
||||
|
||||
### Initialization
|
||||
|
||||
**Find the docx skill root** (directory containing `scripts/` and `ooxml/`):
|
||||
```bash
|
||||
# Search for document.py to locate the skill root
|
||||
# Note: /mnt/skills is used here as an example; check your context for the actual location
|
||||
find /mnt/skills -name "document.py" -path "*/docx/scripts/*" 2>/dev/null | head -1
|
||||
# Example output: /mnt/skills/docx/scripts/document.py
|
||||
# Skill root is: /mnt/skills/docx
|
||||
```
|
||||
|
||||
**Run your script with PYTHONPATH** set to the docx skill root:
|
||||
```bash
|
||||
PYTHONPATH=/mnt/skills/docx python your_script.py
|
||||
```
|
||||
|
||||
**In your script**, import from the skill root:
|
||||
```python
|
||||
from scripts.document import Document, DocxXMLEditor
|
||||
|
||||
# Basic initialization (automatically creates temp copy and sets up infrastructure)
|
||||
doc = Document('unpacked')
|
||||
|
||||
# Customize author and initials
|
||||
doc = Document('unpacked', author="John Doe", initials="JD")
|
||||
|
||||
# Enable track revisions mode
|
||||
doc = Document('unpacked', track_revisions=True)
|
||||
|
||||
# Specify custom RSID (auto-generated if not provided)
|
||||
doc = Document('unpacked', rsid="07DC5ECB")
|
||||
```
|
||||
|
||||
### Creating Tracked Changes
|
||||
|
||||
**CRITICAL**: Only mark text that actually changes. Keep ALL unchanged text outside `<w:del>`/`<w:ins>` tags. Marking unchanged text makes edits unprofessional and harder to review.
|
||||
|
||||
**Attribute Handling**: The Document class auto-injects attributes (w:id, w:date, w:rsidR, w:rsidDel, w16du:dateUtc, xml:space) into new elements. When preserving unchanged text from the original document, copy the original `<w:r>` element with its existing attributes to maintain document integrity.
|
||||
|
||||
**Method Selection Guide**:
|
||||
- **Adding your own changes to regular text**: Use `replace_node()` with `<w:del>`/`<w:ins>` tags, or `suggest_deletion()` for removing entire `<w:r>` or `<w:p>` elements
|
||||
- **Partially modifying another author's tracked change**: Use `replace_node()` to nest your changes inside their `<w:ins>`/`<w:del>`
|
||||
- **Completely rejecting another author's insertion**: Use `revert_insertion()` on the `<w:ins>` element (NOT `suggest_deletion()`)
|
||||
- **Completely rejecting another author's deletion**: Use `revert_deletion()` on the `<w:del>` element to restore deleted content using tracked changes
|
||||
|
||||
```python
|
||||
# Minimal edit - change one word: "The report is monthly" → "The report is quarterly"
|
||||
# Original: <w:r w:rsidR="00AB12CD"><w:rPr><w:rFonts w:ascii="Calibri"/></w:rPr><w:t>The report is monthly</w:t></w:r>
|
||||
node = doc["word/document.xml"].get_node(tag="w:r", contains="The report is monthly")
|
||||
rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else ""
|
||||
replacement = f'<w:r w:rsidR="00AB12CD">{rpr}<w:t>The report is </w:t></w:r><w:del><w:r>{rpr}<w:delText>monthly</w:delText></w:r></w:del><w:ins><w:r>{rpr}<w:t>quarterly</w:t></w:r></w:ins>'
|
||||
doc["word/document.xml"].replace_node(node, replacement)
|
||||
|
||||
# Minimal edit - change number: "within 30 days" → "within 45 days"
|
||||
# Original: <w:r w:rsidR="00XYZ789"><w:rPr><w:rFonts w:ascii="Calibri"/></w:rPr><w:t>within 30 days</w:t></w:r>
|
||||
node = doc["word/document.xml"].get_node(tag="w:r", contains="within 30 days")
|
||||
rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else ""
|
||||
replacement = f'<w:r w:rsidR="00XYZ789">{rpr}<w:t>within </w:t></w:r><w:del><w:r>{rpr}<w:delText>30</w:delText></w:r></w:del><w:ins><w:r>{rpr}<w:t>45</w:t></w:r></w:ins><w:r w:rsidR="00XYZ789">{rpr}<w:t> days</w:t></w:r>'
|
||||
doc["word/document.xml"].replace_node(node, replacement)
|
||||
|
||||
# Complete replacement - preserve formatting even when replacing all text
|
||||
node = doc["word/document.xml"].get_node(tag="w:r", contains="apple")
|
||||
rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else ""
|
||||
replacement = f'<w:del><w:r>{rpr}<w:delText>apple</w:delText></w:r></w:del><w:ins><w:r>{rpr}<w:t>banana orange</w:t></w:r></w:ins>'
|
||||
doc["word/document.xml"].replace_node(node, replacement)
|
||||
|
||||
# Insert new content (no attributes needed - auto-injected)
|
||||
node = doc["word/document.xml"].get_node(tag="w:r", contains="existing text")
|
||||
doc["word/document.xml"].insert_after(node, '<w:ins><w:r><w:t>new text</w:t></w:r></w:ins>')
|
||||
|
||||
# Partially delete another author's insertion
|
||||
# Original: <w:ins w:author="Jane Smith" w:date="..."><w:r><w:t>quarterly financial report</w:t></w:r></w:ins>
|
||||
# Goal: Delete only "financial" to make it "quarterly report"
|
||||
node = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "5"})
|
||||
# IMPORTANT: Preserve w:author="Jane Smith" on the outer <w:ins> to maintain authorship
|
||||
replacement = '''<w:ins w:author="Jane Smith" w:date="2025-01-15T10:00:00Z">
|
||||
<w:r><w:t>quarterly </w:t></w:r>
|
||||
<w:del><w:r><w:delText>financial </w:delText></w:r></w:del>
|
||||
<w:r><w:t>report</w:t></w:r>
|
||||
</w:ins>'''
|
||||
doc["word/document.xml"].replace_node(node, replacement)
|
||||
|
||||
# Change part of another author's insertion
|
||||
# Original: <w:ins w:author="Jane Smith"><w:r><w:t>in silence, safe and sound</w:t></w:r></w:ins>
|
||||
# Goal: Change "safe and sound" to "soft and unbound"
|
||||
node = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "8"})
|
||||
replacement = f'''<w:ins w:author="Jane Smith" w:date="2025-01-15T10:00:00Z">
|
||||
<w:r><w:t>in silence, </w:t></w:r>
|
||||
</w:ins>
|
||||
<w:ins>
|
||||
<w:r><w:t>soft and unbound</w:t></w:r>
|
||||
</w:ins>
|
||||
<w:ins w:author="Jane Smith" w:date="2025-01-15T10:00:00Z">
|
||||
<w:del><w:r><w:delText>safe and sound</w:delText></w:r></w:del>
|
||||
</w:ins>'''
|
||||
doc["word/document.xml"].replace_node(node, replacement)
|
||||
|
||||
# Delete entire run (use only when deleting all content; use replace_node for partial deletions)
|
||||
node = doc["word/document.xml"].get_node(tag="w:r", contains="text to delete")
|
||||
doc["word/document.xml"].suggest_deletion(node)
|
||||
|
||||
# Delete entire paragraph (in-place, handles both regular and numbered list paragraphs)
|
||||
para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph to delete")
|
||||
doc["word/document.xml"].suggest_deletion(para)
|
||||
|
||||
# Add new numbered list item
|
||||
target_para = doc["word/document.xml"].get_node(tag="w:p", contains="existing list item")
|
||||
pPr = tags[0].toxml() if (tags := target_para.getElementsByTagName("w:pPr")) else ""
|
||||
new_item = f'<w:p>{pPr}<w:r><w:t>New item</w:t></w:r></w:p>'
|
||||
tracked_para = DocxXMLEditor.suggest_paragraph(new_item)
|
||||
doc["word/document.xml"].insert_after(target_para, tracked_para)
|
||||
# Optional: add spacing paragraph before content for better visual separation
|
||||
# spacing = DocxXMLEditor.suggest_paragraph('<w:p><w:pPr><w:pStyle w:val="ListParagraph"/></w:pPr></w:p>')
|
||||
# doc["word/document.xml"].insert_after(target_para, spacing + tracked_para)
|
||||
```
|
||||
|
||||
### Adding Comments
|
||||
|
||||
```python
|
||||
# Add comment spanning two existing tracked changes
|
||||
# Note: w:id is auto-generated. Only search by w:id if you know it from XML inspection
|
||||
start_node = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "1"})
|
||||
end_node = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "2"})
|
||||
doc.add_comment(start=start_node, end=end_node, text="Explanation of this change")
|
||||
|
||||
# Add comment on a paragraph
|
||||
para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph text")
|
||||
doc.add_comment(start=para, end=para, text="Comment on this paragraph")
|
||||
|
||||
# Add comment on newly created tracked change
|
||||
# First create the tracked change
|
||||
node = doc["word/document.xml"].get_node(tag="w:r", contains="old")
|
||||
new_nodes = doc["word/document.xml"].replace_node(
|
||||
node,
|
||||
'<w:del><w:r><w:delText>old</w:delText></w:r></w:del><w:ins><w:r><w:t>new</w:t></w:r></w:ins>'
|
||||
)
|
||||
# Then add comment on the newly created elements
|
||||
# new_nodes[0] is the <w:del>, new_nodes[1] is the <w:ins>
|
||||
doc.add_comment(start=new_nodes[0], end=new_nodes[1], text="Changed old to new per requirements")
|
||||
|
||||
# Reply to existing comment
|
||||
doc.reply_to_comment(parent_comment_id=0, text="I agree with this change")
|
||||
```
|
||||
|
||||
### Rejecting Tracked Changes
|
||||
|
||||
**IMPORTANT**: Use `revert_insertion()` to reject insertions and `revert_deletion()` to restore deletions using tracked changes. Use `suggest_deletion()` only for regular unmarked content.
|
||||
|
||||
```python
|
||||
# Reject insertion (wraps it in deletion)
|
||||
# Use this when another author inserted text that you want to delete
|
||||
ins = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "5"})
|
||||
nodes = doc["word/document.xml"].revert_insertion(ins) # Returns [ins]
|
||||
|
||||
# Reject deletion (creates insertion to restore deleted content)
|
||||
# Use this when another author deleted text that you want to restore
|
||||
del_elem = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "3"})
|
||||
nodes = doc["word/document.xml"].revert_deletion(del_elem) # Returns [del_elem, new_ins]
|
||||
|
||||
# Reject all insertions in a paragraph
|
||||
para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph text")
|
||||
nodes = doc["word/document.xml"].revert_insertion(para) # Returns [para]
|
||||
|
||||
# Reject all deletions in a paragraph
|
||||
para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph text")
|
||||
nodes = doc["word/document.xml"].revert_deletion(para) # Returns [para]
|
||||
```
|
||||
|
||||
### Inserting Images
|
||||
|
||||
**CRITICAL**: The Document class works with a temporary copy at `doc.unpacked_path`. Always copy images to this temp directory, not the original unpacked folder.
|
||||
|
||||
```python
|
||||
from PIL import Image
|
||||
import shutil, os
|
||||
|
||||
# Initialize document first
|
||||
doc = Document('unpacked')
|
||||
|
||||
# Copy image and calculate full-width dimensions with aspect ratio
|
||||
media_dir = os.path.join(doc.unpacked_path, 'word/media')
|
||||
os.makedirs(media_dir, exist_ok=True)
|
||||
shutil.copy('image.png', os.path.join(media_dir, 'image1.png'))
|
||||
img = Image.open(os.path.join(media_dir, 'image1.png'))
|
||||
width_emus = int(6.5 * 914400) # 6.5" usable width, 914400 EMUs/inch
|
||||
height_emus = int(width_emus * img.size[1] / img.size[0])
|
||||
|
||||
# Add relationship and content type
|
||||
rels_editor = doc['word/_rels/document.xml.rels']
|
||||
next_rid = rels_editor.get_next_rid()
|
||||
rels_editor.append_to(rels_editor.dom.documentElement,
|
||||
f'<Relationship Id="{next_rid}" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>')
|
||||
doc['[Content_Types].xml'].append_to(doc['[Content_Types].xml'].dom.documentElement,
|
||||
'<Default Extension="png" ContentType="image/png"/>')
|
||||
|
||||
# Insert image
|
||||
node = doc["word/document.xml"].get_node(tag="w:p", line_number=100)
|
||||
doc["word/document.xml"].insert_after(node, f'''<w:p>
|
||||
<w:r>
|
||||
<w:drawing>
|
||||
<wp:inline distT="0" distB="0" distL="0" distR="0">
|
||||
<wp:extent cx="{width_emus}" cy="{height_emus}"/>
|
||||
<wp:docPr id="1" name="Picture 1"/>
|
||||
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
|
||||
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
|
||||
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
|
||||
<pic:nvPicPr><pic:cNvPr id="1" name="image1.png"/><pic:cNvPicPr/></pic:nvPicPr>
|
||||
<pic:blipFill><a:blip r:embed="{next_rid}"/><a:stretch><a:fillRect/></a:stretch></pic:blipFill>
|
||||
<pic:spPr><a:xfrm><a:ext cx="{width_emus}" cy="{height_emus}"/></a:xfrm><a:prstGeom prst="rect"><a:avLst/></a:prstGeom></pic:spPr>
|
||||
</pic:pic>
|
||||
</a:graphicData>
|
||||
</a:graphic>
|
||||
</wp:inline>
|
||||
</w:drawing>
|
||||
</w:r>
|
||||
</w:p>''')
|
||||
```
|
||||
|
||||
### Getting Nodes
|
||||
|
||||
```python
|
||||
# By text content
|
||||
node = doc["word/document.xml"].get_node(tag="w:p", contains="specific text")
|
||||
|
||||
# By line range
|
||||
para = doc["word/document.xml"].get_node(tag="w:p", line_number=range(100, 150))
|
||||
|
||||
# By attributes
|
||||
node = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "1"})
|
||||
|
||||
# By exact line number (must be line number where tag opens)
|
||||
para = doc["word/document.xml"].get_node(tag="w:p", line_number=42)
|
||||
|
||||
# Combine filters
|
||||
node = doc["word/document.xml"].get_node(tag="w:r", line_number=range(40, 60), contains="text")
|
||||
|
||||
# Disambiguate when text appears multiple times - add line_number range
|
||||
node = doc["word/document.xml"].get_node(tag="w:r", contains="Section", line_number=range(2400, 2500))
|
||||
```
|
||||
|
||||
### Saving
|
||||
|
||||
```python
|
||||
# Save with automatic validation (copies back to original directory)
|
||||
doc.save() # Validates by default, raises error if validation fails
|
||||
|
||||
# Save to different location
|
||||
doc.save('modified-unpacked')
|
||||
|
||||
# Skip validation (debugging only - needing this in production indicates XML issues)
|
||||
doc.save(validate=False)
|
||||
```
|
||||
|
||||
### Direct DOM Manipulation
|
||||
|
||||
For complex scenarios not covered by the library:
|
||||
|
||||
```python
|
||||
# Access any XML file
|
||||
editor = doc["word/document.xml"]
|
||||
editor = doc["word/comments.xml"]
|
||||
|
||||
# Direct DOM access (defusedxml.minidom.Document)
|
||||
node = doc["word/document.xml"].get_node(tag="w:p", line_number=5)
|
||||
parent = node.parentNode
|
||||
parent.removeChild(node)
|
||||
parent.appendChild(node) # Move to end
|
||||
|
||||
# General document manipulation (without tracked changes)
|
||||
old_node = doc["word/document.xml"].get_node(tag="w:p", contains="original text")
|
||||
doc["word/document.xml"].replace_node(old_node, "<w:p><w:r><w:t>replacement text</w:t></w:r></w:p>")
|
||||
|
||||
# Multiple insertions - use return value to maintain order
|
||||
node = doc["word/document.xml"].get_node(tag="w:r", line_number=100)
|
||||
nodes = doc["word/document.xml"].insert_after(node, "<w:r><w:t>A</w:t></w:r>")
|
||||
nodes = doc["word/document.xml"].insert_after(nodes[-1], "<w:r><w:t>B</w:t></w:r>")
|
||||
nodes = doc["word/document.xml"].insert_after(nodes[-1], "<w:r><w:t>C</w:t></w:r>")
|
||||
# Results in: original_node, A, B, C
|
||||
```
|
||||
|
||||
## Tracked Changes (Redlining)
|
||||
|
||||
**Use the Document class above for all tracked changes.** The patterns below are for reference when constructing replacement XML strings.
|
||||
|
||||
### Validation Rules
|
||||
The validator checks that the document text matches the original after reverting Scientific-Writer's changes. This means:
|
||||
- **NEVER modify text inside another author's `<w:ins>` or `<w:del>` tags**
|
||||
- **ALWAYS use nested deletions** to remove another author's insertions
|
||||
- **Every edit must be properly tracked** with `<w:ins>` or `<w:del>` tags
|
||||
|
||||
### Tracked Change Patterns
|
||||
|
||||
**CRITICAL RULES**:
|
||||
1. Never modify the content inside another author's tracked changes. Always use nested deletions.
|
||||
2. **XML Structure**: Always place `<w:del>` and `<w:ins>` at paragraph level containing complete `<w:r>` elements. Never nest inside `<w:r>` elements - this creates invalid XML that breaks document processing.
|
||||
|
||||
**Text Insertion:**
|
||||
```xml
|
||||
<w:ins w:id="1" w:author="Scientific-Writer" w:date="2025-07-30T23:05:00Z" w16du:dateUtc="2025-07-31T06:05:00Z">
|
||||
<w:r w:rsidR="00792858">
|
||||
<w:t>inserted text</w:t>
|
||||
</w:r>
|
||||
</w:ins>
|
||||
```
|
||||
|
||||
**Text Deletion:**
|
||||
```xml
|
||||
<w:del w:id="2" w:author="Scientific-Writer" w:date="2025-07-30T23:05:00Z" w16du:dateUtc="2025-07-31T06:05:00Z">
|
||||
<w:r w:rsidDel="00792858">
|
||||
<w:delText>deleted text</w:delText>
|
||||
</w:r>
|
||||
</w:del>
|
||||
```
|
||||
|
||||
**Deleting Another Author's Insertion (MUST use nested structure):**
|
||||
```xml
|
||||
<!-- Nest deletion inside the original insertion -->
|
||||
<w:ins w:author="Jane Smith" w:id="16">
|
||||
<w:del w:author="Scientific-Writer" w:id="40">
|
||||
<w:r><w:delText>monthly</w:delText></w:r>
|
||||
</w:del>
|
||||
</w:ins>
|
||||
<w:ins w:author="Scientific-Writer" w:id="41">
|
||||
<w:r><w:t>weekly</w:t></w:r>
|
||||
</w:ins>
|
||||
```
|
||||
|
||||
**Restoring Another Author's Deletion:**
|
||||
```xml
|
||||
<!-- Leave their deletion unchanged, add new insertion after it -->
|
||||
<w:del w:author="Jane Smith" w:id="50">
|
||||
<w:r><w:delText>within 30 days</w:delText></w:r>
|
||||
</w:del>
|
||||
<w:ins w:author="Scientific-Writer" w:id="51">
|
||||
<w:r><w:t>within 30 days</w:t></w:r>
|
||||
</w:ins>
|
||||
```
|
||||
@@ -1,159 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Tool to pack a directory into a .docx, .pptx, or .xlsx file with XML formatting undone.
|
||||
|
||||
Example usage:
|
||||
python pack.py <input_directory> <office_file> [--force]
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
import defusedxml.minidom
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Pack a directory into an Office file")
|
||||
parser.add_argument("input_directory", help="Unpacked Office document directory")
|
||||
parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
|
||||
parser.add_argument("--force", action="store_true", help="Skip validation")
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
success = pack_document(
|
||||
args.input_directory, args.output_file, validate=not args.force
|
||||
)
|
||||
|
||||
# Show warning if validation was skipped
|
||||
if args.force:
|
||||
print("Warning: Skipped validation, file may be corrupt", file=sys.stderr)
|
||||
# Exit with error if validation failed
|
||||
elif not success:
|
||||
print("Contents would produce a corrupt file.", file=sys.stderr)
|
||||
print("Please validate XML before repacking.", file=sys.stderr)
|
||||
print("Use --force to skip validation and pack anyway.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
except ValueError as e:
|
||||
sys.exit(f"Error: {e}")
|
||||
|
||||
|
||||
def pack_document(input_dir, output_file, validate=False):
|
||||
"""Pack a directory into an Office file (.docx/.pptx/.xlsx).
|
||||
|
||||
Args:
|
||||
input_dir: Path to unpacked Office document directory
|
||||
output_file: Path to output Office file
|
||||
validate: If True, validates with soffice (default: False)
|
||||
|
||||
Returns:
|
||||
bool: True if successful, False if validation failed
|
||||
"""
|
||||
input_dir = Path(input_dir)
|
||||
output_file = Path(output_file)
|
||||
|
||||
if not input_dir.is_dir():
|
||||
raise ValueError(f"{input_dir} is not a directory")
|
||||
if output_file.suffix.lower() not in {".docx", ".pptx", ".xlsx"}:
|
||||
raise ValueError(f"{output_file} must be a .docx, .pptx, or .xlsx file")
|
||||
|
||||
# Work in temporary directory to avoid modifying original
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
temp_content_dir = Path(temp_dir) / "content"
|
||||
shutil.copytree(input_dir, temp_content_dir)
|
||||
|
||||
# Process XML files to remove pretty-printing whitespace
|
||||
for pattern in ["*.xml", "*.rels"]:
|
||||
for xml_file in temp_content_dir.rglob(pattern):
|
||||
condense_xml(xml_file)
|
||||
|
||||
# Create final Office file as zip archive
|
||||
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
with zipfile.ZipFile(output_file, "w", zipfile.ZIP_DEFLATED) as zf:
|
||||
for f in temp_content_dir.rglob("*"):
|
||||
if f.is_file():
|
||||
zf.write(f, f.relative_to(temp_content_dir))
|
||||
|
||||
# Validate if requested
|
||||
if validate:
|
||||
if not validate_document(output_file):
|
||||
output_file.unlink() # Delete the corrupt file
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def validate_document(doc_path):
|
||||
"""Validate document by converting to HTML with soffice."""
|
||||
# Determine the correct filter based on file extension
|
||||
match doc_path.suffix.lower():
|
||||
case ".docx":
|
||||
filter_name = "html:HTML"
|
||||
case ".pptx":
|
||||
filter_name = "html:impress_html_Export"
|
||||
case ".xlsx":
|
||||
filter_name = "html:HTML (StarCalc)"
|
||||
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
try:
|
||||
result = subprocess.run(
|
||||
[
|
||||
"soffice",
|
||||
"--headless",
|
||||
"--convert-to",
|
||||
filter_name,
|
||||
"--outdir",
|
||||
temp_dir,
|
||||
str(doc_path),
|
||||
],
|
||||
capture_output=True,
|
||||
timeout=10,
|
||||
text=True,
|
||||
)
|
||||
if not (Path(temp_dir) / f"{doc_path.stem}.html").exists():
|
||||
error_msg = result.stderr.strip() or "Document validation failed"
|
||||
print(f"Validation error: {error_msg}", file=sys.stderr)
|
||||
return False
|
||||
return True
|
||||
except FileNotFoundError:
|
||||
print("Warning: soffice not found. Skipping validation.", file=sys.stderr)
|
||||
return True
|
||||
except subprocess.TimeoutExpired:
|
||||
print("Validation error: Timeout during conversion", file=sys.stderr)
|
||||
return False
|
||||
except Exception as e:
|
||||
print(f"Validation error: {e}", file=sys.stderr)
|
||||
return False
|
||||
|
||||
|
||||
def condense_xml(xml_file):
|
||||
"""Strip unnecessary whitespace and remove comments."""
|
||||
with open(xml_file, "r", encoding="utf-8") as f:
|
||||
dom = defusedxml.minidom.parse(f)
|
||||
|
||||
# Process each element to remove whitespace and comments
|
||||
for element in dom.getElementsByTagName("*"):
|
||||
# Skip w:t elements and their processing
|
||||
if element.tagName.endswith(":t"):
|
||||
continue
|
||||
|
||||
# Remove whitespace-only text nodes and comment nodes
|
||||
for child in list(element.childNodes):
|
||||
if (
|
||||
child.nodeType == child.TEXT_NODE
|
||||
and child.nodeValue
|
||||
and child.nodeValue.strip() == ""
|
||||
) or child.nodeType == child.COMMENT_NODE:
|
||||
element.removeChild(child)
|
||||
|
||||
# Write back the condensed XML
|
||||
with open(xml_file, "wb") as f:
|
||||
f.write(dom.toxml(encoding="UTF-8"))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,29 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Unpack and format XML contents of Office files (.docx, .pptx, .xlsx)"""
|
||||
|
||||
import random
|
||||
import sys
|
||||
import defusedxml.minidom
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
# Get command line arguments
|
||||
assert len(sys.argv) == 3, "Usage: python unpack.py <office_file> <output_dir>"
|
||||
input_file, output_dir = sys.argv[1], sys.argv[2]
|
||||
|
||||
# Extract and format
|
||||
output_path = Path(output_dir)
|
||||
output_path.mkdir(parents=True, exist_ok=True)
|
||||
zipfile.ZipFile(input_file).extractall(output_path)
|
||||
|
||||
# Pretty print all XML files
|
||||
xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels"))
|
||||
for xml_file in xml_files:
|
||||
content = xml_file.read_text(encoding="utf-8")
|
||||
dom = defusedxml.minidom.parseString(content)
|
||||
xml_file.write_bytes(dom.toprettyxml(indent=" ", encoding="ascii"))
|
||||
|
||||
# For .docx files, suggest an RSID for tracked changes
|
||||
if input_file.endswith(".docx"):
|
||||
suggested_rsid = "".join(random.choices("0123456789ABCDEF", k=8))
|
||||
print(f"Suggested RSID for edit session: {suggested_rsid}")
|
||||
@@ -1,69 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Command line tool to validate Office document XML files against XSD schemas and tracked changes.
|
||||
|
||||
Usage:
|
||||
python validate.py <dir> --original <original_file>
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
from validation import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Validate Office document XML files")
|
||||
parser.add_argument(
|
||||
"unpacked_dir",
|
||||
help="Path to unpacked Office document directory",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--original",
|
||||
required=True,
|
||||
help="Path to original file (.docx/.pptx/.xlsx)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-v",
|
||||
"--verbose",
|
||||
action="store_true",
|
||||
help="Enable verbose output",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate paths
|
||||
unpacked_dir = Path(args.unpacked_dir)
|
||||
original_file = Path(args.original)
|
||||
file_extension = original_file.suffix.lower()
|
||||
assert unpacked_dir.is_dir(), f"Error: {unpacked_dir} is not a directory"
|
||||
assert original_file.is_file(), f"Error: {original_file} is not a file"
|
||||
assert file_extension in [".docx", ".pptx", ".xlsx"], (
|
||||
f"Error: {original_file} must be a .docx, .pptx, or .xlsx file"
|
||||
)
|
||||
|
||||
# Run validations
|
||||
match file_extension:
|
||||
case ".docx":
|
||||
validators = [DOCXSchemaValidator, RedliningValidator]
|
||||
case ".pptx":
|
||||
validators = [PPTXSchemaValidator]
|
||||
case _:
|
||||
print(f"Error: Validation not supported for file type {file_extension}")
|
||||
sys.exit(1)
|
||||
|
||||
# Run validators
|
||||
success = True
|
||||
for V in validators:
|
||||
validator = V(unpacked_dir, original_file, verbose=args.verbose)
|
||||
if not validator.validate():
|
||||
success = False
|
||||
|
||||
if success:
|
||||
print("All validations PASSED!")
|
||||
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,274 +0,0 @@
|
||||
"""
|
||||
Validator for Word document XML files against XSD schemas.
|
||||
"""
|
||||
|
||||
import re
|
||||
import tempfile
|
||||
import zipfile
|
||||
|
||||
import lxml.etree
|
||||
|
||||
from .base import BaseSchemaValidator
|
||||
|
||||
|
||||
class DOCXSchemaValidator(BaseSchemaValidator):
|
||||
"""Validator for Word document XML files against XSD schemas."""
|
||||
|
||||
# Word-specific namespace
|
||||
WORD_2006_NAMESPACE = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
|
||||
|
||||
# Word-specific element to relationship type mappings
|
||||
# Start with empty mapping - add specific cases as we discover them
|
||||
ELEMENT_RELATIONSHIP_TYPES = {}
|
||||
|
||||
def validate(self):
|
||||
"""Run all validation checks and return True if all pass."""
|
||||
# Test 0: XML well-formedness
|
||||
if not self.validate_xml():
|
||||
return False
|
||||
|
||||
# Test 1: Namespace declarations
|
||||
all_valid = True
|
||||
if not self.validate_namespaces():
|
||||
all_valid = False
|
||||
|
||||
# Test 2: Unique IDs
|
||||
if not self.validate_unique_ids():
|
||||
all_valid = False
|
||||
|
||||
# Test 3: Relationship and file reference validation
|
||||
if not self.validate_file_references():
|
||||
all_valid = False
|
||||
|
||||
# Test 4: Content type declarations
|
||||
if not self.validate_content_types():
|
||||
all_valid = False
|
||||
|
||||
# Test 5: XSD schema validation
|
||||
if not self.validate_against_xsd():
|
||||
all_valid = False
|
||||
|
||||
# Test 6: Whitespace preservation
|
||||
if not self.validate_whitespace_preservation():
|
||||
all_valid = False
|
||||
|
||||
# Test 7: Deletion validation
|
||||
if not self.validate_deletions():
|
||||
all_valid = False
|
||||
|
||||
# Test 8: Insertion validation
|
||||
if not self.validate_insertions():
|
||||
all_valid = False
|
||||
|
||||
# Test 9: Relationship ID reference validation
|
||||
if not self.validate_all_relationship_ids():
|
||||
all_valid = False
|
||||
|
||||
# Count and compare paragraphs
|
||||
self.compare_paragraph_counts()
|
||||
|
||||
return all_valid
|
||||
|
||||
def validate_whitespace_preservation(self):
|
||||
"""
|
||||
Validate that w:t elements with whitespace have xml:space='preserve'.
|
||||
"""
|
||||
errors = []
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
# Only check document.xml files
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
|
||||
# Find all w:t elements
|
||||
for elem in root.iter(f"{{{self.WORD_2006_NAMESPACE}}}t"):
|
||||
if elem.text:
|
||||
text = elem.text
|
||||
# Check if text starts or ends with whitespace
|
||||
if re.match(r"^\s.*", text) or re.match(r".*\s$", text):
|
||||
# Check if xml:space="preserve" attribute exists
|
||||
xml_space_attr = f"{{{self.XML_NAMESPACE}}}space"
|
||||
if (
|
||||
xml_space_attr not in elem.attrib
|
||||
or elem.attrib[xml_space_attr] != "preserve"
|
||||
):
|
||||
# Show a preview of the text
|
||||
text_preview = (
|
||||
repr(text)[:50] + "..."
|
||||
if len(repr(text)) > 50
|
||||
else repr(text)
|
||||
)
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: "
|
||||
f"Line {elem.sourceline}: w:t element with whitespace missing xml:space='preserve': {text_preview}"
|
||||
)
|
||||
|
||||
except (lxml.etree.XMLSyntaxError, Exception) as e:
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
|
||||
)
|
||||
|
||||
if errors:
|
||||
print(f"FAILED - Found {len(errors)} whitespace preservation violations:")
|
||||
for error in errors:
|
||||
print(error)
|
||||
return False
|
||||
else:
|
||||
if self.verbose:
|
||||
print("PASSED - All whitespace is properly preserved")
|
||||
return True
|
||||
|
||||
def validate_deletions(self):
|
||||
"""
|
||||
Validate that w:t elements are not within w:del elements.
|
||||
For some reason, XSD validation does not catch this, so we do it manually.
|
||||
"""
|
||||
errors = []
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
# Only check document.xml files
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
|
||||
# Find all w:t elements that are descendants of w:del elements
|
||||
namespaces = {"w": self.WORD_2006_NAMESPACE}
|
||||
xpath_expression = ".//w:del//w:t"
|
||||
problematic_t_elements = root.xpath(
|
||||
xpath_expression, namespaces=namespaces
|
||||
)
|
||||
for t_elem in problematic_t_elements:
|
||||
if t_elem.text:
|
||||
# Show a preview of the text
|
||||
text_preview = (
|
||||
repr(t_elem.text)[:50] + "..."
|
||||
if len(repr(t_elem.text)) > 50
|
||||
else repr(t_elem.text)
|
||||
)
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: "
|
||||
f"Line {t_elem.sourceline}: <w:t> found within <w:del>: {text_preview}"
|
||||
)
|
||||
|
||||
except (lxml.etree.XMLSyntaxError, Exception) as e:
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
|
||||
)
|
||||
|
||||
if errors:
|
||||
print(f"FAILED - Found {len(errors)} deletion validation violations:")
|
||||
for error in errors:
|
||||
print(error)
|
||||
return False
|
||||
else:
|
||||
if self.verbose:
|
||||
print("PASSED - No w:t elements found within w:del elements")
|
||||
return True
|
||||
|
||||
def count_paragraphs_in_unpacked(self):
|
||||
"""Count the number of paragraphs in the unpacked document."""
|
||||
count = 0
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
# Only check document.xml files
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
# Count all w:p elements
|
||||
paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p")
|
||||
count = len(paragraphs)
|
||||
except Exception as e:
|
||||
print(f"Error counting paragraphs in unpacked document: {e}")
|
||||
|
||||
return count
|
||||
|
||||
def count_paragraphs_in_original(self):
|
||||
"""Count the number of paragraphs in the original docx file."""
|
||||
count = 0
|
||||
|
||||
try:
|
||||
# Create temporary directory to unpack original
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
# Unpack original docx
|
||||
with zipfile.ZipFile(self.original_file, "r") as zip_ref:
|
||||
zip_ref.extractall(temp_dir)
|
||||
|
||||
# Parse document.xml
|
||||
doc_xml_path = temp_dir + "/word/document.xml"
|
||||
root = lxml.etree.parse(doc_xml_path).getroot()
|
||||
|
||||
# Count all w:p elements
|
||||
paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p")
|
||||
count = len(paragraphs)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error counting paragraphs in original document: {e}")
|
||||
|
||||
return count
|
||||
|
||||
def validate_insertions(self):
|
||||
"""
|
||||
Validate that w:delText elements are not within w:ins elements.
|
||||
w:delText is only allowed in w:ins if nested within a w:del.
|
||||
"""
|
||||
errors = []
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
namespaces = {"w": self.WORD_2006_NAMESPACE}
|
||||
|
||||
# Find w:delText in w:ins that are NOT within w:del
|
||||
invalid_elements = root.xpath(
|
||||
".//w:ins//w:delText[not(ancestor::w:del)]",
|
||||
namespaces=namespaces
|
||||
)
|
||||
|
||||
for elem in invalid_elements:
|
||||
text_preview = (
|
||||
repr(elem.text or "")[:50] + "..."
|
||||
if len(repr(elem.text or "")) > 50
|
||||
else repr(elem.text or "")
|
||||
)
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: "
|
||||
f"Line {elem.sourceline}: <w:delText> within <w:ins>: {text_preview}"
|
||||
)
|
||||
|
||||
except (lxml.etree.XMLSyntaxError, Exception) as e:
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
|
||||
)
|
||||
|
||||
if errors:
|
||||
print(f"FAILED - Found {len(errors)} insertion validation violations:")
|
||||
for error in errors:
|
||||
print(error)
|
||||
return False
|
||||
else:
|
||||
if self.verbose:
|
||||
print("PASSED - No w:delText elements within w:ins elements")
|
||||
return True
|
||||
|
||||
def compare_paragraph_counts(self):
|
||||
"""Compare paragraph counts between original and new document."""
|
||||
original_count = self.count_paragraphs_in_original()
|
||||
new_count = self.count_paragraphs_in_unpacked()
|
||||
|
||||
diff = new_count - original_count
|
||||
diff_str = f"+{diff}" if diff > 0 else str(diff)
|
||||
print(f"\nParagraphs: {original_count} → {new_count} ({diff_str})")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise RuntimeError("This module should not be run directly.")
|
||||
@@ -1 +0,0 @@
|
||||
# Make scripts directory a package for relative imports in tests
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,374 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Utilities for editing OOXML documents.
|
||||
|
||||
This module provides XMLEditor, a tool for manipulating XML files with support for
|
||||
line-number-based node finding and DOM manipulation. Each element is automatically
|
||||
annotated with its original line and column position during parsing.
|
||||
|
||||
Example usage:
|
||||
editor = XMLEditor("document.xml")
|
||||
|
||||
# Find node by line number or range
|
||||
elem = editor.get_node(tag="w:r", line_number=519)
|
||||
elem = editor.get_node(tag="w:p", line_number=range(100, 200))
|
||||
|
||||
# Find node by text content
|
||||
elem = editor.get_node(tag="w:p", contains="specific text")
|
||||
|
||||
# Find node by attributes
|
||||
elem = editor.get_node(tag="w:r", attrs={"w:id": "target"})
|
||||
|
||||
# Combine filters
|
||||
elem = editor.get_node(tag="w:p", line_number=range(1, 50), contains="text")
|
||||
|
||||
# Replace, insert, or manipulate
|
||||
new_elem = editor.replace_node(elem, "<w:r><w:t>new text</w:t></w:r>")
|
||||
editor.insert_after(new_elem, "<w:r><w:t>more</w:t></w:r>")
|
||||
|
||||
# Save changes
|
||||
editor.save()
|
||||
"""
|
||||
|
||||
import html
|
||||
from pathlib import Path
|
||||
from typing import Optional, Union
|
||||
|
||||
import defusedxml.minidom
|
||||
import defusedxml.sax
|
||||
|
||||
|
||||
class XMLEditor:
|
||||
"""
|
||||
Editor for manipulating OOXML XML files with line-number-based node finding.
|
||||
|
||||
This class parses XML files and tracks the original line and column position
|
||||
of each element. This enables finding nodes by their line number in the original
|
||||
file, which is useful when working with Read tool output.
|
||||
|
||||
Attributes:
|
||||
xml_path: Path to the XML file being edited
|
||||
encoding: Detected encoding of the XML file ('ascii' or 'utf-8')
|
||||
dom: Parsed DOM tree with parse_position attributes on elements
|
||||
"""
|
||||
|
||||
def __init__(self, xml_path):
|
||||
"""
|
||||
Initialize with path to XML file and parse with line number tracking.
|
||||
|
||||
Args:
|
||||
xml_path: Path to XML file to edit (str or Path)
|
||||
|
||||
Raises:
|
||||
ValueError: If the XML file does not exist
|
||||
"""
|
||||
self.xml_path = Path(xml_path)
|
||||
if not self.xml_path.exists():
|
||||
raise ValueError(f"XML file not found: {xml_path}")
|
||||
|
||||
with open(self.xml_path, "rb") as f:
|
||||
header = f.read(200).decode("utf-8", errors="ignore")
|
||||
self.encoding = "ascii" if 'encoding="ascii"' in header else "utf-8"
|
||||
|
||||
parser = _create_line_tracking_parser()
|
||||
self.dom = defusedxml.minidom.parse(str(self.xml_path), parser)
|
||||
|
||||
def get_node(
|
||||
self,
|
||||
tag: str,
|
||||
attrs: Optional[dict[str, str]] = None,
|
||||
line_number: Optional[Union[int, range]] = None,
|
||||
contains: Optional[str] = None,
|
||||
):
|
||||
"""
|
||||
Get a DOM element by tag and identifier.
|
||||
|
||||
Finds an element by either its line number in the original file or by
|
||||
matching attribute values. Exactly one match must be found.
|
||||
|
||||
Args:
|
||||
tag: The XML tag name (e.g., "w:del", "w:ins", "w:r")
|
||||
attrs: Dictionary of attribute name-value pairs to match (e.g., {"w:id": "1"})
|
||||
line_number: Line number (int) or line range (range) in original XML file (1-indexed)
|
||||
contains: Text string that must appear in any text node within the element.
|
||||
Supports both entity notation (“) and Unicode characters (\u201c).
|
||||
|
||||
Returns:
|
||||
defusedxml.minidom.Element: The matching DOM element
|
||||
|
||||
Raises:
|
||||
ValueError: If node not found or multiple matches found
|
||||
|
||||
Example:
|
||||
elem = editor.get_node(tag="w:r", line_number=519)
|
||||
elem = editor.get_node(tag="w:r", line_number=range(100, 200))
|
||||
elem = editor.get_node(tag="w:del", attrs={"w:id": "1"})
|
||||
elem = editor.get_node(tag="w:p", attrs={"w14:paraId": "12345678"})
|
||||
elem = editor.get_node(tag="w:commentRangeStart", attrs={"w:id": "0"})
|
||||
elem = editor.get_node(tag="w:p", contains="specific text")
|
||||
elem = editor.get_node(tag="w:t", contains="“Agreement") # Entity notation
|
||||
elem = editor.get_node(tag="w:t", contains="\u201cAgreement") # Unicode character
|
||||
"""
|
||||
matches = []
|
||||
for elem in self.dom.getElementsByTagName(tag):
|
||||
# Check line_number filter
|
||||
if line_number is not None:
|
||||
parse_pos = getattr(elem, "parse_position", (None,))
|
||||
elem_line = parse_pos[0]
|
||||
|
||||
# Handle both single line number and range
|
||||
if isinstance(line_number, range):
|
||||
if elem_line not in line_number:
|
||||
continue
|
||||
else:
|
||||
if elem_line != line_number:
|
||||
continue
|
||||
|
||||
# Check attrs filter
|
||||
if attrs is not None:
|
||||
if not all(
|
||||
elem.getAttribute(attr_name) == attr_value
|
||||
for attr_name, attr_value in attrs.items()
|
||||
):
|
||||
continue
|
||||
|
||||
# Check contains filter
|
||||
if contains is not None:
|
||||
elem_text = self._get_element_text(elem)
|
||||
# Normalize the search string: convert HTML entities to Unicode characters
|
||||
# This allows searching for both "“Rowan" and ""Rowan"
|
||||
normalized_contains = html.unescape(contains)
|
||||
if normalized_contains not in elem_text:
|
||||
continue
|
||||
|
||||
# If all applicable filters passed, this is a match
|
||||
matches.append(elem)
|
||||
|
||||
if not matches:
|
||||
# Build descriptive error message
|
||||
filters = []
|
||||
if line_number is not None:
|
||||
line_str = (
|
||||
f"lines {line_number.start}-{line_number.stop - 1}"
|
||||
if isinstance(line_number, range)
|
||||
else f"line {line_number}"
|
||||
)
|
||||
filters.append(f"at {line_str}")
|
||||
if attrs is not None:
|
||||
filters.append(f"with attributes {attrs}")
|
||||
if contains is not None:
|
||||
filters.append(f"containing '{contains}'")
|
||||
|
||||
filter_desc = " ".join(filters) if filters else ""
|
||||
base_msg = f"Node not found: <{tag}> {filter_desc}".strip()
|
||||
|
||||
# Add helpful hint based on filters used
|
||||
if contains:
|
||||
hint = "Text may be split across elements or use different wording."
|
||||
elif line_number:
|
||||
hint = "Line numbers may have changed if document was modified."
|
||||
elif attrs:
|
||||
hint = "Verify attribute values are correct."
|
||||
else:
|
||||
hint = "Try adding filters (attrs, line_number, or contains)."
|
||||
|
||||
raise ValueError(f"{base_msg}. {hint}")
|
||||
if len(matches) > 1:
|
||||
raise ValueError(
|
||||
f"Multiple nodes found: <{tag}>. "
|
||||
f"Add more filters (attrs, line_number, or contains) to narrow the search."
|
||||
)
|
||||
return matches[0]
|
||||
|
||||
def _get_element_text(self, elem):
|
||||
"""
|
||||
Recursively extract all text content from an element.
|
||||
|
||||
Skips text nodes that contain only whitespace (spaces, tabs, newlines),
|
||||
which typically represent XML formatting rather than document content.
|
||||
|
||||
Args:
|
||||
elem: defusedxml.minidom.Element to extract text from
|
||||
|
||||
Returns:
|
||||
str: Concatenated text from all non-whitespace text nodes within the element
|
||||
"""
|
||||
text_parts = []
|
||||
for node in elem.childNodes:
|
||||
if node.nodeType == node.TEXT_NODE:
|
||||
# Skip whitespace-only text nodes (XML formatting)
|
||||
if node.data.strip():
|
||||
text_parts.append(node.data)
|
||||
elif node.nodeType == node.ELEMENT_NODE:
|
||||
text_parts.append(self._get_element_text(node))
|
||||
return "".join(text_parts)
|
||||
|
||||
def replace_node(self, elem, new_content):
|
||||
"""
|
||||
Replace a DOM element with new XML content.
|
||||
|
||||
Args:
|
||||
elem: defusedxml.minidom.Element to replace
|
||||
new_content: String containing XML to replace the node with
|
||||
|
||||
Returns:
|
||||
List[defusedxml.minidom.Node]: All inserted nodes
|
||||
|
||||
Example:
|
||||
new_nodes = editor.replace_node(old_elem, "<w:r><w:t>text</w:t></w:r>")
|
||||
"""
|
||||
parent = elem.parentNode
|
||||
nodes = self._parse_fragment(new_content)
|
||||
for node in nodes:
|
||||
parent.insertBefore(node, elem)
|
||||
parent.removeChild(elem)
|
||||
return nodes
|
||||
|
||||
def insert_after(self, elem, xml_content):
|
||||
"""
|
||||
Insert XML content after a DOM element.
|
||||
|
||||
Args:
|
||||
elem: defusedxml.minidom.Element to insert after
|
||||
xml_content: String containing XML to insert
|
||||
|
||||
Returns:
|
||||
List[defusedxml.minidom.Node]: All inserted nodes
|
||||
|
||||
Example:
|
||||
new_nodes = editor.insert_after(elem, "<w:r><w:t>text</w:t></w:r>")
|
||||
"""
|
||||
parent = elem.parentNode
|
||||
next_sibling = elem.nextSibling
|
||||
nodes = self._parse_fragment(xml_content)
|
||||
for node in nodes:
|
||||
if next_sibling:
|
||||
parent.insertBefore(node, next_sibling)
|
||||
else:
|
||||
parent.appendChild(node)
|
||||
return nodes
|
||||
|
||||
def insert_before(self, elem, xml_content):
|
||||
"""
|
||||
Insert XML content before a DOM element.
|
||||
|
||||
Args:
|
||||
elem: defusedxml.minidom.Element to insert before
|
||||
xml_content: String containing XML to insert
|
||||
|
||||
Returns:
|
||||
List[defusedxml.minidom.Node]: All inserted nodes
|
||||
|
||||
Example:
|
||||
new_nodes = editor.insert_before(elem, "<w:r><w:t>text</w:t></w:r>")
|
||||
"""
|
||||
parent = elem.parentNode
|
||||
nodes = self._parse_fragment(xml_content)
|
||||
for node in nodes:
|
||||
parent.insertBefore(node, elem)
|
||||
return nodes
|
||||
|
||||
def append_to(self, elem, xml_content):
|
||||
"""
|
||||
Append XML content as a child of a DOM element.
|
||||
|
||||
Args:
|
||||
elem: defusedxml.minidom.Element to append to
|
||||
xml_content: String containing XML to append
|
||||
|
||||
Returns:
|
||||
List[defusedxml.minidom.Node]: All inserted nodes
|
||||
|
||||
Example:
|
||||
new_nodes = editor.append_to(elem, "<w:r><w:t>text</w:t></w:r>")
|
||||
"""
|
||||
nodes = self._parse_fragment(xml_content)
|
||||
for node in nodes:
|
||||
elem.appendChild(node)
|
||||
return nodes
|
||||
|
||||
def get_next_rid(self):
|
||||
"""Get the next available rId for relationships files."""
|
||||
max_id = 0
|
||||
for rel_elem in self.dom.getElementsByTagName("Relationship"):
|
||||
rel_id = rel_elem.getAttribute("Id")
|
||||
if rel_id.startswith("rId"):
|
||||
try:
|
||||
max_id = max(max_id, int(rel_id[3:]))
|
||||
except ValueError:
|
||||
pass
|
||||
return f"rId{max_id + 1}"
|
||||
|
||||
def save(self):
|
||||
"""
|
||||
Save the edited XML back to the file.
|
||||
|
||||
Serializes the DOM tree and writes it back to the original file path,
|
||||
preserving the original encoding (ascii or utf-8).
|
||||
"""
|
||||
content = self.dom.toxml(encoding=self.encoding)
|
||||
self.xml_path.write_bytes(content)
|
||||
|
||||
def _parse_fragment(self, xml_content):
|
||||
"""
|
||||
Parse XML fragment and return list of imported nodes.
|
||||
|
||||
Args:
|
||||
xml_content: String containing XML fragment
|
||||
|
||||
Returns:
|
||||
List of defusedxml.minidom.Node objects imported into this document
|
||||
|
||||
Raises:
|
||||
AssertionError: If fragment contains no element nodes
|
||||
"""
|
||||
# Extract namespace declarations from the root document element
|
||||
root_elem = self.dom.documentElement
|
||||
namespaces = []
|
||||
if root_elem and root_elem.attributes:
|
||||
for i in range(root_elem.attributes.length):
|
||||
attr = root_elem.attributes.item(i)
|
||||
if attr.name.startswith("xmlns"): # type: ignore
|
||||
namespaces.append(f'{attr.name}="{attr.value}"') # type: ignore
|
||||
|
||||
ns_decl = " ".join(namespaces)
|
||||
wrapper = f"<root {ns_decl}>{xml_content}</root>"
|
||||
fragment_doc = defusedxml.minidom.parseString(wrapper)
|
||||
nodes = [
|
||||
self.dom.importNode(child, deep=True)
|
||||
for child in fragment_doc.documentElement.childNodes # type: ignore
|
||||
]
|
||||
elements = [n for n in nodes if n.nodeType == n.ELEMENT_NODE]
|
||||
assert elements, "Fragment must contain at least one element"
|
||||
return nodes
|
||||
|
||||
|
||||
def _create_line_tracking_parser():
|
||||
"""
|
||||
Create a SAX parser that tracks line and column numbers for each element.
|
||||
|
||||
Monkey patches the SAX content handler to store the current line and column
|
||||
position from the underlying expat parser onto each element as a parse_position
|
||||
attribute (line, column) tuple.
|
||||
|
||||
Returns:
|
||||
defusedxml.sax.xmlreader.XMLReader: Configured SAX parser
|
||||
"""
|
||||
|
||||
def set_content_handler(dom_handler):
|
||||
def startElementNS(name, tagName, attrs):
|
||||
orig_start_cb(name, tagName, attrs)
|
||||
cur_elem = dom_handler.elementStack[-1]
|
||||
cur_elem.parse_position = (
|
||||
parser._parser.CurrentLineNumber, # type: ignore
|
||||
parser._parser.CurrentColumnNumber, # type: ignore
|
||||
)
|
||||
|
||||
orig_start_cb = dom_handler.startElementNS
|
||||
dom_handler.startElementNS = startElementNS
|
||||
orig_set_content_handler(dom_handler)
|
||||
|
||||
parser = defusedxml.sax.make_parser()
|
||||
orig_set_content_handler = parser.setContentHandler
|
||||
parser.setContentHandler = set_content_handler # type: ignore
|
||||
return parser
|
||||
@@ -1,205 +0,0 @@
|
||||
**CRITICAL: You MUST complete these steps in order. Do not skip ahead to writing code.**
|
||||
|
||||
If you need to fill out a PDF form, first check to see if the PDF has fillable form fields. Run this script from this file's directory:
|
||||
`python scripts/check_fillable_fields <file.pdf>`, and depending on the result go to either the "Fillable fields" or "Non-fillable fields" and follow those instructions.
|
||||
|
||||
# Fillable fields
|
||||
If the PDF has fillable form fields:
|
||||
- Run this script from this file's directory: `python scripts/extract_form_field_info.py <input.pdf> <field_info.json>`. It will create a JSON file with a list of fields in this format:
|
||||
```
|
||||
[
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"rect": ([left, bottom, right, top] bounding box in PDF coordinates, y=0 is the bottom of the page),
|
||||
"type": ("text", "checkbox", "radio_group", or "choice"),
|
||||
},
|
||||
// Checkboxes have "checked_value" and "unchecked_value" properties:
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"type": "checkbox",
|
||||
"checked_value": (Set the field to this value to check the checkbox),
|
||||
"unchecked_value": (Set the field to this value to uncheck the checkbox),
|
||||
},
|
||||
// Radio groups have a "radio_options" list with the possible choices.
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"type": "radio_group",
|
||||
"radio_options": [
|
||||
{
|
||||
"value": (set the field to this value to select this radio option),
|
||||
"rect": (bounding box for the radio button for this option)
|
||||
},
|
||||
// Other radio options
|
||||
]
|
||||
},
|
||||
// Multiple choice fields have a "choice_options" list with the possible choices:
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"type": "choice",
|
||||
"choice_options": [
|
||||
{
|
||||
"value": (set the field to this value to select this option),
|
||||
"text": (display text of the option)
|
||||
},
|
||||
// Other choice options
|
||||
],
|
||||
}
|
||||
]
|
||||
```
|
||||
- Convert the PDF to PNGs (one image for each page) with this script (run from this file's directory):
|
||||
`python scripts/convert_pdf_to_images.py <file.pdf> <output_directory>`
|
||||
Then analyze the images to determine the purpose of each form field (make sure to convert the bounding box PDF coordinates to image coordinates).
|
||||
- Create a `field_values.json` file in this format with the values to be entered for each field:
|
||||
```
|
||||
[
|
||||
{
|
||||
"field_id": "last_name", // Must match the field_id from `extract_form_field_info.py`
|
||||
"description": "The user's last name",
|
||||
"page": 1, // Must match the "page" value in field_info.json
|
||||
"value": "Simpson"
|
||||
},
|
||||
{
|
||||
"field_id": "Checkbox12",
|
||||
"description": "Checkbox to be checked if the user is 18 or over",
|
||||
"page": 1,
|
||||
"value": "/On" // If this is a checkbox, use its "checked_value" value to check it. If it's a radio button group, use one of the "value" values in "radio_options".
|
||||
},
|
||||
// more fields
|
||||
]
|
||||
```
|
||||
- Run the `fill_fillable_fields.py` script from this file's directory to create a filled-in PDF:
|
||||
`python scripts/fill_fillable_fields.py <input pdf> <field_values.json> <output pdf>`
|
||||
This script will verify that the field IDs and values you provide are valid; if it prints error messages, correct the appropriate fields and try again.
|
||||
|
||||
# Non-fillable fields
|
||||
If the PDF doesn't have fillable form fields, you'll need to visually determine where the data should be added and create text annotations. Follow the below steps *exactly*. You MUST perform all of these steps to ensure that the the form is accurately completed. Details for each step are below.
|
||||
- Convert the PDF to PNG images and determine field bounding boxes.
|
||||
- Create a JSON file with field information and validation images showing the bounding boxes.
|
||||
- Validate the the bounding boxes.
|
||||
- Use the bounding boxes to fill in the form.
|
||||
|
||||
## Step 1: Visual Analysis (REQUIRED)
|
||||
- Convert the PDF to PNG images. Run this script from this file's directory:
|
||||
`python scripts/convert_pdf_to_images.py <file.pdf> <output_directory>`
|
||||
The script will create a PNG image for each page in the PDF.
|
||||
- Carefully examine each PNG image and identify all form fields and areas where the user should enter data. For each form field where the user should enter text, determine bounding boxes for both the form field label, and the area where the user should enter text. The label and entry bounding boxes MUST NOT INTERSECT; the text entry box should only include the area where data should be entered. Usually this area will be immediately to the side, above, or below its label. Entry bounding boxes must be tall and wide enough to contain their text.
|
||||
|
||||
These are some examples of form structures that you might see:
|
||||
|
||||
*Label inside box*
|
||||
```
|
||||
┌────────────────────────┐
|
||||
│ Name: │
|
||||
└────────────────────────┘
|
||||
```
|
||||
The input area should be to the right of the "Name" label and extend to the edge of the box.
|
||||
|
||||
*Label before line*
|
||||
```
|
||||
Email: _______________________
|
||||
```
|
||||
The input area should be above the line and include its entire width.
|
||||
|
||||
*Label under line*
|
||||
```
|
||||
_________________________
|
||||
Name
|
||||
```
|
||||
The input area should be above the line and include the entire width of the line. This is common for signature and date fields.
|
||||
|
||||
*Label above line*
|
||||
```
|
||||
Please enter any special requests:
|
||||
________________________________________________
|
||||
```
|
||||
The input area should extend from the bottom of the label to the line, and should include the entire width of the line.
|
||||
|
||||
*Checkboxes*
|
||||
```
|
||||
Are you a US citizen? Yes □ No □
|
||||
```
|
||||
For checkboxes:
|
||||
- Look for small square boxes (□) - these are the actual checkboxes to target. They may be to the left or right of their labels.
|
||||
- Distinguish between label text ("Yes", "No") and the clickable checkbox squares.
|
||||
- The entry bounding box should cover ONLY the small square, not the text label.
|
||||
|
||||
### Step 2: Create fields.json and validation images (REQUIRED)
|
||||
- Create a file named `fields.json` with information for the form fields and bounding boxes in this format:
|
||||
```
|
||||
{
|
||||
"pages": [
|
||||
{
|
||||
"page_number": 1,
|
||||
"image_width": (first page image width in pixels),
|
||||
"image_height": (first page image height in pixels),
|
||||
},
|
||||
{
|
||||
"page_number": 2,
|
||||
"image_width": (second page image width in pixels),
|
||||
"image_height": (second page image height in pixels),
|
||||
}
|
||||
// additional pages
|
||||
],
|
||||
"form_fields": [
|
||||
// Example for a text field.
|
||||
{
|
||||
"page_number": 1,
|
||||
"description": "The user's last name should be entered here",
|
||||
// Bounding boxes are [left, top, right, bottom]. The bounding boxes for the label and text entry should not overlap.
|
||||
"field_label": "Last name",
|
||||
"label_bounding_box": [30, 125, 95, 142],
|
||||
"entry_bounding_box": [100, 125, 280, 142],
|
||||
"entry_text": {
|
||||
"text": "Johnson", // This text will be added as an annotation at the entry_bounding_box location
|
||||
"font_size": 14, // optional, defaults to 14
|
||||
"font_color": "000000", // optional, RRGGBB format, defaults to 000000 (black)
|
||||
}
|
||||
},
|
||||
// Example for a checkbox. TARGET THE SQUARE for the entry bounding box, NOT THE TEXT
|
||||
{
|
||||
"page_number": 2,
|
||||
"description": "Checkbox that should be checked if the user is over 18",
|
||||
"entry_bounding_box": [140, 525, 155, 540], // Small box over checkbox square
|
||||
"field_label": "Yes",
|
||||
"label_bounding_box": [100, 525, 132, 540], // Box containing "Yes" text
|
||||
// Use "X" to check a checkbox.
|
||||
"entry_text": {
|
||||
"text": "X",
|
||||
}
|
||||
}
|
||||
// additional form field entries
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Create validation images by running this script from this file's directory for each page:
|
||||
`python scripts/create_validation_image.py <page_number> <path_to_fields.json> <input_image_path> <output_image_path>
|
||||
|
||||
The validation images will have red rectangles where text should be entered, and blue rectangles covering label text.
|
||||
|
||||
### Step 3: Validate Bounding Boxes (REQUIRED)
|
||||
#### Automated intersection check
|
||||
- Verify that none of bounding boxes intersect and that the entry bounding boxes are tall enough by checking the fields.json file with the `check_bounding_boxes.py` script (run from this file's directory):
|
||||
`python scripts/check_bounding_boxes.py <JSON file>`
|
||||
|
||||
If there are errors, reanalyze the relevant fields, adjust the bounding boxes, and iterate until there are no remaining errors. Remember: label (blue) bounding boxes should contain text labels, entry (red) boxes should not.
|
||||
|
||||
#### Manual image inspection
|
||||
**CRITICAL: Do not proceed without visually inspecting validation images**
|
||||
- Red rectangles must ONLY cover input areas
|
||||
- Red rectangles MUST NOT contain any text
|
||||
- Blue rectangles should contain label text
|
||||
- For checkboxes:
|
||||
- Red rectangle MUST be centered on the checkbox square
|
||||
- Blue rectangle should cover the text label for the checkbox
|
||||
|
||||
- If any rectangles look wrong, fix fields.json, regenerate the validation images, and verify again. Repeat this process until the bounding boxes are fully accurate.
|
||||
|
||||
|
||||
### Step 4: Add annotations to the PDF
|
||||
Run this script from this file's directory to create a filled-out PDF using the information in fields.json:
|
||||
`python scripts/fill_pdf_form_with_annotations.py <input_pdf_path> <path_to_fields.json> <output_pdf_path>
|
||||
@@ -1,226 +0,0 @@
|
||||
import unittest
|
||||
import json
|
||||
import io
|
||||
from check_bounding_boxes import get_bounding_box_messages
|
||||
|
||||
|
||||
# Currently this is not run automatically in CI; it's just for documentation and manual checking.
|
||||
class TestGetBoundingBoxMessages(unittest.TestCase):
|
||||
|
||||
def create_json_stream(self, data):
|
||||
"""Helper to create a JSON stream from data"""
|
||||
return io.StringIO(json.dumps(data))
|
||||
|
||||
def test_no_intersections(self):
|
||||
"""Test case with no bounding box intersections"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 30]
|
||||
},
|
||||
{
|
||||
"description": "Email",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 40, 50, 60],
|
||||
"entry_bounding_box": [60, 40, 150, 60]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("SUCCESS" in msg for msg in messages))
|
||||
self.assertFalse(any("FAILURE" in msg for msg in messages))
|
||||
|
||||
def test_label_entry_intersection_same_field(self):
|
||||
"""Test intersection between label and entry of the same field"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 60, 30],
|
||||
"entry_bounding_box": [50, 10, 150, 30] # Overlaps with label
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("FAILURE" in msg and "intersection" in msg for msg in messages))
|
||||
self.assertFalse(any("SUCCESS" in msg for msg in messages))
|
||||
|
||||
def test_intersection_between_different_fields(self):
|
||||
"""Test intersection between bounding boxes of different fields"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 30]
|
||||
},
|
||||
{
|
||||
"description": "Email",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [40, 20, 80, 40], # Overlaps with Name's boxes
|
||||
"entry_bounding_box": [160, 10, 250, 30]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("FAILURE" in msg and "intersection" in msg for msg in messages))
|
||||
self.assertFalse(any("SUCCESS" in msg for msg in messages))
|
||||
|
||||
def test_different_pages_no_intersection(self):
|
||||
"""Test that boxes on different pages don't count as intersecting"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 30]
|
||||
},
|
||||
{
|
||||
"description": "Email",
|
||||
"page_number": 2,
|
||||
"label_bounding_box": [10, 10, 50, 30], # Same coordinates but different page
|
||||
"entry_bounding_box": [60, 10, 150, 30]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("SUCCESS" in msg for msg in messages))
|
||||
self.assertFalse(any("FAILURE" in msg for msg in messages))
|
||||
|
||||
def test_entry_height_too_small(self):
|
||||
"""Test that entry box height is checked against font size"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 20], # Height is 10
|
||||
"entry_text": {
|
||||
"font_size": 14 # Font size larger than height
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("FAILURE" in msg and "height" in msg for msg in messages))
|
||||
self.assertFalse(any("SUCCESS" in msg for msg in messages))
|
||||
|
||||
def test_entry_height_adequate(self):
|
||||
"""Test that adequate entry box height passes"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 30], # Height is 20
|
||||
"entry_text": {
|
||||
"font_size": 14 # Font size smaller than height
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("SUCCESS" in msg for msg in messages))
|
||||
self.assertFalse(any("FAILURE" in msg for msg in messages))
|
||||
|
||||
def test_default_font_size(self):
|
||||
"""Test that default font size is used when not specified"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 20], # Height is 10
|
||||
"entry_text": {} # No font_size specified, should use default 14
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("FAILURE" in msg and "height" in msg for msg in messages))
|
||||
self.assertFalse(any("SUCCESS" in msg for msg in messages))
|
||||
|
||||
def test_no_entry_text(self):
|
||||
"""Test that missing entry_text doesn't cause height check"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 20] # Small height but no entry_text
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("SUCCESS" in msg for msg in messages))
|
||||
self.assertFalse(any("FAILURE" in msg for msg in messages))
|
||||
|
||||
def test_multiple_errors_limit(self):
|
||||
"""Test that error messages are limited to prevent excessive output"""
|
||||
fields = []
|
||||
# Create many overlapping fields
|
||||
for i in range(25):
|
||||
fields.append({
|
||||
"description": f"Field{i}",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30], # All overlap
|
||||
"entry_bounding_box": [20, 15, 60, 35] # All overlap
|
||||
})
|
||||
|
||||
data = {"form_fields": fields}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
# Should abort after ~20 messages
|
||||
self.assertTrue(any("Aborting" in msg for msg in messages))
|
||||
# Should have some FAILURE messages but not hundreds
|
||||
failure_count = sum(1 for msg in messages if "FAILURE" in msg)
|
||||
self.assertGreater(failure_count, 0)
|
||||
self.assertLess(len(messages), 30) # Should be limited
|
||||
|
||||
def test_edge_touching_boxes(self):
|
||||
"""Test that boxes touching at edges don't count as intersecting"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [50, 10, 150, 30] # Touches at x=50
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("SUCCESS" in msg for msg in messages))
|
||||
self.assertFalse(any("FAILURE" in msg for msg in messages))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
@@ -1,518 +0,0 @@
|
||||
---
|
||||
name: pptx
|
||||
description: Presentation toolkit (.pptx). Create/edit slides, layouts, content, speaker notes, comments, for programmatic presentation creation and modification.
|
||||
license: Proprietary. LICENSE.txt has complete terms
|
||||
---
|
||||
|
||||
# PPTX creation, editing, and analysis
|
||||
|
||||
## Overview
|
||||
|
||||
A .pptx file is a ZIP archive containing XML files and resources. Create, edit, or analyze PowerPoint presentations using text extraction, raw XML access, or html2pptx workflows. Apply this skill for programmatic presentation creation and modification.
|
||||
|
||||
## Visual Enhancement with Scientific Schematics
|
||||
|
||||
**When creating documents with this skill, always consider adding scientific diagrams and schematics to enhance visual communication.**
|
||||
|
||||
If your document does not already contain schematics or diagrams:
|
||||
- Use the **scientific-schematics** skill to generate AI-powered publication-quality diagrams
|
||||
- Simply describe your desired diagram in natural language
|
||||
- Nano Banana Pro will automatically generate, review, and refine the schematic
|
||||
|
||||
**For new documents:** Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.
|
||||
|
||||
**How to generate schematics:**
|
||||
```bash
|
||||
python scripts/generate_schematic.py "your diagram description" -o figures/output.png
|
||||
```
|
||||
|
||||
The AI will automatically:
|
||||
- Create publication-quality images with proper formatting
|
||||
- Review and refine through multiple iterations
|
||||
- Ensure accessibility (colorblind-friendly, high contrast)
|
||||
- Save outputs in the figures/ directory
|
||||
|
||||
**When to add schematics:**
|
||||
- Presentation workflow diagrams for slides
|
||||
- Slide design process flowcharts
|
||||
- Content organization diagrams
|
||||
- System architecture illustrations
|
||||
- Process flow visualizations
|
||||
- Any complex concept that benefits from visualization
|
||||
|
||||
For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.
|
||||
|
||||
---
|
||||
|
||||
## Reading and analyzing content
|
||||
|
||||
### Text extraction
|
||||
To read the text contents of a presentation, convert the document to markdown:
|
||||
|
||||
```bash
|
||||
# Convert document to markdown
|
||||
python -m markitdown path-to-file.pptx
|
||||
```
|
||||
|
||||
### Raw XML access
|
||||
Raw XML access is required for: comments, speaker notes, slide layouts, animations, design elements, and complex formatting. For any of these features, unpack a presentation and read its raw XML contents.
|
||||
|
||||
#### Unpacking a file
|
||||
`python ooxml/scripts/unpack.py <office_file> <output_dir>`
|
||||
|
||||
**Note**: The unpack.py script is located at `skills/pptx/ooxml/scripts/unpack.py` relative to the project root. If the script doesn't exist at this path, use `find . -name "unpack.py"` to locate it.
|
||||
|
||||
#### Key file structures
|
||||
* `ppt/presentation.xml` - Main presentation metadata and slide references
|
||||
* `ppt/slides/slide{N}.xml` - Individual slide contents (slide1.xml, slide2.xml, etc.)
|
||||
* `ppt/notesSlides/notesSlide{N}.xml` - Speaker notes for each slide
|
||||
* `ppt/comments/modernComment_*.xml` - Comments for specific slides
|
||||
* `ppt/slideLayouts/` - Layout templates for slides
|
||||
* `ppt/slideMasters/` - Master slide templates
|
||||
* `ppt/theme/` - Theme and styling information
|
||||
* `ppt/media/` - Images and other media files
|
||||
|
||||
#### Typography and color extraction
|
||||
**When given an example design to emulate**: Always analyze the presentation's typography and colors first using the methods below:
|
||||
1. **Read theme file**: Check `ppt/theme/theme1.xml` for colors (`<a:clrScheme>`) and fonts (`<a:fontScheme>`)
|
||||
2. **Sample slide content**: Examine `ppt/slides/slide1.xml` for actual font usage (`<a:rPr>`) and colors
|
||||
3. **Search for patterns**: Use grep to find color (`<a:solidFill>`, `<a:srgbClr>`) and font references across all XML files
|
||||
|
||||
## Creating a new PowerPoint presentation **without a template**
|
||||
|
||||
When creating a new PowerPoint presentation from scratch, use the **html2pptx** workflow to convert HTML slides to PowerPoint with accurate positioning.
|
||||
|
||||
### Design Principles
|
||||
|
||||
**CRITICAL**: Before creating any presentation, analyze the content and choose appropriate design elements:
|
||||
1. **Consider the subject matter**: What is this presentation about? What tone, industry, or mood does it suggest?
|
||||
2. **Check for branding**: If the user mentions a company/organization, consider their brand colors and identity
|
||||
3. **Match palette to content**: Select colors that reflect the subject
|
||||
4. **State your approach**: Explain your design choices before writing code
|
||||
|
||||
**Requirements**:
|
||||
- ✅ State your content-informed design approach BEFORE writing code
|
||||
- ✅ Use web-safe fonts only: Arial, Helvetica, Times New Roman, Georgia, Courier New, Verdana, Tahoma, Trebuchet MS, Impact
|
||||
- ✅ Create clear visual hierarchy through size, weight, and color
|
||||
- ✅ Ensure readability: strong contrast, appropriately sized text, clean alignment
|
||||
- ✅ Be consistent: repeat patterns, spacing, and visual language across slides
|
||||
|
||||
#### Color Palette Selection
|
||||
|
||||
**Choosing colors creatively**:
|
||||
- **Think beyond defaults**: What colors genuinely match this specific topic? Avoid autopilot choices.
|
||||
- **Consider multiple angles**: Topic, industry, mood, energy level, target audience, brand identity (if mentioned)
|
||||
- **Be adventurous**: Try unexpected combinations - a healthcare presentation doesn't have to be green, finance doesn't have to be navy
|
||||
- **Build your palette**: Pick 3-5 colors that work together (dominant colors + supporting tones + accent)
|
||||
- **Ensure contrast**: Text must be clearly readable on backgrounds
|
||||
|
||||
**Example color palettes** (use these to spark creativity - choose one, adapt it, or create your own):
|
||||
|
||||
1. **Classic Blue**: Deep navy (#1C2833), slate gray (#2E4053), silver (#AAB7B8), off-white (#F4F6F6)
|
||||
2. **Teal & Coral**: Teal (#5EA8A7), deep teal (#277884), coral (#FE4447), white (#FFFFFF)
|
||||
3. **Bold Red**: Red (#C0392B), bright red (#E74C3C), orange (#F39C12), yellow (#F1C40F), green (#2ECC71)
|
||||
4. **Warm Blush**: Mauve (#A49393), blush (#EED6D3), rose (#E8B4B8), cream (#FAF7F2)
|
||||
5. **Burgundy Luxury**: Burgundy (#5D1D2E), crimson (#951233), rust (#C15937), gold (#997929)
|
||||
6. **Deep Purple & Emerald**: Purple (#B165FB), dark blue (#181B24), emerald (#40695B), white (#FFFFFF)
|
||||
7. **Cream & Forest Green**: Cream (#FFE1C7), forest green (#40695B), white (#FCFCFC)
|
||||
8. **Pink & Purple**: Pink (#F8275B), coral (#FF574A), rose (#FF737D), purple (#3D2F68)
|
||||
9. **Lime & Plum**: Lime (#C5DE82), plum (#7C3A5F), coral (#FD8C6E), blue-gray (#98ACB5)
|
||||
10. **Black & Gold**: Gold (#BF9A4A), black (#000000), cream (#F4F6F6)
|
||||
11. **Sage & Terracotta**: Sage (#87A96B), terracotta (#E07A5F), cream (#F4F1DE), charcoal (#2C2C2C)
|
||||
12. **Charcoal & Red**: Charcoal (#292929), red (#E33737), light gray (#CCCBCB)
|
||||
13. **Vibrant Orange**: Orange (#F96D00), light gray (#F2F2F2), charcoal (#222831)
|
||||
14. **Forest Green**: Black (#191A19), green (#4E9F3D), dark green (#1E5128), white (#FFFFFF)
|
||||
15. **Retro Rainbow**: Purple (#722880), pink (#D72D51), orange (#EB5C18), amber (#F08800), gold (#DEB600)
|
||||
16. **Vintage Earthy**: Mustard (#E3B448), sage (#CBD18F), forest green (#3A6B35), cream (#F4F1DE)
|
||||
17. **Coastal Rose**: Old rose (#AD7670), beaver (#B49886), eggshell (#F3ECDC), ash gray (#BFD5BE)
|
||||
18. **Orange & Turquoise**: Light orange (#FC993E), grayish turquoise (#667C6F), white (#FCFCFC)
|
||||
|
||||
#### Visual Details Options
|
||||
|
||||
**Geometric Patterns**:
|
||||
- Diagonal section dividers instead of horizontal
|
||||
- Asymmetric column widths (30/70, 40/60, 25/75)
|
||||
- Rotated text headers at 90° or 270°
|
||||
- Circular/hexagonal frames for images
|
||||
- Triangular accent shapes in corners
|
||||
- Overlapping shapes for depth
|
||||
|
||||
**Border & Frame Treatments**:
|
||||
- Thick single-color borders (10-20pt) on one side only
|
||||
- Double-line borders with contrasting colors
|
||||
- Corner brackets instead of full frames
|
||||
- L-shaped borders (top+left or bottom+right)
|
||||
- Underline accents beneath headers (3-5pt thick)
|
||||
|
||||
**Typography Treatments**:
|
||||
- Extreme size contrast (72pt headlines vs 11pt body)
|
||||
- All-caps headers with wide letter spacing
|
||||
- Numbered sections in oversized display type
|
||||
- Monospace (Courier New) for data/stats/technical content
|
||||
- Condensed fonts (Arial Narrow) for dense information
|
||||
- Outlined text for emphasis
|
||||
|
||||
**Chart & Data Styling**:
|
||||
- Monochrome charts with single accent color for key data
|
||||
- Horizontal bar charts instead of vertical
|
||||
- Dot plots instead of bar charts
|
||||
- Minimal gridlines or none at all
|
||||
- Data labels directly on elements (no legends)
|
||||
- Oversized numbers for key metrics
|
||||
|
||||
**Layout Innovations**:
|
||||
- Full-bleed images with text overlays
|
||||
- Sidebar column (20-30% width) for navigation/context
|
||||
- Modular grid systems (3×3, 4×4 blocks)
|
||||
- Z-pattern or F-pattern content flow
|
||||
- Floating text boxes over colored shapes
|
||||
- Magazine-style multi-column layouts
|
||||
|
||||
**Background Treatments**:
|
||||
- Solid color blocks occupying 40-60% of slide
|
||||
- Gradient fills (vertical or diagonal only)
|
||||
- Split backgrounds (two colors, diagonal or vertical)
|
||||
- Edge-to-edge color bands
|
||||
- Negative space as a design element
|
||||
|
||||
### Layout Tips
|
||||
**For slides with charts or tables:**
|
||||
- **Two-column layout (PREFERRED)**: Use a header spanning the full width, then two columns below - text/bullets in one column and the featured content in the other. This provides better balance and makes charts/tables more readable. Use flexbox with unequal column widths (e.g., 40%/60% split) to optimize space for each content type.
|
||||
- **Full-slide layout**: Let the featured content (chart/table) take up the entire slide for maximum impact and readability
|
||||
- **NEVER vertically stack**: Do not place charts/tables below text in a single column - this causes poor readability and layout issues
|
||||
|
||||
### Workflow
|
||||
1. **MANDATORY - READ ENTIRE FILE**: Read [`html2pptx.md`](html2pptx.md) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for detailed syntax, critical formatting rules, and best practices before proceeding with presentation creation.
|
||||
2. Create an HTML file for each slide with proper dimensions (e.g., 720pt × 405pt for 16:9)
|
||||
- Use `<p>`, `<h1>`-`<h6>`, `<ul>`, `<ol>` for all text content
|
||||
- Use `class="placeholder"` for areas where charts/tables will be added (render with gray background for visibility)
|
||||
- **CRITICAL**: Rasterize gradients and icons as PNG images FIRST using Sharp, then reference in HTML
|
||||
- **LAYOUT**: For slides with charts/tables/images, use either full-slide layout or two-column layout for better readability
|
||||
3. Create and run a JavaScript file using the [`html2pptx.js`](scripts/html2pptx.js) library to convert HTML slides to PowerPoint and save the presentation
|
||||
- Use the `html2pptx()` function to process each HTML file
|
||||
- Add charts and tables to placeholder areas using PptxGenJS API
|
||||
- Save the presentation using `pptx.writeFile()`
|
||||
4. **Visual validation**: Generate thumbnails and inspect for layout issues
|
||||
- Create thumbnail grid: `python scripts/thumbnail.py output.pptx workspace/thumbnails --cols 4`
|
||||
- Read and carefully examine the thumbnail image for:
|
||||
- **Text cutoff**: Text being cut off by header bars, shapes, or slide edges
|
||||
- **Text overlap**: Text overlapping with other text or shapes
|
||||
- **Positioning issues**: Content too close to slide boundaries or other elements
|
||||
- **Contrast issues**: Insufficient contrast between text and backgrounds
|
||||
- If issues found, adjust HTML margins/spacing/colors and regenerate the presentation
|
||||
- Repeat until all slides are visually correct
|
||||
|
||||
## Editing an existing PowerPoint presentation
|
||||
|
||||
To edit slides in an existing PowerPoint presentation, work with the raw Office Open XML (OOXML) format. This involves unpacking the .pptx file, editing the XML content, and repacking it.
|
||||
|
||||
### Workflow
|
||||
1. **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~500 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for detailed guidance on OOXML structure and editing workflows before any presentation editing.
|
||||
2. Unpack the presentation: `python ooxml/scripts/unpack.py <office_file> <output_dir>`
|
||||
3. Edit the XML files (primarily `ppt/slides/slide{N}.xml` and related files)
|
||||
4. **CRITICAL**: Validate immediately after each edit and fix any validation errors before proceeding: `python ooxml/scripts/validate.py <dir> --original <file>`
|
||||
5. Pack the final presentation: `python ooxml/scripts/pack.py <input_directory> <office_file>`
|
||||
|
||||
## Creating a new PowerPoint presentation **using a template**
|
||||
|
||||
To create a presentation that follows an existing template's design, duplicate and re-arrange template slides before replacing placeholder context.
|
||||
|
||||
### Workflow
|
||||
1. **Extract template text AND create visual thumbnail grid**:
|
||||
* Extract text: `python -m markitdown template.pptx > template-content.md`
|
||||
* Read `template-content.md`: Read the entire file to understand the contents of the template presentation. **NEVER set any range limits when reading this file.**
|
||||
* Create thumbnail grids: `python scripts/thumbnail.py template.pptx`
|
||||
* See [Creating Thumbnail Grids](#creating-thumbnail-grids) section for more details
|
||||
|
||||
2. **Analyze template and save inventory to a file**:
|
||||
* **Visual Analysis**: Review thumbnail grid(s) to understand slide layouts, design patterns, and visual structure
|
||||
* Create and save a template inventory file at `template-inventory.md` containing:
|
||||
```markdown
|
||||
# Template Inventory Analysis
|
||||
**Total Slides: [count]**
|
||||
**IMPORTANT: Slides are 0-indexed (first slide = 0, last slide = count-1)**
|
||||
|
||||
## [Category Name]
|
||||
- Slide 0: [Layout code if available] - Description/purpose
|
||||
- Slide 1: [Layout code] - Description/purpose
|
||||
- Slide 2: [Layout code] - Description/purpose
|
||||
[... EVERY slide must be listed individually with its index ...]
|
||||
```
|
||||
* **Using the thumbnail grid**: Reference the visual thumbnails to identify:
|
||||
- Layout patterns (title slides, content layouts, section dividers)
|
||||
- Image placeholder locations and counts
|
||||
- Design consistency across slide groups
|
||||
- Visual hierarchy and structure
|
||||
* This inventory file is REQUIRED for selecting appropriate templates in the next step
|
||||
|
||||
3. **Create presentation outline based on template inventory**:
|
||||
* Review available templates from step 2.
|
||||
* Choose an intro or title template for the first slide. This should be one of the first templates.
|
||||
* Choose safe, text-based layouts for the other slides.
|
||||
* **CRITICAL: Match layout structure to actual content**:
|
||||
- Single-column layouts: Use for unified narrative or single topic
|
||||
- Two-column layouts: Use ONLY when there are exactly 2 distinct items/concepts
|
||||
- Three-column layouts: Use ONLY when there are exactly 3 distinct items/concepts
|
||||
- Image + text layouts: Use ONLY when actual images are available to insert
|
||||
- Quote layouts: Use ONLY for actual quotes from people (with attribution), never for emphasis
|
||||
- Never use layouts with more placeholders than available content
|
||||
- If there are 2 items, don't force them into a 3-column layout
|
||||
- If there are 4+ items, consider breaking into multiple slides or using a list format
|
||||
* Count actual content pieces BEFORE selecting the layout
|
||||
* Verify each placeholder in the chosen layout will be filled with meaningful content
|
||||
* Select one option representing the **best** layout for each content section.
|
||||
* Save `outline.md` with content AND template mapping that leverages available designs
|
||||
* Example template mapping:
|
||||
```
|
||||
# Template slides to use (0-based indexing)
|
||||
# WARNING: Verify indices are within range! Template with 73 slides has indices 0-72
|
||||
# Mapping: slide numbers from outline -> template slide indices
|
||||
template_mapping = [
|
||||
0, # Use slide 0 (Title/Cover)
|
||||
34, # Use slide 34 (B1: Title and body)
|
||||
34, # Use slide 34 again (duplicate for second B1)
|
||||
50, # Use slide 50 (E1: Quote)
|
||||
54, # Use slide 54 (F2: Closing + Text)
|
||||
]
|
||||
```
|
||||
|
||||
4. **Duplicate, reorder, and delete slides using `rearrange.py`**:
|
||||
* Use the `scripts/rearrange.py` script to create a new presentation with slides in the desired order:
|
||||
```bash
|
||||
python scripts/rearrange.py template.pptx working.pptx 0,34,34,50,52
|
||||
```
|
||||
* The script handles duplicating repeated slides, deleting unused slides, and reordering automatically
|
||||
* Slide indices are 0-based (first slide is 0, second is 1, etc.)
|
||||
* The same slide index can appear multiple times to duplicate that slide
|
||||
|
||||
5. **Extract ALL text using the `inventory.py` script**:
|
||||
* **Run inventory extraction**:
|
||||
```bash
|
||||
python scripts/inventory.py working.pptx text-inventory.json
|
||||
```
|
||||
* **Read text-inventory.json**: Read the entire text-inventory.json file to understand all shapes and their properties. **NEVER set any range limits when reading this file.**
|
||||
|
||||
* The inventory JSON structure:
|
||||
```json
|
||||
{
|
||||
"slide-0": {
|
||||
"shape-0": {
|
||||
"placeholder_type": "TITLE", // or null for non-placeholders
|
||||
"left": 1.5, // position in inches
|
||||
"top": 2.0,
|
||||
"width": 7.5,
|
||||
"height": 1.2,
|
||||
"paragraphs": [
|
||||
{
|
||||
"text": "Paragraph text",
|
||||
// Optional properties (only included when non-default):
|
||||
"bullet": true, // explicit bullet detected
|
||||
"level": 0, // only included when bullet is true
|
||||
"alignment": "CENTER", // CENTER, RIGHT (not LEFT)
|
||||
"space_before": 10.0, // space before paragraph in points
|
||||
"space_after": 6.0, // space after paragraph in points
|
||||
"line_spacing": 22.4, // line spacing in points
|
||||
"font_name": "Arial", // from first run
|
||||
"font_size": 14.0, // in points
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"color": "FF0000" // RGB color
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
* Key features:
|
||||
- **Slides**: Named as "slide-0", "slide-1", etc.
|
||||
- **Shapes**: Ordered by visual position (top-to-bottom, left-to-right) as "shape-0", "shape-1", etc.
|
||||
- **Placeholder types**: TITLE, CENTER_TITLE, SUBTITLE, BODY, OBJECT, or null
|
||||
- **Default font size**: `default_font_size` in points extracted from layout placeholders (when available)
|
||||
- **Slide numbers are filtered**: Shapes with SLIDE_NUMBER placeholder type are automatically excluded from inventory
|
||||
- **Bullets**: When `bullet: true`, `level` is always included (even if 0)
|
||||
- **Spacing**: `space_before`, `space_after`, and `line_spacing` in points (only included when set)
|
||||
- **Colors**: `color` for RGB (e.g., "FF0000"), `theme_color` for theme colors (e.g., "DARK_1")
|
||||
- **Properties**: Only non-default values are included in the output
|
||||
|
||||
6. **Generate replacement text and save the data to a JSON file**
|
||||
Based on the text inventory from the previous step:
|
||||
- **CRITICAL**: First verify which shapes exist in the inventory - only reference shapes that are actually present
|
||||
- **VALIDATION**: The replace.py script will validate that all shapes in the replacement JSON exist in the inventory
|
||||
- If a non-existent shape is referenced, an error will show available shapes
|
||||
- If a non-existent slide is referenced, an error will indicate the slide doesn't exist
|
||||
- All validation errors are shown at once before the script exits
|
||||
- **IMPORTANT**: The replace.py script uses inventory.py internally to identify ALL text shapes
|
||||
- **AUTOMATIC CLEARING**: ALL text shapes from the inventory will be cleared unless you provide "paragraphs" for them
|
||||
- Add a "paragraphs" field to shapes that need content (not "replacement_paragraphs")
|
||||
- Shapes without "paragraphs" in the replacement JSON will have their text cleared automatically
|
||||
- Paragraphs with bullets will be automatically left aligned. Don't set the `alignment` property on when `"bullet": true`
|
||||
- Generate appropriate replacement content for placeholder text
|
||||
- Use shape size to determine appropriate content length
|
||||
- **CRITICAL**: Include paragraph properties from the original inventory - don't just provide text
|
||||
- **IMPORTANT**: When bullet: true, do NOT include bullet symbols (•, -, *) in text - they are added automatically
|
||||
- **ESSENTIAL FORMATTING RULES**:
|
||||
- Headers/titles should typically have `"bold": true`
|
||||
- List items should have `"bullet": true, "level": 0` (level is required when bullet is true)
|
||||
- Preserve any alignment properties (e.g., `"alignment": "CENTER"` for centered text)
|
||||
- Include font properties when different from default (e.g., `"font_size": 14.0`, `"font_name": "Lora"`)
|
||||
- Colors: Use `"color": "FF0000"` for RGB or `"theme_color": "DARK_1"` for theme colors
|
||||
- The replacement script expects **properly formatted paragraphs**, not just text strings
|
||||
- **Overlapping shapes**: Prefer shapes with larger default_font_size or more appropriate placeholder_type
|
||||
- Save the updated inventory with replacements to `replacement-text.json`
|
||||
- **WARNING**: Different template layouts have different shape counts - always check the actual inventory before creating replacements
|
||||
|
||||
Example paragraphs field showing proper formatting:
|
||||
```json
|
||||
"paragraphs": [
|
||||
{
|
||||
"text": "New presentation title text",
|
||||
"alignment": "CENTER",
|
||||
"bold": true
|
||||
},
|
||||
{
|
||||
"text": "Section Header",
|
||||
"bold": true
|
||||
},
|
||||
{
|
||||
"text": "First bullet point without bullet symbol",
|
||||
"bullet": true,
|
||||
"level": 0
|
||||
},
|
||||
{
|
||||
"text": "Red colored text",
|
||||
"color": "FF0000"
|
||||
},
|
||||
{
|
||||
"text": "Theme colored text",
|
||||
"theme_color": "DARK_1"
|
||||
},
|
||||
{
|
||||
"text": "Regular paragraph text without special formatting"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
**Shapes not listed in the replacement JSON are automatically cleared**:
|
||||
```json
|
||||
{
|
||||
"slide-0": {
|
||||
"shape-0": {
|
||||
"paragraphs": [...] // This shape gets new text
|
||||
}
|
||||
// shape-1 and shape-2 from inventory will be cleared automatically
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Common formatting patterns for presentations**:
|
||||
- Title slides: Bold text, sometimes centered
|
||||
- Section headers within slides: Bold text
|
||||
- Bullet lists: Each item needs `"bullet": true, "level": 0`
|
||||
- Body text: Usually no special properties needed
|
||||
- Quotes: May have special alignment or font properties
|
||||
|
||||
7. **Apply replacements using the `replace.py` script**
|
||||
```bash
|
||||
python scripts/replace.py working.pptx replacement-text.json output.pptx
|
||||
```
|
||||
|
||||
The script will:
|
||||
- First extract the inventory of ALL text shapes using functions from inventory.py
|
||||
- Validate that all shapes in the replacement JSON exist in the inventory
|
||||
- Clear text from ALL shapes identified in the inventory
|
||||
- Apply new text only to shapes with "paragraphs" defined in the replacement JSON
|
||||
- Preserve formatting by applying paragraph properties from the JSON
|
||||
- Handle bullets, alignment, font properties, and colors automatically
|
||||
- Save the updated presentation
|
||||
|
||||
Example validation errors:
|
||||
```
|
||||
ERROR: Invalid shapes in replacement JSON:
|
||||
- Shape 'shape-99' not found on 'slide-0'. Available shapes: shape-0, shape-1, shape-4
|
||||
- Slide 'slide-999' not found in inventory
|
||||
```
|
||||
|
||||
```
|
||||
ERROR: Replacement text made overflow worse in these shapes:
|
||||
- slide-0/shape-2: overflow worsened by 1.25" (was 0.00", now 1.25")
|
||||
```
|
||||
|
||||
## Creating Thumbnail Grids
|
||||
|
||||
To create visual thumbnail grids of PowerPoint slides for quick analysis and reference:
|
||||
|
||||
```bash
|
||||
python scripts/thumbnail.py template.pptx [output_prefix]
|
||||
```
|
||||
|
||||
**Features**:
|
||||
- Creates: `thumbnails.jpg` (or `thumbnails-1.jpg`, `thumbnails-2.jpg`, etc. for large decks)
|
||||
- Default: 5 columns, max 30 slides per grid (5×6)
|
||||
- Custom prefix: `python scripts/thumbnail.py template.pptx my-grid`
|
||||
- Note: The output prefix should include the path if you want output in a specific directory (e.g., `workspace/my-grid`)
|
||||
- Adjust columns: `--cols 4` (range: 3-6, affects slides per grid)
|
||||
- Grid limits: 3 cols = 12 slides/grid, 4 cols = 20, 5 cols = 30, 6 cols = 42
|
||||
- Slides are zero-indexed (Slide 0, Slide 1, etc.)
|
||||
|
||||
**Use cases**:
|
||||
- Template analysis: Quickly understand slide layouts and design patterns
|
||||
- Content review: Visual overview of entire presentation
|
||||
- Navigation reference: Find specific slides by their visual appearance
|
||||
- Quality check: Verify all slides are properly formatted
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Basic usage
|
||||
python scripts/thumbnail.py presentation.pptx
|
||||
|
||||
# Combine options: custom name, columns
|
||||
python scripts/thumbnail.py template.pptx analysis --cols 4
|
||||
```
|
||||
|
||||
## Converting Slides to Images
|
||||
|
||||
To visually analyze PowerPoint slides, convert them to images using a two-step process:
|
||||
|
||||
1. **Convert PPTX to PDF**:
|
||||
```bash
|
||||
soffice --headless --convert-to pdf template.pptx
|
||||
```
|
||||
|
||||
2. **Convert PDF pages to JPEG images**:
|
||||
```bash
|
||||
pdftoppm -jpeg -r 150 template.pdf slide
|
||||
```
|
||||
This creates files like `slide-1.jpg`, `slide-2.jpg`, etc.
|
||||
|
||||
Options:
|
||||
- `-r 150`: Sets resolution to 150 DPI (adjust for quality/size balance)
|
||||
- `-jpeg`: Output JPEG format (use `-png` for PNG if preferred)
|
||||
- `-f N`: First page to convert (e.g., `-f 2` starts from page 2)
|
||||
- `-l N`: Last page to convert (e.g., `-l 5` stops at page 5)
|
||||
- `slide`: Prefix for output files
|
||||
|
||||
Example for specific range:
|
||||
```bash
|
||||
pdftoppm -jpeg -r 150 -f 2 -l 5 template.pdf slide # Converts only pages 2-5
|
||||
```
|
||||
|
||||
## Code Style Guidelines
|
||||
**IMPORTANT**: When generating code for PPTX operations:
|
||||
- Write concise code
|
||||
- Avoid verbose variable names and redundant operations
|
||||
- Avoid unnecessary print statements
|
||||
|
||||
## Dependencies
|
||||
|
||||
Required dependencies (should already be installed):
|
||||
|
||||
- **markitdown**: `pip install "markitdown[pptx]"` (for text extraction from presentations)
|
||||
- **pptxgenjs**: `npm install -g pptxgenjs` (for creating presentations via html2pptx)
|
||||
- **playwright**: `npm install -g playwright` (for HTML rendering in html2pptx)
|
||||
- **react-icons**: `npm install -g react-icons react react-dom` (for icons)
|
||||
- **sharp**: `npm install -g sharp` (for SVG rasterization and image processing)
|
||||
- **LibreOffice**: `sudo apt-get install libreoffice` (for PDF conversion)
|
||||
- **Poppler**: `sudo apt-get install poppler-utils` (for pdftoppm to convert PDF to images)
|
||||
- **defusedxml**: `pip install defusedxml` (for secure XML parsing)
|
||||
@@ -1,625 +0,0 @@
|
||||
# HTML to PowerPoint Guide
|
||||
|
||||
Convert HTML slides to PowerPoint presentations with accurate positioning using the `html2pptx.js` library.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Creating HTML Slides](#creating-html-slides)
|
||||
2. [Using the html2pptx Library](#using-the-html2pptx-library)
|
||||
3. [Using PptxGenJS](#using-pptxgenjs)
|
||||
|
||||
---
|
||||
|
||||
## Creating HTML Slides
|
||||
|
||||
Every HTML slide must include proper body dimensions:
|
||||
|
||||
### Layout Dimensions
|
||||
|
||||
- **16:9** (default): `width: 720pt; height: 405pt`
|
||||
- **4:3**: `width: 720pt; height: 540pt`
|
||||
- **16:10**: `width: 720pt; height: 450pt`
|
||||
|
||||
### Supported Elements
|
||||
|
||||
- `<p>`, `<h1>`-`<h6>` - Text with styling
|
||||
- `<ul>`, `<ol>` - Lists (never use manual bullets •, -, *)
|
||||
- `<b>`, `<strong>` - Bold text (inline formatting)
|
||||
- `<i>`, `<em>` - Italic text (inline formatting)
|
||||
- `<u>` - Underlined text (inline formatting)
|
||||
- `<span>` - Inline formatting with CSS styles (bold, italic, underline, color)
|
||||
- `<br>` - Line breaks
|
||||
- `<div>` with bg/border - Becomes shape
|
||||
- `<img>` - Images
|
||||
- `class="placeholder"` - Reserved space for charts (returns `{ id, x, y, w, h }`)
|
||||
|
||||
### Critical Text Rules
|
||||
|
||||
**ALL text MUST be inside `<p>`, `<h1>`-`<h6>`, `<ul>`, or `<ol>` tags:**
|
||||
- ✅ Correct: `<div><p>Text here</p></div>`
|
||||
- ❌ Wrong: `<div>Text here</div>` - **Text will NOT appear in PowerPoint**
|
||||
- ❌ Wrong: `<span>Text</span>` - **Text will NOT appear in PowerPoint**
|
||||
- Text in `<div>` or `<span>` without a text tag will be silently ignored
|
||||
|
||||
**NEVER use manual bullet symbols (•, -, *, etc.)** - Use `<ul>` or `<ol>` lists instead
|
||||
|
||||
**ONLY use web-safe fonts that are universally available:**
|
||||
- ✅ Web-safe fonts: `Arial`, `Helvetica`, `Times New Roman`, `Georgia`, `Courier New`, `Verdana`, `Tahoma`, `Trebuchet MS`, `Impact`, `Comic Sans MS`
|
||||
- ❌ Wrong: `'Segoe UI'`, `'SF Pro'`, `'Roboto'`, custom fonts - **Might cause rendering issues**
|
||||
|
||||
### Styling
|
||||
|
||||
- Use `display: flex` on body to prevent margin collapse from breaking overflow validation
|
||||
- Use `margin` for spacing (padding included in size)
|
||||
- Inline formatting: Use `<b>`, `<i>`, `<u>` tags OR `<span>` with CSS styles
|
||||
- `<span>` supports: `font-weight: bold`, `font-style: italic`, `text-decoration: underline`, `color: #rrggbb`
|
||||
- `<span>` does NOT support: `margin`, `padding` (not supported in PowerPoint text runs)
|
||||
- Example: `<span style="font-weight: bold; color: #667eea;">Bold blue text</span>`
|
||||
- Flexbox works - positions calculated from rendered layout
|
||||
- Use hex colors with `#` prefix in CSS
|
||||
- **Text alignment**: Use CSS `text-align` (`center`, `right`, etc.) when needed as a hint to PptxGenJS for text formatting if text lengths are slightly off
|
||||
|
||||
### Shape Styling (DIV elements only)
|
||||
|
||||
**IMPORTANT: Backgrounds, borders, and shadows only work on `<div>` elements, NOT on text elements (`<p>`, `<h1>`-`<h6>`, `<ul>`, `<ol>`)**
|
||||
|
||||
- **Backgrounds**: CSS `background` or `background-color` on `<div>` elements only
|
||||
- Example: `<div style="background: #f0f0f0;">` - Creates a shape with background
|
||||
- **Borders**: CSS `border` on `<div>` elements converts to PowerPoint shape borders
|
||||
- Supports uniform borders: `border: 2px solid #333333`
|
||||
- Supports partial borders: `border-left`, `border-right`, `border-top`, `border-bottom` (rendered as line shapes)
|
||||
- Example: `<div style="border-left: 8pt solid #E76F51;">`
|
||||
- **Border radius**: CSS `border-radius` on `<div>` elements for rounded corners
|
||||
- `border-radius: 50%` or higher creates circular shape
|
||||
- Percentages <50% calculated relative to shape's smaller dimension
|
||||
- Supports px and pt units (e.g., `border-radius: 8pt;`, `border-radius: 12px;`)
|
||||
- Example: `<div style="border-radius: 25%;">` on 100x200px box = 25% of 100px = 25px radius
|
||||
- **Box shadows**: CSS `box-shadow` on `<div>` elements converts to PowerPoint shadows
|
||||
- Supports outer shadows only (inset shadows are ignored to prevent corruption)
|
||||
- Example: `<div style="box-shadow: 2px 2px 8px rgba(0, 0, 0, 0.3);">`
|
||||
- Note: Inset/inner shadows are not supported by PowerPoint and will be skipped
|
||||
|
||||
### Icons & Gradients
|
||||
|
||||
- **CRITICAL: Never use CSS gradients (`linear-gradient`, `radial-gradient`)** - They don't convert to PowerPoint
|
||||
- **ALWAYS create gradient/icon PNGs FIRST using Sharp, then reference in HTML**
|
||||
- For gradients: Rasterize SVG to PNG background images
|
||||
- For icons: Rasterize react-icons SVG to PNG images
|
||||
- All visual effects must be pre-rendered as raster images before HTML rendering
|
||||
|
||||
**Rasterizing Icons with Sharp:**
|
||||
|
||||
```javascript
|
||||
const React = require('react');
|
||||
const ReactDOMServer = require('react-dom/server');
|
||||
const sharp = require('sharp');
|
||||
const { FaHome } = require('react-icons/fa');
|
||||
|
||||
async function rasterizeIconPng(IconComponent, color, size = "256", filename) {
|
||||
const svgString = ReactDOMServer.renderToStaticMarkup(
|
||||
React.createElement(IconComponent, { color: `#${color}`, size: size })
|
||||
);
|
||||
|
||||
// Convert SVG to PNG using Sharp
|
||||
await sharp(Buffer.from(svgString))
|
||||
.png()
|
||||
.toFile(filename);
|
||||
|
||||
return filename;
|
||||
}
|
||||
|
||||
// Usage: Rasterize icon before using in HTML
|
||||
const iconPath = await rasterizeIconPng(FaHome, "4472c4", "256", "home-icon.png");
|
||||
// Then reference in HTML: <img src="home-icon.png" style="width: 40pt; height: 40pt;">
|
||||
```
|
||||
|
||||
**Rasterizing Gradients with Sharp:**
|
||||
|
||||
```javascript
|
||||
const sharp = require('sharp');
|
||||
|
||||
async function createGradientBackground(filename) {
|
||||
const svg = `<svg xmlns="http://www.w3.org/2000/svg" width="1000" height="562.5">
|
||||
<defs>
|
||||
<linearGradient id="g" x1="0%" y1="0%" x2="100%" y2="100%">
|
||||
<stop offset="0%" style="stop-color:#COLOR1"/>
|
||||
<stop offset="100%" style="stop-color:#COLOR2"/>
|
||||
</linearGradient>
|
||||
</defs>
|
||||
<rect width="100%" height="100%" fill="url(#g)"/>
|
||||
</svg>`;
|
||||
|
||||
await sharp(Buffer.from(svg))
|
||||
.png()
|
||||
.toFile(filename);
|
||||
|
||||
return filename;
|
||||
}
|
||||
|
||||
// Usage: Create gradient background before HTML
|
||||
const bgPath = await createGradientBackground("gradient-bg.png");
|
||||
// Then in HTML: <body style="background-image: url('gradient-bg.png');">
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
```html
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<style>
|
||||
html { background: #ffffff; }
|
||||
body {
|
||||
width: 720pt; height: 405pt; margin: 0; padding: 0;
|
||||
background: #f5f5f5; font-family: Arial, sans-serif;
|
||||
display: flex;
|
||||
}
|
||||
.content { margin: 30pt; padding: 40pt; background: #ffffff; border-radius: 8pt; }
|
||||
h1 { color: #2d3748; font-size: 32pt; }
|
||||
.box {
|
||||
background: #70ad47; padding: 20pt; border: 3px solid #5a8f37;
|
||||
border-radius: 12pt; box-shadow: 3px 3px 10px rgba(0, 0, 0, 0.25);
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="content">
|
||||
<h1>Recipe Title</h1>
|
||||
<ul>
|
||||
<li><b>Item:</b> Description</li>
|
||||
</ul>
|
||||
<p>Text with <b>bold</b>, <i>italic</i>, <u>underline</u>.</p>
|
||||
<div id="chart" class="placeholder" style="width: 350pt; height: 200pt;"></div>
|
||||
|
||||
<!-- Text MUST be in <p> tags -->
|
||||
<div class="box">
|
||||
<p>5</p>
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
```
|
||||
|
||||
## Using the html2pptx Library
|
||||
|
||||
### Dependencies
|
||||
|
||||
These libraries have been globally installed and are available to use:
|
||||
- `pptxgenjs`
|
||||
- `playwright`
|
||||
- `sharp`
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```javascript
|
||||
const pptxgen = require('pptxgenjs');
|
||||
const html2pptx = require('./html2pptx');
|
||||
|
||||
const pptx = new pptxgen();
|
||||
pptx.layout = 'LAYOUT_16x9'; // Must match HTML body dimensions
|
||||
|
||||
const { slide, placeholders } = await html2pptx('slide1.html', pptx);
|
||||
|
||||
// Add chart to placeholder area
|
||||
if (placeholders.length > 0) {
|
||||
slide.addChart(pptx.charts.LINE, chartData, placeholders[0]);
|
||||
}
|
||||
|
||||
await pptx.writeFile('output.pptx');
|
||||
```
|
||||
|
||||
### API Reference
|
||||
|
||||
#### Function Signature
|
||||
```javascript
|
||||
await html2pptx(htmlFile, pres, options)
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
- `htmlFile` (string): Path to HTML file (absolute or relative)
|
||||
- `pres` (pptxgen): PptxGenJS presentation instance with layout already set
|
||||
- `options` (object, optional):
|
||||
- `tmpDir` (string): Temporary directory for generated files (default: `process.env.TMPDIR || '/tmp'`)
|
||||
- `slide` (object): Existing slide to reuse (default: creates new slide)
|
||||
|
||||
#### Returns
|
||||
```javascript
|
||||
{
|
||||
slide: pptxgenSlide, // The created/updated slide
|
||||
placeholders: [ // Array of placeholder positions
|
||||
{ id: string, x: number, y: number, w: number, h: number },
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Validation
|
||||
|
||||
The library automatically validates and collects all errors before throwing:
|
||||
|
||||
1. **HTML dimensions must match presentation layout** - Reports dimension mismatches
|
||||
2. **Content must not overflow body** - Reports overflow with exact measurements
|
||||
3. **CSS gradients** - Reports unsupported gradient usage
|
||||
4. **Text element styling** - Reports backgrounds/borders/shadows on text elements (only allowed on divs)
|
||||
|
||||
**All validation errors are collected and reported together** in a single error message, allowing you to fix all issues at once instead of one at a time.
|
||||
|
||||
### Working with Placeholders
|
||||
|
||||
```javascript
|
||||
const { slide, placeholders } = await html2pptx('slide.html', pptx);
|
||||
|
||||
// Use first placeholder
|
||||
slide.addChart(pptx.charts.BAR, data, placeholders[0]);
|
||||
|
||||
// Find by ID
|
||||
const chartArea = placeholders.find(p => p.id === 'chart-area');
|
||||
slide.addChart(pptx.charts.LINE, data, chartArea);
|
||||
```
|
||||
|
||||
### Complete Example
|
||||
|
||||
```javascript
|
||||
const pptxgen = require('pptxgenjs');
|
||||
const html2pptx = require('./html2pptx');
|
||||
|
||||
async function createPresentation() {
|
||||
const pptx = new pptxgen();
|
||||
pptx.layout = 'LAYOUT_16x9';
|
||||
pptx.author = 'Your Name';
|
||||
pptx.title = 'My Presentation';
|
||||
|
||||
// Slide 1: Title
|
||||
const { slide: slide1 } = await html2pptx('slides/title.html', pptx);
|
||||
|
||||
// Slide 2: Content with chart
|
||||
const { slide: slide2, placeholders } = await html2pptx('slides/data.html', pptx);
|
||||
|
||||
const chartData = [{
|
||||
name: 'Sales',
|
||||
labels: ['Q1', 'Q2', 'Q3', 'Q4'],
|
||||
values: [4500, 5500, 6200, 7100]
|
||||
}];
|
||||
|
||||
slide2.addChart(pptx.charts.BAR, chartData, {
|
||||
...placeholders[0],
|
||||
showTitle: true,
|
||||
title: 'Quarterly Sales',
|
||||
showCatAxisTitle: true,
|
||||
catAxisTitle: 'Quarter',
|
||||
showValAxisTitle: true,
|
||||
valAxisTitle: 'Sales ($000s)'
|
||||
});
|
||||
|
||||
// Save
|
||||
await pptx.writeFile({ fileName: 'presentation.pptx' });
|
||||
console.log('Presentation created successfully!');
|
||||
}
|
||||
|
||||
createPresentation().catch(console.error);
|
||||
```
|
||||
|
||||
## Using PptxGenJS
|
||||
|
||||
After converting HTML to slides with `html2pptx`, you'll use PptxGenJS to add dynamic content like charts, images, and additional elements.
|
||||
|
||||
### ⚠️ Critical Rules
|
||||
|
||||
#### Colors
|
||||
- **NEVER use `#` prefix** with hex colors in PptxGenJS - causes file corruption
|
||||
- ✅ Correct: `color: "FF0000"`, `fill: { color: "0066CC" }`
|
||||
- ❌ Wrong: `color: "#FF0000"` (breaks document)
|
||||
|
||||
### Adding Images
|
||||
|
||||
Always calculate aspect ratios from actual image dimensions:
|
||||
|
||||
```javascript
|
||||
// Get image dimensions: identify image.png | grep -o '[0-9]* x [0-9]*'
|
||||
const imgWidth = 1860, imgHeight = 1519; // From actual file
|
||||
const aspectRatio = imgWidth / imgHeight;
|
||||
|
||||
const h = 3; // Max height
|
||||
const w = h * aspectRatio;
|
||||
const x = (10 - w) / 2; // Center on 16:9 slide
|
||||
|
||||
slide.addImage({ path: "chart.png", x, y: 1.5, w, h });
|
||||
```
|
||||
|
||||
### Adding Text
|
||||
|
||||
```javascript
|
||||
// Rich text with formatting
|
||||
slide.addText([
|
||||
{ text: "Bold ", options: { bold: true } },
|
||||
{ text: "Italic ", options: { italic: true } },
|
||||
{ text: "Normal" }
|
||||
], {
|
||||
x: 1, y: 2, w: 8, h: 1
|
||||
});
|
||||
```
|
||||
|
||||
### Adding Shapes
|
||||
|
||||
```javascript
|
||||
// Rectangle
|
||||
slide.addShape(pptx.shapes.RECTANGLE, {
|
||||
x: 1, y: 1, w: 3, h: 2,
|
||||
fill: { color: "4472C4" },
|
||||
line: { color: "000000", width: 2 }
|
||||
});
|
||||
|
||||
// Circle
|
||||
slide.addShape(pptx.shapes.OVAL, {
|
||||
x: 5, y: 1, w: 2, h: 2,
|
||||
fill: { color: "ED7D31" }
|
||||
});
|
||||
|
||||
// Rounded rectangle
|
||||
slide.addShape(pptx.shapes.ROUNDED_RECTANGLE, {
|
||||
x: 1, y: 4, w: 3, h: 1.5,
|
||||
fill: { color: "70AD47" },
|
||||
rectRadius: 0.2
|
||||
});
|
||||
```
|
||||
|
||||
### Adding Charts
|
||||
|
||||
**Required for most charts:** Axis labels using `catAxisTitle` (category) and `valAxisTitle` (value).
|
||||
|
||||
**Chart Data Format:**
|
||||
- Use **single series with all labels** for simple bar/line charts
|
||||
- Each series creates a separate legend entry
|
||||
- Labels array defines X-axis values
|
||||
|
||||
**Time Series Data - Choose Correct Granularity:**
|
||||
- **< 30 days**: Use daily grouping (e.g., "10-01", "10-02") - avoid monthly aggregation that creates single-point charts
|
||||
- **30-365 days**: Use monthly grouping (e.g., "2024-01", "2024-02")
|
||||
- **> 365 days**: Use yearly grouping (e.g., "2023", "2024")
|
||||
- **Validate**: Charts with only 1 data point likely indicate incorrect aggregation for the time period
|
||||
|
||||
```javascript
|
||||
const { slide, placeholders } = await html2pptx('slide.html', pptx);
|
||||
|
||||
// CORRECT: Single series with all labels
|
||||
slide.addChart(pptx.charts.BAR, [{
|
||||
name: "Sales 2024",
|
||||
labels: ["Q1", "Q2", "Q3", "Q4"],
|
||||
values: [4500, 5500, 6200, 7100]
|
||||
}], {
|
||||
...placeholders[0], // Use placeholder position
|
||||
barDir: 'col', // 'col' = vertical bars, 'bar' = horizontal
|
||||
showTitle: true,
|
||||
title: 'Quarterly Sales',
|
||||
showLegend: false, // No legend needed for single series
|
||||
// Required axis labels
|
||||
showCatAxisTitle: true,
|
||||
catAxisTitle: 'Quarter',
|
||||
showValAxisTitle: true,
|
||||
valAxisTitle: 'Sales ($000s)',
|
||||
// Optional: Control scaling (adjust min based on data range for better visualization)
|
||||
valAxisMaxVal: 8000,
|
||||
valAxisMinVal: 0, // Use 0 for counts/amounts; for clustered data (e.g., 4500-7100), consider starting closer to min value
|
||||
valAxisMajorUnit: 2000, // Control y-axis label spacing to prevent crowding
|
||||
catAxisLabelRotate: 45, // Rotate labels if crowded
|
||||
dataLabelPosition: 'outEnd',
|
||||
dataLabelColor: '000000',
|
||||
// Use single color for single-series charts
|
||||
chartColors: ["4472C4"] // All bars same color
|
||||
});
|
||||
```
|
||||
|
||||
#### Scatter Chart
|
||||
|
||||
**IMPORTANT**: Scatter chart data format is unusual - first series contains X-axis values, subsequent series contain Y-values:
|
||||
|
||||
```javascript
|
||||
// Prepare data
|
||||
const data1 = [{ x: 10, y: 20 }, { x: 15, y: 25 }, { x: 20, y: 30 }];
|
||||
const data2 = [{ x: 12, y: 18 }, { x: 18, y: 22 }];
|
||||
|
||||
const allXValues = [...data1.map(d => d.x), ...data2.map(d => d.x)];
|
||||
|
||||
slide.addChart(pptx.charts.SCATTER, [
|
||||
{ name: 'X-Axis', values: allXValues }, // First series = X values
|
||||
{ name: 'Series 1', values: data1.map(d => d.y) }, // Y values only
|
||||
{ name: 'Series 2', values: data2.map(d => d.y) } // Y values only
|
||||
], {
|
||||
x: 1, y: 1, w: 8, h: 4,
|
||||
lineSize: 0, // 0 = no connecting lines
|
||||
lineDataSymbol: 'circle',
|
||||
lineDataSymbolSize: 6,
|
||||
showCatAxisTitle: true,
|
||||
catAxisTitle: 'X Axis',
|
||||
showValAxisTitle: true,
|
||||
valAxisTitle: 'Y Axis',
|
||||
chartColors: ["4472C4", "ED7D31"]
|
||||
});
|
||||
```
|
||||
|
||||
#### Line Chart
|
||||
|
||||
```javascript
|
||||
slide.addChart(pptx.charts.LINE, [{
|
||||
name: "Temperature",
|
||||
labels: ["Jan", "Feb", "Mar", "Apr"],
|
||||
values: [32, 35, 42, 55]
|
||||
}], {
|
||||
x: 1, y: 1, w: 8, h: 4,
|
||||
lineSize: 4,
|
||||
lineSmooth: true,
|
||||
// Required axis labels
|
||||
showCatAxisTitle: true,
|
||||
catAxisTitle: 'Month',
|
||||
showValAxisTitle: true,
|
||||
valAxisTitle: 'Temperature (°F)',
|
||||
// Optional: Y-axis range (set min based on data range for better visualization)
|
||||
valAxisMinVal: 0, // For ranges starting at 0 (counts, percentages, etc.)
|
||||
valAxisMaxVal: 60,
|
||||
valAxisMajorUnit: 20, // Control y-axis label spacing to prevent crowding (e.g., 10, 20, 25)
|
||||
// valAxisMinVal: 30, // PREFERRED: For data clustered in a range (e.g., 32-55 or ratings 3-5), start axis closer to min value to show variation
|
||||
// Optional: Chart colors
|
||||
chartColors: ["4472C4", "ED7D31", "A5A5A5"]
|
||||
});
|
||||
```
|
||||
|
||||
#### Pie Chart (No Axis Labels Required)
|
||||
|
||||
**CRITICAL**: Pie charts require a **single data series** with all categories in the `labels` array and corresponding values in the `values` array.
|
||||
|
||||
```javascript
|
||||
slide.addChart(pptx.charts.PIE, [{
|
||||
name: "Market Share",
|
||||
labels: ["Product A", "Product B", "Other"], // All categories in one array
|
||||
values: [35, 45, 20] // All values in one array
|
||||
}], {
|
||||
x: 2, y: 1, w: 6, h: 4,
|
||||
showPercent: true,
|
||||
showLegend: true,
|
||||
legendPos: 'r', // right
|
||||
chartColors: ["4472C4", "ED7D31", "A5A5A5"]
|
||||
});
|
||||
```
|
||||
|
||||
#### Multiple Data Series
|
||||
|
||||
```javascript
|
||||
slide.addChart(pptx.charts.LINE, [
|
||||
{
|
||||
name: "Product A",
|
||||
labels: ["Q1", "Q2", "Q3", "Q4"],
|
||||
values: [10, 20, 30, 40]
|
||||
},
|
||||
{
|
||||
name: "Product B",
|
||||
labels: ["Q1", "Q2", "Q3", "Q4"],
|
||||
values: [15, 25, 20, 35]
|
||||
}
|
||||
], {
|
||||
x: 1, y: 1, w: 8, h: 4,
|
||||
showCatAxisTitle: true,
|
||||
catAxisTitle: 'Quarter',
|
||||
showValAxisTitle: true,
|
||||
valAxisTitle: 'Revenue ($M)'
|
||||
});
|
||||
```
|
||||
|
||||
### Chart Colors
|
||||
|
||||
**CRITICAL**: Use hex colors **without** the `#` prefix - including `#` causes file corruption.
|
||||
|
||||
**Align chart colors with your chosen design palette**, ensuring sufficient contrast and distinctiveness for data visualization. Adjust colors for:
|
||||
- Strong contrast between adjacent series
|
||||
- Readability against slide backgrounds
|
||||
- Accessibility (avoid red-green only combinations)
|
||||
|
||||
```javascript
|
||||
// Example: Ocean palette-inspired chart colors (adjusted for contrast)
|
||||
const chartColors = ["16A085", "FF6B9D", "2C3E50", "F39C12", "9B59B6"];
|
||||
|
||||
// Single-series chart: Use one color for all bars/points
|
||||
slide.addChart(pptx.charts.BAR, [{
|
||||
name: "Sales",
|
||||
labels: ["Q1", "Q2", "Q3", "Q4"],
|
||||
values: [4500, 5500, 6200, 7100]
|
||||
}], {
|
||||
...placeholders[0],
|
||||
chartColors: ["16A085"], // All bars same color
|
||||
showLegend: false
|
||||
});
|
||||
|
||||
// Multi-series chart: Each series gets a different color
|
||||
slide.addChart(pptx.charts.LINE, [
|
||||
{ name: "Product A", labels: ["Q1", "Q2", "Q3"], values: [10, 20, 30] },
|
||||
{ name: "Product B", labels: ["Q1", "Q2", "Q3"], values: [15, 25, 20] }
|
||||
], {
|
||||
...placeholders[0],
|
||||
chartColors: ["16A085", "FF6B9D"] // One color per series
|
||||
});
|
||||
```
|
||||
|
||||
### Adding Tables
|
||||
|
||||
Tables can be added with basic or advanced formatting:
|
||||
|
||||
#### Basic Table
|
||||
|
||||
```javascript
|
||||
slide.addTable([
|
||||
["Header 1", "Header 2", "Header 3"],
|
||||
["Row 1, Col 1", "Row 1, Col 2", "Row 1, Col 3"],
|
||||
["Row 2, Col 1", "Row 2, Col 2", "Row 2, Col 3"]
|
||||
], {
|
||||
x: 0.5,
|
||||
y: 1,
|
||||
w: 9,
|
||||
h: 3,
|
||||
border: { pt: 1, color: "999999" },
|
||||
fill: { color: "F1F1F1" }
|
||||
});
|
||||
```
|
||||
|
||||
#### Table with Custom Formatting
|
||||
|
||||
```javascript
|
||||
const tableData = [
|
||||
// Header row with custom styling
|
||||
[
|
||||
{ text: "Product", options: { fill: { color: "4472C4" }, color: "FFFFFF", bold: true } },
|
||||
{ text: "Revenue", options: { fill: { color: "4472C4" }, color: "FFFFFF", bold: true } },
|
||||
{ text: "Growth", options: { fill: { color: "4472C4" }, color: "FFFFFF", bold: true } }
|
||||
],
|
||||
// Data rows
|
||||
["Product A", "$50M", "+15%"],
|
||||
["Product B", "$35M", "+22%"],
|
||||
["Product C", "$28M", "+8%"]
|
||||
];
|
||||
|
||||
slide.addTable(tableData, {
|
||||
x: 1,
|
||||
y: 1.5,
|
||||
w: 8,
|
||||
h: 3,
|
||||
colW: [3, 2.5, 2.5], // Column widths
|
||||
rowH: [0.5, 0.6, 0.6, 0.6], // Row heights
|
||||
border: { pt: 1, color: "CCCCCC" },
|
||||
align: "center",
|
||||
valign: "middle",
|
||||
fontSize: 14
|
||||
});
|
||||
```
|
||||
|
||||
#### Table with Merged Cells
|
||||
|
||||
```javascript
|
||||
const mergedTableData = [
|
||||
[
|
||||
{ text: "Q1 Results", options: { colspan: 3, fill: { color: "4472C4" }, color: "FFFFFF", bold: true } }
|
||||
],
|
||||
["Product", "Sales", "Market Share"],
|
||||
["Product A", "$25M", "35%"],
|
||||
["Product B", "$18M", "25%"]
|
||||
];
|
||||
|
||||
slide.addTable(mergedTableData, {
|
||||
x: 1,
|
||||
y: 1,
|
||||
w: 8,
|
||||
h: 2.5,
|
||||
colW: [3, 2.5, 2.5],
|
||||
border: { pt: 1, color: "DDDDDD" }
|
||||
});
|
||||
```
|
||||
|
||||
### Table Options
|
||||
|
||||
Common table options:
|
||||
- `x, y, w, h` - Position and size
|
||||
- `colW` - Array of column widths (in inches)
|
||||
- `rowH` - Array of row heights (in inches)
|
||||
- `border` - Border style: `{ pt: 1, color: "999999" }`
|
||||
- `fill` - Background color (no # prefix)
|
||||
- `align` - Text alignment: "left", "center", "right"
|
||||
- `valign` - Vertical alignment: "top", "middle", "bottom"
|
||||
- `fontSize` - Text size
|
||||
- `autoPage` - Auto-create new slides if content overflows
|
||||
@@ -1,427 +0,0 @@
|
||||
# Office Open XML Technical Reference for PowerPoint
|
||||
|
||||
**Important: Read this entire document before starting.** Critical XML schema rules and formatting requirements are covered throughout. Incorrect implementation can create invalid PPTX files that PowerPoint cannot open.
|
||||
|
||||
## Technical Guidelines
|
||||
|
||||
### Schema Compliance
|
||||
- **Element ordering in `<p:txBody>`**: `<a:bodyPr>`, `<a:lstStyle>`, `<a:p>`
|
||||
- **Whitespace**: Add `xml:space='preserve'` to `<a:t>` elements with leading/trailing spaces
|
||||
- **Unicode**: Escape characters in ASCII content: `"` becomes `“`
|
||||
- **Images**: Add to `ppt/media/`, reference in slide XML, set dimensions to fit slide bounds
|
||||
- **Relationships**: Update `ppt/slides/_rels/slideN.xml.rels` for each slide's resources
|
||||
- **Dirty attribute**: Add `dirty="0"` to `<a:rPr>` and `<a:endParaRPr>` elements to indicate clean state
|
||||
|
||||
## Presentation Structure
|
||||
|
||||
### Basic Slide Structure
|
||||
```xml
|
||||
<!-- ppt/slides/slide1.xml -->
|
||||
<p:sld>
|
||||
<p:cSld>
|
||||
<p:spTree>
|
||||
<p:nvGrpSpPr>...</p:nvGrpSpPr>
|
||||
<p:grpSpPr>...</p:grpSpPr>
|
||||
<!-- Shapes go here -->
|
||||
</p:spTree>
|
||||
</p:cSld>
|
||||
</p:sld>
|
||||
```
|
||||
|
||||
### Text Box / Shape with Text
|
||||
```xml
|
||||
<p:sp>
|
||||
<p:nvSpPr>
|
||||
<p:cNvPr id="2" name="Title"/>
|
||||
<p:cNvSpPr>
|
||||
<a:spLocks noGrp="1"/>
|
||||
</p:cNvSpPr>
|
||||
<p:nvPr>
|
||||
<p:ph type="ctrTitle"/>
|
||||
</p:nvPr>
|
||||
</p:nvSpPr>
|
||||
<p:spPr>
|
||||
<a:xfrm>
|
||||
<a:off x="838200" y="365125"/>
|
||||
<a:ext cx="7772400" cy="1470025"/>
|
||||
</a:xfrm>
|
||||
</p:spPr>
|
||||
<p:txBody>
|
||||
<a:bodyPr/>
|
||||
<a:lstStyle/>
|
||||
<a:p>
|
||||
<a:r>
|
||||
<a:t>Slide Title</a:t>
|
||||
</a:r>
|
||||
</a:p>
|
||||
</p:txBody>
|
||||
</p:sp>
|
||||
```
|
||||
|
||||
### Text Formatting
|
||||
```xml
|
||||
<!-- Bold -->
|
||||
<a:r>
|
||||
<a:rPr b="1"/>
|
||||
<a:t>Bold Text</a:t>
|
||||
</a:r>
|
||||
|
||||
<!-- Italic -->
|
||||
<a:r>
|
||||
<a:rPr i="1"/>
|
||||
<a:t>Italic Text</a:t>
|
||||
</a:r>
|
||||
|
||||
<!-- Underline -->
|
||||
<a:r>
|
||||
<a:rPr u="sng"/>
|
||||
<a:t>Underlined</a:t>
|
||||
</a:r>
|
||||
|
||||
<!-- Highlight -->
|
||||
<a:r>
|
||||
<a:rPr>
|
||||
<a:highlight>
|
||||
<a:srgbClr val="FFFF00"/>
|
||||
</a:highlight>
|
||||
</a:rPr>
|
||||
<a:t>Highlighted Text</a:t>
|
||||
</a:r>
|
||||
|
||||
<!-- Font and Size -->
|
||||
<a:r>
|
||||
<a:rPr sz="2400" typeface="Arial">
|
||||
<a:solidFill>
|
||||
<a:srgbClr val="FF0000"/>
|
||||
</a:solidFill>
|
||||
</a:rPr>
|
||||
<a:t>Colored Arial 24pt</a:t>
|
||||
</a:r>
|
||||
|
||||
<!-- Complete formatting example -->
|
||||
<a:r>
|
||||
<a:rPr lang="en-US" sz="1400" b="1" dirty="0">
|
||||
<a:solidFill>
|
||||
<a:srgbClr val="FAFAFA"/>
|
||||
</a:solidFill>
|
||||
</a:rPr>
|
||||
<a:t>Formatted text</a:t>
|
||||
</a:r>
|
||||
```
|
||||
|
||||
### Lists
|
||||
```xml
|
||||
<!-- Bullet list -->
|
||||
<a:p>
|
||||
<a:pPr lvl="0">
|
||||
<a:buChar char="•"/>
|
||||
</a:pPr>
|
||||
<a:r>
|
||||
<a:t>First bullet point</a:t>
|
||||
</a:r>
|
||||
</a:p>
|
||||
|
||||
<!-- Numbered list -->
|
||||
<a:p>
|
||||
<a:pPr lvl="0">
|
||||
<a:buAutoNum type="arabicPeriod"/>
|
||||
</a:pPr>
|
||||
<a:r>
|
||||
<a:t>First numbered item</a:t>
|
||||
</a:r>
|
||||
</a:p>
|
||||
|
||||
<!-- Second level indent -->
|
||||
<a:p>
|
||||
<a:pPr lvl="1">
|
||||
<a:buChar char="•"/>
|
||||
</a:pPr>
|
||||
<a:r>
|
||||
<a:t>Indented bullet</a:t>
|
||||
</a:r>
|
||||
</a:p>
|
||||
```
|
||||
|
||||
### Shapes
|
||||
```xml
|
||||
<!-- Rectangle -->
|
||||
<p:sp>
|
||||
<p:nvSpPr>
|
||||
<p:cNvPr id="3" name="Rectangle"/>
|
||||
<p:cNvSpPr/>
|
||||
<p:nvPr/>
|
||||
</p:nvSpPr>
|
||||
<p:spPr>
|
||||
<a:xfrm>
|
||||
<a:off x="1000000" y="1000000"/>
|
||||
<a:ext cx="3000000" cy="2000000"/>
|
||||
</a:xfrm>
|
||||
<a:prstGeom prst="rect">
|
||||
<a:avLst/>
|
||||
</a:prstGeom>
|
||||
<a:solidFill>
|
||||
<a:srgbClr val="FF0000"/>
|
||||
</a:solidFill>
|
||||
<a:ln w="25400">
|
||||
<a:solidFill>
|
||||
<a:srgbClr val="000000"/>
|
||||
</a:solidFill>
|
||||
</a:ln>
|
||||
</p:spPr>
|
||||
</p:sp>
|
||||
|
||||
<!-- Rounded Rectangle -->
|
||||
<p:sp>
|
||||
<p:spPr>
|
||||
<a:prstGeom prst="roundRect">
|
||||
<a:avLst/>
|
||||
</a:prstGeom>
|
||||
</p:spPr>
|
||||
</p:sp>
|
||||
|
||||
<!-- Circle/Ellipse -->
|
||||
<p:sp>
|
||||
<p:spPr>
|
||||
<a:prstGeom prst="ellipse">
|
||||
<a:avLst/>
|
||||
</a:prstGeom>
|
||||
</p:spPr>
|
||||
</p:sp>
|
||||
```
|
||||
|
||||
### Images
|
||||
```xml
|
||||
<p:pic>
|
||||
<p:nvPicPr>
|
||||
<p:cNvPr id="4" name="Picture">
|
||||
<a:hlinkClick r:id="" action="ppaction://media"/>
|
||||
</p:cNvPr>
|
||||
<p:cNvPicPr>
|
||||
<a:picLocks noChangeAspect="1"/>
|
||||
</p:cNvPicPr>
|
||||
<p:nvPr/>
|
||||
</p:nvPicPr>
|
||||
<p:blipFill>
|
||||
<a:blip r:embed="rId2"/>
|
||||
<a:stretch>
|
||||
<a:fillRect/>
|
||||
</a:stretch>
|
||||
</p:blipFill>
|
||||
<p:spPr>
|
||||
<a:xfrm>
|
||||
<a:off x="1000000" y="1000000"/>
|
||||
<a:ext cx="3000000" cy="2000000"/>
|
||||
</a:xfrm>
|
||||
<a:prstGeom prst="rect">
|
||||
<a:avLst/>
|
||||
</a:prstGeom>
|
||||
</p:spPr>
|
||||
</p:pic>
|
||||
```
|
||||
|
||||
### Tables
|
||||
```xml
|
||||
<p:graphicFrame>
|
||||
<p:nvGraphicFramePr>
|
||||
<p:cNvPr id="5" name="Table"/>
|
||||
<p:cNvGraphicFramePr>
|
||||
<a:graphicFrameLocks noGrp="1"/>
|
||||
</p:cNvGraphicFramePr>
|
||||
<p:nvPr/>
|
||||
</p:nvGraphicFramePr>
|
||||
<p:xfrm>
|
||||
<a:off x="1000000" y="1000000"/>
|
||||
<a:ext cx="6000000" cy="2000000"/>
|
||||
</p:xfrm>
|
||||
<a:graphic>
|
||||
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/table">
|
||||
<a:tbl>
|
||||
<a:tblGrid>
|
||||
<a:gridCol w="3000000"/>
|
||||
<a:gridCol w="3000000"/>
|
||||
</a:tblGrid>
|
||||
<a:tr h="500000">
|
||||
<a:tc>
|
||||
<a:txBody>
|
||||
<a:bodyPr/>
|
||||
<a:lstStyle/>
|
||||
<a:p>
|
||||
<a:r>
|
||||
<a:t>Cell 1</a:t>
|
||||
</a:r>
|
||||
</a:p>
|
||||
</a:txBody>
|
||||
</a:tc>
|
||||
<a:tc>
|
||||
<a:txBody>
|
||||
<a:bodyPr/>
|
||||
<a:lstStyle/>
|
||||
<a:p>
|
||||
<a:r>
|
||||
<a:t>Cell 2</a:t>
|
||||
</a:r>
|
||||
</a:p>
|
||||
</a:txBody>
|
||||
</a:tc>
|
||||
</a:tr>
|
||||
</a:tbl>
|
||||
</a:graphicData>
|
||||
</a:graphic>
|
||||
</p:graphicFrame>
|
||||
```
|
||||
|
||||
### Slide Layouts
|
||||
|
||||
```xml
|
||||
<!-- Title Slide Layout -->
|
||||
<p:sp>
|
||||
<p:nvSpPr>
|
||||
<p:nvPr>
|
||||
<p:ph type="ctrTitle"/>
|
||||
</p:nvPr>
|
||||
</p:nvSpPr>
|
||||
<!-- Title content -->
|
||||
</p:sp>
|
||||
|
||||
<p:sp>
|
||||
<p:nvSpPr>
|
||||
<p:nvPr>
|
||||
<p:ph type="subTitle" idx="1"/>
|
||||
</p:nvPr>
|
||||
</p:nvSpPr>
|
||||
<!-- Subtitle content -->
|
||||
</p:sp>
|
||||
|
||||
<!-- Content Slide Layout -->
|
||||
<p:sp>
|
||||
<p:nvSpPr>
|
||||
<p:nvPr>
|
||||
<p:ph type="title"/>
|
||||
</p:nvPr>
|
||||
</p:nvSpPr>
|
||||
<!-- Slide title -->
|
||||
</p:sp>
|
||||
|
||||
<p:sp>
|
||||
<p:nvSpPr>
|
||||
<p:nvPr>
|
||||
<p:ph type="body" idx="1"/>
|
||||
</p:nvPr>
|
||||
</p:nvSpPr>
|
||||
<!-- Content body -->
|
||||
</p:sp>
|
||||
```
|
||||
|
||||
## File Updates
|
||||
|
||||
When adding content, update these files:
|
||||
|
||||
**`ppt/_rels/presentation.xml.rels`:**
|
||||
```xml
|
||||
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide1.xml"/>
|
||||
<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideMaster" Target="slideMasters/slideMaster1.xml"/>
|
||||
```
|
||||
|
||||
**`ppt/slides/_rels/slide1.xml.rels`:**
|
||||
```xml
|
||||
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout" Target="../slideLayouts/slideLayout1.xml"/>
|
||||
<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="../media/image1.png"/>
|
||||
```
|
||||
|
||||
**`[Content_Types].xml`:**
|
||||
```xml
|
||||
<Default Extension="png" ContentType="image/png"/>
|
||||
<Default Extension="jpg" ContentType="image/jpeg"/>
|
||||
<Override PartName="/ppt/slides/slide1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/>
|
||||
```
|
||||
|
||||
**`ppt/presentation.xml`:**
|
||||
```xml
|
||||
<p:sldIdLst>
|
||||
<p:sldId id="256" r:id="rId1"/>
|
||||
<p:sldId id="257" r:id="rId2"/>
|
||||
</p:sldIdLst>
|
||||
```
|
||||
|
||||
**`docProps/app.xml`:** Update slide count and statistics
|
||||
```xml
|
||||
<Slides>2</Slides>
|
||||
<Paragraphs>10</Paragraphs>
|
||||
<Words>50</Words>
|
||||
```
|
||||
|
||||
## Slide Operations
|
||||
|
||||
### Adding a New Slide
|
||||
When adding a slide to the end of the presentation:
|
||||
|
||||
1. **Create the slide file** (`ppt/slides/slideN.xml`)
|
||||
2. **Update `[Content_Types].xml`**: Add Override for the new slide
|
||||
3. **Update `ppt/_rels/presentation.xml.rels`**: Add relationship for the new slide
|
||||
4. **Update `ppt/presentation.xml`**: Add slide ID to `<p:sldIdLst>`
|
||||
5. **Create slide relationships** (`ppt/slides/_rels/slideN.xml.rels`) if needed
|
||||
6. **Update `docProps/app.xml`**: Increment slide count and update statistics (if present)
|
||||
|
||||
### Duplicating a Slide
|
||||
1. Copy the source slide XML file with a new name
|
||||
2. Update all IDs in the new slide to be unique
|
||||
3. Follow the "Adding a New Slide" steps above
|
||||
4. **CRITICAL**: Remove or update any notes slide references in `_rels` files
|
||||
5. Remove references to unused media files
|
||||
|
||||
### Reordering Slides
|
||||
1. **Update `ppt/presentation.xml`**: Reorder `<p:sldId>` elements in `<p:sldIdLst>`
|
||||
2. The order of `<p:sldId>` elements determines slide order
|
||||
3. Keep slide IDs and relationship IDs unchanged
|
||||
|
||||
Example:
|
||||
```xml
|
||||
<!-- Original order -->
|
||||
<p:sldIdLst>
|
||||
<p:sldId id="256" r:id="rId2"/>
|
||||
<p:sldId id="257" r:id="rId3"/>
|
||||
<p:sldId id="258" r:id="rId4"/>
|
||||
</p:sldIdLst>
|
||||
|
||||
<!-- After moving slide 3 to position 2 -->
|
||||
<p:sldIdLst>
|
||||
<p:sldId id="256" r:id="rId2"/>
|
||||
<p:sldId id="258" r:id="rId4"/>
|
||||
<p:sldId id="257" r:id="rId3"/>
|
||||
</p:sldIdLst>
|
||||
```
|
||||
|
||||
### Deleting a Slide
|
||||
1. **Remove from `ppt/presentation.xml`**: Delete the `<p:sldId>` entry
|
||||
2. **Remove from `ppt/_rels/presentation.xml.rels`**: Delete the relationship
|
||||
3. **Remove from `[Content_Types].xml`**: Delete the Override entry
|
||||
4. **Delete files**: Remove `ppt/slides/slideN.xml` and `ppt/slides/_rels/slideN.xml.rels`
|
||||
5. **Update `docProps/app.xml`**: Decrement slide count and update statistics
|
||||
6. **Clean up unused media**: Remove orphaned images from `ppt/media/`
|
||||
|
||||
Note: Don't renumber remaining slides - keep their original IDs and filenames.
|
||||
|
||||
|
||||
## Common Errors to Avoid
|
||||
|
||||
- **Encodings**: Escape unicode characters in ASCII content: `"` becomes `“`
|
||||
- **Images**: Add to `ppt/media/` and update relationship files
|
||||
- **Lists**: Omit bullets from list headers
|
||||
- **IDs**: Use valid hexadecimal values for UUIDs
|
||||
- **Themes**: Check all themes in `theme` directory for colors
|
||||
|
||||
## Validation Checklist for Template-Based Presentations
|
||||
|
||||
### Before Packing, Always:
|
||||
- **Clean unused resources**: Remove unreferenced media, fonts, and notes directories
|
||||
- **Fix Content_Types.xml**: Declare ALL slides, layouts, and themes present in the package
|
||||
- **Fix relationship IDs**:
|
||||
- Remove font embed references if not using embedded fonts
|
||||
- **Remove broken references**: Check all `_rels` files for references to deleted resources
|
||||
|
||||
### Common Template Duplication Pitfalls:
|
||||
- Multiple slides referencing the same notes slide after duplication
|
||||
- Image/media references from template slides that no longer exist
|
||||
- Font embedding references when fonts aren't included
|
||||
- Missing slideLayout declarations for layouts 12-25
|
||||
- docProps directory may not unpack - this is optional
|
||||
@@ -1,159 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Tool to pack a directory into a .docx, .pptx, or .xlsx file with XML formatting undone.
|
||||
|
||||
Example usage:
|
||||
python pack.py <input_directory> <office_file> [--force]
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
import defusedxml.minidom
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Pack a directory into an Office file")
|
||||
parser.add_argument("input_directory", help="Unpacked Office document directory")
|
||||
parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
|
||||
parser.add_argument("--force", action="store_true", help="Skip validation")
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
success = pack_document(
|
||||
args.input_directory, args.output_file, validate=not args.force
|
||||
)
|
||||
|
||||
# Show warning if validation was skipped
|
||||
if args.force:
|
||||
print("Warning: Skipped validation, file may be corrupt", file=sys.stderr)
|
||||
# Exit with error if validation failed
|
||||
elif not success:
|
||||
print("Contents would produce a corrupt file.", file=sys.stderr)
|
||||
print("Please validate XML before repacking.", file=sys.stderr)
|
||||
print("Use --force to skip validation and pack anyway.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
except ValueError as e:
|
||||
sys.exit(f"Error: {e}")
|
||||
|
||||
|
||||
def pack_document(input_dir, output_file, validate=False):
|
||||
"""Pack a directory into an Office file (.docx/.pptx/.xlsx).
|
||||
|
||||
Args:
|
||||
input_dir: Path to unpacked Office document directory
|
||||
output_file: Path to output Office file
|
||||
validate: If True, validates with soffice (default: False)
|
||||
|
||||
Returns:
|
||||
bool: True if successful, False if validation failed
|
||||
"""
|
||||
input_dir = Path(input_dir)
|
||||
output_file = Path(output_file)
|
||||
|
||||
if not input_dir.is_dir():
|
||||
raise ValueError(f"{input_dir} is not a directory")
|
||||
if output_file.suffix.lower() not in {".docx", ".pptx", ".xlsx"}:
|
||||
raise ValueError(f"{output_file} must be a .docx, .pptx, or .xlsx file")
|
||||
|
||||
# Work in temporary directory to avoid modifying original
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
temp_content_dir = Path(temp_dir) / "content"
|
||||
shutil.copytree(input_dir, temp_content_dir)
|
||||
|
||||
# Process XML files to remove pretty-printing whitespace
|
||||
for pattern in ["*.xml", "*.rels"]:
|
||||
for xml_file in temp_content_dir.rglob(pattern):
|
||||
condense_xml(xml_file)
|
||||
|
||||
# Create final Office file as zip archive
|
||||
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
with zipfile.ZipFile(output_file, "w", zipfile.ZIP_DEFLATED) as zf:
|
||||
for f in temp_content_dir.rglob("*"):
|
||||
if f.is_file():
|
||||
zf.write(f, f.relative_to(temp_content_dir))
|
||||
|
||||
# Validate if requested
|
||||
if validate:
|
||||
if not validate_document(output_file):
|
||||
output_file.unlink() # Delete the corrupt file
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def validate_document(doc_path):
|
||||
"""Validate document by converting to HTML with soffice."""
|
||||
# Determine the correct filter based on file extension
|
||||
match doc_path.suffix.lower():
|
||||
case ".docx":
|
||||
filter_name = "html:HTML"
|
||||
case ".pptx":
|
||||
filter_name = "html:impress_html_Export"
|
||||
case ".xlsx":
|
||||
filter_name = "html:HTML (StarCalc)"
|
||||
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
try:
|
||||
result = subprocess.run(
|
||||
[
|
||||
"soffice",
|
||||
"--headless",
|
||||
"--convert-to",
|
||||
filter_name,
|
||||
"--outdir",
|
||||
temp_dir,
|
||||
str(doc_path),
|
||||
],
|
||||
capture_output=True,
|
||||
timeout=10,
|
||||
text=True,
|
||||
)
|
||||
if not (Path(temp_dir) / f"{doc_path.stem}.html").exists():
|
||||
error_msg = result.stderr.strip() or "Document validation failed"
|
||||
print(f"Validation error: {error_msg}", file=sys.stderr)
|
||||
return False
|
||||
return True
|
||||
except FileNotFoundError:
|
||||
print("Warning: soffice not found. Skipping validation.", file=sys.stderr)
|
||||
return True
|
||||
except subprocess.TimeoutExpired:
|
||||
print("Validation error: Timeout during conversion", file=sys.stderr)
|
||||
return False
|
||||
except Exception as e:
|
||||
print(f"Validation error: {e}", file=sys.stderr)
|
||||
return False
|
||||
|
||||
|
||||
def condense_xml(xml_file):
|
||||
"""Strip unnecessary whitespace and remove comments."""
|
||||
with open(xml_file, "r", encoding="utf-8") as f:
|
||||
dom = defusedxml.minidom.parse(f)
|
||||
|
||||
# Process each element to remove whitespace and comments
|
||||
for element in dom.getElementsByTagName("*"):
|
||||
# Skip w:t elements and their processing
|
||||
if element.tagName.endswith(":t"):
|
||||
continue
|
||||
|
||||
# Remove whitespace-only text nodes and comment nodes
|
||||
for child in list(element.childNodes):
|
||||
if (
|
||||
child.nodeType == child.TEXT_NODE
|
||||
and child.nodeValue
|
||||
and child.nodeValue.strip() == ""
|
||||
) or child.nodeType == child.COMMENT_NODE:
|
||||
element.removeChild(child)
|
||||
|
||||
# Write back the condensed XML
|
||||
with open(xml_file, "wb") as f:
|
||||
f.write(dom.toxml(encoding="UTF-8"))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,29 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Unpack and format XML contents of Office files (.docx, .pptx, .xlsx)"""
|
||||
|
||||
import random
|
||||
import sys
|
||||
import defusedxml.minidom
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
# Get command line arguments
|
||||
assert len(sys.argv) == 3, "Usage: python unpack.py <office_file> <output_dir>"
|
||||
input_file, output_dir = sys.argv[1], sys.argv[2]
|
||||
|
||||
# Extract and format
|
||||
output_path = Path(output_dir)
|
||||
output_path.mkdir(parents=True, exist_ok=True)
|
||||
zipfile.ZipFile(input_file).extractall(output_path)
|
||||
|
||||
# Pretty print all XML files
|
||||
xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels"))
|
||||
for xml_file in xml_files:
|
||||
content = xml_file.read_text(encoding="utf-8")
|
||||
dom = defusedxml.minidom.parseString(content)
|
||||
xml_file.write_bytes(dom.toprettyxml(indent=" ", encoding="ascii"))
|
||||
|
||||
# For .docx files, suggest an RSID for tracked changes
|
||||
if input_file.endswith(".docx"):
|
||||
suggested_rsid = "".join(random.choices("0123456789ABCDEF", k=8))
|
||||
print(f"Suggested RSID for edit session: {suggested_rsid}")
|
||||
@@ -1,69 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Command line tool to validate Office document XML files against XSD schemas and tracked changes.
|
||||
|
||||
Usage:
|
||||
python validate.py <dir> --original <original_file>
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
from validation import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Validate Office document XML files")
|
||||
parser.add_argument(
|
||||
"unpacked_dir",
|
||||
help="Path to unpacked Office document directory",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--original",
|
||||
required=True,
|
||||
help="Path to original file (.docx/.pptx/.xlsx)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-v",
|
||||
"--verbose",
|
||||
action="store_true",
|
||||
help="Enable verbose output",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate paths
|
||||
unpacked_dir = Path(args.unpacked_dir)
|
||||
original_file = Path(args.original)
|
||||
file_extension = original_file.suffix.lower()
|
||||
assert unpacked_dir.is_dir(), f"Error: {unpacked_dir} is not a directory"
|
||||
assert original_file.is_file(), f"Error: {original_file} is not a file"
|
||||
assert file_extension in [".docx", ".pptx", ".xlsx"], (
|
||||
f"Error: {original_file} must be a .docx, .pptx, or .xlsx file"
|
||||
)
|
||||
|
||||
# Run validations
|
||||
match file_extension:
|
||||
case ".docx":
|
||||
validators = [DOCXSchemaValidator, RedliningValidator]
|
||||
case ".pptx":
|
||||
validators = [PPTXSchemaValidator]
|
||||
case _:
|
||||
print(f"Error: Validation not supported for file type {file_extension}")
|
||||
sys.exit(1)
|
||||
|
||||
# Run validators
|
||||
success = True
|
||||
for V in validators:
|
||||
validator = V(unpacked_dir, original_file, verbose=args.verbose)
|
||||
if not validator.validate():
|
||||
success = False
|
||||
|
||||
if success:
|
||||
print("All validations PASSED!")
|
||||
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,274 +0,0 @@
|
||||
"""
|
||||
Validator for Word document XML files against XSD schemas.
|
||||
"""
|
||||
|
||||
import re
|
||||
import tempfile
|
||||
import zipfile
|
||||
|
||||
import lxml.etree
|
||||
|
||||
from .base import BaseSchemaValidator
|
||||
|
||||
|
||||
class DOCXSchemaValidator(BaseSchemaValidator):
|
||||
"""Validator for Word document XML files against XSD schemas."""
|
||||
|
||||
# Word-specific namespace
|
||||
WORD_2006_NAMESPACE = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
|
||||
|
||||
# Word-specific element to relationship type mappings
|
||||
# Start with empty mapping - add specific cases as we discover them
|
||||
ELEMENT_RELATIONSHIP_TYPES = {}
|
||||
|
||||
def validate(self):
|
||||
"""Run all validation checks and return True if all pass."""
|
||||
# Test 0: XML well-formedness
|
||||
if not self.validate_xml():
|
||||
return False
|
||||
|
||||
# Test 1: Namespace declarations
|
||||
all_valid = True
|
||||
if not self.validate_namespaces():
|
||||
all_valid = False
|
||||
|
||||
# Test 2: Unique IDs
|
||||
if not self.validate_unique_ids():
|
||||
all_valid = False
|
||||
|
||||
# Test 3: Relationship and file reference validation
|
||||
if not self.validate_file_references():
|
||||
all_valid = False
|
||||
|
||||
# Test 4: Content type declarations
|
||||
if not self.validate_content_types():
|
||||
all_valid = False
|
||||
|
||||
# Test 5: XSD schema validation
|
||||
if not self.validate_against_xsd():
|
||||
all_valid = False
|
||||
|
||||
# Test 6: Whitespace preservation
|
||||
if not self.validate_whitespace_preservation():
|
||||
all_valid = False
|
||||
|
||||
# Test 7: Deletion validation
|
||||
if not self.validate_deletions():
|
||||
all_valid = False
|
||||
|
||||
# Test 8: Insertion validation
|
||||
if not self.validate_insertions():
|
||||
all_valid = False
|
||||
|
||||
# Test 9: Relationship ID reference validation
|
||||
if not self.validate_all_relationship_ids():
|
||||
all_valid = False
|
||||
|
||||
# Count and compare paragraphs
|
||||
self.compare_paragraph_counts()
|
||||
|
||||
return all_valid
|
||||
|
||||
def validate_whitespace_preservation(self):
|
||||
"""
|
||||
Validate that w:t elements with whitespace have xml:space='preserve'.
|
||||
"""
|
||||
errors = []
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
# Only check document.xml files
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
|
||||
# Find all w:t elements
|
||||
for elem in root.iter(f"{{{self.WORD_2006_NAMESPACE}}}t"):
|
||||
if elem.text:
|
||||
text = elem.text
|
||||
# Check if text starts or ends with whitespace
|
||||
if re.match(r"^\s.*", text) or re.match(r".*\s$", text):
|
||||
# Check if xml:space="preserve" attribute exists
|
||||
xml_space_attr = f"{{{self.XML_NAMESPACE}}}space"
|
||||
if (
|
||||
xml_space_attr not in elem.attrib
|
||||
or elem.attrib[xml_space_attr] != "preserve"
|
||||
):
|
||||
# Show a preview of the text
|
||||
text_preview = (
|
||||
repr(text)[:50] + "..."
|
||||
if len(repr(text)) > 50
|
||||
else repr(text)
|
||||
)
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: "
|
||||
f"Line {elem.sourceline}: w:t element with whitespace missing xml:space='preserve': {text_preview}"
|
||||
)
|
||||
|
||||
except (lxml.etree.XMLSyntaxError, Exception) as e:
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
|
||||
)
|
||||
|
||||
if errors:
|
||||
print(f"FAILED - Found {len(errors)} whitespace preservation violations:")
|
||||
for error in errors:
|
||||
print(error)
|
||||
return False
|
||||
else:
|
||||
if self.verbose:
|
||||
print("PASSED - All whitespace is properly preserved")
|
||||
return True
|
||||
|
||||
def validate_deletions(self):
|
||||
"""
|
||||
Validate that w:t elements are not within w:del elements.
|
||||
For some reason, XSD validation does not catch this, so we do it manually.
|
||||
"""
|
||||
errors = []
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
# Only check document.xml files
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
|
||||
# Find all w:t elements that are descendants of w:del elements
|
||||
namespaces = {"w": self.WORD_2006_NAMESPACE}
|
||||
xpath_expression = ".//w:del//w:t"
|
||||
problematic_t_elements = root.xpath(
|
||||
xpath_expression, namespaces=namespaces
|
||||
)
|
||||
for t_elem in problematic_t_elements:
|
||||
if t_elem.text:
|
||||
# Show a preview of the text
|
||||
text_preview = (
|
||||
repr(t_elem.text)[:50] + "..."
|
||||
if len(repr(t_elem.text)) > 50
|
||||
else repr(t_elem.text)
|
||||
)
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: "
|
||||
f"Line {t_elem.sourceline}: <w:t> found within <w:del>: {text_preview}"
|
||||
)
|
||||
|
||||
except (lxml.etree.XMLSyntaxError, Exception) as e:
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
|
||||
)
|
||||
|
||||
if errors:
|
||||
print(f"FAILED - Found {len(errors)} deletion validation violations:")
|
||||
for error in errors:
|
||||
print(error)
|
||||
return False
|
||||
else:
|
||||
if self.verbose:
|
||||
print("PASSED - No w:t elements found within w:del elements")
|
||||
return True
|
||||
|
||||
def count_paragraphs_in_unpacked(self):
|
||||
"""Count the number of paragraphs in the unpacked document."""
|
||||
count = 0
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
# Only check document.xml files
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
# Count all w:p elements
|
||||
paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p")
|
||||
count = len(paragraphs)
|
||||
except Exception as e:
|
||||
print(f"Error counting paragraphs in unpacked document: {e}")
|
||||
|
||||
return count
|
||||
|
||||
def count_paragraphs_in_original(self):
|
||||
"""Count the number of paragraphs in the original docx file."""
|
||||
count = 0
|
||||
|
||||
try:
|
||||
# Create temporary directory to unpack original
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
# Unpack original docx
|
||||
with zipfile.ZipFile(self.original_file, "r") as zip_ref:
|
||||
zip_ref.extractall(temp_dir)
|
||||
|
||||
# Parse document.xml
|
||||
doc_xml_path = temp_dir + "/word/document.xml"
|
||||
root = lxml.etree.parse(doc_xml_path).getroot()
|
||||
|
||||
# Count all w:p elements
|
||||
paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p")
|
||||
count = len(paragraphs)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error counting paragraphs in original document: {e}")
|
||||
|
||||
return count
|
||||
|
||||
def validate_insertions(self):
|
||||
"""
|
||||
Validate that w:delText elements are not within w:ins elements.
|
||||
w:delText is only allowed in w:ins if nested within a w:del.
|
||||
"""
|
||||
errors = []
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
namespaces = {"w": self.WORD_2006_NAMESPACE}
|
||||
|
||||
# Find w:delText in w:ins that are NOT within w:del
|
||||
invalid_elements = root.xpath(
|
||||
".//w:ins//w:delText[not(ancestor::w:del)]",
|
||||
namespaces=namespaces
|
||||
)
|
||||
|
||||
for elem in invalid_elements:
|
||||
text_preview = (
|
||||
repr(elem.text or "")[:50] + "..."
|
||||
if len(repr(elem.text or "")) > 50
|
||||
else repr(elem.text or "")
|
||||
)
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: "
|
||||
f"Line {elem.sourceline}: <w:delText> within <w:ins>: {text_preview}"
|
||||
)
|
||||
|
||||
except (lxml.etree.XMLSyntaxError, Exception) as e:
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
|
||||
)
|
||||
|
||||
if errors:
|
||||
print(f"FAILED - Found {len(errors)} insertion validation violations:")
|
||||
for error in errors:
|
||||
print(error)
|
||||
return False
|
||||
else:
|
||||
if self.verbose:
|
||||
print("PASSED - No w:delText elements within w:ins elements")
|
||||
return True
|
||||
|
||||
def compare_paragraph_counts(self):
|
||||
"""Compare paragraph counts between original and new document."""
|
||||
original_count = self.count_paragraphs_in_original()
|
||||
new_count = self.count_paragraphs_in_unpacked()
|
||||
|
||||
diff = new_count - original_count
|
||||
diff_str = f"+{diff}" if diff > 0 else str(diff)
|
||||
print(f"\nParagraphs: {original_count} → {new_count} ({diff_str})")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise RuntimeError("This module should not be run directly.")
|
||||
@@ -1,979 +0,0 @@
|
||||
/**
|
||||
* html2pptx - Convert HTML slide to pptxgenjs slide with positioned elements
|
||||
*
|
||||
* USAGE:
|
||||
* const pptx = new pptxgen();
|
||||
* pptx.layout = 'LAYOUT_16x9'; // Must match HTML body dimensions
|
||||
*
|
||||
* const { slide, placeholders } = await html2pptx('slide.html', pptx);
|
||||
* slide.addChart(pptx.charts.LINE, data, placeholders[0]);
|
||||
*
|
||||
* await pptx.writeFile('output.pptx');
|
||||
*
|
||||
* FEATURES:
|
||||
* - Converts HTML to PowerPoint with accurate positioning
|
||||
* - Supports text, images, shapes, and bullet lists
|
||||
* - Extracts placeholder elements (class="placeholder") with positions
|
||||
* - Handles CSS gradients, borders, and margins
|
||||
*
|
||||
* VALIDATION:
|
||||
* - Uses body width/height from HTML for viewport sizing
|
||||
* - Throws error if HTML dimensions don't match presentation layout
|
||||
* - Throws error if content overflows body (with overflow details)
|
||||
*
|
||||
* RETURNS:
|
||||
* { slide, placeholders } where placeholders is an array of { id, x, y, w, h }
|
||||
*/
|
||||
|
||||
const { chromium } = require('playwright');
|
||||
const path = require('path');
|
||||
const sharp = require('sharp');
|
||||
|
||||
const PT_PER_PX = 0.75;
|
||||
const PX_PER_IN = 96;
|
||||
const EMU_PER_IN = 914400;
|
||||
|
||||
// Helper: Get body dimensions and check for overflow
|
||||
async function getBodyDimensions(page) {
|
||||
const bodyDimensions = await page.evaluate(() => {
|
||||
const body = document.body;
|
||||
const style = window.getComputedStyle(body);
|
||||
|
||||
return {
|
||||
width: parseFloat(style.width),
|
||||
height: parseFloat(style.height),
|
||||
scrollWidth: body.scrollWidth,
|
||||
scrollHeight: body.scrollHeight
|
||||
};
|
||||
});
|
||||
|
||||
const errors = [];
|
||||
const widthOverflowPx = Math.max(0, bodyDimensions.scrollWidth - bodyDimensions.width - 1);
|
||||
const heightOverflowPx = Math.max(0, bodyDimensions.scrollHeight - bodyDimensions.height - 1);
|
||||
|
||||
const widthOverflowPt = widthOverflowPx * PT_PER_PX;
|
||||
const heightOverflowPt = heightOverflowPx * PT_PER_PX;
|
||||
|
||||
if (widthOverflowPt > 0 || heightOverflowPt > 0) {
|
||||
const directions = [];
|
||||
if (widthOverflowPt > 0) directions.push(`${widthOverflowPt.toFixed(1)}pt horizontally`);
|
||||
if (heightOverflowPt > 0) directions.push(`${heightOverflowPt.toFixed(1)}pt vertically`);
|
||||
const reminder = heightOverflowPt > 0 ? ' (Remember: leave 0.5" margin at bottom of slide)' : '';
|
||||
errors.push(`HTML content overflows body by ${directions.join(' and ')}${reminder}`);
|
||||
}
|
||||
|
||||
return { ...bodyDimensions, errors };
|
||||
}
|
||||
|
||||
// Helper: Validate dimensions match presentation layout
|
||||
function validateDimensions(bodyDimensions, pres) {
|
||||
const errors = [];
|
||||
const widthInches = bodyDimensions.width / PX_PER_IN;
|
||||
const heightInches = bodyDimensions.height / PX_PER_IN;
|
||||
|
||||
if (pres.presLayout) {
|
||||
const layoutWidth = pres.presLayout.width / EMU_PER_IN;
|
||||
const layoutHeight = pres.presLayout.height / EMU_PER_IN;
|
||||
|
||||
if (Math.abs(layoutWidth - widthInches) > 0.1 || Math.abs(layoutHeight - heightInches) > 0.1) {
|
||||
errors.push(
|
||||
`HTML dimensions (${widthInches.toFixed(1)}" × ${heightInches.toFixed(1)}") ` +
|
||||
`don't match presentation layout (${layoutWidth.toFixed(1)}" × ${layoutHeight.toFixed(1)}")`
|
||||
);
|
||||
}
|
||||
}
|
||||
return errors;
|
||||
}
|
||||
|
||||
function validateTextBoxPosition(slideData, bodyDimensions) {
|
||||
const errors = [];
|
||||
const slideHeightInches = bodyDimensions.height / PX_PER_IN;
|
||||
const minBottomMargin = 0.5; // 0.5 inches from bottom
|
||||
|
||||
for (const el of slideData.elements) {
|
||||
// Check text elements (p, h1-h6, list)
|
||||
if (['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'list'].includes(el.type)) {
|
||||
const fontSize = el.style?.fontSize || 0;
|
||||
const bottomEdge = el.position.y + el.position.h;
|
||||
const distanceFromBottom = slideHeightInches - bottomEdge;
|
||||
|
||||
if (fontSize > 12 && distanceFromBottom < minBottomMargin) {
|
||||
const getText = () => {
|
||||
if (typeof el.text === 'string') return el.text;
|
||||
if (Array.isArray(el.text)) return el.text.find(t => t.text)?.text || '';
|
||||
if (Array.isArray(el.items)) return el.items.find(item => item.text)?.text || '';
|
||||
return '';
|
||||
};
|
||||
const textPrefix = getText().substring(0, 50) + (getText().length > 50 ? '...' : '');
|
||||
|
||||
errors.push(
|
||||
`Text box "${textPrefix}" ends too close to bottom edge ` +
|
||||
`(${distanceFromBottom.toFixed(2)}" from bottom, minimum ${minBottomMargin}" required)`
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return errors;
|
||||
}
|
||||
|
||||
// Helper: Add background to slide
|
||||
async function addBackground(slideData, targetSlide, tmpDir) {
|
||||
if (slideData.background.type === 'image' && slideData.background.path) {
|
||||
let imagePath = slideData.background.path.startsWith('file://')
|
||||
? slideData.background.path.replace('file://', '')
|
||||
: slideData.background.path;
|
||||
targetSlide.background = { path: imagePath };
|
||||
} else if (slideData.background.type === 'color' && slideData.background.value) {
|
||||
targetSlide.background = { color: slideData.background.value };
|
||||
}
|
||||
}
|
||||
|
||||
// Helper: Add elements to slide
|
||||
function addElements(slideData, targetSlide, pres) {
|
||||
for (const el of slideData.elements) {
|
||||
if (el.type === 'image') {
|
||||
let imagePath = el.src.startsWith('file://') ? el.src.replace('file://', '') : el.src;
|
||||
targetSlide.addImage({
|
||||
path: imagePath,
|
||||
x: el.position.x,
|
||||
y: el.position.y,
|
||||
w: el.position.w,
|
||||
h: el.position.h
|
||||
});
|
||||
} else if (el.type === 'line') {
|
||||
targetSlide.addShape(pres.ShapeType.line, {
|
||||
x: el.x1,
|
||||
y: el.y1,
|
||||
w: el.x2 - el.x1,
|
||||
h: el.y2 - el.y1,
|
||||
line: { color: el.color, width: el.width }
|
||||
});
|
||||
} else if (el.type === 'shape') {
|
||||
const shapeOptions = {
|
||||
x: el.position.x,
|
||||
y: el.position.y,
|
||||
w: el.position.w,
|
||||
h: el.position.h,
|
||||
shape: el.shape.rectRadius > 0 ? pres.ShapeType.roundRect : pres.ShapeType.rect
|
||||
};
|
||||
|
||||
if (el.shape.fill) {
|
||||
shapeOptions.fill = { color: el.shape.fill };
|
||||
if (el.shape.transparency != null) shapeOptions.fill.transparency = el.shape.transparency;
|
||||
}
|
||||
if (el.shape.line) shapeOptions.line = el.shape.line;
|
||||
if (el.shape.rectRadius > 0) shapeOptions.rectRadius = el.shape.rectRadius;
|
||||
if (el.shape.shadow) shapeOptions.shadow = el.shape.shadow;
|
||||
|
||||
targetSlide.addText(el.text || '', shapeOptions);
|
||||
} else if (el.type === 'list') {
|
||||
const listOptions = {
|
||||
x: el.position.x,
|
||||
y: el.position.y,
|
||||
w: el.position.w,
|
||||
h: el.position.h,
|
||||
fontSize: el.style.fontSize,
|
||||
fontFace: el.style.fontFace,
|
||||
color: el.style.color,
|
||||
align: el.style.align,
|
||||
valign: 'top',
|
||||
lineSpacing: el.style.lineSpacing,
|
||||
paraSpaceBefore: el.style.paraSpaceBefore,
|
||||
paraSpaceAfter: el.style.paraSpaceAfter,
|
||||
margin: el.style.margin
|
||||
};
|
||||
if (el.style.margin) listOptions.margin = el.style.margin;
|
||||
targetSlide.addText(el.items, listOptions);
|
||||
} else {
|
||||
// Check if text is single-line (height suggests one line)
|
||||
const lineHeight = el.style.lineSpacing || el.style.fontSize * 1.2;
|
||||
const isSingleLine = el.position.h <= lineHeight * 1.5;
|
||||
|
||||
let adjustedX = el.position.x;
|
||||
let adjustedW = el.position.w;
|
||||
|
||||
// Make single-line text 2% wider to account for underestimate
|
||||
if (isSingleLine) {
|
||||
const widthIncrease = el.position.w * 0.02;
|
||||
const align = el.style.align;
|
||||
|
||||
if (align === 'center') {
|
||||
// Center: expand both sides
|
||||
adjustedX = el.position.x - (widthIncrease / 2);
|
||||
adjustedW = el.position.w + widthIncrease;
|
||||
} else if (align === 'right') {
|
||||
// Right: expand to the left
|
||||
adjustedX = el.position.x - widthIncrease;
|
||||
adjustedW = el.position.w + widthIncrease;
|
||||
} else {
|
||||
// Left (default): expand to the right
|
||||
adjustedW = el.position.w + widthIncrease;
|
||||
}
|
||||
}
|
||||
|
||||
const textOptions = {
|
||||
x: adjustedX,
|
||||
y: el.position.y,
|
||||
w: adjustedW,
|
||||
h: el.position.h,
|
||||
fontSize: el.style.fontSize,
|
||||
fontFace: el.style.fontFace,
|
||||
color: el.style.color,
|
||||
bold: el.style.bold,
|
||||
italic: el.style.italic,
|
||||
underline: el.style.underline,
|
||||
valign: 'top',
|
||||
lineSpacing: el.style.lineSpacing,
|
||||
paraSpaceBefore: el.style.paraSpaceBefore,
|
||||
paraSpaceAfter: el.style.paraSpaceAfter,
|
||||
inset: 0 // Remove default PowerPoint internal padding
|
||||
};
|
||||
|
||||
if (el.style.align) textOptions.align = el.style.align;
|
||||
if (el.style.margin) textOptions.margin = el.style.margin;
|
||||
if (el.style.rotate !== undefined) textOptions.rotate = el.style.rotate;
|
||||
if (el.style.transparency !== null && el.style.transparency !== undefined) textOptions.transparency = el.style.transparency;
|
||||
|
||||
targetSlide.addText(el.text, textOptions);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Helper: Extract slide data from HTML page
|
||||
async function extractSlideData(page) {
|
||||
return await page.evaluate(() => {
|
||||
const PT_PER_PX = 0.75;
|
||||
const PX_PER_IN = 96;
|
||||
|
||||
// Fonts that are single-weight and should not have bold applied
|
||||
// (applying bold causes PowerPoint to use faux bold which makes text wider)
|
||||
const SINGLE_WEIGHT_FONTS = ['impact'];
|
||||
|
||||
// Helper: Check if a font should skip bold formatting
|
||||
const shouldSkipBold = (fontFamily) => {
|
||||
if (!fontFamily) return false;
|
||||
const normalizedFont = fontFamily.toLowerCase().replace(/['"]/g, '').split(',')[0].trim();
|
||||
return SINGLE_WEIGHT_FONTS.includes(normalizedFont);
|
||||
};
|
||||
|
||||
// Unit conversion helpers
|
||||
const pxToInch = (px) => px / PX_PER_IN;
|
||||
const pxToPoints = (pxStr) => parseFloat(pxStr) * PT_PER_PX;
|
||||
const rgbToHex = (rgbStr) => {
|
||||
// Handle transparent backgrounds by defaulting to white
|
||||
if (rgbStr === 'rgba(0, 0, 0, 0)' || rgbStr === 'transparent') return 'FFFFFF';
|
||||
|
||||
const match = rgbStr.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/);
|
||||
if (!match) return 'FFFFFF';
|
||||
return match.slice(1).map(n => parseInt(n).toString(16).padStart(2, '0')).join('');
|
||||
};
|
||||
|
||||
const extractAlpha = (rgbStr) => {
|
||||
const match = rgbStr.match(/rgba\((\d+),\s*(\d+),\s*(\d+),\s*([\d.]+)\)/);
|
||||
if (!match || !match[4]) return null;
|
||||
const alpha = parseFloat(match[4]);
|
||||
return Math.round((1 - alpha) * 100);
|
||||
};
|
||||
|
||||
const applyTextTransform = (text, textTransform) => {
|
||||
if (textTransform === 'uppercase') return text.toUpperCase();
|
||||
if (textTransform === 'lowercase') return text.toLowerCase();
|
||||
if (textTransform === 'capitalize') {
|
||||
return text.replace(/\b\w/g, c => c.toUpperCase());
|
||||
}
|
||||
return text;
|
||||
};
|
||||
|
||||
// Extract rotation angle from CSS transform and writing-mode
|
||||
const getRotation = (transform, writingMode) => {
|
||||
let angle = 0;
|
||||
|
||||
// Handle writing-mode first
|
||||
// PowerPoint: 90° = text rotated 90° clockwise (reads top to bottom, letters upright)
|
||||
// PowerPoint: 270° = text rotated 270° clockwise (reads bottom to top, letters upright)
|
||||
if (writingMode === 'vertical-rl') {
|
||||
// vertical-rl alone = text reads top to bottom = 90° in PowerPoint
|
||||
angle = 90;
|
||||
} else if (writingMode === 'vertical-lr') {
|
||||
// vertical-lr alone = text reads bottom to top = 270° in PowerPoint
|
||||
angle = 270;
|
||||
}
|
||||
|
||||
// Then add any transform rotation
|
||||
if (transform && transform !== 'none') {
|
||||
// Try to match rotate() function
|
||||
const rotateMatch = transform.match(/rotate\((-?\d+(?:\.\d+)?)deg\)/);
|
||||
if (rotateMatch) {
|
||||
angle += parseFloat(rotateMatch[1]);
|
||||
} else {
|
||||
// Browser may compute as matrix - extract rotation from matrix
|
||||
const matrixMatch = transform.match(/matrix\(([^)]+)\)/);
|
||||
if (matrixMatch) {
|
||||
const values = matrixMatch[1].split(',').map(parseFloat);
|
||||
// matrix(a, b, c, d, e, f) where rotation = atan2(b, a)
|
||||
const matrixAngle = Math.atan2(values[1], values[0]) * (180 / Math.PI);
|
||||
angle += Math.round(matrixAngle);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Normalize to 0-359 range
|
||||
angle = angle % 360;
|
||||
if (angle < 0) angle += 360;
|
||||
|
||||
return angle === 0 ? null : angle;
|
||||
};
|
||||
|
||||
// Get position/dimensions accounting for rotation
|
||||
const getPositionAndSize = (el, rect, rotation) => {
|
||||
if (rotation === null) {
|
||||
return { x: rect.left, y: rect.top, w: rect.width, h: rect.height };
|
||||
}
|
||||
|
||||
// For 90° or 270° rotations, swap width and height
|
||||
// because PowerPoint applies rotation to the original (unrotated) box
|
||||
const isVertical = rotation === 90 || rotation === 270;
|
||||
|
||||
if (isVertical) {
|
||||
// The browser shows us the rotated dimensions (tall box for vertical text)
|
||||
// But PowerPoint needs the pre-rotation dimensions (wide box that will be rotated)
|
||||
// So we swap: browser's height becomes PPT's width, browser's width becomes PPT's height
|
||||
const centerX = rect.left + rect.width / 2;
|
||||
const centerY = rect.top + rect.height / 2;
|
||||
|
||||
return {
|
||||
x: centerX - rect.height / 2,
|
||||
y: centerY - rect.width / 2,
|
||||
w: rect.height,
|
||||
h: rect.width
|
||||
};
|
||||
}
|
||||
|
||||
// For other rotations, use element's offset dimensions
|
||||
const centerX = rect.left + rect.width / 2;
|
||||
const centerY = rect.top + rect.height / 2;
|
||||
return {
|
||||
x: centerX - el.offsetWidth / 2,
|
||||
y: centerY - el.offsetHeight / 2,
|
||||
w: el.offsetWidth,
|
||||
h: el.offsetHeight
|
||||
};
|
||||
};
|
||||
|
||||
// Parse CSS box-shadow into PptxGenJS shadow properties
|
||||
const parseBoxShadow = (boxShadow) => {
|
||||
if (!boxShadow || boxShadow === 'none') return null;
|
||||
|
||||
// Browser computed style format: "rgba(0, 0, 0, 0.3) 2px 2px 8px 0px [inset]"
|
||||
// CSS format: "[inset] 2px 2px 8px 0px rgba(0, 0, 0, 0.3)"
|
||||
|
||||
const insetMatch = boxShadow.match(/inset/);
|
||||
|
||||
// IMPORTANT: PptxGenJS/PowerPoint doesn't properly support inset shadows
|
||||
// Only process outer shadows to avoid file corruption
|
||||
if (insetMatch) return null;
|
||||
|
||||
// Extract color first (rgba or rgb at start)
|
||||
const colorMatch = boxShadow.match(/rgba?\([^)]+\)/);
|
||||
|
||||
// Extract numeric values (handles both px and pt units)
|
||||
const parts = boxShadow.match(/([-\d.]+)(px|pt)/g);
|
||||
|
||||
if (!parts || parts.length < 2) return null;
|
||||
|
||||
const offsetX = parseFloat(parts[0]);
|
||||
const offsetY = parseFloat(parts[1]);
|
||||
const blur = parts.length > 2 ? parseFloat(parts[2]) : 0;
|
||||
|
||||
// Calculate angle from offsets (in degrees, 0 = right, 90 = down)
|
||||
let angle = 0;
|
||||
if (offsetX !== 0 || offsetY !== 0) {
|
||||
angle = Math.atan2(offsetY, offsetX) * (180 / Math.PI);
|
||||
if (angle < 0) angle += 360;
|
||||
}
|
||||
|
||||
// Calculate offset distance (hypotenuse)
|
||||
const offset = Math.sqrt(offsetX * offsetX + offsetY * offsetY) * PT_PER_PX;
|
||||
|
||||
// Extract opacity from rgba
|
||||
let opacity = 0.5;
|
||||
if (colorMatch) {
|
||||
const opacityMatch = colorMatch[0].match(/[\d.]+\)$/);
|
||||
if (opacityMatch) {
|
||||
opacity = parseFloat(opacityMatch[0].replace(')', ''));
|
||||
}
|
||||
}
|
||||
|
||||
return {
|
||||
type: 'outer',
|
||||
angle: Math.round(angle),
|
||||
blur: blur * 0.75, // Convert to points
|
||||
color: colorMatch ? rgbToHex(colorMatch[0]) : '000000',
|
||||
offset: offset,
|
||||
opacity
|
||||
};
|
||||
};
|
||||
|
||||
// Parse inline formatting tags (<b>, <i>, <u>, <strong>, <em>, <span>) into text runs
|
||||
const parseInlineFormatting = (element, baseOptions = {}, runs = [], baseTextTransform = (x) => x) => {
|
||||
let prevNodeIsText = false;
|
||||
|
||||
element.childNodes.forEach((node) => {
|
||||
let textTransform = baseTextTransform;
|
||||
|
||||
const isText = node.nodeType === Node.TEXT_NODE || node.tagName === 'BR';
|
||||
if (isText) {
|
||||
const text = node.tagName === 'BR' ? '\n' : textTransform(node.textContent.replace(/\s+/g, ' '));
|
||||
const prevRun = runs[runs.length - 1];
|
||||
if (prevNodeIsText && prevRun) {
|
||||
prevRun.text += text;
|
||||
} else {
|
||||
runs.push({ text, options: { ...baseOptions } });
|
||||
}
|
||||
|
||||
} else if (node.nodeType === Node.ELEMENT_NODE && node.textContent.trim()) {
|
||||
const options = { ...baseOptions };
|
||||
const computed = window.getComputedStyle(node);
|
||||
|
||||
// Handle inline elements with computed styles
|
||||
if (node.tagName === 'SPAN' || node.tagName === 'B' || node.tagName === 'STRONG' || node.tagName === 'I' || node.tagName === 'EM' || node.tagName === 'U') {
|
||||
const isBold = computed.fontWeight === 'bold' || parseInt(computed.fontWeight) >= 600;
|
||||
if (isBold && !shouldSkipBold(computed.fontFamily)) options.bold = true;
|
||||
if (computed.fontStyle === 'italic') options.italic = true;
|
||||
if (computed.textDecoration && computed.textDecoration.includes('underline')) options.underline = true;
|
||||
if (computed.color && computed.color !== 'rgb(0, 0, 0)') {
|
||||
options.color = rgbToHex(computed.color);
|
||||
const transparency = extractAlpha(computed.color);
|
||||
if (transparency !== null) options.transparency = transparency;
|
||||
}
|
||||
if (computed.fontSize) options.fontSize = pxToPoints(computed.fontSize);
|
||||
|
||||
// Apply text-transform on the span element itself
|
||||
if (computed.textTransform && computed.textTransform !== 'none') {
|
||||
const transformStr = computed.textTransform;
|
||||
textTransform = (text) => applyTextTransform(text, transformStr);
|
||||
}
|
||||
|
||||
// Validate: Check for margins on inline elements
|
||||
if (computed.marginLeft && parseFloat(computed.marginLeft) > 0) {
|
||||
errors.push(`Inline element <${node.tagName.toLowerCase()}> has margin-left which is not supported in PowerPoint. Remove margin from inline elements.`);
|
||||
}
|
||||
if (computed.marginRight && parseFloat(computed.marginRight) > 0) {
|
||||
errors.push(`Inline element <${node.tagName.toLowerCase()}> has margin-right which is not supported in PowerPoint. Remove margin from inline elements.`);
|
||||
}
|
||||
if (computed.marginTop && parseFloat(computed.marginTop) > 0) {
|
||||
errors.push(`Inline element <${node.tagName.toLowerCase()}> has margin-top which is not supported in PowerPoint. Remove margin from inline elements.`);
|
||||
}
|
||||
if (computed.marginBottom && parseFloat(computed.marginBottom) > 0) {
|
||||
errors.push(`Inline element <${node.tagName.toLowerCase()}> has margin-bottom which is not supported in PowerPoint. Remove margin from inline elements.`);
|
||||
}
|
||||
|
||||
// Recursively process the child node. This will flatten nested spans into multiple runs.
|
||||
parseInlineFormatting(node, options, runs, textTransform);
|
||||
}
|
||||
}
|
||||
|
||||
prevNodeIsText = isText;
|
||||
});
|
||||
|
||||
// Trim leading space from first run and trailing space from last run
|
||||
if (runs.length > 0) {
|
||||
runs[0].text = runs[0].text.replace(/^\s+/, '');
|
||||
runs[runs.length - 1].text = runs[runs.length - 1].text.replace(/\s+$/, '');
|
||||
}
|
||||
|
||||
return runs.filter(r => r.text.length > 0);
|
||||
};
|
||||
|
||||
// Extract background from body (image or color)
|
||||
const body = document.body;
|
||||
const bodyStyle = window.getComputedStyle(body);
|
||||
const bgImage = bodyStyle.backgroundImage;
|
||||
const bgColor = bodyStyle.backgroundColor;
|
||||
|
||||
// Collect validation errors
|
||||
const errors = [];
|
||||
|
||||
// Validate: Check for CSS gradients
|
||||
if (bgImage && (bgImage.includes('linear-gradient') || bgImage.includes('radial-gradient'))) {
|
||||
errors.push(
|
||||
'CSS gradients are not supported. Use Sharp to rasterize gradients as PNG images first, ' +
|
||||
'then reference with background-image: url(\'gradient.png\')'
|
||||
);
|
||||
}
|
||||
|
||||
let background;
|
||||
if (bgImage && bgImage !== 'none') {
|
||||
// Extract URL from url("...") or url(...)
|
||||
const urlMatch = bgImage.match(/url\(["']?([^"')]+)["']?\)/);
|
||||
if (urlMatch) {
|
||||
background = {
|
||||
type: 'image',
|
||||
path: urlMatch[1]
|
||||
};
|
||||
} else {
|
||||
background = {
|
||||
type: 'color',
|
||||
value: rgbToHex(bgColor)
|
||||
};
|
||||
}
|
||||
} else {
|
||||
background = {
|
||||
type: 'color',
|
||||
value: rgbToHex(bgColor)
|
||||
};
|
||||
}
|
||||
|
||||
// Process all elements
|
||||
const elements = [];
|
||||
const placeholders = [];
|
||||
const textTags = ['P', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'UL', 'OL', 'LI'];
|
||||
const processed = new Set();
|
||||
|
||||
document.querySelectorAll('*').forEach((el) => {
|
||||
if (processed.has(el)) return;
|
||||
|
||||
// Validate text elements don't have backgrounds, borders, or shadows
|
||||
if (textTags.includes(el.tagName)) {
|
||||
const computed = window.getComputedStyle(el);
|
||||
const hasBg = computed.backgroundColor && computed.backgroundColor !== 'rgba(0, 0, 0, 0)';
|
||||
const hasBorder = (computed.borderWidth && parseFloat(computed.borderWidth) > 0) ||
|
||||
(computed.borderTopWidth && parseFloat(computed.borderTopWidth) > 0) ||
|
||||
(computed.borderRightWidth && parseFloat(computed.borderRightWidth) > 0) ||
|
||||
(computed.borderBottomWidth && parseFloat(computed.borderBottomWidth) > 0) ||
|
||||
(computed.borderLeftWidth && parseFloat(computed.borderLeftWidth) > 0);
|
||||
const hasShadow = computed.boxShadow && computed.boxShadow !== 'none';
|
||||
|
||||
if (hasBg || hasBorder || hasShadow) {
|
||||
errors.push(
|
||||
`Text element <${el.tagName.toLowerCase()}> has ${hasBg ? 'background' : hasBorder ? 'border' : 'shadow'}. ` +
|
||||
'Backgrounds, borders, and shadows are only supported on <div> elements, not text elements.'
|
||||
);
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
// Extract placeholder elements (for charts, etc.)
|
||||
if (el.className && el.className.includes('placeholder')) {
|
||||
const rect = el.getBoundingClientRect();
|
||||
if (rect.width === 0 || rect.height === 0) {
|
||||
errors.push(
|
||||
`Placeholder "${el.id || 'unnamed'}" has ${rect.width === 0 ? 'width: 0' : 'height: 0'}. Check the layout CSS.`
|
||||
);
|
||||
} else {
|
||||
placeholders.push({
|
||||
id: el.id || `placeholder-${placeholders.length}`,
|
||||
x: pxToInch(rect.left),
|
||||
y: pxToInch(rect.top),
|
||||
w: pxToInch(rect.width),
|
||||
h: pxToInch(rect.height)
|
||||
});
|
||||
}
|
||||
processed.add(el);
|
||||
return;
|
||||
}
|
||||
|
||||
// Extract images
|
||||
if (el.tagName === 'IMG') {
|
||||
const rect = el.getBoundingClientRect();
|
||||
if (rect.width > 0 && rect.height > 0) {
|
||||
elements.push({
|
||||
type: 'image',
|
||||
src: el.src,
|
||||
position: {
|
||||
x: pxToInch(rect.left),
|
||||
y: pxToInch(rect.top),
|
||||
w: pxToInch(rect.width),
|
||||
h: pxToInch(rect.height)
|
||||
}
|
||||
});
|
||||
processed.add(el);
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
// Extract DIVs with backgrounds/borders as shapes
|
||||
const isContainer = el.tagName === 'DIV' && !textTags.includes(el.tagName);
|
||||
if (isContainer) {
|
||||
const computed = window.getComputedStyle(el);
|
||||
const hasBg = computed.backgroundColor && computed.backgroundColor !== 'rgba(0, 0, 0, 0)';
|
||||
|
||||
// Validate: Check for unwrapped text content in DIV
|
||||
for (const node of el.childNodes) {
|
||||
if (node.nodeType === Node.TEXT_NODE) {
|
||||
const text = node.textContent.trim();
|
||||
if (text) {
|
||||
errors.push(
|
||||
`DIV element contains unwrapped text "${text.substring(0, 50)}${text.length > 50 ? '...' : ''}". ` +
|
||||
'All text must be wrapped in <p>, <h1>-<h6>, <ul>, or <ol> tags to appear in PowerPoint.'
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Check for background images on shapes
|
||||
const bgImage = computed.backgroundImage;
|
||||
if (bgImage && bgImage !== 'none') {
|
||||
errors.push(
|
||||
'Background images on DIV elements are not supported. ' +
|
||||
'Use solid colors or borders for shapes, or use slide.addImage() in PptxGenJS to layer images.'
|
||||
);
|
||||
return;
|
||||
}
|
||||
|
||||
// Check for borders - both uniform and partial
|
||||
const borderTop = computed.borderTopWidth;
|
||||
const borderRight = computed.borderRightWidth;
|
||||
const borderBottom = computed.borderBottomWidth;
|
||||
const borderLeft = computed.borderLeftWidth;
|
||||
const borders = [borderTop, borderRight, borderBottom, borderLeft].map(b => parseFloat(b) || 0);
|
||||
const hasBorder = borders.some(b => b > 0);
|
||||
const hasUniformBorder = hasBorder && borders.every(b => b === borders[0]);
|
||||
const borderLines = [];
|
||||
|
||||
if (hasBorder && !hasUniformBorder) {
|
||||
const rect = el.getBoundingClientRect();
|
||||
const x = pxToInch(rect.left);
|
||||
const y = pxToInch(rect.top);
|
||||
const w = pxToInch(rect.width);
|
||||
const h = pxToInch(rect.height);
|
||||
|
||||
// Collect lines to add after shape (inset by half the line width to center on edge)
|
||||
if (parseFloat(borderTop) > 0) {
|
||||
const widthPt = pxToPoints(borderTop);
|
||||
const inset = (widthPt / 72) / 2; // Convert points to inches, then half
|
||||
borderLines.push({
|
||||
type: 'line',
|
||||
x1: x, y1: y + inset, x2: x + w, y2: y + inset,
|
||||
width: widthPt,
|
||||
color: rgbToHex(computed.borderTopColor)
|
||||
});
|
||||
}
|
||||
if (parseFloat(borderRight) > 0) {
|
||||
const widthPt = pxToPoints(borderRight);
|
||||
const inset = (widthPt / 72) / 2;
|
||||
borderLines.push({
|
||||
type: 'line',
|
||||
x1: x + w - inset, y1: y, x2: x + w - inset, y2: y + h,
|
||||
width: widthPt,
|
||||
color: rgbToHex(computed.borderRightColor)
|
||||
});
|
||||
}
|
||||
if (parseFloat(borderBottom) > 0) {
|
||||
const widthPt = pxToPoints(borderBottom);
|
||||
const inset = (widthPt / 72) / 2;
|
||||
borderLines.push({
|
||||
type: 'line',
|
||||
x1: x, y1: y + h - inset, x2: x + w, y2: y + h - inset,
|
||||
width: widthPt,
|
||||
color: rgbToHex(computed.borderBottomColor)
|
||||
});
|
||||
}
|
||||
if (parseFloat(borderLeft) > 0) {
|
||||
const widthPt = pxToPoints(borderLeft);
|
||||
const inset = (widthPt / 72) / 2;
|
||||
borderLines.push({
|
||||
type: 'line',
|
||||
x1: x + inset, y1: y, x2: x + inset, y2: y + h,
|
||||
width: widthPt,
|
||||
color: rgbToHex(computed.borderLeftColor)
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
if (hasBg || hasBorder) {
|
||||
const rect = el.getBoundingClientRect();
|
||||
if (rect.width > 0 && rect.height > 0) {
|
||||
const shadow = parseBoxShadow(computed.boxShadow);
|
||||
|
||||
// Only add shape if there's background or uniform border
|
||||
if (hasBg || hasUniformBorder) {
|
||||
elements.push({
|
||||
type: 'shape',
|
||||
text: '', // Shape only - child text elements render on top
|
||||
position: {
|
||||
x: pxToInch(rect.left),
|
||||
y: pxToInch(rect.top),
|
||||
w: pxToInch(rect.width),
|
||||
h: pxToInch(rect.height)
|
||||
},
|
||||
shape: {
|
||||
fill: hasBg ? rgbToHex(computed.backgroundColor) : null,
|
||||
transparency: hasBg ? extractAlpha(computed.backgroundColor) : null,
|
||||
line: hasUniformBorder ? {
|
||||
color: rgbToHex(computed.borderColor),
|
||||
width: pxToPoints(computed.borderWidth)
|
||||
} : null,
|
||||
// Convert border-radius to rectRadius (in inches)
|
||||
// % values: 50%+ = circle (1), <50% = percentage of min dimension
|
||||
// pt values: divide by 72 (72pt = 1 inch)
|
||||
// px values: divide by 96 (96px = 1 inch)
|
||||
rectRadius: (() => {
|
||||
const radius = computed.borderRadius;
|
||||
const radiusValue = parseFloat(radius);
|
||||
if (radiusValue === 0) return 0;
|
||||
|
||||
if (radius.includes('%')) {
|
||||
if (radiusValue >= 50) return 1;
|
||||
// Calculate percentage of smaller dimension
|
||||
const minDim = Math.min(rect.width, rect.height);
|
||||
return (radiusValue / 100) * pxToInch(minDim);
|
||||
}
|
||||
|
||||
if (radius.includes('pt')) return radiusValue / 72;
|
||||
return radiusValue / PX_PER_IN;
|
||||
})(),
|
||||
shadow: shadow
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Add partial border lines
|
||||
elements.push(...borderLines);
|
||||
|
||||
processed.add(el);
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Extract bullet lists as single text block
|
||||
if (el.tagName === 'UL' || el.tagName === 'OL') {
|
||||
const rect = el.getBoundingClientRect();
|
||||
if (rect.width === 0 || rect.height === 0) return;
|
||||
|
||||
const liElements = Array.from(el.querySelectorAll('li'));
|
||||
const items = [];
|
||||
const ulComputed = window.getComputedStyle(el);
|
||||
const ulPaddingLeftPt = pxToPoints(ulComputed.paddingLeft);
|
||||
|
||||
// Split: margin-left for bullet position, indent for text position
|
||||
// margin-left + indent = ul padding-left
|
||||
const marginLeft = ulPaddingLeftPt * 0.5;
|
||||
const textIndent = ulPaddingLeftPt * 0.5;
|
||||
|
||||
liElements.forEach((li, idx) => {
|
||||
const isLast = idx === liElements.length - 1;
|
||||
const runs = parseInlineFormatting(li, { breakLine: false });
|
||||
// Clean manual bullets from first run
|
||||
if (runs.length > 0) {
|
||||
runs[0].text = runs[0].text.replace(/^[•\-\*▪▸]\s*/, '');
|
||||
runs[0].options.bullet = { indent: textIndent };
|
||||
}
|
||||
// Set breakLine on last run
|
||||
if (runs.length > 0 && !isLast) {
|
||||
runs[runs.length - 1].options.breakLine = true;
|
||||
}
|
||||
items.push(...runs);
|
||||
});
|
||||
|
||||
const computed = window.getComputedStyle(liElements[0] || el);
|
||||
|
||||
elements.push({
|
||||
type: 'list',
|
||||
items: items,
|
||||
position: {
|
||||
x: pxToInch(rect.left),
|
||||
y: pxToInch(rect.top),
|
||||
w: pxToInch(rect.width),
|
||||
h: pxToInch(rect.height)
|
||||
},
|
||||
style: {
|
||||
fontSize: pxToPoints(computed.fontSize),
|
||||
fontFace: computed.fontFamily.split(',')[0].replace(/['"]/g, '').trim(),
|
||||
color: rgbToHex(computed.color),
|
||||
transparency: extractAlpha(computed.color),
|
||||
align: computed.textAlign === 'start' ? 'left' : computed.textAlign,
|
||||
lineSpacing: computed.lineHeight && computed.lineHeight !== 'normal' ? pxToPoints(computed.lineHeight) : null,
|
||||
paraSpaceBefore: 0,
|
||||
paraSpaceAfter: pxToPoints(computed.marginBottom),
|
||||
// PptxGenJS margin array is [left, right, bottom, top]
|
||||
margin: [marginLeft, 0, 0, 0]
|
||||
}
|
||||
});
|
||||
|
||||
liElements.forEach(li => processed.add(li));
|
||||
processed.add(el);
|
||||
return;
|
||||
}
|
||||
|
||||
// Extract text elements (P, H1, H2, etc.)
|
||||
if (!textTags.includes(el.tagName)) return;
|
||||
|
||||
const rect = el.getBoundingClientRect();
|
||||
const text = el.textContent.trim();
|
||||
if (rect.width === 0 || rect.height === 0 || !text) return;
|
||||
|
||||
// Validate: Check for manual bullet symbols in text elements (not in lists)
|
||||
if (el.tagName !== 'LI' && /^[•\-\*▪▸○●◆◇■□]\s/.test(text.trimStart())) {
|
||||
errors.push(
|
||||
`Text element <${el.tagName.toLowerCase()}> starts with bullet symbol "${text.substring(0, 20)}...". ` +
|
||||
'Use <ul> or <ol> lists instead of manual bullet symbols.'
|
||||
);
|
||||
return;
|
||||
}
|
||||
|
||||
const computed = window.getComputedStyle(el);
|
||||
const rotation = getRotation(computed.transform, computed.writingMode);
|
||||
const { x, y, w, h } = getPositionAndSize(el, rect, rotation);
|
||||
|
||||
const baseStyle = {
|
||||
fontSize: pxToPoints(computed.fontSize),
|
||||
fontFace: computed.fontFamily.split(',')[0].replace(/['"]/g, '').trim(),
|
||||
color: rgbToHex(computed.color),
|
||||
align: computed.textAlign === 'start' ? 'left' : computed.textAlign,
|
||||
lineSpacing: pxToPoints(computed.lineHeight),
|
||||
paraSpaceBefore: pxToPoints(computed.marginTop),
|
||||
paraSpaceAfter: pxToPoints(computed.marginBottom),
|
||||
// PptxGenJS margin array is [left, right, bottom, top] (not [top, right, bottom, left] as documented)
|
||||
margin: [
|
||||
pxToPoints(computed.paddingLeft),
|
||||
pxToPoints(computed.paddingRight),
|
||||
pxToPoints(computed.paddingBottom),
|
||||
pxToPoints(computed.paddingTop)
|
||||
]
|
||||
};
|
||||
|
||||
const transparency = extractAlpha(computed.color);
|
||||
if (transparency !== null) baseStyle.transparency = transparency;
|
||||
|
||||
if (rotation !== null) baseStyle.rotate = rotation;
|
||||
|
||||
const hasFormatting = el.querySelector('b, i, u, strong, em, span, br');
|
||||
|
||||
if (hasFormatting) {
|
||||
// Text with inline formatting
|
||||
const transformStr = computed.textTransform;
|
||||
const runs = parseInlineFormatting(el, {}, [], (str) => applyTextTransform(str, transformStr));
|
||||
|
||||
// Adjust lineSpacing based on largest fontSize in runs
|
||||
const adjustedStyle = { ...baseStyle };
|
||||
if (adjustedStyle.lineSpacing) {
|
||||
const maxFontSize = Math.max(
|
||||
adjustedStyle.fontSize,
|
||||
...runs.map(r => r.options?.fontSize || 0)
|
||||
);
|
||||
if (maxFontSize > adjustedStyle.fontSize) {
|
||||
const lineHeightMultiplier = adjustedStyle.lineSpacing / adjustedStyle.fontSize;
|
||||
adjustedStyle.lineSpacing = maxFontSize * lineHeightMultiplier;
|
||||
}
|
||||
}
|
||||
|
||||
elements.push({
|
||||
type: el.tagName.toLowerCase(),
|
||||
text: runs,
|
||||
position: { x: pxToInch(x), y: pxToInch(y), w: pxToInch(w), h: pxToInch(h) },
|
||||
style: adjustedStyle
|
||||
});
|
||||
} else {
|
||||
// Plain text - inherit CSS formatting
|
||||
const textTransform = computed.textTransform;
|
||||
const transformedText = applyTextTransform(text, textTransform);
|
||||
|
||||
const isBold = computed.fontWeight === 'bold' || parseInt(computed.fontWeight) >= 600;
|
||||
|
||||
elements.push({
|
||||
type: el.tagName.toLowerCase(),
|
||||
text: transformedText,
|
||||
position: { x: pxToInch(x), y: pxToInch(y), w: pxToInch(w), h: pxToInch(h) },
|
||||
style: {
|
||||
...baseStyle,
|
||||
bold: isBold && !shouldSkipBold(computed.fontFamily),
|
||||
italic: computed.fontStyle === 'italic',
|
||||
underline: computed.textDecoration.includes('underline')
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
processed.add(el);
|
||||
});
|
||||
|
||||
return { background, elements, placeholders, errors };
|
||||
});
|
||||
}
|
||||
|
||||
async function html2pptx(htmlFile, pres, options = {}) {
|
||||
const {
|
||||
tmpDir = process.env.TMPDIR || '/tmp',
|
||||
slide = null
|
||||
} = options;
|
||||
|
||||
try {
|
||||
// Use Chrome on macOS, default Chromium on Unix
|
||||
const launchOptions = { env: { TMPDIR: tmpDir } };
|
||||
if (process.platform === 'darwin') {
|
||||
launchOptions.channel = 'chrome';
|
||||
}
|
||||
|
||||
const browser = await chromium.launch(launchOptions);
|
||||
|
||||
let bodyDimensions;
|
||||
let slideData;
|
||||
|
||||
const filePath = path.isAbsolute(htmlFile) ? htmlFile : path.join(process.cwd(), htmlFile);
|
||||
const validationErrors = [];
|
||||
|
||||
try {
|
||||
const page = await browser.newPage();
|
||||
page.on('console', (msg) => {
|
||||
// Log the message text to your test runner's console
|
||||
console.log(`Browser console: ${msg.text()}`);
|
||||
});
|
||||
|
||||
await page.goto(`file://${filePath}`);
|
||||
|
||||
bodyDimensions = await getBodyDimensions(page);
|
||||
|
||||
await page.setViewportSize({
|
||||
width: Math.round(bodyDimensions.width),
|
||||
height: Math.round(bodyDimensions.height)
|
||||
});
|
||||
|
||||
slideData = await extractSlideData(page);
|
||||
} finally {
|
||||
await browser.close();
|
||||
}
|
||||
|
||||
// Collect all validation errors
|
||||
if (bodyDimensions.errors && bodyDimensions.errors.length > 0) {
|
||||
validationErrors.push(...bodyDimensions.errors);
|
||||
}
|
||||
|
||||
const dimensionErrors = validateDimensions(bodyDimensions, pres);
|
||||
if (dimensionErrors.length > 0) {
|
||||
validationErrors.push(...dimensionErrors);
|
||||
}
|
||||
|
||||
const textBoxPositionErrors = validateTextBoxPosition(slideData, bodyDimensions);
|
||||
if (textBoxPositionErrors.length > 0) {
|
||||
validationErrors.push(...textBoxPositionErrors);
|
||||
}
|
||||
|
||||
if (slideData.errors && slideData.errors.length > 0) {
|
||||
validationErrors.push(...slideData.errors);
|
||||
}
|
||||
|
||||
// Throw all errors at once if any exist
|
||||
if (validationErrors.length > 0) {
|
||||
const errorMessage = validationErrors.length === 1
|
||||
? validationErrors[0]
|
||||
: `Multiple validation errors found:\n${validationErrors.map((e, i) => ` ${i + 1}. ${e}`).join('\n')}`;
|
||||
throw new Error(errorMessage);
|
||||
}
|
||||
|
||||
const targetSlide = slide || pres.addSlide();
|
||||
|
||||
await addBackground(slideData, targetSlide, tmpDir);
|
||||
addElements(slideData, targetSlide, pres);
|
||||
|
||||
return { slide: targetSlide, placeholders: slideData.placeholders };
|
||||
} catch (error) {
|
||||
if (!error.message.startsWith(htmlFile)) {
|
||||
throw new Error(`${htmlFile}: ${error.message}`);
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
module.exports = html2pptx;
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,231 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Rearrange PowerPoint slides based on a sequence of indices.
|
||||
|
||||
Usage:
|
||||
python rearrange.py template.pptx output.pptx 0,34,34,50,52
|
||||
|
||||
This will create output.pptx using slides from template.pptx in the specified order.
|
||||
Slides can be repeated (e.g., 34 appears twice).
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import shutil
|
||||
import sys
|
||||
from copy import deepcopy
|
||||
from pathlib import Path
|
||||
|
||||
import six
|
||||
from pptx import Presentation
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Rearrange PowerPoint slides based on a sequence of indices.",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
python rearrange.py template.pptx output.pptx 0,34,34,50,52
|
||||
Creates output.pptx using slides 0, 34 (twice), 50, and 52 from template.pptx
|
||||
|
||||
python rearrange.py template.pptx output.pptx 5,3,1,2,4
|
||||
Creates output.pptx with slides reordered as specified
|
||||
|
||||
Note: Slide indices are 0-based (first slide is 0, second is 1, etc.)
|
||||
""",
|
||||
)
|
||||
|
||||
parser.add_argument("template", help="Path to template PPTX file")
|
||||
parser.add_argument("output", help="Path for output PPTX file")
|
||||
parser.add_argument(
|
||||
"sequence", help="Comma-separated sequence of slide indices (0-based)"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Parse the slide sequence
|
||||
try:
|
||||
slide_sequence = [int(x.strip()) for x in args.sequence.split(",")]
|
||||
except ValueError:
|
||||
print(
|
||||
"Error: Invalid sequence format. Use comma-separated integers (e.g., 0,34,34,50,52)"
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
# Check template exists
|
||||
template_path = Path(args.template)
|
||||
if not template_path.exists():
|
||||
print(f"Error: Template file not found: {args.template}")
|
||||
sys.exit(1)
|
||||
|
||||
# Create output directory if needed
|
||||
output_path = Path(args.output)
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
try:
|
||||
rearrange_presentation(template_path, output_path, slide_sequence)
|
||||
except ValueError as e:
|
||||
print(f"Error: {e}")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"Error processing presentation: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def duplicate_slide(pres, index):
|
||||
"""Duplicate a slide in the presentation."""
|
||||
source = pres.slides[index]
|
||||
|
||||
# Use source's layout to preserve formatting
|
||||
new_slide = pres.slides.add_slide(source.slide_layout)
|
||||
|
||||
# Collect all image and media relationships from the source slide
|
||||
image_rels = {}
|
||||
for rel_id, rel in six.iteritems(source.part.rels):
|
||||
if "image" in rel.reltype or "media" in rel.reltype:
|
||||
image_rels[rel_id] = rel
|
||||
|
||||
# CRITICAL: Clear placeholder shapes to avoid duplicates
|
||||
for shape in new_slide.shapes:
|
||||
sp = shape.element
|
||||
sp.getparent().remove(sp)
|
||||
|
||||
# Copy all shapes from source
|
||||
for shape in source.shapes:
|
||||
el = shape.element
|
||||
new_el = deepcopy(el)
|
||||
new_slide.shapes._spTree.insert_element_before(new_el, "p:extLst")
|
||||
|
||||
# Handle picture shapes - need to update the blip reference
|
||||
# Look for all blip elements (they can be in pic or other contexts)
|
||||
# Using the element's own xpath method without namespaces argument
|
||||
blips = new_el.xpath(".//a:blip[@r:embed]")
|
||||
for blip in blips:
|
||||
old_rId = blip.get(
|
||||
"{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed"
|
||||
)
|
||||
if old_rId in image_rels:
|
||||
# Create a new relationship in the destination slide for this image
|
||||
old_rel = image_rels[old_rId]
|
||||
# get_or_add returns the rId directly, or adds and returns new rId
|
||||
new_rId = new_slide.part.rels.get_or_add(
|
||||
old_rel.reltype, old_rel._target
|
||||
)
|
||||
# Update the blip's embed reference to use the new relationship ID
|
||||
blip.set(
|
||||
"{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed",
|
||||
new_rId,
|
||||
)
|
||||
|
||||
# Copy any additional image/media relationships that might be referenced elsewhere
|
||||
for rel_id, rel in image_rels.items():
|
||||
try:
|
||||
new_slide.part.rels.get_or_add(rel.reltype, rel._target)
|
||||
except Exception:
|
||||
pass # Relationship might already exist
|
||||
|
||||
return new_slide
|
||||
|
||||
|
||||
def delete_slide(pres, index):
|
||||
"""Delete a slide from the presentation."""
|
||||
rId = pres.slides._sldIdLst[index].rId
|
||||
pres.part.drop_rel(rId)
|
||||
del pres.slides._sldIdLst[index]
|
||||
|
||||
|
||||
def reorder_slides(pres, slide_index, target_index):
|
||||
"""Move a slide from one position to another."""
|
||||
slides = pres.slides._sldIdLst
|
||||
|
||||
# Remove slide element from current position
|
||||
slide_element = slides[slide_index]
|
||||
slides.remove(slide_element)
|
||||
|
||||
# Insert at target position
|
||||
slides.insert(target_index, slide_element)
|
||||
|
||||
|
||||
def rearrange_presentation(template_path, output_path, slide_sequence):
|
||||
"""
|
||||
Create a new presentation with slides from template in specified order.
|
||||
|
||||
Args:
|
||||
template_path: Path to template PPTX file
|
||||
output_path: Path for output PPTX file
|
||||
slide_sequence: List of slide indices (0-based) to include
|
||||
"""
|
||||
# Copy template to preserve dimensions and theme
|
||||
if template_path != output_path:
|
||||
shutil.copy2(template_path, output_path)
|
||||
prs = Presentation(output_path)
|
||||
else:
|
||||
prs = Presentation(template_path)
|
||||
|
||||
total_slides = len(prs.slides)
|
||||
|
||||
# Validate indices
|
||||
for idx in slide_sequence:
|
||||
if idx < 0 or idx >= total_slides:
|
||||
raise ValueError(f"Slide index {idx} out of range (0-{total_slides - 1})")
|
||||
|
||||
# Track original slides and their duplicates
|
||||
slide_map = [] # List of actual slide indices for final presentation
|
||||
duplicated = {} # Track duplicates: original_idx -> [duplicate_indices]
|
||||
|
||||
# Step 1: DUPLICATE repeated slides
|
||||
print(f"Processing {len(slide_sequence)} slides from template...")
|
||||
for i, template_idx in enumerate(slide_sequence):
|
||||
if template_idx in duplicated and duplicated[template_idx]:
|
||||
# Already duplicated this slide, use the duplicate
|
||||
slide_map.append(duplicated[template_idx].pop(0))
|
||||
print(f" [{i}] Using duplicate of slide {template_idx}")
|
||||
elif slide_sequence.count(template_idx) > 1 and template_idx not in duplicated:
|
||||
# First occurrence of a repeated slide - create duplicates
|
||||
slide_map.append(template_idx)
|
||||
duplicates = []
|
||||
count = slide_sequence.count(template_idx) - 1
|
||||
print(
|
||||
f" [{i}] Using original slide {template_idx}, creating {count} duplicate(s)"
|
||||
)
|
||||
for _ in range(count):
|
||||
duplicate_slide(prs, template_idx)
|
||||
duplicates.append(len(prs.slides) - 1)
|
||||
duplicated[template_idx] = duplicates
|
||||
else:
|
||||
# Unique slide or first occurrence already handled, use original
|
||||
slide_map.append(template_idx)
|
||||
print(f" [{i}] Using original slide {template_idx}")
|
||||
|
||||
# Step 2: DELETE unwanted slides (work backwards)
|
||||
slides_to_keep = set(slide_map)
|
||||
print(f"\nDeleting {len(prs.slides) - len(slides_to_keep)} unused slides...")
|
||||
for i in range(len(prs.slides) - 1, -1, -1):
|
||||
if i not in slides_to_keep:
|
||||
delete_slide(prs, i)
|
||||
# Update slide_map indices after deletion
|
||||
slide_map = [idx - 1 if idx > i else idx for idx in slide_map]
|
||||
|
||||
# Step 3: REORDER to final sequence
|
||||
print(f"Reordering {len(slide_map)} slides to final sequence...")
|
||||
for target_pos in range(len(slide_map)):
|
||||
# Find which slide should be at target_pos
|
||||
current_pos = slide_map[target_pos]
|
||||
if current_pos != target_pos:
|
||||
reorder_slides(prs, current_pos, target_pos)
|
||||
# Update slide_map: the move shifts other slides
|
||||
for i in range(len(slide_map)):
|
||||
if slide_map[i] > current_pos and slide_map[i] <= target_pos:
|
||||
slide_map[i] -= 1
|
||||
elif slide_map[i] < current_pos and slide_map[i] >= target_pos:
|
||||
slide_map[i] += 1
|
||||
slide_map[target_pos] = target_pos
|
||||
|
||||
# Save the presentation
|
||||
prs.save(output_path)
|
||||
print(f"\nSaved rearranged presentation to: {output_path}")
|
||||
print(f"Final presentation has {len(prs.slides)} slides")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,385 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Apply text replacements to PowerPoint presentation.
|
||||
|
||||
Usage:
|
||||
python replace.py <input.pptx> <replacements.json> <output.pptx>
|
||||
|
||||
The replacements JSON should have the structure output by inventory.py.
|
||||
ALL text shapes identified by inventory.py will have their text cleared
|
||||
unless "paragraphs" is specified in the replacements for that shape.
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, List
|
||||
|
||||
from inventory import InventoryData, extract_text_inventory
|
||||
from pptx import Presentation
|
||||
from pptx.dml.color import RGBColor
|
||||
from pptx.enum.dml import MSO_THEME_COLOR
|
||||
from pptx.enum.text import PP_ALIGN
|
||||
from pptx.oxml.xmlchemy import OxmlElement
|
||||
from pptx.util import Pt
|
||||
|
||||
|
||||
def clear_paragraph_bullets(paragraph):
|
||||
"""Clear bullet formatting from a paragraph."""
|
||||
pPr = paragraph._element.get_or_add_pPr()
|
||||
|
||||
# Remove existing bullet elements
|
||||
for child in list(pPr):
|
||||
if (
|
||||
child.tag.endswith("buChar")
|
||||
or child.tag.endswith("buNone")
|
||||
or child.tag.endswith("buAutoNum")
|
||||
or child.tag.endswith("buFont")
|
||||
):
|
||||
pPr.remove(child)
|
||||
|
||||
return pPr
|
||||
|
||||
|
||||
def apply_paragraph_properties(paragraph, para_data: Dict[str, Any]):
|
||||
"""Apply formatting properties to a paragraph."""
|
||||
# Get the text but don't set it on paragraph directly yet
|
||||
text = para_data.get("text", "")
|
||||
|
||||
# Get or create paragraph properties
|
||||
pPr = clear_paragraph_bullets(paragraph)
|
||||
|
||||
# Handle bullet formatting
|
||||
if para_data.get("bullet", False):
|
||||
level = para_data.get("level", 0)
|
||||
paragraph.level = level
|
||||
|
||||
# Calculate font-proportional indentation
|
||||
font_size = para_data.get("font_size", 18.0)
|
||||
level_indent_emu = int((font_size * (1.6 + level * 1.6)) * 12700)
|
||||
hanging_indent_emu = int(-font_size * 0.8 * 12700)
|
||||
|
||||
# Set indentation
|
||||
pPr.attrib["marL"] = str(level_indent_emu)
|
||||
pPr.attrib["indent"] = str(hanging_indent_emu)
|
||||
|
||||
# Add bullet character
|
||||
buChar = OxmlElement("a:buChar")
|
||||
buChar.set("char", "•")
|
||||
pPr.append(buChar)
|
||||
|
||||
# Default to left alignment for bullets if not specified
|
||||
if "alignment" not in para_data:
|
||||
paragraph.alignment = PP_ALIGN.LEFT
|
||||
else:
|
||||
# Remove indentation for non-bullet text
|
||||
pPr.attrib["marL"] = "0"
|
||||
pPr.attrib["indent"] = "0"
|
||||
|
||||
# Add buNone element
|
||||
buNone = OxmlElement("a:buNone")
|
||||
pPr.insert(0, buNone)
|
||||
|
||||
# Apply alignment
|
||||
if "alignment" in para_data:
|
||||
alignment_map = {
|
||||
"LEFT": PP_ALIGN.LEFT,
|
||||
"CENTER": PP_ALIGN.CENTER,
|
||||
"RIGHT": PP_ALIGN.RIGHT,
|
||||
"JUSTIFY": PP_ALIGN.JUSTIFY,
|
||||
}
|
||||
if para_data["alignment"] in alignment_map:
|
||||
paragraph.alignment = alignment_map[para_data["alignment"]]
|
||||
|
||||
# Apply spacing
|
||||
if "space_before" in para_data:
|
||||
paragraph.space_before = Pt(para_data["space_before"])
|
||||
if "space_after" in para_data:
|
||||
paragraph.space_after = Pt(para_data["space_after"])
|
||||
if "line_spacing" in para_data:
|
||||
paragraph.line_spacing = Pt(para_data["line_spacing"])
|
||||
|
||||
# Apply run-level formatting
|
||||
if not paragraph.runs:
|
||||
run = paragraph.add_run()
|
||||
run.text = text
|
||||
else:
|
||||
run = paragraph.runs[0]
|
||||
run.text = text
|
||||
|
||||
# Apply font properties
|
||||
apply_font_properties(run, para_data)
|
||||
|
||||
|
||||
def apply_font_properties(run, para_data: Dict[str, Any]):
|
||||
"""Apply font properties to a text run."""
|
||||
if "bold" in para_data:
|
||||
run.font.bold = para_data["bold"]
|
||||
if "italic" in para_data:
|
||||
run.font.italic = para_data["italic"]
|
||||
if "underline" in para_data:
|
||||
run.font.underline = para_data["underline"]
|
||||
if "font_size" in para_data:
|
||||
run.font.size = Pt(para_data["font_size"])
|
||||
if "font_name" in para_data:
|
||||
run.font.name = para_data["font_name"]
|
||||
|
||||
# Apply color - prefer RGB, fall back to theme_color
|
||||
if "color" in para_data:
|
||||
color_hex = para_data["color"].lstrip("#")
|
||||
if len(color_hex) == 6:
|
||||
r = int(color_hex[0:2], 16)
|
||||
g = int(color_hex[2:4], 16)
|
||||
b = int(color_hex[4:6], 16)
|
||||
run.font.color.rgb = RGBColor(r, g, b)
|
||||
elif "theme_color" in para_data:
|
||||
# Get theme color by name (e.g., "DARK_1", "ACCENT_1")
|
||||
theme_name = para_data["theme_color"]
|
||||
try:
|
||||
run.font.color.theme_color = getattr(MSO_THEME_COLOR, theme_name)
|
||||
except AttributeError:
|
||||
print(f" WARNING: Unknown theme color name '{theme_name}'")
|
||||
|
||||
|
||||
def detect_frame_overflow(inventory: InventoryData) -> Dict[str, Dict[str, float]]:
|
||||
"""Detect text overflow in shapes (text exceeding shape bounds).
|
||||
|
||||
Returns dict of slide_key -> shape_key -> overflow_inches.
|
||||
Only includes shapes that have text overflow.
|
||||
"""
|
||||
overflow_map = {}
|
||||
|
||||
for slide_key, shapes_dict in inventory.items():
|
||||
for shape_key, shape_data in shapes_dict.items():
|
||||
# Check for frame overflow (text exceeding shape bounds)
|
||||
if shape_data.frame_overflow_bottom is not None:
|
||||
if slide_key not in overflow_map:
|
||||
overflow_map[slide_key] = {}
|
||||
overflow_map[slide_key][shape_key] = shape_data.frame_overflow_bottom
|
||||
|
||||
return overflow_map
|
||||
|
||||
|
||||
def validate_replacements(inventory: InventoryData, replacements: Dict) -> List[str]:
|
||||
"""Validate that all shapes in replacements exist in inventory.
|
||||
|
||||
Returns list of error messages.
|
||||
"""
|
||||
errors = []
|
||||
|
||||
for slide_key, shapes_data in replacements.items():
|
||||
if not slide_key.startswith("slide-"):
|
||||
continue
|
||||
|
||||
# Check if slide exists
|
||||
if slide_key not in inventory:
|
||||
errors.append(f"Slide '{slide_key}' not found in inventory")
|
||||
continue
|
||||
|
||||
# Check each shape
|
||||
for shape_key in shapes_data.keys():
|
||||
if shape_key not in inventory[slide_key]:
|
||||
# Find shapes without replacements defined and show their content
|
||||
unused_with_content = []
|
||||
for k in inventory[slide_key].keys():
|
||||
if k not in shapes_data:
|
||||
shape_data = inventory[slide_key][k]
|
||||
# Get text from paragraphs as preview
|
||||
paragraphs = shape_data.paragraphs
|
||||
if paragraphs and paragraphs[0].text:
|
||||
first_text = paragraphs[0].text[:50]
|
||||
if len(paragraphs[0].text) > 50:
|
||||
first_text += "..."
|
||||
unused_with_content.append(f"{k} ('{first_text}')")
|
||||
else:
|
||||
unused_with_content.append(k)
|
||||
|
||||
errors.append(
|
||||
f"Shape '{shape_key}' not found on '{slide_key}'. "
|
||||
f"Shapes without replacements: {', '.join(sorted(unused_with_content)) if unused_with_content else 'none'}"
|
||||
)
|
||||
|
||||
return errors
|
||||
|
||||
|
||||
def check_duplicate_keys(pairs):
|
||||
"""Check for duplicate keys when loading JSON."""
|
||||
result = {}
|
||||
for key, value in pairs:
|
||||
if key in result:
|
||||
raise ValueError(f"Duplicate key found in JSON: '{key}'")
|
||||
result[key] = value
|
||||
return result
|
||||
|
||||
|
||||
def apply_replacements(pptx_file: str, json_file: str, output_file: str):
|
||||
"""Apply text replacements from JSON to PowerPoint presentation."""
|
||||
|
||||
# Load presentation
|
||||
prs = Presentation(pptx_file)
|
||||
|
||||
# Get inventory of all text shapes (returns ShapeData objects)
|
||||
# Pass prs to use same Presentation instance
|
||||
inventory = extract_text_inventory(Path(pptx_file), prs)
|
||||
|
||||
# Detect text overflow in original presentation
|
||||
original_overflow = detect_frame_overflow(inventory)
|
||||
|
||||
# Load replacement data with duplicate key detection
|
||||
with open(json_file, "r") as f:
|
||||
replacements = json.load(f, object_pairs_hook=check_duplicate_keys)
|
||||
|
||||
# Validate replacements
|
||||
errors = validate_replacements(inventory, replacements)
|
||||
if errors:
|
||||
print("ERROR: Invalid shapes in replacement JSON:")
|
||||
for error in errors:
|
||||
print(f" - {error}")
|
||||
print("\nPlease check the inventory and update your replacement JSON.")
|
||||
print(
|
||||
"You can regenerate the inventory with: python inventory.py <input.pptx> <output.json>"
|
||||
)
|
||||
raise ValueError(f"Found {len(errors)} validation error(s)")
|
||||
|
||||
# Track statistics
|
||||
shapes_processed = 0
|
||||
shapes_cleared = 0
|
||||
shapes_replaced = 0
|
||||
|
||||
# Process each slide from inventory
|
||||
for slide_key, shapes_dict in inventory.items():
|
||||
if not slide_key.startswith("slide-"):
|
||||
continue
|
||||
|
||||
slide_index = int(slide_key.split("-")[1])
|
||||
|
||||
if slide_index >= len(prs.slides):
|
||||
print(f"Warning: Slide {slide_index} not found")
|
||||
continue
|
||||
|
||||
# Process each shape from inventory
|
||||
for shape_key, shape_data in shapes_dict.items():
|
||||
shapes_processed += 1
|
||||
|
||||
# Get the shape directly from ShapeData
|
||||
shape = shape_data.shape
|
||||
if not shape:
|
||||
print(f"Warning: {shape_key} has no shape reference")
|
||||
continue
|
||||
|
||||
# ShapeData already validates text_frame in __init__
|
||||
text_frame = shape.text_frame # type: ignore
|
||||
|
||||
text_frame.clear() # type: ignore
|
||||
shapes_cleared += 1
|
||||
|
||||
# Check for replacement paragraphs
|
||||
replacement_shape_data = replacements.get(slide_key, {}).get(shape_key, {})
|
||||
if "paragraphs" not in replacement_shape_data:
|
||||
continue
|
||||
|
||||
shapes_replaced += 1
|
||||
|
||||
# Add replacement paragraphs
|
||||
for i, para_data in enumerate(replacement_shape_data["paragraphs"]):
|
||||
if i == 0:
|
||||
p = text_frame.paragraphs[0] # type: ignore
|
||||
else:
|
||||
p = text_frame.add_paragraph() # type: ignore
|
||||
|
||||
apply_paragraph_properties(p, para_data)
|
||||
|
||||
# Check for issues after replacements
|
||||
# Save to a temporary file and reload to avoid modifying the presentation during inventory
|
||||
# (extract_text_inventory accesses font.color which adds empty <a:solidFill/> elements)
|
||||
import tempfile
|
||||
|
||||
with tempfile.NamedTemporaryFile(suffix=".pptx", delete=False) as tmp:
|
||||
tmp_path = Path(tmp.name)
|
||||
prs.save(str(tmp_path))
|
||||
|
||||
try:
|
||||
updated_inventory = extract_text_inventory(tmp_path)
|
||||
updated_overflow = detect_frame_overflow(updated_inventory)
|
||||
finally:
|
||||
tmp_path.unlink() # Clean up temp file
|
||||
|
||||
# Check if any text overflow got worse
|
||||
overflow_errors = []
|
||||
for slide_key, shape_overflows in updated_overflow.items():
|
||||
for shape_key, new_overflow in shape_overflows.items():
|
||||
# Get original overflow (0 if there was no overflow before)
|
||||
original = original_overflow.get(slide_key, {}).get(shape_key, 0.0)
|
||||
|
||||
# Error if overflow increased
|
||||
if new_overflow > original + 0.01: # Small tolerance for rounding
|
||||
increase = new_overflow - original
|
||||
overflow_errors.append(
|
||||
f'{slide_key}/{shape_key}: overflow worsened by {increase:.2f}" '
|
||||
f'(was {original:.2f}", now {new_overflow:.2f}")'
|
||||
)
|
||||
|
||||
# Collect warnings from updated shapes
|
||||
warnings = []
|
||||
for slide_key, shapes_dict in updated_inventory.items():
|
||||
for shape_key, shape_data in shapes_dict.items():
|
||||
if shape_data.warnings:
|
||||
for warning in shape_data.warnings:
|
||||
warnings.append(f"{slide_key}/{shape_key}: {warning}")
|
||||
|
||||
# Fail if there are any issues
|
||||
if overflow_errors or warnings:
|
||||
print("\nERROR: Issues detected in replacement output:")
|
||||
if overflow_errors:
|
||||
print("\nText overflow worsened:")
|
||||
for error in overflow_errors:
|
||||
print(f" - {error}")
|
||||
if warnings:
|
||||
print("\nFormatting warnings:")
|
||||
for warning in warnings:
|
||||
print(f" - {warning}")
|
||||
print("\nPlease fix these issues before saving.")
|
||||
raise ValueError(
|
||||
f"Found {len(overflow_errors)} overflow error(s) and {len(warnings)} warning(s)"
|
||||
)
|
||||
|
||||
# Save the presentation
|
||||
prs.save(output_file)
|
||||
|
||||
# Report results
|
||||
print(f"Saved updated presentation to: {output_file}")
|
||||
print(f"Processed {len(prs.slides)} slides")
|
||||
print(f" - Shapes processed: {shapes_processed}")
|
||||
print(f" - Shapes cleared: {shapes_cleared}")
|
||||
print(f" - Shapes replaced: {shapes_replaced}")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point for command-line usage."""
|
||||
if len(sys.argv) != 4:
|
||||
print(__doc__)
|
||||
sys.exit(1)
|
||||
|
||||
input_pptx = Path(sys.argv[1])
|
||||
replacements_json = Path(sys.argv[2])
|
||||
output_pptx = Path(sys.argv[3])
|
||||
|
||||
if not input_pptx.exists():
|
||||
print(f"Error: Input file '{input_pptx}' not found")
|
||||
sys.exit(1)
|
||||
|
||||
if not replacements_json.exists():
|
||||
print(f"Error: Replacements JSON file '{replacements_json}' not found")
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
apply_replacements(str(input_pptx), str(replacements_json), str(output_pptx))
|
||||
except Exception as e:
|
||||
print(f"Error applying replacements: {e}")
|
||||
import traceback
|
||||
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,450 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Create thumbnail grids from PowerPoint presentation slides.
|
||||
|
||||
Creates a grid layout of slide thumbnails with configurable columns (max 6).
|
||||
Each grid contains up to cols×(cols+1) images. For presentations with more
|
||||
slides, multiple numbered grid files are created automatically.
|
||||
|
||||
The program outputs the names of all files created.
|
||||
|
||||
Output:
|
||||
- Single grid: {prefix}.jpg (if slides fit in one grid)
|
||||
- Multiple grids: {prefix}-1.jpg, {prefix}-2.jpg, etc.
|
||||
|
||||
Grid limits by column count:
|
||||
- 3 cols: max 12 slides per grid (3×4)
|
||||
- 4 cols: max 20 slides per grid (4×5)
|
||||
- 5 cols: max 30 slides per grid (5×6) [default]
|
||||
- 6 cols: max 42 slides per grid (6×7)
|
||||
|
||||
Usage:
|
||||
python thumbnail.py input.pptx [output_prefix] [--cols N] [--outline-placeholders]
|
||||
|
||||
Examples:
|
||||
python thumbnail.py presentation.pptx
|
||||
# Creates: thumbnails.jpg (using default prefix)
|
||||
# Outputs:
|
||||
# Created 1 grid(s):
|
||||
# - thumbnails.jpg
|
||||
|
||||
python thumbnail.py large-deck.pptx grid --cols 4
|
||||
# Creates: grid-1.jpg, grid-2.jpg, grid-3.jpg
|
||||
# Outputs:
|
||||
# Created 3 grid(s):
|
||||
# - grid-1.jpg
|
||||
# - grid-2.jpg
|
||||
# - grid-3.jpg
|
||||
|
||||
python thumbnail.py template.pptx analysis --outline-placeholders
|
||||
# Creates thumbnail grids with red outlines around text placeholders
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
from inventory import extract_text_inventory
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
from pptx import Presentation
|
||||
|
||||
# Constants
|
||||
THUMBNAIL_WIDTH = 300 # Fixed thumbnail width in pixels
|
||||
CONVERSION_DPI = 100 # DPI for PDF to image conversion
|
||||
MAX_COLS = 6 # Maximum number of columns
|
||||
DEFAULT_COLS = 5 # Default number of columns
|
||||
JPEG_QUALITY = 95 # JPEG compression quality
|
||||
|
||||
# Grid layout constants
|
||||
GRID_PADDING = 20 # Padding between thumbnails
|
||||
BORDER_WIDTH = 2 # Border width around thumbnails
|
||||
FONT_SIZE_RATIO = 0.12 # Font size as fraction of thumbnail width
|
||||
LABEL_PADDING_RATIO = 0.4 # Label padding as fraction of font size
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Create thumbnail grids from PowerPoint slides."
|
||||
)
|
||||
parser.add_argument("input", help="Input PowerPoint file (.pptx)")
|
||||
parser.add_argument(
|
||||
"output_prefix",
|
||||
nargs="?",
|
||||
default="thumbnails",
|
||||
help="Output prefix for image files (default: thumbnails, will create prefix.jpg or prefix-N.jpg)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--cols",
|
||||
type=int,
|
||||
default=DEFAULT_COLS,
|
||||
help=f"Number of columns (default: {DEFAULT_COLS}, max: {MAX_COLS})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--outline-placeholders",
|
||||
action="store_true",
|
||||
help="Outline text placeholders with a colored border",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate columns
|
||||
cols = min(args.cols, MAX_COLS)
|
||||
if args.cols > MAX_COLS:
|
||||
print(f"Warning: Columns limited to {MAX_COLS} (requested {args.cols})")
|
||||
|
||||
# Validate input
|
||||
input_path = Path(args.input)
|
||||
if not input_path.exists() or input_path.suffix.lower() != ".pptx":
|
||||
print(f"Error: Invalid PowerPoint file: {args.input}")
|
||||
sys.exit(1)
|
||||
|
||||
# Construct output path (always JPG)
|
||||
output_path = Path(f"{args.output_prefix}.jpg")
|
||||
|
||||
print(f"Processing: {args.input}")
|
||||
|
||||
try:
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
# Get placeholder regions if outlining is enabled
|
||||
placeholder_regions = None
|
||||
slide_dimensions = None
|
||||
if args.outline_placeholders:
|
||||
print("Extracting placeholder regions...")
|
||||
placeholder_regions, slide_dimensions = get_placeholder_regions(
|
||||
input_path
|
||||
)
|
||||
if placeholder_regions:
|
||||
print(f"Found placeholders on {len(placeholder_regions)} slides")
|
||||
|
||||
# Convert slides to images
|
||||
slide_images = convert_to_images(input_path, Path(temp_dir), CONVERSION_DPI)
|
||||
if not slide_images:
|
||||
print("Error: No slides found")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"Found {len(slide_images)} slides")
|
||||
|
||||
# Create grids (max cols×(cols+1) images per grid)
|
||||
grid_files = create_grids(
|
||||
slide_images,
|
||||
cols,
|
||||
THUMBNAIL_WIDTH,
|
||||
output_path,
|
||||
placeholder_regions,
|
||||
slide_dimensions,
|
||||
)
|
||||
|
||||
# Print saved files
|
||||
print(f"Created {len(grid_files)} grid(s):")
|
||||
for grid_file in grid_files:
|
||||
print(f" - {grid_file}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def create_hidden_slide_placeholder(size):
|
||||
"""Create placeholder image for hidden slides."""
|
||||
img = Image.new("RGB", size, color="#F0F0F0")
|
||||
draw = ImageDraw.Draw(img)
|
||||
line_width = max(5, min(size) // 100)
|
||||
draw.line([(0, 0), size], fill="#CCCCCC", width=line_width)
|
||||
draw.line([(size[0], 0), (0, size[1])], fill="#CCCCCC", width=line_width)
|
||||
return img
|
||||
|
||||
|
||||
def get_placeholder_regions(pptx_path):
|
||||
"""Extract ALL text regions from the presentation.
|
||||
|
||||
Returns a tuple of (placeholder_regions, slide_dimensions).
|
||||
text_regions is a dict mapping slide indices to lists of text regions.
|
||||
Each region is a dict with 'left', 'top', 'width', 'height' in inches.
|
||||
slide_dimensions is a tuple of (width_inches, height_inches).
|
||||
"""
|
||||
prs = Presentation(str(pptx_path))
|
||||
inventory = extract_text_inventory(pptx_path, prs)
|
||||
placeholder_regions = {}
|
||||
|
||||
# Get actual slide dimensions in inches (EMU to inches conversion)
|
||||
slide_width_inches = (prs.slide_width or 9144000) / 914400.0
|
||||
slide_height_inches = (prs.slide_height or 5143500) / 914400.0
|
||||
|
||||
for slide_key, shapes in inventory.items():
|
||||
# Extract slide index from "slide-N" format
|
||||
slide_idx = int(slide_key.split("-")[1])
|
||||
regions = []
|
||||
|
||||
for shape_key, shape_data in shapes.items():
|
||||
# The inventory only contains shapes with text, so all shapes should be highlighted
|
||||
regions.append(
|
||||
{
|
||||
"left": shape_data.left,
|
||||
"top": shape_data.top,
|
||||
"width": shape_data.width,
|
||||
"height": shape_data.height,
|
||||
}
|
||||
)
|
||||
|
||||
if regions:
|
||||
placeholder_regions[slide_idx] = regions
|
||||
|
||||
return placeholder_regions, (slide_width_inches, slide_height_inches)
|
||||
|
||||
|
||||
def convert_to_images(pptx_path, temp_dir, dpi):
|
||||
"""Convert PowerPoint to images via PDF, handling hidden slides."""
|
||||
# Detect hidden slides
|
||||
print("Analyzing presentation...")
|
||||
prs = Presentation(str(pptx_path))
|
||||
total_slides = len(prs.slides)
|
||||
|
||||
# Find hidden slides (1-based indexing for display)
|
||||
hidden_slides = {
|
||||
idx + 1
|
||||
for idx, slide in enumerate(prs.slides)
|
||||
if slide.element.get("show") == "0"
|
||||
}
|
||||
|
||||
print(f"Total slides: {total_slides}")
|
||||
if hidden_slides:
|
||||
print(f"Hidden slides: {sorted(hidden_slides)}")
|
||||
|
||||
pdf_path = temp_dir / f"{pptx_path.stem}.pdf"
|
||||
|
||||
# Convert to PDF
|
||||
print("Converting to PDF...")
|
||||
result = subprocess.run(
|
||||
[
|
||||
"soffice",
|
||||
"--headless",
|
||||
"--convert-to",
|
||||
"pdf",
|
||||
"--outdir",
|
||||
str(temp_dir),
|
||||
str(pptx_path),
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
if result.returncode != 0 or not pdf_path.exists():
|
||||
raise RuntimeError("PDF conversion failed")
|
||||
|
||||
# Convert PDF to images
|
||||
print(f"Converting to images at {dpi} DPI...")
|
||||
result = subprocess.run(
|
||||
["pdftoppm", "-jpeg", "-r", str(dpi), str(pdf_path), str(temp_dir / "slide")],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
if result.returncode != 0:
|
||||
raise RuntimeError("Image conversion failed")
|
||||
|
||||
visible_images = sorted(temp_dir.glob("slide-*.jpg"))
|
||||
|
||||
# Create full list with placeholders for hidden slides
|
||||
all_images = []
|
||||
visible_idx = 0
|
||||
|
||||
# Get placeholder dimensions from first visible slide
|
||||
if visible_images:
|
||||
with Image.open(visible_images[0]) as img:
|
||||
placeholder_size = img.size
|
||||
else:
|
||||
placeholder_size = (1920, 1080)
|
||||
|
||||
for slide_num in range(1, total_slides + 1):
|
||||
if slide_num in hidden_slides:
|
||||
# Create placeholder image for hidden slide
|
||||
placeholder_path = temp_dir / f"hidden-{slide_num:03d}.jpg"
|
||||
placeholder_img = create_hidden_slide_placeholder(placeholder_size)
|
||||
placeholder_img.save(placeholder_path, "JPEG")
|
||||
all_images.append(placeholder_path)
|
||||
else:
|
||||
# Use the actual visible slide image
|
||||
if visible_idx < len(visible_images):
|
||||
all_images.append(visible_images[visible_idx])
|
||||
visible_idx += 1
|
||||
|
||||
return all_images
|
||||
|
||||
|
||||
def create_grids(
|
||||
image_paths,
|
||||
cols,
|
||||
width,
|
||||
output_path,
|
||||
placeholder_regions=None,
|
||||
slide_dimensions=None,
|
||||
):
|
||||
"""Create multiple thumbnail grids from slide images, max cols×(cols+1) images per grid."""
|
||||
# Maximum images per grid is cols × (cols + 1) for better proportions
|
||||
max_images_per_grid = cols * (cols + 1)
|
||||
grid_files = []
|
||||
|
||||
print(
|
||||
f"Creating grids with {cols} columns (max {max_images_per_grid} images per grid)"
|
||||
)
|
||||
|
||||
# Split images into chunks
|
||||
for chunk_idx, start_idx in enumerate(
|
||||
range(0, len(image_paths), max_images_per_grid)
|
||||
):
|
||||
end_idx = min(start_idx + max_images_per_grid, len(image_paths))
|
||||
chunk_images = image_paths[start_idx:end_idx]
|
||||
|
||||
# Create grid for this chunk
|
||||
grid = create_grid(
|
||||
chunk_images, cols, width, start_idx, placeholder_regions, slide_dimensions
|
||||
)
|
||||
|
||||
# Generate output filename
|
||||
if len(image_paths) <= max_images_per_grid:
|
||||
# Single grid - use base filename without suffix
|
||||
grid_filename = output_path
|
||||
else:
|
||||
# Multiple grids - insert index before extension with dash
|
||||
stem = output_path.stem
|
||||
suffix = output_path.suffix
|
||||
grid_filename = output_path.parent / f"{stem}-{chunk_idx + 1}{suffix}"
|
||||
|
||||
# Save grid
|
||||
grid_filename.parent.mkdir(parents=True, exist_ok=True)
|
||||
grid.save(str(grid_filename), quality=JPEG_QUALITY)
|
||||
grid_files.append(str(grid_filename))
|
||||
|
||||
return grid_files
|
||||
|
||||
|
||||
def create_grid(
|
||||
image_paths,
|
||||
cols,
|
||||
width,
|
||||
start_slide_num=0,
|
||||
placeholder_regions=None,
|
||||
slide_dimensions=None,
|
||||
):
|
||||
"""Create thumbnail grid from slide images with optional placeholder outlining."""
|
||||
font_size = int(width * FONT_SIZE_RATIO)
|
||||
label_padding = int(font_size * LABEL_PADDING_RATIO)
|
||||
|
||||
# Get dimensions
|
||||
with Image.open(image_paths[0]) as img:
|
||||
aspect = img.height / img.width
|
||||
height = int(width * aspect)
|
||||
|
||||
# Calculate grid size
|
||||
rows = (len(image_paths) + cols - 1) // cols
|
||||
grid_w = cols * width + (cols + 1) * GRID_PADDING
|
||||
grid_h = rows * (height + font_size + label_padding * 2) + (rows + 1) * GRID_PADDING
|
||||
|
||||
# Create grid
|
||||
grid = Image.new("RGB", (grid_w, grid_h), "white")
|
||||
draw = ImageDraw.Draw(grid)
|
||||
|
||||
# Load font with size based on thumbnail width
|
||||
try:
|
||||
# Use Pillow's default font with size
|
||||
font = ImageFont.load_default(size=font_size)
|
||||
except Exception:
|
||||
# Fall back to basic default font if size parameter not supported
|
||||
font = ImageFont.load_default()
|
||||
|
||||
# Place thumbnails
|
||||
for i, img_path in enumerate(image_paths):
|
||||
row, col = i // cols, i % cols
|
||||
x = col * width + (col + 1) * GRID_PADDING
|
||||
y_base = (
|
||||
row * (height + font_size + label_padding * 2) + (row + 1) * GRID_PADDING
|
||||
)
|
||||
|
||||
# Add label with actual slide number
|
||||
label = f"{start_slide_num + i}"
|
||||
bbox = draw.textbbox((0, 0), label, font=font)
|
||||
text_w = bbox[2] - bbox[0]
|
||||
draw.text(
|
||||
(x + (width - text_w) // 2, y_base + label_padding),
|
||||
label,
|
||||
fill="black",
|
||||
font=font,
|
||||
)
|
||||
|
||||
# Add thumbnail below label with proportional spacing
|
||||
y_thumbnail = y_base + label_padding + font_size + label_padding
|
||||
|
||||
with Image.open(img_path) as img:
|
||||
# Get original dimensions before thumbnail
|
||||
orig_w, orig_h = img.size
|
||||
|
||||
# Apply placeholder outlines if enabled
|
||||
if placeholder_regions and (start_slide_num + i) in placeholder_regions:
|
||||
# Convert to RGBA for transparency support
|
||||
if img.mode != "RGBA":
|
||||
img = img.convert("RGBA")
|
||||
|
||||
# Get the regions for this slide
|
||||
regions = placeholder_regions[start_slide_num + i]
|
||||
|
||||
# Calculate scale factors using actual slide dimensions
|
||||
if slide_dimensions:
|
||||
slide_width_inches, slide_height_inches = slide_dimensions
|
||||
else:
|
||||
# Fallback: estimate from image size at CONVERSION_DPI
|
||||
slide_width_inches = orig_w / CONVERSION_DPI
|
||||
slide_height_inches = orig_h / CONVERSION_DPI
|
||||
|
||||
x_scale = orig_w / slide_width_inches
|
||||
y_scale = orig_h / slide_height_inches
|
||||
|
||||
# Create a highlight overlay
|
||||
overlay = Image.new("RGBA", img.size, (255, 255, 255, 0))
|
||||
overlay_draw = ImageDraw.Draw(overlay)
|
||||
|
||||
# Highlight each placeholder region
|
||||
for region in regions:
|
||||
# Convert from inches to pixels in the original image
|
||||
px_left = int(region["left"] * x_scale)
|
||||
px_top = int(region["top"] * y_scale)
|
||||
px_width = int(region["width"] * x_scale)
|
||||
px_height = int(region["height"] * y_scale)
|
||||
|
||||
# Draw highlight outline with red color and thick stroke
|
||||
# Using a bright red outline instead of fill
|
||||
stroke_width = max(
|
||||
5, min(orig_w, orig_h) // 150
|
||||
) # Thicker proportional stroke width
|
||||
overlay_draw.rectangle(
|
||||
[(px_left, px_top), (px_left + px_width, px_top + px_height)],
|
||||
outline=(255, 0, 0, 255), # Bright red, fully opaque
|
||||
width=stroke_width,
|
||||
)
|
||||
|
||||
# Composite the overlay onto the image using alpha blending
|
||||
img = Image.alpha_composite(img, overlay)
|
||||
# Convert back to RGB for JPEG saving
|
||||
img = img.convert("RGB")
|
||||
|
||||
img.thumbnail((width, height), Image.Resampling.LANCZOS)
|
||||
w, h = img.size
|
||||
tx = x + (width - w) // 2
|
||||
ty = y_thumbnail + (height - h) // 2
|
||||
grid.paste(img, (tx, ty))
|
||||
|
||||
# Add border
|
||||
if BORDER_WIDTH > 0:
|
||||
draw.rectangle(
|
||||
[
|
||||
(tx - BORDER_WIDTH, ty - BORDER_WIDTH),
|
||||
(tx + w + BORDER_WIDTH - 1, ty + h + BORDER_WIDTH - 1),
|
||||
],
|
||||
outline="gray",
|
||||
width=BORDER_WIDTH,
|
||||
)
|
||||
|
||||
return grid
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,178 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Excel Formula Recalculation Script
|
||||
Recalculates all formulas in an Excel file using LibreOffice
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
import subprocess
|
||||
import os
|
||||
import platform
|
||||
from pathlib import Path
|
||||
from openpyxl import load_workbook
|
||||
|
||||
|
||||
def setup_libreoffice_macro():
|
||||
"""Setup LibreOffice macro for recalculation if not already configured"""
|
||||
if platform.system() == 'Darwin':
|
||||
macro_dir = os.path.expanduser('~/Library/Application Support/LibreOffice/4/user/basic/Standard')
|
||||
else:
|
||||
macro_dir = os.path.expanduser('~/.config/libreoffice/4/user/basic/Standard')
|
||||
|
||||
macro_file = os.path.join(macro_dir, 'Module1.xba')
|
||||
|
||||
if os.path.exists(macro_file):
|
||||
with open(macro_file, 'r') as f:
|
||||
if 'RecalculateAndSave' in f.read():
|
||||
return True
|
||||
|
||||
if not os.path.exists(macro_dir):
|
||||
subprocess.run(['soffice', '--headless', '--terminate_after_init'],
|
||||
capture_output=True, timeout=10)
|
||||
os.makedirs(macro_dir, exist_ok=True)
|
||||
|
||||
macro_content = '''<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
|
||||
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">
|
||||
Sub RecalculateAndSave()
|
||||
ThisComponent.calculateAll()
|
||||
ThisComponent.store()
|
||||
ThisComponent.close(True)
|
||||
End Sub
|
||||
</script:module>'''
|
||||
|
||||
try:
|
||||
with open(macro_file, 'w') as f:
|
||||
f.write(macro_content)
|
||||
return True
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def recalc(filename, timeout=30):
|
||||
"""
|
||||
Recalculate formulas in Excel file and report any errors
|
||||
|
||||
Args:
|
||||
filename: Path to Excel file
|
||||
timeout: Maximum time to wait for recalculation (seconds)
|
||||
|
||||
Returns:
|
||||
dict with error locations and counts
|
||||
"""
|
||||
if not Path(filename).exists():
|
||||
return {'error': f'File {filename} does not exist'}
|
||||
|
||||
abs_path = str(Path(filename).absolute())
|
||||
|
||||
if not setup_libreoffice_macro():
|
||||
return {'error': 'Failed to setup LibreOffice macro'}
|
||||
|
||||
cmd = [
|
||||
'soffice', '--headless', '--norestore',
|
||||
'vnd.sun.star.script:Standard.Module1.RecalculateAndSave?language=Basic&location=application',
|
||||
abs_path
|
||||
]
|
||||
|
||||
# Handle timeout command differences between Linux and macOS
|
||||
if platform.system() != 'Windows':
|
||||
timeout_cmd = 'timeout' if platform.system() == 'Linux' else None
|
||||
if platform.system() == 'Darwin':
|
||||
# Check if gtimeout is available on macOS
|
||||
try:
|
||||
subprocess.run(['gtimeout', '--version'], capture_output=True, timeout=1, check=False)
|
||||
timeout_cmd = 'gtimeout'
|
||||
except (FileNotFoundError, subprocess.TimeoutExpired):
|
||||
pass
|
||||
|
||||
if timeout_cmd:
|
||||
cmd = [timeout_cmd, str(timeout)] + cmd
|
||||
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode != 0 and result.returncode != 124: # 124 is timeout exit code
|
||||
error_msg = result.stderr or 'Unknown error during recalculation'
|
||||
if 'Module1' in error_msg or 'RecalculateAndSave' not in error_msg:
|
||||
return {'error': 'LibreOffice macro not configured properly'}
|
||||
else:
|
||||
return {'error': error_msg}
|
||||
|
||||
# Check for Excel errors in the recalculated file - scan ALL cells
|
||||
try:
|
||||
wb = load_workbook(filename, data_only=True)
|
||||
|
||||
excel_errors = ['#VALUE!', '#DIV/0!', '#REF!', '#NAME?', '#NULL!', '#NUM!', '#N/A']
|
||||
error_details = {err: [] for err in excel_errors}
|
||||
total_errors = 0
|
||||
|
||||
for sheet_name in wb.sheetnames:
|
||||
ws = wb[sheet_name]
|
||||
# Check ALL rows and columns - no limits
|
||||
for row in ws.iter_rows():
|
||||
for cell in row:
|
||||
if cell.value is not None and isinstance(cell.value, str):
|
||||
for err in excel_errors:
|
||||
if err in cell.value:
|
||||
location = f"{sheet_name}!{cell.coordinate}"
|
||||
error_details[err].append(location)
|
||||
total_errors += 1
|
||||
break
|
||||
|
||||
wb.close()
|
||||
|
||||
# Build result summary
|
||||
result = {
|
||||
'status': 'success' if total_errors == 0 else 'errors_found',
|
||||
'total_errors': total_errors,
|
||||
'error_summary': {}
|
||||
}
|
||||
|
||||
# Add non-empty error categories
|
||||
for err_type, locations in error_details.items():
|
||||
if locations:
|
||||
result['error_summary'][err_type] = {
|
||||
'count': len(locations),
|
||||
'locations': locations[:20] # Show up to 20 locations
|
||||
}
|
||||
|
||||
# Add formula count for context - also check ALL cells
|
||||
wb_formulas = load_workbook(filename, data_only=False)
|
||||
formula_count = 0
|
||||
for sheet_name in wb_formulas.sheetnames:
|
||||
ws = wb_formulas[sheet_name]
|
||||
for row in ws.iter_rows():
|
||||
for cell in row:
|
||||
if cell.value and isinstance(cell.value, str) and cell.value.startswith('='):
|
||||
formula_count += 1
|
||||
wb_formulas.close()
|
||||
|
||||
result['total_formulas'] = formula_count
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
return {'error': str(e)}
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python recalc.py <excel_file> [timeout_seconds]")
|
||||
print("\nRecalculates all formulas in an Excel file using LibreOffice")
|
||||
print("\nReturns JSON with error details:")
|
||||
print(" - status: 'success' or 'errors_found'")
|
||||
print(" - total_errors: Total number of Excel errors found")
|
||||
print(" - total_formulas: Number of formulas in the file")
|
||||
print(" - error_summary: Breakdown by error type with locations")
|
||||
print(" - #VALUE!, #DIV/0!, #REF!, #NAME?, #NULL!, #NUM!, #N/A")
|
||||
sys.exit(1)
|
||||
|
||||
filename = sys.argv[1]
|
||||
timeout = int(sys.argv[2]) if len(sys.argv) > 2 else 30
|
||||
|
||||
result = recalc(filename, timeout)
|
||||
print(json.dumps(result, indent=2))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
590
scientific-skills/docx/SKILL.md
Normal file
590
scientific-skills/docx/SKILL.md
Normal file
@@ -0,0 +1,590 @@
|
||||
---
|
||||
name: docx
|
||||
description: "Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of 'Word doc', 'word document', '.docx', or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a 'report', 'memo', 'letter', 'template', or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation."
|
||||
license: Proprietary. LICENSE.txt has complete terms
|
||||
---
|
||||
|
||||
# DOCX creation, editing, and analysis
|
||||
|
||||
## Overview
|
||||
|
||||
A .docx file is a ZIP archive containing XML files.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Task | Approach |
|
||||
|------|----------|
|
||||
| Read/analyze content | `pandoc` or unpack for raw XML |
|
||||
| Create new document | Use `docx-js` - see Creating New Documents below |
|
||||
| Edit existing document | Unpack → edit XML → repack - see Editing Existing Documents below |
|
||||
|
||||
### Converting .doc to .docx
|
||||
|
||||
Legacy `.doc` files must be converted before editing:
|
||||
|
||||
```bash
|
||||
python scripts/office/soffice.py --headless --convert-to docx document.doc
|
||||
```
|
||||
|
||||
### Reading Content
|
||||
|
||||
```bash
|
||||
# Text extraction with tracked changes
|
||||
pandoc --track-changes=all document.docx -o output.md
|
||||
|
||||
# Raw XML access
|
||||
python scripts/office/unpack.py document.docx unpacked/
|
||||
```
|
||||
|
||||
### Converting to Images
|
||||
|
||||
```bash
|
||||
python scripts/office/soffice.py --headless --convert-to pdf document.docx
|
||||
pdftoppm -jpeg -r 150 document.pdf page
|
||||
```
|
||||
|
||||
### Accepting Tracked Changes
|
||||
|
||||
To produce a clean document with all tracked changes accepted (requires LibreOffice):
|
||||
|
||||
```bash
|
||||
python scripts/accept_changes.py input.docx output.docx
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Creating New Documents
|
||||
|
||||
Generate .docx files with JavaScript, then validate. Install: `npm install -g docx`
|
||||
|
||||
### Setup
|
||||
```javascript
|
||||
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
|
||||
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
|
||||
InternalHyperlink, Bookmark, FootnoteReferenceRun, PositionalTab,
|
||||
PositionalTabAlignment, PositionalTabRelativeTo, PositionalTabLeader,
|
||||
TabStopType, TabStopPosition, Column, SectionType,
|
||||
TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
|
||||
VerticalAlign, PageNumber, PageBreak } = require('docx');
|
||||
|
||||
const doc = new Document({ sections: [{ children: [/* content */] }] });
|
||||
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));
|
||||
```
|
||||
|
||||
### Validation
|
||||
After creating the file, validate it. If validation fails, unpack, fix the XML, and repack.
|
||||
```bash
|
||||
python scripts/office/validate.py doc.docx
|
||||
```
|
||||
|
||||
### Page Size
|
||||
|
||||
```javascript
|
||||
// CRITICAL: docx-js defaults to A4, not US Letter
|
||||
// Always set page size explicitly for consistent results
|
||||
sections: [{
|
||||
properties: {
|
||||
page: {
|
||||
size: {
|
||||
width: 12240, // 8.5 inches in DXA
|
||||
height: 15840 // 11 inches in DXA
|
||||
},
|
||||
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins
|
||||
}
|
||||
},
|
||||
children: [/* content */]
|
||||
}]
|
||||
```
|
||||
|
||||
**Common page sizes (DXA units, 1440 DXA = 1 inch):**
|
||||
|
||||
| Paper | Width | Height | Content Width (1" margins) |
|
||||
|-------|-------|--------|---------------------------|
|
||||
| US Letter | 12,240 | 15,840 | 9,360 |
|
||||
| A4 (default) | 11,906 | 16,838 | 9,026 |
|
||||
|
||||
**Landscape orientation:** docx-js swaps width/height internally, so pass portrait dimensions and let it handle the swap:
|
||||
```javascript
|
||||
size: {
|
||||
width: 12240, // Pass SHORT edge as width
|
||||
height: 15840, // Pass LONG edge as height
|
||||
orientation: PageOrientation.LANDSCAPE // docx-js swaps them in the XML
|
||||
},
|
||||
// Content width = 15840 - left margin - right margin (uses the long edge)
|
||||
```
|
||||
|
||||
### Styles (Override Built-in Headings)
|
||||
|
||||
Use Arial as the default font (universally supported). Keep titles black for readability.
|
||||
|
||||
```javascript
|
||||
const doc = new Document({
|
||||
styles: {
|
||||
default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default
|
||||
paragraphStyles: [
|
||||
// IMPORTANT: Use exact IDs to override built-in styles
|
||||
{ id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
|
||||
run: { size: 32, bold: true, font: "Arial" },
|
||||
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC
|
||||
{ id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
|
||||
run: { size: 28, bold: true, font: "Arial" },
|
||||
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
|
||||
]
|
||||
},
|
||||
sections: [{
|
||||
children: [
|
||||
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
|
||||
]
|
||||
}]
|
||||
});
|
||||
```
|
||||
|
||||
### Lists (NEVER use unicode bullets)
|
||||
|
||||
```javascript
|
||||
// ❌ WRONG - never manually insert bullet characters
|
||||
new Paragraph({ children: [new TextRun("• Item")] }) // BAD
|
||||
new Paragraph({ children: [new TextRun("\u2022 Item")] }) // BAD
|
||||
|
||||
// ✅ CORRECT - use numbering config with LevelFormat.BULLET
|
||||
const doc = new Document({
|
||||
numbering: {
|
||||
config: [
|
||||
{ reference: "bullets",
|
||||
levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
|
||||
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
|
||||
{ reference: "numbers",
|
||||
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
|
||||
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
|
||||
]
|
||||
},
|
||||
sections: [{
|
||||
children: [
|
||||
new Paragraph({ numbering: { reference: "bullets", level: 0 },
|
||||
children: [new TextRun("Bullet item")] }),
|
||||
new Paragraph({ numbering: { reference: "numbers", level: 0 },
|
||||
children: [new TextRun("Numbered item")] }),
|
||||
]
|
||||
}]
|
||||
});
|
||||
|
||||
// ⚠️ Each reference creates INDEPENDENT numbering
|
||||
// Same reference = continues (1,2,3 then 4,5,6)
|
||||
// Different reference = restarts (1,2,3 then 1,2,3)
|
||||
```
|
||||
|
||||
### Tables
|
||||
|
||||
**CRITICAL: Tables need dual widths** - set both `columnWidths` on the table AND `width` on each cell. Without both, tables render incorrectly on some platforms.
|
||||
|
||||
```javascript
|
||||
// CRITICAL: Always set table width for consistent rendering
|
||||
// CRITICAL: Use ShadingType.CLEAR (not SOLID) to prevent black backgrounds
|
||||
const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
|
||||
const borders = { top: border, bottom: border, left: border, right: border };
|
||||
|
||||
new Table({
|
||||
width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs)
|
||||
columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch)
|
||||
rows: [
|
||||
new TableRow({
|
||||
children: [
|
||||
new TableCell({
|
||||
borders,
|
||||
width: { size: 4680, type: WidthType.DXA }, // Also set on each cell
|
||||
shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID
|
||||
margins: { top: 80, bottom: 80, left: 120, right: 120 }, // Cell padding (internal, not added to width)
|
||||
children: [new Paragraph({ children: [new TextRun("Cell")] })]
|
||||
})
|
||||
]
|
||||
})
|
||||
]
|
||||
})
|
||||
```
|
||||
|
||||
**Table width calculation:**
|
||||
|
||||
Always use `WidthType.DXA` — `WidthType.PERCENTAGE` breaks in Google Docs.
|
||||
|
||||
```javascript
|
||||
// Table width = sum of columnWidths = content width
|
||||
// US Letter with 1" margins: 12240 - 2880 = 9360 DXA
|
||||
width: { size: 9360, type: WidthType.DXA },
|
||||
columnWidths: [7000, 2360] // Must sum to table width
|
||||
```
|
||||
|
||||
**Width rules:**
|
||||
- **Always use `WidthType.DXA`** — never `WidthType.PERCENTAGE` (incompatible with Google Docs)
|
||||
- Table width must equal the sum of `columnWidths`
|
||||
- Cell `width` must match corresponding `columnWidth`
|
||||
- Cell `margins` are internal padding - they reduce content area, not add to cell width
|
||||
- For full-width tables: use content width (page width minus left and right margins)
|
||||
|
||||
### Images
|
||||
|
||||
```javascript
|
||||
// CRITICAL: type parameter is REQUIRED
|
||||
new Paragraph({
|
||||
children: [new ImageRun({
|
||||
type: "png", // Required: png, jpg, jpeg, gif, bmp, svg
|
||||
data: fs.readFileSync("image.png"),
|
||||
transformation: { width: 200, height: 150 },
|
||||
altText: { title: "Title", description: "Desc", name: "Name" } // All three required
|
||||
})]
|
||||
})
|
||||
```
|
||||
|
||||
### Page Breaks
|
||||
|
||||
```javascript
|
||||
// CRITICAL: PageBreak must be inside a Paragraph
|
||||
new Paragraph({ children: [new PageBreak()] })
|
||||
|
||||
// Or use pageBreakBefore
|
||||
new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })
|
||||
```
|
||||
|
||||
### Hyperlinks
|
||||
|
||||
```javascript
|
||||
// External link
|
||||
new Paragraph({
|
||||
children: [new ExternalHyperlink({
|
||||
children: [new TextRun({ text: "Click here", style: "Hyperlink" })],
|
||||
link: "https://example.com",
|
||||
})]
|
||||
})
|
||||
|
||||
// Internal link (bookmark + reference)
|
||||
// 1. Create bookmark at destination
|
||||
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [
|
||||
new Bookmark({ id: "chapter1", children: [new TextRun("Chapter 1")] }),
|
||||
]})
|
||||
// 2. Link to it
|
||||
new Paragraph({ children: [new InternalHyperlink({
|
||||
children: [new TextRun({ text: "See Chapter 1", style: "Hyperlink" })],
|
||||
anchor: "chapter1",
|
||||
})]})
|
||||
```
|
||||
|
||||
### Footnotes
|
||||
|
||||
```javascript
|
||||
const doc = new Document({
|
||||
footnotes: {
|
||||
1: { children: [new Paragraph("Source: Annual Report 2024")] },
|
||||
2: { children: [new Paragraph("See appendix for methodology")] },
|
||||
},
|
||||
sections: [{
|
||||
children: [new Paragraph({
|
||||
children: [
|
||||
new TextRun("Revenue grew 15%"),
|
||||
new FootnoteReferenceRun(1),
|
||||
new TextRun(" using adjusted metrics"),
|
||||
new FootnoteReferenceRun(2),
|
||||
],
|
||||
})]
|
||||
}]
|
||||
});
|
||||
```
|
||||
|
||||
### Tab Stops
|
||||
|
||||
```javascript
|
||||
// Right-align text on same line (e.g., date opposite a title)
|
||||
new Paragraph({
|
||||
children: [
|
||||
new TextRun("Company Name"),
|
||||
new TextRun("\tJanuary 2025"),
|
||||
],
|
||||
tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
|
||||
})
|
||||
|
||||
// Dot leader (e.g., TOC-style)
|
||||
new Paragraph({
|
||||
children: [
|
||||
new TextRun("Introduction"),
|
||||
new TextRun({ children: [
|
||||
new PositionalTab({
|
||||
alignment: PositionalTabAlignment.RIGHT,
|
||||
relativeTo: PositionalTabRelativeTo.MARGIN,
|
||||
leader: PositionalTabLeader.DOT,
|
||||
}),
|
||||
"3",
|
||||
]}),
|
||||
],
|
||||
})
|
||||
```
|
||||
|
||||
### Multi-Column Layouts
|
||||
|
||||
```javascript
|
||||
// Equal-width columns
|
||||
sections: [{
|
||||
properties: {
|
||||
column: {
|
||||
count: 2, // number of columns
|
||||
space: 720, // gap between columns in DXA (720 = 0.5 inch)
|
||||
equalWidth: true,
|
||||
separate: true, // vertical line between columns
|
||||
},
|
||||
},
|
||||
children: [/* content flows naturally across columns */]
|
||||
}]
|
||||
|
||||
// Custom-width columns (equalWidth must be false)
|
||||
sections: [{
|
||||
properties: {
|
||||
column: {
|
||||
equalWidth: false,
|
||||
children: [
|
||||
new Column({ width: 5400, space: 720 }),
|
||||
new Column({ width: 3240 }),
|
||||
],
|
||||
},
|
||||
},
|
||||
children: [/* content */]
|
||||
}]
|
||||
```
|
||||
|
||||
Force a column break with a new section using `type: SectionType.NEXT_COLUMN`.
|
||||
|
||||
### Table of Contents
|
||||
|
||||
```javascript
|
||||
// CRITICAL: Headings must use HeadingLevel ONLY - no custom styles
|
||||
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })
|
||||
```
|
||||
|
||||
### Headers/Footers
|
||||
|
||||
```javascript
|
||||
sections: [{
|
||||
properties: {
|
||||
page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1 inch
|
||||
},
|
||||
headers: {
|
||||
default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
|
||||
},
|
||||
footers: {
|
||||
default: new Footer({ children: [new Paragraph({
|
||||
children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
|
||||
})] })
|
||||
},
|
||||
children: [/* content */]
|
||||
}]
|
||||
```
|
||||
|
||||
### Critical Rules for docx-js
|
||||
|
||||
- **Set page size explicitly** - docx-js defaults to A4; use US Letter (12240 x 15840 DXA) for US documents
|
||||
- **Landscape: pass portrait dimensions** - docx-js swaps width/height internally; pass short edge as `width`, long edge as `height`, and set `orientation: PageOrientation.LANDSCAPE`
|
||||
- **Never use `\n`** - use separate Paragraph elements
|
||||
- **Never use unicode bullets** - use `LevelFormat.BULLET` with numbering config
|
||||
- **PageBreak must be in Paragraph** - standalone creates invalid XML
|
||||
- **ImageRun requires `type`** - always specify png/jpg/etc
|
||||
- **Always set table `width` with DXA** - never use `WidthType.PERCENTAGE` (breaks in Google Docs)
|
||||
- **Tables need dual widths** - `columnWidths` array AND cell `width`, both must match
|
||||
- **Table width = sum of columnWidths** - for DXA, ensure they add up exactly
|
||||
- **Always add cell margins** - use `margins: { top: 80, bottom: 80, left: 120, right: 120 }` for readable padding
|
||||
- **Use `ShadingType.CLEAR`** - never SOLID for table shading
|
||||
- **Never use tables as dividers/rules** - cells have minimum height and render as empty boxes (including in headers/footers); use `border: { bottom: { style: BorderStyle.SINGLE, size: 6, color: "2E75B6", space: 1 } }` on a Paragraph instead. For two-column footers, use tab stops (see Tab Stops section), not tables
|
||||
- **TOC requires HeadingLevel only** - no custom styles on heading paragraphs
|
||||
- **Override built-in styles** - use exact IDs: "Heading1", "Heading2", etc.
|
||||
- **Include `outlineLevel`** - required for TOC (0 for H1, 1 for H2, etc.)
|
||||
|
||||
---
|
||||
|
||||
## Editing Existing Documents
|
||||
|
||||
**Follow all 3 steps in order.**
|
||||
|
||||
### Step 1: Unpack
|
||||
```bash
|
||||
python scripts/office/unpack.py document.docx unpacked/
|
||||
```
|
||||
Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities (`“` etc.) so they survive editing. Use `--merge-runs false` to skip run merging.
|
||||
|
||||
### Step 2: Edit XML
|
||||
|
||||
Edit files in `unpacked/word/`. See XML Reference below for patterns.
|
||||
|
||||
**Use "Claude" as the author** for tracked changes and comments, unless the user explicitly requests use of a different name.
|
||||
|
||||
**Use the Edit tool directly for string replacement. Do not write Python scripts.** Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced.
|
||||
|
||||
**CRITICAL: Use smart quotes for new content.** When adding text with apostrophes or quotes, use XML entities to produce smart quotes:
|
||||
```xml
|
||||
<!-- Use these entities for professional typography -->
|
||||
<w:t>Here’s a quote: “Hello”</w:t>
|
||||
```
|
||||
| Entity | Character |
|
||||
|--------|-----------|
|
||||
| `‘` | ‘ (left single) |
|
||||
| `’` | ’ (right single / apostrophe) |
|
||||
| `“` | “ (left double) |
|
||||
| `”` | ” (right double) |
|
||||
|
||||
**Adding comments:** Use `comment.py` to handle boilerplate across multiple XML files (text must be pre-escaped XML):
|
||||
```bash
|
||||
python scripts/comment.py unpacked/ 0 "Comment text with & and ’"
|
||||
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0 # reply to comment 0
|
||||
python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author" # custom author name
|
||||
```
|
||||
Then add markers to document.xml (see Comments in XML Reference).
|
||||
|
||||
### Step 3: Pack
|
||||
```bash
|
||||
python scripts/office/pack.py unpacked/ output.docx --original document.docx
|
||||
```
|
||||
Validates with auto-repair, condenses XML, and creates DOCX. Use `--validate false` to skip.
|
||||
|
||||
**Auto-repair will fix:**
|
||||
- `durableId` >= 0x7FFFFFFF (regenerates valid ID)
|
||||
- Missing `xml:space="preserve"` on `<w:t>` with whitespace
|
||||
|
||||
**Auto-repair won't fix:**
|
||||
- Malformed XML, invalid element nesting, missing relationships, schema violations
|
||||
|
||||
### Common Pitfalls
|
||||
|
||||
- **Replace entire `<w:r>` elements**: When adding tracked changes, replace the whole `<w:r>...</w:r>` block with `<w:del>...<w:ins>...` as siblings. Don't inject tracked change tags inside a run.
|
||||
- **Preserve `<w:rPr>` formatting**: Copy the original run's `<w:rPr>` block into your tracked change runs to maintain bold, font size, etc.
|
||||
|
||||
---
|
||||
|
||||
## XML Reference
|
||||
|
||||
### Schema Compliance
|
||||
|
||||
- **Element order in `<w:pPr>`**: `<w:pStyle>`, `<w:numPr>`, `<w:spacing>`, `<w:ind>`, `<w:jc>`, `<w:rPr>` last
|
||||
- **Whitespace**: Add `xml:space="preserve"` to `<w:t>` with leading/trailing spaces
|
||||
- **RSIDs**: Must be 8-digit hex (e.g., `00AB1234`)
|
||||
|
||||
### Tracked Changes
|
||||
|
||||
**Insertion:**
|
||||
```xml
|
||||
<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
|
||||
<w:r><w:t>inserted text</w:t></w:r>
|
||||
</w:ins>
|
||||
```
|
||||
|
||||
**Deletion:**
|
||||
```xml
|
||||
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
|
||||
<w:r><w:delText>deleted text</w:delText></w:r>
|
||||
</w:del>
|
||||
```
|
||||
|
||||
**Inside `<w:del>`**: Use `<w:delText>` instead of `<w:t>`, and `<w:delInstrText>` instead of `<w:instrText>`.
|
||||
|
||||
**Minimal edits** - only mark what changes:
|
||||
```xml
|
||||
<!-- Change "30 days" to "60 days" -->
|
||||
<w:r><w:t>The term is </w:t></w:r>
|
||||
<w:del w:id="1" w:author="Claude" w:date="...">
|
||||
<w:r><w:delText>30</w:delText></w:r>
|
||||
</w:del>
|
||||
<w:ins w:id="2" w:author="Claude" w:date="...">
|
||||
<w:r><w:t>60</w:t></w:r>
|
||||
</w:ins>
|
||||
<w:r><w:t> days.</w:t></w:r>
|
||||
```
|
||||
|
||||
**Deleting entire paragraphs/list items** - when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add `<w:del/>` inside `<w:pPr><w:rPr>`:
|
||||
```xml
|
||||
<w:p>
|
||||
<w:pPr>
|
||||
<w:numPr>...</w:numPr> <!-- list numbering if present -->
|
||||
<w:rPr>
|
||||
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
|
||||
</w:rPr>
|
||||
</w:pPr>
|
||||
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
|
||||
<w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
|
||||
</w:del>
|
||||
</w:p>
|
||||
```
|
||||
Without the `<w:del/>` in `<w:pPr><w:rPr>`, accepting changes leaves an empty paragraph/list item.
|
||||
|
||||
**Rejecting another author's insertion** - nest deletion inside their insertion:
|
||||
```xml
|
||||
<w:ins w:author="Jane" w:id="5">
|
||||
<w:del w:author="Claude" w:id="10">
|
||||
<w:r><w:delText>their inserted text</w:delText></w:r>
|
||||
</w:del>
|
||||
</w:ins>
|
||||
```
|
||||
|
||||
**Restoring another author's deletion** - add insertion after (don't modify their deletion):
|
||||
```xml
|
||||
<w:del w:author="Jane" w:id="5">
|
||||
<w:r><w:delText>deleted text</w:delText></w:r>
|
||||
</w:del>
|
||||
<w:ins w:author="Claude" w:id="10">
|
||||
<w:r><w:t>deleted text</w:t></w:r>
|
||||
</w:ins>
|
||||
```
|
||||
|
||||
### Comments
|
||||
|
||||
After running `comment.py` (see Step 2), add markers to document.xml. For replies, use `--parent` flag and nest markers inside the parent's.
|
||||
|
||||
**CRITICAL: `<w:commentRangeStart>` and `<w:commentRangeEnd>` are siblings of `<w:r>`, never inside `<w:r>`.**
|
||||
|
||||
```xml
|
||||
<!-- Comment markers are direct children of w:p, never inside w:r -->
|
||||
<w:commentRangeStart w:id="0"/>
|
||||
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
|
||||
<w:r><w:delText>deleted</w:delText></w:r>
|
||||
</w:del>
|
||||
<w:r><w:t> more text</w:t></w:r>
|
||||
<w:commentRangeEnd w:id="0"/>
|
||||
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
|
||||
|
||||
<!-- Comment 0 with reply 1 nested inside -->
|
||||
<w:commentRangeStart w:id="0"/>
|
||||
<w:commentRangeStart w:id="1"/>
|
||||
<w:r><w:t>text</w:t></w:r>
|
||||
<w:commentRangeEnd w:id="1"/>
|
||||
<w:commentRangeEnd w:id="0"/>
|
||||
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
|
||||
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>
|
||||
```
|
||||
|
||||
### Images
|
||||
|
||||
1. Add image file to `word/media/`
|
||||
2. Add relationship to `word/_rels/document.xml.rels`:
|
||||
```xml
|
||||
<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>
|
||||
```
|
||||
3. Add content type to `[Content_Types].xml`:
|
||||
```xml
|
||||
<Default Extension="png" ContentType="image/png"/>
|
||||
```
|
||||
4. Reference in document.xml:
|
||||
```xml
|
||||
<w:drawing>
|
||||
<wp:inline>
|
||||
<wp:extent cx="914400" cy="914400"/> <!-- EMUs: 914400 = 1 inch -->
|
||||
<a:graphic>
|
||||
<a:graphicData uri=".../picture">
|
||||
<pic:pic>
|
||||
<pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
|
||||
</pic:pic>
|
||||
</a:graphicData>
|
||||
</a:graphic>
|
||||
</wp:inline>
|
||||
</w:drawing>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
- **pandoc**: Text extraction
|
||||
- **docx**: `npm install -g docx` (new documents)
|
||||
- **LibreOffice**: PDF conversion (auto-configured for sandboxed environments via `scripts/office/soffice.py`)
|
||||
- **Poppler**: `pdftoppm` for images
|
||||
1
scientific-skills/docx/scripts/__init__.py
Executable file
1
scientific-skills/docx/scripts/__init__.py
Executable file
@@ -0,0 +1 @@
|
||||
|
||||
135
scientific-skills/docx/scripts/accept_changes.py
Executable file
135
scientific-skills/docx/scripts/accept_changes.py
Executable file
@@ -0,0 +1,135 @@
|
||||
"""Accept all tracked changes in a DOCX file using LibreOffice.
|
||||
|
||||
Requires LibreOffice (soffice) to be installed.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import shutil
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
|
||||
from office.soffice import get_soffice_env
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
LIBREOFFICE_PROFILE = "/tmp/libreoffice_docx_profile"
|
||||
MACRO_DIR = f"{LIBREOFFICE_PROFILE}/user/basic/Standard"
|
||||
|
||||
ACCEPT_CHANGES_MACRO = """<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
|
||||
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">
|
||||
Sub AcceptAllTrackedChanges()
|
||||
Dim document As Object
|
||||
Dim dispatcher As Object
|
||||
|
||||
document = ThisComponent.CurrentController.Frame
|
||||
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
|
||||
|
||||
dispatcher.executeDispatch(document, ".uno:AcceptAllTrackedChanges", "", 0, Array())
|
||||
ThisComponent.store()
|
||||
ThisComponent.close(True)
|
||||
End Sub
|
||||
</script:module>"""
|
||||
|
||||
|
||||
def accept_changes(
|
||||
input_file: str,
|
||||
output_file: str,
|
||||
) -> tuple[None, str]:
|
||||
input_path = Path(input_file)
|
||||
output_path = Path(output_file)
|
||||
|
||||
if not input_path.exists():
|
||||
return None, f"Error: Input file not found: {input_file}"
|
||||
|
||||
if not input_path.suffix.lower() == ".docx":
|
||||
return None, f"Error: Input file is not a DOCX file: {input_file}"
|
||||
|
||||
try:
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
shutil.copy2(input_path, output_path)
|
||||
except Exception as e:
|
||||
return None, f"Error: Failed to copy input file to output location: {e}"
|
||||
|
||||
if not _setup_libreoffice_macro():
|
||||
return None, "Error: Failed to setup LibreOffice macro"
|
||||
|
||||
cmd = [
|
||||
"soffice",
|
||||
"--headless",
|
||||
f"-env:UserInstallation=file://{LIBREOFFICE_PROFILE}",
|
||||
"--norestore",
|
||||
"vnd.sun.star.script:Standard.Module1.AcceptAllTrackedChanges?language=Basic&location=application",
|
||||
str(output_path.absolute()),
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30,
|
||||
check=False,
|
||||
env=get_soffice_env(),
|
||||
)
|
||||
except subprocess.TimeoutExpired:
|
||||
return (
|
||||
None,
|
||||
f"Successfully accepted all tracked changes: {input_file} -> {output_file}",
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
return None, f"Error: LibreOffice failed: {result.stderr}"
|
||||
|
||||
return (
|
||||
None,
|
||||
f"Successfully accepted all tracked changes: {input_file} -> {output_file}",
|
||||
)
|
||||
|
||||
|
||||
def _setup_libreoffice_macro() -> bool:
|
||||
macro_dir = Path(MACRO_DIR)
|
||||
macro_file = macro_dir / "Module1.xba"
|
||||
|
||||
if macro_file.exists() and "AcceptAllTrackedChanges" in macro_file.read_text():
|
||||
return True
|
||||
|
||||
if not macro_dir.exists():
|
||||
subprocess.run(
|
||||
[
|
||||
"soffice",
|
||||
"--headless",
|
||||
f"-env:UserInstallation=file://{LIBREOFFICE_PROFILE}",
|
||||
"--terminate_after_init",
|
||||
],
|
||||
capture_output=True,
|
||||
timeout=10,
|
||||
check=False,
|
||||
env=get_soffice_env(),
|
||||
)
|
||||
macro_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
try:
|
||||
macro_file.write_text(ACCEPT_CHANGES_MACRO)
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to setup LibreOffice macro: {e}")
|
||||
return False
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Accept all tracked changes in a DOCX file"
|
||||
)
|
||||
parser.add_argument("input_file", help="Input DOCX file with tracked changes")
|
||||
parser.add_argument(
|
||||
"output_file", help="Output DOCX file (clean, no tracked changes)"
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
_, message = accept_changes(args.input_file, args.output_file)
|
||||
print(message)
|
||||
|
||||
if "Error" in message:
|
||||
raise SystemExit(1)
|
||||
318
scientific-skills/docx/scripts/comment.py
Executable file
318
scientific-skills/docx/scripts/comment.py
Executable file
@@ -0,0 +1,318 @@
|
||||
"""Add comments to DOCX documents.
|
||||
|
||||
Usage:
|
||||
python comment.py unpacked/ 0 "Comment text"
|
||||
python comment.py unpacked/ 1 "Reply text" --parent 0
|
||||
|
||||
Text should be pre-escaped XML (e.g., & for &, ’ for smart quotes).
|
||||
|
||||
After running, add markers to document.xml:
|
||||
<w:commentRangeStart w:id="0"/>
|
||||
... commented content ...
|
||||
<w:commentRangeEnd w:id="0"/>
|
||||
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import random
|
||||
import shutil
|
||||
import sys
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import defusedxml.minidom
|
||||
|
||||
TEMPLATE_DIR = Path(__file__).parent / "templates"
|
||||
NS = {
|
||||
"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
|
||||
"w14": "http://schemas.microsoft.com/office/word/2010/wordml",
|
||||
"w15": "http://schemas.microsoft.com/office/word/2012/wordml",
|
||||
"w16cid": "http://schemas.microsoft.com/office/word/2016/wordml/cid",
|
||||
"w16cex": "http://schemas.microsoft.com/office/word/2018/wordml/cex",
|
||||
}
|
||||
|
||||
COMMENT_XML = """\
|
||||
<w:comment w:id="{id}" w:author="{author}" w:date="{date}" w:initials="{initials}">
|
||||
<w:p w14:paraId="{para_id}" w14:textId="77777777">
|
||||
<w:r>
|
||||
<w:rPr><w:rStyle w:val="CommentReference"/></w:rPr>
|
||||
<w:annotationRef/>
|
||||
</w:r>
|
||||
<w:r>
|
||||
<w:rPr>
|
||||
<w:color w:val="000000"/>
|
||||
<w:sz w:val="20"/>
|
||||
<w:szCs w:val="20"/>
|
||||
</w:rPr>
|
||||
<w:t>{text}</w:t>
|
||||
</w:r>
|
||||
</w:p>
|
||||
</w:comment>"""
|
||||
|
||||
COMMENT_MARKER_TEMPLATE = """
|
||||
Add to document.xml (markers must be direct children of w:p, never inside w:r):
|
||||
<w:commentRangeStart w:id="{cid}"/>
|
||||
<w:r>...</w:r>
|
||||
<w:commentRangeEnd w:id="{cid}"/>
|
||||
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="{cid}"/></w:r>"""
|
||||
|
||||
REPLY_MARKER_TEMPLATE = """
|
||||
Nest markers inside parent {pid}'s markers (markers must be direct children of w:p, never inside w:r):
|
||||
<w:commentRangeStart w:id="{pid}"/><w:commentRangeStart w:id="{cid}"/>
|
||||
<w:r>...</w:r>
|
||||
<w:commentRangeEnd w:id="{cid}"/><w:commentRangeEnd w:id="{pid}"/>
|
||||
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="{pid}"/></w:r>
|
||||
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="{cid}"/></w:r>"""
|
||||
|
||||
|
||||
def _generate_hex_id() -> str:
|
||||
return f"{random.randint(0, 0x7FFFFFFE):08X}"
|
||||
|
||||
|
||||
SMART_QUOTE_ENTITIES = {
|
||||
"\u201c": "“",
|
||||
"\u201d": "”",
|
||||
"\u2018": "‘",
|
||||
"\u2019": "’",
|
||||
}
|
||||
|
||||
|
||||
def _encode_smart_quotes(text: str) -> str:
|
||||
for char, entity in SMART_QUOTE_ENTITIES.items():
|
||||
text = text.replace(char, entity)
|
||||
return text
|
||||
|
||||
|
||||
def _append_xml(xml_path: Path, root_tag: str, content: str) -> None:
|
||||
dom = defusedxml.minidom.parseString(xml_path.read_text(encoding="utf-8"))
|
||||
root = dom.getElementsByTagName(root_tag)[0]
|
||||
ns_attrs = " ".join(f'xmlns:{k}="{v}"' for k, v in NS.items())
|
||||
wrapper_dom = defusedxml.minidom.parseString(f"<root {ns_attrs}>{content}</root>")
|
||||
for child in wrapper_dom.documentElement.childNodes:
|
||||
if child.nodeType == child.ELEMENT_NODE:
|
||||
root.appendChild(dom.importNode(child, True))
|
||||
output = _encode_smart_quotes(dom.toxml(encoding="UTF-8").decode("utf-8"))
|
||||
xml_path.write_text(output, encoding="utf-8")
|
||||
|
||||
|
||||
def _find_para_id(comments_path: Path, comment_id: int) -> str | None:
|
||||
dom = defusedxml.minidom.parseString(comments_path.read_text(encoding="utf-8"))
|
||||
for c in dom.getElementsByTagName("w:comment"):
|
||||
if c.getAttribute("w:id") == str(comment_id):
|
||||
for p in c.getElementsByTagName("w:p"):
|
||||
if pid := p.getAttribute("w14:paraId"):
|
||||
return pid
|
||||
return None
|
||||
|
||||
|
||||
def _get_next_rid(rels_path: Path) -> int:
|
||||
dom = defusedxml.minidom.parseString(rels_path.read_text(encoding="utf-8"))
|
||||
max_rid = 0
|
||||
for rel in dom.getElementsByTagName("Relationship"):
|
||||
rid = rel.getAttribute("Id")
|
||||
if rid and rid.startswith("rId"):
|
||||
try:
|
||||
max_rid = max(max_rid, int(rid[3:]))
|
||||
except ValueError:
|
||||
pass
|
||||
return max_rid + 1
|
||||
|
||||
|
||||
def _has_relationship(rels_path: Path, target: str) -> bool:
|
||||
dom = defusedxml.minidom.parseString(rels_path.read_text(encoding="utf-8"))
|
||||
for rel in dom.getElementsByTagName("Relationship"):
|
||||
if rel.getAttribute("Target") == target:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def _has_content_type(ct_path: Path, part_name: str) -> bool:
|
||||
dom = defusedxml.minidom.parseString(ct_path.read_text(encoding="utf-8"))
|
||||
for override in dom.getElementsByTagName("Override"):
|
||||
if override.getAttribute("PartName") == part_name:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def _ensure_comment_relationships(unpacked_dir: Path) -> None:
|
||||
rels_path = unpacked_dir / "word" / "_rels" / "document.xml.rels"
|
||||
if not rels_path.exists():
|
||||
return
|
||||
|
||||
if _has_relationship(rels_path, "comments.xml"):
|
||||
return
|
||||
|
||||
dom = defusedxml.minidom.parseString(rels_path.read_text(encoding="utf-8"))
|
||||
root = dom.documentElement
|
||||
next_rid = _get_next_rid(rels_path)
|
||||
|
||||
rels = [
|
||||
(
|
||||
"http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments",
|
||||
"comments.xml",
|
||||
),
|
||||
(
|
||||
"http://schemas.microsoft.com/office/2011/relationships/commentsExtended",
|
||||
"commentsExtended.xml",
|
||||
),
|
||||
(
|
||||
"http://schemas.microsoft.com/office/2016/09/relationships/commentsIds",
|
||||
"commentsIds.xml",
|
||||
),
|
||||
(
|
||||
"http://schemas.microsoft.com/office/2018/08/relationships/commentsExtensible",
|
||||
"commentsExtensible.xml",
|
||||
),
|
||||
]
|
||||
|
||||
for rel_type, target in rels:
|
||||
rel = dom.createElement("Relationship")
|
||||
rel.setAttribute("Id", f"rId{next_rid}")
|
||||
rel.setAttribute("Type", rel_type)
|
||||
rel.setAttribute("Target", target)
|
||||
root.appendChild(rel)
|
||||
next_rid += 1
|
||||
|
||||
rels_path.write_bytes(dom.toxml(encoding="UTF-8"))
|
||||
|
||||
|
||||
def _ensure_comment_content_types(unpacked_dir: Path) -> None:
|
||||
ct_path = unpacked_dir / "[Content_Types].xml"
|
||||
if not ct_path.exists():
|
||||
return
|
||||
|
||||
if _has_content_type(ct_path, "/word/comments.xml"):
|
||||
return
|
||||
|
||||
dom = defusedxml.minidom.parseString(ct_path.read_text(encoding="utf-8"))
|
||||
root = dom.documentElement
|
||||
|
||||
overrides = [
|
||||
(
|
||||
"/word/comments.xml",
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml",
|
||||
),
|
||||
(
|
||||
"/word/commentsExtended.xml",
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.commentsExtended+xml",
|
||||
),
|
||||
(
|
||||
"/word/commentsIds.xml",
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.commentsIds+xml",
|
||||
),
|
||||
(
|
||||
"/word/commentsExtensible.xml",
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.commentsExtensible+xml",
|
||||
),
|
||||
]
|
||||
|
||||
for part_name, content_type in overrides:
|
||||
override = dom.createElement("Override")
|
||||
override.setAttribute("PartName", part_name)
|
||||
override.setAttribute("ContentType", content_type)
|
||||
root.appendChild(override)
|
||||
|
||||
ct_path.write_bytes(dom.toxml(encoding="UTF-8"))
|
||||
|
||||
|
||||
def add_comment(
|
||||
unpacked_dir: str,
|
||||
comment_id: int,
|
||||
text: str,
|
||||
author: str = "Claude",
|
||||
initials: str = "C",
|
||||
parent_id: int | None = None,
|
||||
) -> tuple[str, str]:
|
||||
word = Path(unpacked_dir) / "word"
|
||||
if not word.exists():
|
||||
return "", f"Error: {word} not found"
|
||||
|
||||
para_id, durable_id = _generate_hex_id(), _generate_hex_id()
|
||||
ts = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
|
||||
comments = word / "comments.xml"
|
||||
first_comment = not comments.exists()
|
||||
if first_comment:
|
||||
shutil.copy(TEMPLATE_DIR / "comments.xml", comments)
|
||||
_ensure_comment_relationships(Path(unpacked_dir))
|
||||
_ensure_comment_content_types(Path(unpacked_dir))
|
||||
_append_xml(
|
||||
comments,
|
||||
"w:comments",
|
||||
COMMENT_XML.format(
|
||||
id=comment_id,
|
||||
author=author,
|
||||
date=ts,
|
||||
initials=initials,
|
||||
para_id=para_id,
|
||||
text=text,
|
||||
),
|
||||
)
|
||||
|
||||
ext = word / "commentsExtended.xml"
|
||||
if not ext.exists():
|
||||
shutil.copy(TEMPLATE_DIR / "commentsExtended.xml", ext)
|
||||
if parent_id is not None:
|
||||
parent_para = _find_para_id(comments, parent_id)
|
||||
if not parent_para:
|
||||
return "", f"Error: Parent comment {parent_id} not found"
|
||||
_append_xml(
|
||||
ext,
|
||||
"w15:commentsEx",
|
||||
f'<w15:commentEx w15:paraId="{para_id}" w15:paraIdParent="{parent_para}" w15:done="0"/>',
|
||||
)
|
||||
else:
|
||||
_append_xml(
|
||||
ext,
|
||||
"w15:commentsEx",
|
||||
f'<w15:commentEx w15:paraId="{para_id}" w15:done="0"/>',
|
||||
)
|
||||
|
||||
ids = word / "commentsIds.xml"
|
||||
if not ids.exists():
|
||||
shutil.copy(TEMPLATE_DIR / "commentsIds.xml", ids)
|
||||
_append_xml(
|
||||
ids,
|
||||
"w16cid:commentsIds",
|
||||
f'<w16cid:commentId w16cid:paraId="{para_id}" w16cid:durableId="{durable_id}"/>',
|
||||
)
|
||||
|
||||
extensible = word / "commentsExtensible.xml"
|
||||
if not extensible.exists():
|
||||
shutil.copy(TEMPLATE_DIR / "commentsExtensible.xml", extensible)
|
||||
_append_xml(
|
||||
extensible,
|
||||
"w16cex:commentsExtensible",
|
||||
f'<w16cex:commentExtensible w16cex:durableId="{durable_id}" w16cex:dateUtc="{ts}"/>',
|
||||
)
|
||||
|
||||
action = "reply" if parent_id is not None else "comment"
|
||||
return para_id, f"Added {action} {comment_id} (para_id={para_id})"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
p = argparse.ArgumentParser(description="Add comments to DOCX documents")
|
||||
p.add_argument("unpacked_dir", help="Unpacked DOCX directory")
|
||||
p.add_argument("comment_id", type=int, help="Comment ID (must be unique)")
|
||||
p.add_argument("text", help="Comment text")
|
||||
p.add_argument("--author", default="Claude", help="Author name")
|
||||
p.add_argument("--initials", default="C", help="Author initials")
|
||||
p.add_argument("--parent", type=int, help="Parent comment ID (for replies)")
|
||||
args = p.parse_args()
|
||||
|
||||
para_id, msg = add_comment(
|
||||
args.unpacked_dir,
|
||||
args.comment_id,
|
||||
args.text,
|
||||
args.author,
|
||||
args.initials,
|
||||
args.parent,
|
||||
)
|
||||
print(msg)
|
||||
if "Error" in msg:
|
||||
sys.exit(1)
|
||||
cid = args.comment_id
|
||||
if args.parent is not None:
|
||||
print(REPLY_MARKER_TEMPLATE.format(pid=args.parent, cid=cid))
|
||||
else:
|
||||
print(COMMENT_MARKER_TEMPLATE.format(cid=cid))
|
||||
199
scientific-skills/docx/scripts/office/helpers/merge_runs.py
Normal file
199
scientific-skills/docx/scripts/office/helpers/merge_runs.py
Normal file
@@ -0,0 +1,199 @@
|
||||
"""Merge adjacent runs with identical formatting in DOCX.
|
||||
|
||||
Merges adjacent <w:r> elements that have identical <w:rPr> properties.
|
||||
Works on runs in paragraphs and inside tracked changes (<w:ins>, <w:del>).
|
||||
|
||||
Also:
|
||||
- Removes rsid attributes from runs (revision metadata that doesn't affect rendering)
|
||||
- Removes proofErr elements (spell/grammar markers that block merging)
|
||||
"""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import defusedxml.minidom
|
||||
|
||||
|
||||
def merge_runs(input_dir: str) -> tuple[int, str]:
|
||||
doc_xml = Path(input_dir) / "word" / "document.xml"
|
||||
|
||||
if not doc_xml.exists():
|
||||
return 0, f"Error: {doc_xml} not found"
|
||||
|
||||
try:
|
||||
dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
|
||||
root = dom.documentElement
|
||||
|
||||
_remove_elements(root, "proofErr")
|
||||
_strip_run_rsid_attrs(root)
|
||||
|
||||
containers = {run.parentNode for run in _find_elements(root, "r")}
|
||||
|
||||
merge_count = 0
|
||||
for container in containers:
|
||||
merge_count += _merge_runs_in(container)
|
||||
|
||||
doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
|
||||
return merge_count, f"Merged {merge_count} runs"
|
||||
|
||||
except Exception as e:
|
||||
return 0, f"Error: {e}"
|
||||
|
||||
|
||||
|
||||
|
||||
def _find_elements(root, tag: str) -> list:
|
||||
results = []
|
||||
|
||||
def traverse(node):
|
||||
if node.nodeType == node.ELEMENT_NODE:
|
||||
name = node.localName or node.tagName
|
||||
if name == tag or name.endswith(f":{tag}"):
|
||||
results.append(node)
|
||||
for child in node.childNodes:
|
||||
traverse(child)
|
||||
|
||||
traverse(root)
|
||||
return results
|
||||
|
||||
|
||||
def _get_child(parent, tag: str):
|
||||
for child in parent.childNodes:
|
||||
if child.nodeType == child.ELEMENT_NODE:
|
||||
name = child.localName or child.tagName
|
||||
if name == tag or name.endswith(f":{tag}"):
|
||||
return child
|
||||
return None
|
||||
|
||||
|
||||
def _get_children(parent, tag: str) -> list:
|
||||
results = []
|
||||
for child in parent.childNodes:
|
||||
if child.nodeType == child.ELEMENT_NODE:
|
||||
name = child.localName or child.tagName
|
||||
if name == tag or name.endswith(f":{tag}"):
|
||||
results.append(child)
|
||||
return results
|
||||
|
||||
|
||||
def _is_adjacent(elem1, elem2) -> bool:
|
||||
node = elem1.nextSibling
|
||||
while node:
|
||||
if node == elem2:
|
||||
return True
|
||||
if node.nodeType == node.ELEMENT_NODE:
|
||||
return False
|
||||
if node.nodeType == node.TEXT_NODE and node.data.strip():
|
||||
return False
|
||||
node = node.nextSibling
|
||||
return False
|
||||
|
||||
|
||||
|
||||
|
||||
def _remove_elements(root, tag: str):
|
||||
for elem in _find_elements(root, tag):
|
||||
if elem.parentNode:
|
||||
elem.parentNode.removeChild(elem)
|
||||
|
||||
|
||||
def _strip_run_rsid_attrs(root):
|
||||
for run in _find_elements(root, "r"):
|
||||
for attr in list(run.attributes.values()):
|
||||
if "rsid" in attr.name.lower():
|
||||
run.removeAttribute(attr.name)
|
||||
|
||||
|
||||
|
||||
|
||||
def _merge_runs_in(container) -> int:
|
||||
merge_count = 0
|
||||
run = _first_child_run(container)
|
||||
|
||||
while run:
|
||||
while True:
|
||||
next_elem = _next_element_sibling(run)
|
||||
if next_elem and _is_run(next_elem) and _can_merge(run, next_elem):
|
||||
_merge_run_content(run, next_elem)
|
||||
container.removeChild(next_elem)
|
||||
merge_count += 1
|
||||
else:
|
||||
break
|
||||
|
||||
_consolidate_text(run)
|
||||
run = _next_sibling_run(run)
|
||||
|
||||
return merge_count
|
||||
|
||||
|
||||
def _first_child_run(container):
|
||||
for child in container.childNodes:
|
||||
if child.nodeType == child.ELEMENT_NODE and _is_run(child):
|
||||
return child
|
||||
return None
|
||||
|
||||
|
||||
def _next_element_sibling(node):
|
||||
sibling = node.nextSibling
|
||||
while sibling:
|
||||
if sibling.nodeType == sibling.ELEMENT_NODE:
|
||||
return sibling
|
||||
sibling = sibling.nextSibling
|
||||
return None
|
||||
|
||||
|
||||
def _next_sibling_run(node):
|
||||
sibling = node.nextSibling
|
||||
while sibling:
|
||||
if sibling.nodeType == sibling.ELEMENT_NODE:
|
||||
if _is_run(sibling):
|
||||
return sibling
|
||||
sibling = sibling.nextSibling
|
||||
return None
|
||||
|
||||
|
||||
def _is_run(node) -> bool:
|
||||
name = node.localName or node.tagName
|
||||
return name == "r" or name.endswith(":r")
|
||||
|
||||
|
||||
def _can_merge(run1, run2) -> bool:
|
||||
rpr1 = _get_child(run1, "rPr")
|
||||
rpr2 = _get_child(run2, "rPr")
|
||||
|
||||
if (rpr1 is None) != (rpr2 is None):
|
||||
return False
|
||||
if rpr1 is None:
|
||||
return True
|
||||
return rpr1.toxml() == rpr2.toxml()
|
||||
|
||||
|
||||
def _merge_run_content(target, source):
|
||||
for child in list(source.childNodes):
|
||||
if child.nodeType == child.ELEMENT_NODE:
|
||||
name = child.localName or child.tagName
|
||||
if name != "rPr" and not name.endswith(":rPr"):
|
||||
target.appendChild(child)
|
||||
|
||||
|
||||
def _consolidate_text(run):
|
||||
t_elements = _get_children(run, "t")
|
||||
|
||||
for i in range(len(t_elements) - 1, 0, -1):
|
||||
curr, prev = t_elements[i], t_elements[i - 1]
|
||||
|
||||
if _is_adjacent(prev, curr):
|
||||
prev_text = prev.firstChild.data if prev.firstChild else ""
|
||||
curr_text = curr.firstChild.data if curr.firstChild else ""
|
||||
merged = prev_text + curr_text
|
||||
|
||||
if prev.firstChild:
|
||||
prev.firstChild.data = merged
|
||||
else:
|
||||
prev.appendChild(run.ownerDocument.createTextNode(merged))
|
||||
|
||||
if merged.startswith(" ") or merged.endswith(" "):
|
||||
prev.setAttribute("xml:space", "preserve")
|
||||
elif prev.hasAttribute("xml:space"):
|
||||
prev.removeAttribute("xml:space")
|
||||
|
||||
run.removeChild(curr)
|
||||
@@ -0,0 +1,197 @@
|
||||
"""Simplify tracked changes by merging adjacent w:ins or w:del elements.
|
||||
|
||||
Merges adjacent <w:ins> elements from the same author into a single element.
|
||||
Same for <w:del> elements. This makes heavily-redlined documents easier to
|
||||
work with by reducing the number of tracked change wrappers.
|
||||
|
||||
Rules:
|
||||
- Only merges w:ins with w:ins, w:del with w:del (same element type)
|
||||
- Only merges if same author (ignores timestamp differences)
|
||||
- Only merges if truly adjacent (only whitespace between them)
|
||||
"""
|
||||
|
||||
import xml.etree.ElementTree as ET
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
import defusedxml.minidom
|
||||
|
||||
WORD_NS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
|
||||
|
||||
|
||||
def simplify_redlines(input_dir: str) -> tuple[int, str]:
|
||||
doc_xml = Path(input_dir) / "word" / "document.xml"
|
||||
|
||||
if not doc_xml.exists():
|
||||
return 0, f"Error: {doc_xml} not found"
|
||||
|
||||
try:
|
||||
dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
|
||||
root = dom.documentElement
|
||||
|
||||
merge_count = 0
|
||||
|
||||
containers = _find_elements(root, "p") + _find_elements(root, "tc")
|
||||
|
||||
for container in containers:
|
||||
merge_count += _merge_tracked_changes_in(container, "ins")
|
||||
merge_count += _merge_tracked_changes_in(container, "del")
|
||||
|
||||
doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
|
||||
return merge_count, f"Simplified {merge_count} tracked changes"
|
||||
|
||||
except Exception as e:
|
||||
return 0, f"Error: {e}"
|
||||
|
||||
|
||||
def _merge_tracked_changes_in(container, tag: str) -> int:
|
||||
merge_count = 0
|
||||
|
||||
tracked = [
|
||||
child
|
||||
for child in container.childNodes
|
||||
if child.nodeType == child.ELEMENT_NODE and _is_element(child, tag)
|
||||
]
|
||||
|
||||
if len(tracked) < 2:
|
||||
return 0
|
||||
|
||||
i = 0
|
||||
while i < len(tracked) - 1:
|
||||
curr = tracked[i]
|
||||
next_elem = tracked[i + 1]
|
||||
|
||||
if _can_merge_tracked(curr, next_elem):
|
||||
_merge_tracked_content(curr, next_elem)
|
||||
container.removeChild(next_elem)
|
||||
tracked.pop(i + 1)
|
||||
merge_count += 1
|
||||
else:
|
||||
i += 1
|
||||
|
||||
return merge_count
|
||||
|
||||
|
||||
def _is_element(node, tag: str) -> bool:
|
||||
name = node.localName or node.tagName
|
||||
return name == tag or name.endswith(f":{tag}")
|
||||
|
||||
|
||||
def _get_author(elem) -> str:
|
||||
author = elem.getAttribute("w:author")
|
||||
if not author:
|
||||
for attr in elem.attributes.values():
|
||||
if attr.localName == "author" or attr.name.endswith(":author"):
|
||||
return attr.value
|
||||
return author
|
||||
|
||||
|
||||
def _can_merge_tracked(elem1, elem2) -> bool:
|
||||
if _get_author(elem1) != _get_author(elem2):
|
||||
return False
|
||||
|
||||
node = elem1.nextSibling
|
||||
while node and node != elem2:
|
||||
if node.nodeType == node.ELEMENT_NODE:
|
||||
return False
|
||||
if node.nodeType == node.TEXT_NODE and node.data.strip():
|
||||
return False
|
||||
node = node.nextSibling
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def _merge_tracked_content(target, source):
|
||||
while source.firstChild:
|
||||
child = source.firstChild
|
||||
source.removeChild(child)
|
||||
target.appendChild(child)
|
||||
|
||||
|
||||
def _find_elements(root, tag: str) -> list:
|
||||
results = []
|
||||
|
||||
def traverse(node):
|
||||
if node.nodeType == node.ELEMENT_NODE:
|
||||
name = node.localName or node.tagName
|
||||
if name == tag or name.endswith(f":{tag}"):
|
||||
results.append(node)
|
||||
for child in node.childNodes:
|
||||
traverse(child)
|
||||
|
||||
traverse(root)
|
||||
return results
|
||||
|
||||
|
||||
def get_tracked_change_authors(doc_xml_path: Path) -> dict[str, int]:
|
||||
if not doc_xml_path.exists():
|
||||
return {}
|
||||
|
||||
try:
|
||||
tree = ET.parse(doc_xml_path)
|
||||
root = tree.getroot()
|
||||
except ET.ParseError:
|
||||
return {}
|
||||
|
||||
namespaces = {"w": WORD_NS}
|
||||
author_attr = f"{{{WORD_NS}}}author"
|
||||
|
||||
authors: dict[str, int] = {}
|
||||
for tag in ["ins", "del"]:
|
||||
for elem in root.findall(f".//w:{tag}", namespaces):
|
||||
author = elem.get(author_attr)
|
||||
if author:
|
||||
authors[author] = authors.get(author, 0) + 1
|
||||
|
||||
return authors
|
||||
|
||||
|
||||
def _get_authors_from_docx(docx_path: Path) -> dict[str, int]:
|
||||
try:
|
||||
with zipfile.ZipFile(docx_path, "r") as zf:
|
||||
if "word/document.xml" not in zf.namelist():
|
||||
return {}
|
||||
with zf.open("word/document.xml") as f:
|
||||
tree = ET.parse(f)
|
||||
root = tree.getroot()
|
||||
|
||||
namespaces = {"w": WORD_NS}
|
||||
author_attr = f"{{{WORD_NS}}}author"
|
||||
|
||||
authors: dict[str, int] = {}
|
||||
for tag in ["ins", "del"]:
|
||||
for elem in root.findall(f".//w:{tag}", namespaces):
|
||||
author = elem.get(author_attr)
|
||||
if author:
|
||||
authors[author] = authors.get(author, 0) + 1
|
||||
return authors
|
||||
except (zipfile.BadZipFile, ET.ParseError):
|
||||
return {}
|
||||
|
||||
|
||||
def infer_author(modified_dir: Path, original_docx: Path, default: str = "Claude") -> str:
|
||||
modified_xml = modified_dir / "word" / "document.xml"
|
||||
modified_authors = get_tracked_change_authors(modified_xml)
|
||||
|
||||
if not modified_authors:
|
||||
return default
|
||||
|
||||
original_authors = _get_authors_from_docx(original_docx)
|
||||
|
||||
new_changes: dict[str, int] = {}
|
||||
for author, count in modified_authors.items():
|
||||
original_count = original_authors.get(author, 0)
|
||||
diff = count - original_count
|
||||
if diff > 0:
|
||||
new_changes[author] = diff
|
||||
|
||||
if not new_changes:
|
||||
return default
|
||||
|
||||
if len(new_changes) == 1:
|
||||
return next(iter(new_changes))
|
||||
|
||||
raise ValueError(
|
||||
f"Multiple authors added new changes: {new_changes}. "
|
||||
"Cannot infer which author to validate."
|
||||
)
|
||||
159
scientific-skills/docx/scripts/office/pack.py
Executable file
159
scientific-skills/docx/scripts/office/pack.py
Executable file
@@ -0,0 +1,159 @@
|
||||
"""Pack a directory into a DOCX, PPTX, or XLSX file.
|
||||
|
||||
Validates with auto-repair, condenses XML formatting, and creates the Office file.
|
||||
|
||||
Usage:
|
||||
python pack.py <input_directory> <output_file> [--original <file>] [--validate true|false]
|
||||
|
||||
Examples:
|
||||
python pack.py unpacked/ output.docx --original input.docx
|
||||
python pack.py unpacked/ output.pptx --validate false
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import shutil
|
||||
import tempfile
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
import defusedxml.minidom
|
||||
|
||||
from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
|
||||
|
||||
def pack(
|
||||
input_directory: str,
|
||||
output_file: str,
|
||||
original_file: str | None = None,
|
||||
validate: bool = True,
|
||||
infer_author_func=None,
|
||||
) -> tuple[None, str]:
|
||||
input_dir = Path(input_directory)
|
||||
output_path = Path(output_file)
|
||||
suffix = output_path.suffix.lower()
|
||||
|
||||
if not input_dir.is_dir():
|
||||
return None, f"Error: {input_dir} is not a directory"
|
||||
|
||||
if suffix not in {".docx", ".pptx", ".xlsx"}:
|
||||
return None, f"Error: {output_file} must be a .docx, .pptx, or .xlsx file"
|
||||
|
||||
if validate and original_file:
|
||||
original_path = Path(original_file)
|
||||
if original_path.exists():
|
||||
success, output = _run_validation(
|
||||
input_dir, original_path, suffix, infer_author_func
|
||||
)
|
||||
if output:
|
||||
print(output)
|
||||
if not success:
|
||||
return None, f"Error: Validation failed for {input_dir}"
|
||||
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
temp_content_dir = Path(temp_dir) / "content"
|
||||
shutil.copytree(input_dir, temp_content_dir)
|
||||
|
||||
for pattern in ["*.xml", "*.rels"]:
|
||||
for xml_file in temp_content_dir.rglob(pattern):
|
||||
_condense_xml(xml_file)
|
||||
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with zipfile.ZipFile(output_path, "w", zipfile.ZIP_DEFLATED) as zf:
|
||||
for f in temp_content_dir.rglob("*"):
|
||||
if f.is_file():
|
||||
zf.write(f, f.relative_to(temp_content_dir))
|
||||
|
||||
return None, f"Successfully packed {input_dir} to {output_file}"
|
||||
|
||||
|
||||
def _run_validation(
|
||||
unpacked_dir: Path,
|
||||
original_file: Path,
|
||||
suffix: str,
|
||||
infer_author_func=None,
|
||||
) -> tuple[bool, str | None]:
|
||||
output_lines = []
|
||||
validators = []
|
||||
|
||||
if suffix == ".docx":
|
||||
author = "Claude"
|
||||
if infer_author_func:
|
||||
try:
|
||||
author = infer_author_func(unpacked_dir, original_file)
|
||||
except ValueError as e:
|
||||
print(f"Warning: {e} Using default author 'Claude'.", file=sys.stderr)
|
||||
|
||||
validators = [
|
||||
DOCXSchemaValidator(unpacked_dir, original_file),
|
||||
RedliningValidator(unpacked_dir, original_file, author=author),
|
||||
]
|
||||
elif suffix == ".pptx":
|
||||
validators = [PPTXSchemaValidator(unpacked_dir, original_file)]
|
||||
|
||||
if not validators:
|
||||
return True, None
|
||||
|
||||
total_repairs = sum(v.repair() for v in validators)
|
||||
if total_repairs:
|
||||
output_lines.append(f"Auto-repaired {total_repairs} issue(s)")
|
||||
|
||||
success = all(v.validate() for v in validators)
|
||||
|
||||
if success:
|
||||
output_lines.append("All validations PASSED!")
|
||||
|
||||
return success, "\n".join(output_lines) if output_lines else None
|
||||
|
||||
|
||||
def _condense_xml(xml_file: Path) -> None:
|
||||
try:
|
||||
with open(xml_file, encoding="utf-8") as f:
|
||||
dom = defusedxml.minidom.parse(f)
|
||||
|
||||
for element in dom.getElementsByTagName("*"):
|
||||
if element.tagName.endswith(":t"):
|
||||
continue
|
||||
|
||||
for child in list(element.childNodes):
|
||||
if (
|
||||
child.nodeType == child.TEXT_NODE
|
||||
and child.nodeValue
|
||||
and child.nodeValue.strip() == ""
|
||||
) or child.nodeType == child.COMMENT_NODE:
|
||||
element.removeChild(child)
|
||||
|
||||
xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
|
||||
except Exception as e:
|
||||
print(f"ERROR: Failed to parse {xml_file.name}: {e}", file=sys.stderr)
|
||||
raise
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Pack a directory into a DOCX, PPTX, or XLSX file"
|
||||
)
|
||||
parser.add_argument("input_directory", help="Unpacked Office document directory")
|
||||
parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
|
||||
parser.add_argument(
|
||||
"--original",
|
||||
help="Original file for validation comparison",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--validate",
|
||||
type=lambda x: x.lower() == "true",
|
||||
default=True,
|
||||
metavar="true|false",
|
||||
help="Run validation with auto-repair (default: true)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
_, message = pack(
|
||||
args.input_directory,
|
||||
args.output_file,
|
||||
original_file=args.original,
|
||||
validate=args.validate,
|
||||
)
|
||||
print(message)
|
||||
|
||||
if "Error" in message:
|
||||
sys.exit(1)
|
||||
183
scientific-skills/docx/scripts/office/soffice.py
Normal file
183
scientific-skills/docx/scripts/office/soffice.py
Normal file
@@ -0,0 +1,183 @@
|
||||
"""
|
||||
Helper for running LibreOffice (soffice) in environments where AF_UNIX
|
||||
sockets may be blocked (e.g., sandboxed VMs). Detects the restriction
|
||||
at runtime and applies an LD_PRELOAD shim if needed.
|
||||
|
||||
Usage:
|
||||
from office.soffice import run_soffice, get_soffice_env
|
||||
|
||||
# Option 1 – run soffice directly
|
||||
result = run_soffice(["--headless", "--convert-to", "pdf", "input.docx"])
|
||||
|
||||
# Option 2 – get env dict for your own subprocess calls
|
||||
env = get_soffice_env()
|
||||
subprocess.run(["soffice", ...], env=env)
|
||||
"""
|
||||
|
||||
import os
|
||||
import socket
|
||||
import subprocess
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def get_soffice_env() -> dict:
|
||||
env = os.environ.copy()
|
||||
env["SAL_USE_VCLPLUGIN"] = "svp"
|
||||
|
||||
if _needs_shim():
|
||||
shim = _ensure_shim()
|
||||
env["LD_PRELOAD"] = str(shim)
|
||||
|
||||
return env
|
||||
|
||||
|
||||
def run_soffice(args: list[str], **kwargs) -> subprocess.CompletedProcess:
|
||||
env = get_soffice_env()
|
||||
return subprocess.run(["soffice"] + args, env=env, **kwargs)
|
||||
|
||||
|
||||
|
||||
_SHIM_SO = Path(tempfile.gettempdir()) / "lo_socket_shim.so"
|
||||
|
||||
|
||||
def _needs_shim() -> bool:
|
||||
try:
|
||||
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
s.close()
|
||||
return False
|
||||
except OSError:
|
||||
return True
|
||||
|
||||
|
||||
def _ensure_shim() -> Path:
|
||||
if _SHIM_SO.exists():
|
||||
return _SHIM_SO
|
||||
|
||||
src = Path(tempfile.gettempdir()) / "lo_socket_shim.c"
|
||||
src.write_text(_SHIM_SOURCE)
|
||||
subprocess.run(
|
||||
["gcc", "-shared", "-fPIC", "-o", str(_SHIM_SO), str(src), "-ldl"],
|
||||
check=True,
|
||||
capture_output=True,
|
||||
)
|
||||
src.unlink()
|
||||
return _SHIM_SO
|
||||
|
||||
|
||||
|
||||
_SHIM_SOURCE = r"""
|
||||
#define _GNU_SOURCE
|
||||
#include <dlfcn.h>
|
||||
#include <errno.h>
|
||||
#include <signal.h>
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <sys/socket.h>
|
||||
#include <unistd.h>
|
||||
|
||||
static int (*real_socket)(int, int, int);
|
||||
static int (*real_socketpair)(int, int, int, int[2]);
|
||||
static int (*real_listen)(int, int);
|
||||
static int (*real_accept)(int, struct sockaddr *, socklen_t *);
|
||||
static int (*real_close)(int);
|
||||
static int (*real_read)(int, void *, size_t);
|
||||
|
||||
/* Per-FD bookkeeping (FDs >= 1024 are passed through unshimmed). */
|
||||
static int is_shimmed[1024];
|
||||
static int peer_of[1024];
|
||||
static int wake_r[1024]; /* accept() blocks reading this */
|
||||
static int wake_w[1024]; /* close() writes to this */
|
||||
static int listener_fd = -1; /* FD that received listen() */
|
||||
|
||||
__attribute__((constructor))
|
||||
static void init(void) {
|
||||
real_socket = dlsym(RTLD_NEXT, "socket");
|
||||
real_socketpair = dlsym(RTLD_NEXT, "socketpair");
|
||||
real_listen = dlsym(RTLD_NEXT, "listen");
|
||||
real_accept = dlsym(RTLD_NEXT, "accept");
|
||||
real_close = dlsym(RTLD_NEXT, "close");
|
||||
real_read = dlsym(RTLD_NEXT, "read");
|
||||
for (int i = 0; i < 1024; i++) {
|
||||
peer_of[i] = -1;
|
||||
wake_r[i] = -1;
|
||||
wake_w[i] = -1;
|
||||
}
|
||||
}
|
||||
|
||||
/* ---- socket ---------------------------------------------------------- */
|
||||
int socket(int domain, int type, int protocol) {
|
||||
if (domain == AF_UNIX) {
|
||||
int fd = real_socket(domain, type, protocol);
|
||||
if (fd >= 0) return fd;
|
||||
/* socket(AF_UNIX) blocked – fall back to socketpair(). */
|
||||
int sv[2];
|
||||
if (real_socketpair(domain, type, protocol, sv) == 0) {
|
||||
if (sv[0] >= 0 && sv[0] < 1024) {
|
||||
is_shimmed[sv[0]] = 1;
|
||||
peer_of[sv[0]] = sv[1];
|
||||
int wp[2];
|
||||
if (pipe(wp) == 0) {
|
||||
wake_r[sv[0]] = wp[0];
|
||||
wake_w[sv[0]] = wp[1];
|
||||
}
|
||||
}
|
||||
return sv[0];
|
||||
}
|
||||
errno = EPERM;
|
||||
return -1;
|
||||
}
|
||||
return real_socket(domain, type, protocol);
|
||||
}
|
||||
|
||||
/* ---- listen ---------------------------------------------------------- */
|
||||
int listen(int sockfd, int backlog) {
|
||||
if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
|
||||
listener_fd = sockfd;
|
||||
return 0;
|
||||
}
|
||||
return real_listen(sockfd, backlog);
|
||||
}
|
||||
|
||||
/* ---- accept ---------------------------------------------------------- */
|
||||
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
|
||||
if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
|
||||
/* Block until close() writes to the wake pipe. */
|
||||
if (wake_r[sockfd] >= 0) {
|
||||
char buf;
|
||||
real_read(wake_r[sockfd], &buf, 1);
|
||||
}
|
||||
errno = ECONNABORTED;
|
||||
return -1;
|
||||
}
|
||||
return real_accept(sockfd, addr, addrlen);
|
||||
}
|
||||
|
||||
/* ---- close ----------------------------------------------------------- */
|
||||
int close(int fd) {
|
||||
if (fd >= 0 && fd < 1024 && is_shimmed[fd]) {
|
||||
int was_listener = (fd == listener_fd);
|
||||
is_shimmed[fd] = 0;
|
||||
|
||||
if (wake_w[fd] >= 0) { /* unblock accept() */
|
||||
char c = 0;
|
||||
write(wake_w[fd], &c, 1);
|
||||
real_close(wake_w[fd]);
|
||||
wake_w[fd] = -1;
|
||||
}
|
||||
if (wake_r[fd] >= 0) { real_close(wake_r[fd]); wake_r[fd] = -1; }
|
||||
if (peer_of[fd] >= 0) { real_close(peer_of[fd]); peer_of[fd] = -1; }
|
||||
|
||||
if (was_listener)
|
||||
_exit(0); /* conversion done – exit */
|
||||
}
|
||||
return real_close(fd);
|
||||
}
|
||||
"""
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
result = run_soffice(sys.argv[1:])
|
||||
sys.exit(result.returncode)
|
||||
132
scientific-skills/docx/scripts/office/unpack.py
Executable file
132
scientific-skills/docx/scripts/office/unpack.py
Executable file
@@ -0,0 +1,132 @@
|
||||
"""Unpack Office files (DOCX, PPTX, XLSX) for editing.
|
||||
|
||||
Extracts the ZIP archive, pretty-prints XML files, and optionally:
|
||||
- Merges adjacent runs with identical formatting (DOCX only)
|
||||
- Simplifies adjacent tracked changes from same author (DOCX only)
|
||||
|
||||
Usage:
|
||||
python unpack.py <office_file> <output_dir> [options]
|
||||
|
||||
Examples:
|
||||
python unpack.py document.docx unpacked/
|
||||
python unpack.py presentation.pptx unpacked/
|
||||
python unpack.py document.docx unpacked/ --merge-runs false
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
import defusedxml.minidom
|
||||
|
||||
from helpers.merge_runs import merge_runs as do_merge_runs
|
||||
from helpers.simplify_redlines import simplify_redlines as do_simplify_redlines
|
||||
|
||||
SMART_QUOTE_REPLACEMENTS = {
|
||||
"\u201c": "“",
|
||||
"\u201d": "”",
|
||||
"\u2018": "‘",
|
||||
"\u2019": "’",
|
||||
}
|
||||
|
||||
|
||||
def unpack(
|
||||
input_file: str,
|
||||
output_directory: str,
|
||||
merge_runs: bool = True,
|
||||
simplify_redlines: bool = True,
|
||||
) -> tuple[None, str]:
|
||||
input_path = Path(input_file)
|
||||
output_path = Path(output_directory)
|
||||
suffix = input_path.suffix.lower()
|
||||
|
||||
if not input_path.exists():
|
||||
return None, f"Error: {input_file} does not exist"
|
||||
|
||||
if suffix not in {".docx", ".pptx", ".xlsx"}:
|
||||
return None, f"Error: {input_file} must be a .docx, .pptx, or .xlsx file"
|
||||
|
||||
try:
|
||||
output_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
with zipfile.ZipFile(input_path, "r") as zf:
|
||||
zf.extractall(output_path)
|
||||
|
||||
xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels"))
|
||||
for xml_file in xml_files:
|
||||
_pretty_print_xml(xml_file)
|
||||
|
||||
message = f"Unpacked {input_file} ({len(xml_files)} XML files)"
|
||||
|
||||
if suffix == ".docx":
|
||||
if simplify_redlines:
|
||||
simplify_count, _ = do_simplify_redlines(str(output_path))
|
||||
message += f", simplified {simplify_count} tracked changes"
|
||||
|
||||
if merge_runs:
|
||||
merge_count, _ = do_merge_runs(str(output_path))
|
||||
message += f", merged {merge_count} runs"
|
||||
|
||||
for xml_file in xml_files:
|
||||
_escape_smart_quotes(xml_file)
|
||||
|
||||
return None, message
|
||||
|
||||
except zipfile.BadZipFile:
|
||||
return None, f"Error: {input_file} is not a valid Office file"
|
||||
except Exception as e:
|
||||
return None, f"Error unpacking: {e}"
|
||||
|
||||
|
||||
def _pretty_print_xml(xml_file: Path) -> None:
|
||||
try:
|
||||
content = xml_file.read_text(encoding="utf-8")
|
||||
dom = defusedxml.minidom.parseString(content)
|
||||
xml_file.write_bytes(dom.toprettyxml(indent=" ", encoding="utf-8"))
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def _escape_smart_quotes(xml_file: Path) -> None:
|
||||
try:
|
||||
content = xml_file.read_text(encoding="utf-8")
|
||||
for char, entity in SMART_QUOTE_REPLACEMENTS.items():
|
||||
content = content.replace(char, entity)
|
||||
xml_file.write_text(content, encoding="utf-8")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Unpack an Office file (DOCX, PPTX, XLSX) for editing"
|
||||
)
|
||||
parser.add_argument("input_file", help="Office file to unpack")
|
||||
parser.add_argument("output_directory", help="Output directory")
|
||||
parser.add_argument(
|
||||
"--merge-runs",
|
||||
type=lambda x: x.lower() == "true",
|
||||
default=True,
|
||||
metavar="true|false",
|
||||
help="Merge adjacent runs with identical formatting (DOCX only, default: true)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--simplify-redlines",
|
||||
type=lambda x: x.lower() == "true",
|
||||
default=True,
|
||||
metavar="true|false",
|
||||
help="Merge adjacent tracked changes from same author (DOCX only, default: true)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
_, message = unpack(
|
||||
args.input_file,
|
||||
args.output_directory,
|
||||
merge_runs=args.merge_runs,
|
||||
simplify_redlines=args.simplify_redlines,
|
||||
)
|
||||
print(message)
|
||||
|
||||
if "Error" in message:
|
||||
sys.exit(1)
|
||||
111
scientific-skills/docx/scripts/office/validate.py
Executable file
111
scientific-skills/docx/scripts/office/validate.py
Executable file
@@ -0,0 +1,111 @@
|
||||
"""
|
||||
Command line tool to validate Office document XML files against XSD schemas and tracked changes.
|
||||
|
||||
Usage:
|
||||
python validate.py <path> [--original <original_file>] [--auto-repair] [--author NAME]
|
||||
|
||||
The first argument can be either:
|
||||
- An unpacked directory containing the Office document XML files
|
||||
- A packed Office file (.docx/.pptx/.xlsx) which will be unpacked to a temp directory
|
||||
|
||||
Auto-repair fixes:
|
||||
- paraId/durableId values that exceed OOXML limits
|
||||
- Missing xml:space="preserve" on w:t elements with whitespace
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import tempfile
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Validate Office document XML files")
|
||||
parser.add_argument(
|
||||
"path",
|
||||
help="Path to unpacked directory or packed Office file (.docx/.pptx/.xlsx)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--original",
|
||||
required=False,
|
||||
default=None,
|
||||
help="Path to original file (.docx/.pptx/.xlsx). If omitted, all XSD errors are reported and redlining validation is skipped.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-v",
|
||||
"--verbose",
|
||||
action="store_true",
|
||||
help="Enable verbose output",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--auto-repair",
|
||||
action="store_true",
|
||||
help="Automatically repair common issues (hex IDs, whitespace preservation)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--author",
|
||||
default="Claude",
|
||||
help="Author name for redlining validation (default: Claude)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
path = Path(args.path)
|
||||
assert path.exists(), f"Error: {path} does not exist"
|
||||
|
||||
original_file = None
|
||||
if args.original:
|
||||
original_file = Path(args.original)
|
||||
assert original_file.is_file(), f"Error: {original_file} is not a file"
|
||||
assert original_file.suffix.lower() in [".docx", ".pptx", ".xlsx"], (
|
||||
f"Error: {original_file} must be a .docx, .pptx, or .xlsx file"
|
||||
)
|
||||
|
||||
file_extension = (original_file or path).suffix.lower()
|
||||
assert file_extension in [".docx", ".pptx", ".xlsx"], (
|
||||
f"Error: Cannot determine file type from {path}. Use --original or provide a .docx/.pptx/.xlsx file."
|
||||
)
|
||||
|
||||
if path.is_file() and path.suffix.lower() in [".docx", ".pptx", ".xlsx"]:
|
||||
temp_dir = tempfile.mkdtemp()
|
||||
with zipfile.ZipFile(path, "r") as zf:
|
||||
zf.extractall(temp_dir)
|
||||
unpacked_dir = Path(temp_dir)
|
||||
else:
|
||||
assert path.is_dir(), f"Error: {path} is not a directory or Office file"
|
||||
unpacked_dir = path
|
||||
|
||||
match file_extension:
|
||||
case ".docx":
|
||||
validators = [
|
||||
DOCXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
|
||||
]
|
||||
if original_file:
|
||||
validators.append(
|
||||
RedliningValidator(unpacked_dir, original_file, verbose=args.verbose, author=args.author)
|
||||
)
|
||||
case ".pptx":
|
||||
validators = [
|
||||
PPTXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
|
||||
]
|
||||
case _:
|
||||
print(f"Error: Validation not supported for file type {file_extension}")
|
||||
sys.exit(1)
|
||||
|
||||
if args.auto_repair:
|
||||
total_repairs = sum(v.repair() for v in validators)
|
||||
if total_repairs:
|
||||
print(f"Auto-repaired {total_repairs} issue(s)")
|
||||
|
||||
success = all(v.validate() for v in validators)
|
||||
|
||||
if success:
|
||||
print("All validations PASSED!")
|
||||
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -5,72 +5,62 @@ Base validator with common validation logic for document files.
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
import defusedxml.minidom
|
||||
import lxml.etree
|
||||
|
||||
|
||||
class BaseSchemaValidator:
|
||||
"""Base validator with common validation logic for document files."""
|
||||
|
||||
# Elements whose 'id' attributes must be unique within their file
|
||||
# Format: element_name -> (attribute_name, scope)
|
||||
# scope can be 'file' (unique within file) or 'global' (unique across all files)
|
||||
IGNORED_VALIDATION_ERRORS = [
|
||||
"hyphenationZone",
|
||||
"purl.org/dc/terms",
|
||||
]
|
||||
|
||||
UNIQUE_ID_REQUIREMENTS = {
|
||||
# Word elements
|
||||
"comment": ("id", "file"), # Comment IDs in comments.xml
|
||||
"commentrangestart": ("id", "file"), # Must match comment IDs
|
||||
"commentrangeend": ("id", "file"), # Must match comment IDs
|
||||
"bookmarkstart": ("id", "file"), # Bookmark start IDs
|
||||
"bookmarkend": ("id", "file"), # Bookmark end IDs
|
||||
# Note: ins and del (track changes) can share IDs when part of same revision
|
||||
# PowerPoint elements
|
||||
"sldid": ("id", "file"), # Slide IDs in presentation.xml
|
||||
"sldmasterid": ("id", "global"), # Slide master IDs must be globally unique
|
||||
"sldlayoutid": ("id", "global"), # Slide layout IDs must be globally unique
|
||||
"cm": ("authorid", "file"), # Comment author IDs
|
||||
# Excel elements
|
||||
"sheet": ("sheetid", "file"), # Sheet IDs in workbook.xml
|
||||
"definedname": ("id", "file"), # Named range IDs
|
||||
# Drawing/Shape elements (all formats)
|
||||
"cxnsp": ("id", "file"), # Connection shape IDs
|
||||
"sp": ("id", "file"), # Shape IDs
|
||||
"pic": ("id", "file"), # Picture IDs
|
||||
"grpsp": ("id", "file"), # Group shape IDs
|
||||
"comment": ("id", "file"),
|
||||
"commentrangestart": ("id", "file"),
|
||||
"commentrangeend": ("id", "file"),
|
||||
"bookmarkstart": ("id", "file"),
|
||||
"bookmarkend": ("id", "file"),
|
||||
"sldid": ("id", "file"),
|
||||
"sldmasterid": ("id", "global"),
|
||||
"sldlayoutid": ("id", "global"),
|
||||
"cm": ("authorid", "file"),
|
||||
"sheet": ("sheetid", "file"),
|
||||
"definedname": ("id", "file"),
|
||||
"cxnsp": ("id", "file"),
|
||||
"sp": ("id", "file"),
|
||||
"pic": ("id", "file"),
|
||||
"grpsp": ("id", "file"),
|
||||
}
|
||||
|
||||
EXCLUDED_ID_CONTAINERS = {
|
||||
"sectionlst",
|
||||
}
|
||||
|
||||
# Mapping of element names to expected relationship types
|
||||
# Subclasses should override this with format-specific mappings
|
||||
ELEMENT_RELATIONSHIP_TYPES = {}
|
||||
|
||||
# Unified schema mappings for all Office document types
|
||||
SCHEMA_MAPPINGS = {
|
||||
# Document type specific schemas
|
||||
"word": "ISO-IEC29500-4_2016/wml.xsd", # Word documents
|
||||
"ppt": "ISO-IEC29500-4_2016/pml.xsd", # PowerPoint presentations
|
||||
"xl": "ISO-IEC29500-4_2016/sml.xsd", # Excel spreadsheets
|
||||
# Common file types
|
||||
"word": "ISO-IEC29500-4_2016/wml.xsd",
|
||||
"ppt": "ISO-IEC29500-4_2016/pml.xsd",
|
||||
"xl": "ISO-IEC29500-4_2016/sml.xsd",
|
||||
"[Content_Types].xml": "ecma/fouth-edition/opc-contentTypes.xsd",
|
||||
"app.xml": "ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd",
|
||||
"core.xml": "ecma/fouth-edition/opc-coreProperties.xsd",
|
||||
"custom.xml": "ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd",
|
||||
".rels": "ecma/fouth-edition/opc-relationships.xsd",
|
||||
# Word-specific files
|
||||
"people.xml": "microsoft/wml-2012.xsd",
|
||||
"commentsIds.xml": "microsoft/wml-cid-2016.xsd",
|
||||
"commentsExtensible.xml": "microsoft/wml-cex-2018.xsd",
|
||||
"commentsExtended.xml": "microsoft/wml-2012.xsd",
|
||||
# Chart files (common across document types)
|
||||
"chart": "ISO-IEC29500-4_2016/dml-chart.xsd",
|
||||
# Theme files (common across document types)
|
||||
"theme": "ISO-IEC29500-4_2016/dml-main.xsd",
|
||||
# Drawing and media files
|
||||
"drawing": "ISO-IEC29500-4_2016/dml-main.xsd",
|
||||
}
|
||||
|
||||
# Unified namespace constants
|
||||
MC_NAMESPACE = "http://schemas.openxmlformats.org/markup-compatibility/2006"
|
||||
XML_NAMESPACE = "http://www.w3.org/XML/1998/namespace"
|
||||
|
||||
# Common OOXML namespaces used across validators
|
||||
PACKAGE_RELATIONSHIPS_NAMESPACE = (
|
||||
"http://schemas.openxmlformats.org/package/2006/relationships"
|
||||
)
|
||||
@@ -81,10 +71,8 @@ class BaseSchemaValidator:
|
||||
"http://schemas.openxmlformats.org/package/2006/content-types"
|
||||
)
|
||||
|
||||
# Folders where we should clean ignorable namespaces
|
||||
MAIN_CONTENT_FOLDERS = {"word", "ppt", "xl"}
|
||||
|
||||
# All allowed OOXML namespaces (superset of all document types)
|
||||
OOXML_NAMESPACES = {
|
||||
"http://schemas.openxmlformats.org/officeDocument/2006/math",
|
||||
"http://schemas.openxmlformats.org/officeDocument/2006/relationships",
|
||||
@@ -103,15 +91,13 @@ class BaseSchemaValidator:
|
||||
"http://www.w3.org/XML/1998/namespace",
|
||||
}
|
||||
|
||||
def __init__(self, unpacked_dir, original_file, verbose=False):
|
||||
def __init__(self, unpacked_dir, original_file=None, verbose=False):
|
||||
self.unpacked_dir = Path(unpacked_dir).resolve()
|
||||
self.original_file = Path(original_file)
|
||||
self.original_file = Path(original_file) if original_file else None
|
||||
self.verbose = verbose
|
||||
|
||||
# Set schemas directory
|
||||
self.schemas_dir = Path(__file__).parent.parent.parent / "schemas"
|
||||
self.schemas_dir = Path(__file__).parent.parent / "schemas"
|
||||
|
||||
# Get all XML and .rels files
|
||||
patterns = ["*.xml", "*.rels"]
|
||||
self.xml_files = [
|
||||
f for pattern in patterns for f in self.unpacked_dir.rglob(pattern)
|
||||
@@ -121,16 +107,44 @@ class BaseSchemaValidator:
|
||||
print(f"Warning: No XML files found in {self.unpacked_dir}")
|
||||
|
||||
def validate(self):
|
||||
"""Run all validation checks and return True if all pass."""
|
||||
raise NotImplementedError("Subclasses must implement the validate method")
|
||||
|
||||
def repair(self) -> int:
|
||||
return self.repair_whitespace_preservation()
|
||||
|
||||
def repair_whitespace_preservation(self) -> int:
|
||||
repairs = 0
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
try:
|
||||
content = xml_file.read_text(encoding="utf-8")
|
||||
dom = defusedxml.minidom.parseString(content)
|
||||
modified = False
|
||||
|
||||
for elem in dom.getElementsByTagName("*"):
|
||||
if elem.tagName.endswith(":t") and elem.firstChild:
|
||||
text = elem.firstChild.nodeValue
|
||||
if text and (text.startswith((' ', '\t')) or text.endswith((' ', '\t'))):
|
||||
if elem.getAttribute("xml:space") != "preserve":
|
||||
elem.setAttribute("xml:space", "preserve")
|
||||
text_preview = repr(text[:30]) + "..." if len(text) > 30 else repr(text)
|
||||
print(f" Repaired: {xml_file.name}: Added xml:space='preserve' to {elem.tagName}: {text_preview}")
|
||||
repairs += 1
|
||||
modified = True
|
||||
|
||||
if modified:
|
||||
xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
|
||||
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return repairs
|
||||
|
||||
def validate_xml(self):
|
||||
"""Validate that all XML files are well-formed."""
|
||||
errors = []
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
try:
|
||||
# Try to parse the XML file
|
||||
lxml.etree.parse(str(xml_file))
|
||||
except lxml.etree.XMLSyntaxError as e:
|
||||
errors.append(
|
||||
@@ -154,13 +168,12 @@ class BaseSchemaValidator:
|
||||
return True
|
||||
|
||||
def validate_namespaces(self):
|
||||
"""Validate that namespace prefixes in Ignorable attributes are declared."""
|
||||
errors = []
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
declared = set(root.nsmap.keys()) - {None} # Exclude default namespace
|
||||
declared = set(root.nsmap.keys()) - {None}
|
||||
|
||||
for attr_val in [
|
||||
v for k, v in root.attrib.items() if k.endswith("Ignorable")
|
||||
@@ -184,36 +197,37 @@ class BaseSchemaValidator:
|
||||
return True
|
||||
|
||||
def validate_unique_ids(self):
|
||||
"""Validate that specific IDs are unique according to OOXML requirements."""
|
||||
errors = []
|
||||
global_ids = {} # Track globally unique IDs across all files
|
||||
global_ids = {}
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
file_ids = {} # Track IDs that must be unique within this file
|
||||
file_ids = {}
|
||||
|
||||
# Remove all mc:AlternateContent elements from the tree
|
||||
mc_elements = root.xpath(
|
||||
".//mc:AlternateContent", namespaces={"mc": self.MC_NAMESPACE}
|
||||
)
|
||||
for elem in mc_elements:
|
||||
elem.getparent().remove(elem)
|
||||
|
||||
# Now check IDs in the cleaned tree
|
||||
for elem in root.iter():
|
||||
# Get the element name without namespace
|
||||
tag = (
|
||||
elem.tag.split("}")[-1].lower()
|
||||
if "}" in elem.tag
|
||||
else elem.tag.lower()
|
||||
)
|
||||
|
||||
# Check if this element type has ID uniqueness requirements
|
||||
if tag in self.UNIQUE_ID_REQUIREMENTS:
|
||||
in_excluded_container = any(
|
||||
ancestor.tag.split("}")[-1].lower() in self.EXCLUDED_ID_CONTAINERS
|
||||
for ancestor in elem.iterancestors()
|
||||
)
|
||||
if in_excluded_container:
|
||||
continue
|
||||
|
||||
attr_name, scope = self.UNIQUE_ID_REQUIREMENTS[tag]
|
||||
|
||||
# Look for the specified attribute
|
||||
id_value = None
|
||||
for attr, value in elem.attrib.items():
|
||||
attr_local = (
|
||||
@@ -227,7 +241,6 @@ class BaseSchemaValidator:
|
||||
|
||||
if id_value is not None:
|
||||
if scope == "global":
|
||||
# Check global uniqueness
|
||||
if id_value in global_ids:
|
||||
prev_file, prev_line, prev_tag = global_ids[
|
||||
id_value
|
||||
@@ -244,7 +257,6 @@ class BaseSchemaValidator:
|
||||
tag,
|
||||
)
|
||||
elif scope == "file":
|
||||
# Check file-level uniqueness
|
||||
key = (tag, attr_name)
|
||||
if key not in file_ids:
|
||||
file_ids[key] = {}
|
||||
@@ -275,12 +287,8 @@ class BaseSchemaValidator:
|
||||
return True
|
||||
|
||||
def validate_file_references(self):
|
||||
"""
|
||||
Validate that all .rels files properly reference files and that all files are referenced.
|
||||
"""
|
||||
errors = []
|
||||
|
||||
# Find all .rels files
|
||||
rels_files = list(self.unpacked_dir.rglob("*.rels"))
|
||||
|
||||
if not rels_files:
|
||||
@@ -288,17 +296,15 @@ class BaseSchemaValidator:
|
||||
print("PASSED - No .rels files found")
|
||||
return True
|
||||
|
||||
# Get all files in the unpacked directory (excluding reference files)
|
||||
all_files = []
|
||||
for file_path in self.unpacked_dir.rglob("*"):
|
||||
if (
|
||||
file_path.is_file()
|
||||
and file_path.name != "[Content_Types].xml"
|
||||
and not file_path.name.endswith(".rels")
|
||||
): # This file is not referenced by .rels
|
||||
):
|
||||
all_files.append(file_path.resolve())
|
||||
|
||||
# Track all files that are referenced by any .rels file
|
||||
all_referenced_files = set()
|
||||
|
||||
if self.verbose:
|
||||
@@ -306,16 +312,12 @@ class BaseSchemaValidator:
|
||||
f"Found {len(rels_files)} .rels files and {len(all_files)} target files"
|
||||
)
|
||||
|
||||
# Check each .rels file
|
||||
for rels_file in rels_files:
|
||||
try:
|
||||
# Parse relationships file
|
||||
rels_root = lxml.etree.parse(str(rels_file)).getroot()
|
||||
|
||||
# Get the directory where this .rels file is located
|
||||
rels_dir = rels_file.parent
|
||||
|
||||
# Find all relationships and their targets
|
||||
referenced_files = set()
|
||||
broken_refs = []
|
||||
|
||||
@@ -326,18 +328,15 @@ class BaseSchemaValidator:
|
||||
target = rel.get("Target")
|
||||
if target and not target.startswith(
|
||||
("http", "mailto:")
|
||||
): # Skip external URLs
|
||||
# Resolve the target path relative to the .rels file location
|
||||
if rels_file.name == ".rels":
|
||||
# Root .rels file - targets are relative to unpacked_dir
|
||||
):
|
||||
if target.startswith("/"):
|
||||
target_path = self.unpacked_dir / target.lstrip("/")
|
||||
elif rels_file.name == ".rels":
|
||||
target_path = self.unpacked_dir / target
|
||||
else:
|
||||
# Other .rels files - targets are relative to their parent's parent
|
||||
# e.g., word/_rels/document.xml.rels -> targets relative to word/
|
||||
base_dir = rels_dir.parent
|
||||
target_path = base_dir / target
|
||||
|
||||
# Normalize the path and check if it exists
|
||||
try:
|
||||
target_path = target_path.resolve()
|
||||
if target_path.exists() and target_path.is_file():
|
||||
@@ -348,7 +347,6 @@ class BaseSchemaValidator:
|
||||
except (OSError, ValueError):
|
||||
broken_refs.append((target, rel.sourceline))
|
||||
|
||||
# Report broken references
|
||||
if broken_refs:
|
||||
rel_path = rels_file.relative_to(self.unpacked_dir)
|
||||
for broken_ref, line_num in broken_refs:
|
||||
@@ -360,7 +358,6 @@ class BaseSchemaValidator:
|
||||
rel_path = rels_file.relative_to(self.unpacked_dir)
|
||||
errors.append(f" Error parsing {rel_path}: {e}")
|
||||
|
||||
# Check for unreferenced files (files that exist but are not referenced anywhere)
|
||||
unreferenced_files = set(all_files) - all_referenced_files
|
||||
|
||||
if unreferenced_files:
|
||||
@@ -386,31 +383,21 @@ class BaseSchemaValidator:
|
||||
return True
|
||||
|
||||
def validate_all_relationship_ids(self):
|
||||
"""
|
||||
Validate that all r:id attributes in XML files reference existing IDs
|
||||
in their corresponding .rels files, and optionally validate relationship types.
|
||||
"""
|
||||
import lxml.etree
|
||||
|
||||
errors = []
|
||||
|
||||
# Process each XML file that might contain r:id references
|
||||
for xml_file in self.xml_files:
|
||||
# Skip .rels files themselves
|
||||
if xml_file.suffix == ".rels":
|
||||
continue
|
||||
|
||||
# Determine the corresponding .rels file
|
||||
# For dir/file.xml, it's dir/_rels/file.xml.rels
|
||||
rels_dir = xml_file.parent / "_rels"
|
||||
rels_file = rels_dir / f"{xml_file.name}.rels"
|
||||
|
||||
# Skip if there's no corresponding .rels file (that's okay)
|
||||
if not rels_file.exists():
|
||||
continue
|
||||
|
||||
try:
|
||||
# Parse the .rels file to get valid relationship IDs and their types
|
||||
rels_root = lxml.etree.parse(str(rels_file)).getroot()
|
||||
rid_to_type = {}
|
||||
|
||||
@@ -420,47 +407,43 @@ class BaseSchemaValidator:
|
||||
rid = rel.get("Id")
|
||||
rel_type = rel.get("Type", "")
|
||||
if rid:
|
||||
# Check for duplicate rIds
|
||||
if rid in rid_to_type:
|
||||
rels_rel_path = rels_file.relative_to(self.unpacked_dir)
|
||||
errors.append(
|
||||
f" {rels_rel_path}: Line {rel.sourceline}: "
|
||||
f"Duplicate relationship ID '{rid}' (IDs must be unique)"
|
||||
)
|
||||
# Extract just the type name from the full URL
|
||||
type_name = (
|
||||
rel_type.split("/")[-1] if "/" in rel_type else rel_type
|
||||
)
|
||||
rid_to_type[rid] = type_name
|
||||
|
||||
# Parse the XML file to find all r:id references
|
||||
xml_root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
|
||||
# Find all elements with r:id attributes
|
||||
r_ns = self.OFFICE_RELATIONSHIPS_NAMESPACE
|
||||
rid_attrs_to_check = ["id", "embed", "link"]
|
||||
for elem in xml_root.iter():
|
||||
# Check for r:id attribute (relationship ID)
|
||||
rid_attr = elem.get(f"{{{self.OFFICE_RELATIONSHIPS_NAMESPACE}}}id")
|
||||
if rid_attr:
|
||||
for attr_name in rid_attrs_to_check:
|
||||
rid_attr = elem.get(f"{{{r_ns}}}{attr_name}")
|
||||
if not rid_attr:
|
||||
continue
|
||||
xml_rel_path = xml_file.relative_to(self.unpacked_dir)
|
||||
elem_name = (
|
||||
elem.tag.split("}")[-1] if "}" in elem.tag else elem.tag
|
||||
)
|
||||
|
||||
# Check if the ID exists
|
||||
if rid_attr not in rid_to_type:
|
||||
errors.append(
|
||||
f" {xml_rel_path}: Line {elem.sourceline}: "
|
||||
f"<{elem_name}> references non-existent relationship '{rid_attr}' "
|
||||
f"<{elem_name}> r:{attr_name} references non-existent relationship '{rid_attr}' "
|
||||
f"(valid IDs: {', '.join(sorted(rid_to_type.keys())[:5])}{'...' if len(rid_to_type) > 5 else ''})"
|
||||
)
|
||||
# Check if we have type expectations for this element
|
||||
elif self.ELEMENT_RELATIONSHIP_TYPES:
|
||||
elif attr_name == "id" and self.ELEMENT_RELATIONSHIP_TYPES:
|
||||
expected_type = self._get_expected_relationship_type(
|
||||
elem_name
|
||||
)
|
||||
if expected_type:
|
||||
actual_type = rid_to_type[rid_attr]
|
||||
# Check if the actual type matches or contains the expected type
|
||||
if expected_type not in actual_type.lower():
|
||||
errors.append(
|
||||
f" {xml_rel_path}: Line {elem.sourceline}: "
|
||||
@@ -484,58 +467,41 @@ class BaseSchemaValidator:
|
||||
return True
|
||||
|
||||
def _get_expected_relationship_type(self, element_name):
|
||||
"""
|
||||
Get the expected relationship type for an element.
|
||||
First checks the explicit mapping, then tries pattern detection.
|
||||
"""
|
||||
# Normalize element name to lowercase
|
||||
elem_lower = element_name.lower()
|
||||
|
||||
# Check explicit mapping first
|
||||
if elem_lower in self.ELEMENT_RELATIONSHIP_TYPES:
|
||||
return self.ELEMENT_RELATIONSHIP_TYPES[elem_lower]
|
||||
|
||||
# Try pattern detection for common patterns
|
||||
# Pattern 1: Elements ending in "Id" often expect a relationship of the prefix type
|
||||
if elem_lower.endswith("id") and len(elem_lower) > 2:
|
||||
# e.g., "sldId" -> "sld", "sldMasterId" -> "sldMaster"
|
||||
prefix = elem_lower[:-2] # Remove "id"
|
||||
# Check if this might be a compound like "sldMasterId"
|
||||
prefix = elem_lower[:-2]
|
||||
if prefix.endswith("master"):
|
||||
return prefix.lower()
|
||||
elif prefix.endswith("layout"):
|
||||
return prefix.lower()
|
||||
else:
|
||||
# Simple case like "sldId" -> "slide"
|
||||
# Common transformations
|
||||
if prefix == "sld":
|
||||
return "slide"
|
||||
return prefix.lower()
|
||||
|
||||
# Pattern 2: Elements ending in "Reference" expect a relationship of the prefix type
|
||||
if elem_lower.endswith("reference") and len(elem_lower) > 9:
|
||||
prefix = elem_lower[:-9] # Remove "reference"
|
||||
prefix = elem_lower[:-9]
|
||||
return prefix.lower()
|
||||
|
||||
return None
|
||||
|
||||
def validate_content_types(self):
|
||||
"""Validate that all content files are properly declared in [Content_Types].xml."""
|
||||
errors = []
|
||||
|
||||
# Find [Content_Types].xml file
|
||||
content_types_file = self.unpacked_dir / "[Content_Types].xml"
|
||||
if not content_types_file.exists():
|
||||
print("FAILED - [Content_Types].xml file not found")
|
||||
return False
|
||||
|
||||
try:
|
||||
# Parse and get all declared parts and extensions
|
||||
root = lxml.etree.parse(str(content_types_file)).getroot()
|
||||
declared_parts = set()
|
||||
declared_extensions = set()
|
||||
|
||||
# Get Override declarations (specific files)
|
||||
for override in root.findall(
|
||||
f".//{{{self.CONTENT_TYPES_NAMESPACE}}}Override"
|
||||
):
|
||||
@@ -543,7 +509,6 @@ class BaseSchemaValidator:
|
||||
if part_name is not None:
|
||||
declared_parts.add(part_name.lstrip("/"))
|
||||
|
||||
# Get Default declarations (by extension)
|
||||
for default in root.findall(
|
||||
f".//{{{self.CONTENT_TYPES_NAMESPACE}}}Default"
|
||||
):
|
||||
@@ -551,19 +516,17 @@ class BaseSchemaValidator:
|
||||
if extension is not None:
|
||||
declared_extensions.add(extension.lower())
|
||||
|
||||
# Root elements that require content type declaration
|
||||
declarable_roots = {
|
||||
"sld",
|
||||
"sldLayout",
|
||||
"sldMaster",
|
||||
"presentation", # PowerPoint
|
||||
"document", # Word
|
||||
"presentation",
|
||||
"document",
|
||||
"workbook",
|
||||
"worksheet", # Excel
|
||||
"theme", # Common
|
||||
"worksheet",
|
||||
"theme",
|
||||
}
|
||||
|
||||
# Common media file extensions that should be declared
|
||||
media_extensions = {
|
||||
"png": "image/png",
|
||||
"jpg": "image/jpeg",
|
||||
@@ -575,17 +538,14 @@ class BaseSchemaValidator:
|
||||
"emf": "image/x-emf",
|
||||
}
|
||||
|
||||
# Get all files in the unpacked directory
|
||||
all_files = list(self.unpacked_dir.rglob("*"))
|
||||
all_files = [f for f in all_files if f.is_file()]
|
||||
|
||||
# Check all XML files for Override declarations
|
||||
for xml_file in self.xml_files:
|
||||
path_str = str(xml_file.relative_to(self.unpacked_dir)).replace(
|
||||
"\\", "/"
|
||||
)
|
||||
|
||||
# Skip non-content files
|
||||
if any(
|
||||
skip in path_str
|
||||
for skip in [".rels", "[Content_Types]", "docProps/", "_rels/"]
|
||||
@@ -602,11 +562,9 @@ class BaseSchemaValidator:
|
||||
)
|
||||
|
||||
except Exception:
|
||||
continue # Skip unparseable files
|
||||
continue
|
||||
|
||||
# Check all non-XML files for Default extension declarations
|
||||
for file_path in all_files:
|
||||
# Skip XML files and metadata files (already checked above)
|
||||
if file_path.suffix.lower() in {".xml", ".rels"}:
|
||||
continue
|
||||
if file_path.name == "[Content_Types].xml":
|
||||
@@ -616,7 +574,6 @@ class BaseSchemaValidator:
|
||||
|
||||
extension = file_path.suffix.lstrip(".").lower()
|
||||
if extension and extension not in declared_extensions:
|
||||
# Check if it's a known media extension that should be declared
|
||||
if extension in media_extensions:
|
||||
relative_path = file_path.relative_to(self.unpacked_dir)
|
||||
errors.append(
|
||||
@@ -639,36 +596,28 @@ class BaseSchemaValidator:
|
||||
return True
|
||||
|
||||
def validate_file_against_xsd(self, xml_file, verbose=False):
|
||||
"""Validate a single XML file against XSD schema, comparing with original.
|
||||
|
||||
Args:
|
||||
xml_file: Path to XML file to validate
|
||||
verbose: Enable verbose output
|
||||
|
||||
Returns:
|
||||
tuple: (is_valid, new_errors_set) where is_valid is True/False/None (skipped)
|
||||
"""
|
||||
# Resolve both paths to handle symlinks
|
||||
xml_file = Path(xml_file).resolve()
|
||||
unpacked_dir = self.unpacked_dir.resolve()
|
||||
|
||||
# Validate current file
|
||||
is_valid, current_errors = self._validate_single_file_xsd(
|
||||
xml_file, unpacked_dir
|
||||
)
|
||||
|
||||
if is_valid is None:
|
||||
return None, set() # Skipped
|
||||
return None, set()
|
||||
elif is_valid:
|
||||
return True, set() # Valid, no errors
|
||||
return True, set()
|
||||
|
||||
# Get errors from original file for this specific file
|
||||
original_errors = self._get_original_file_errors(xml_file)
|
||||
|
||||
# Compare with original (both are guaranteed to be sets here)
|
||||
assert current_errors is not None
|
||||
new_errors = current_errors - original_errors
|
||||
|
||||
new_errors = {
|
||||
e for e in new_errors
|
||||
if not any(pattern in e for pattern in self.IGNORED_VALIDATION_ERRORS)
|
||||
}
|
||||
|
||||
if new_errors:
|
||||
if verbose:
|
||||
relative_path = xml_file.relative_to(unpacked_dir)
|
||||
@@ -678,7 +627,6 @@ class BaseSchemaValidator:
|
||||
print(f" - {truncated}")
|
||||
return False, new_errors
|
||||
else:
|
||||
# All errors existed in original
|
||||
if verbose:
|
||||
print(
|
||||
f"PASSED - No new errors (original had {len(current_errors)} errors)"
|
||||
@@ -686,7 +634,6 @@ class BaseSchemaValidator:
|
||||
return True, set()
|
||||
|
||||
def validate_against_xsd(self):
|
||||
"""Validate XML files against XSD schemas, showing only new errors compared to original."""
|
||||
new_errors = []
|
||||
original_error_count = 0
|
||||
valid_count = 0
|
||||
@@ -705,19 +652,16 @@ class BaseSchemaValidator:
|
||||
valid_count += 1
|
||||
continue
|
||||
elif is_valid:
|
||||
# Had errors but all existed in original
|
||||
original_error_count += 1
|
||||
valid_count += 1
|
||||
continue
|
||||
|
||||
# Has new errors
|
||||
new_errors.append(f" {relative_path}: {len(new_file_errors)} new error(s)")
|
||||
for error in list(new_file_errors)[:3]: # Show first 3 errors
|
||||
for error in list(new_file_errors)[:3]:
|
||||
new_errors.append(
|
||||
f" - {error[:250]}..." if len(error) > 250 else f" - {error}"
|
||||
)
|
||||
|
||||
# Print summary
|
||||
if self.verbose:
|
||||
print(f"Validated {len(self.xml_files)} files:")
|
||||
print(f" - Valid: {valid_count}")
|
||||
@@ -739,62 +683,47 @@ class BaseSchemaValidator:
|
||||
return True
|
||||
|
||||
def _get_schema_path(self, xml_file):
|
||||
"""Determine the appropriate schema path for an XML file."""
|
||||
# Check exact filename match
|
||||
if xml_file.name in self.SCHEMA_MAPPINGS:
|
||||
return self.schemas_dir / self.SCHEMA_MAPPINGS[xml_file.name]
|
||||
|
||||
# Check .rels files
|
||||
if xml_file.suffix == ".rels":
|
||||
return self.schemas_dir / self.SCHEMA_MAPPINGS[".rels"]
|
||||
|
||||
# Check chart files
|
||||
if "charts/" in str(xml_file) and xml_file.name.startswith("chart"):
|
||||
return self.schemas_dir / self.SCHEMA_MAPPINGS["chart"]
|
||||
|
||||
# Check theme files
|
||||
if "theme/" in str(xml_file) and xml_file.name.startswith("theme"):
|
||||
return self.schemas_dir / self.SCHEMA_MAPPINGS["theme"]
|
||||
|
||||
# Check if file is in a main content folder and use appropriate schema
|
||||
if xml_file.parent.name in self.MAIN_CONTENT_FOLDERS:
|
||||
return self.schemas_dir / self.SCHEMA_MAPPINGS[xml_file.parent.name]
|
||||
|
||||
return None
|
||||
|
||||
def _clean_ignorable_namespaces(self, xml_doc):
|
||||
"""Remove attributes and elements not in allowed namespaces."""
|
||||
# Create a clean copy
|
||||
xml_string = lxml.etree.tostring(xml_doc, encoding="unicode")
|
||||
xml_copy = lxml.etree.fromstring(xml_string)
|
||||
|
||||
# Remove attributes not in allowed namespaces
|
||||
for elem in xml_copy.iter():
|
||||
attrs_to_remove = []
|
||||
|
||||
for attr in elem.attrib:
|
||||
# Check if attribute is from a namespace other than allowed ones
|
||||
if "{" in attr:
|
||||
ns = attr.split("}")[0][1:]
|
||||
if ns not in self.OOXML_NAMESPACES:
|
||||
attrs_to_remove.append(attr)
|
||||
|
||||
# Remove collected attributes
|
||||
for attr in attrs_to_remove:
|
||||
del elem.attrib[attr]
|
||||
|
||||
# Remove elements not in allowed namespaces
|
||||
self._remove_ignorable_elements(xml_copy)
|
||||
|
||||
return lxml.etree.ElementTree(xml_copy)
|
||||
|
||||
def _remove_ignorable_elements(self, root):
|
||||
"""Recursively remove all elements not in allowed namespaces."""
|
||||
elements_to_remove = []
|
||||
|
||||
# Find elements to remove
|
||||
for elem in list(root):
|
||||
# Skip non-element nodes (comments, processing instructions, etc.)
|
||||
if not hasattr(elem, "tag") or callable(elem.tag):
|
||||
continue
|
||||
|
||||
@@ -805,32 +734,25 @@ class BaseSchemaValidator:
|
||||
elements_to_remove.append(elem)
|
||||
continue
|
||||
|
||||
# Recursively clean child elements
|
||||
self._remove_ignorable_elements(elem)
|
||||
|
||||
# Remove collected elements
|
||||
for elem in elements_to_remove:
|
||||
root.remove(elem)
|
||||
|
||||
def _preprocess_for_mc_ignorable(self, xml_doc):
|
||||
"""Preprocess XML to handle mc:Ignorable attribute properly."""
|
||||
# Remove mc:Ignorable attributes before validation
|
||||
root = xml_doc.getroot()
|
||||
|
||||
# Remove mc:Ignorable attribute from root
|
||||
if f"{{{self.MC_NAMESPACE}}}Ignorable" in root.attrib:
|
||||
del root.attrib[f"{{{self.MC_NAMESPACE}}}Ignorable"]
|
||||
|
||||
return xml_doc
|
||||
|
||||
def _validate_single_file_xsd(self, xml_file, base_path):
|
||||
"""Validate a single XML file against XSD schema. Returns (is_valid, errors_set)."""
|
||||
schema_path = self._get_schema_path(xml_file)
|
||||
if not schema_path:
|
||||
return None, None # Skip file
|
||||
return None, None
|
||||
|
||||
try:
|
||||
# Load schema
|
||||
with open(schema_path, "rb") as xsd_file:
|
||||
parser = lxml.etree.XMLParser()
|
||||
xsd_doc = lxml.etree.parse(
|
||||
@@ -838,14 +760,12 @@ class BaseSchemaValidator:
|
||||
)
|
||||
schema = lxml.etree.XMLSchema(xsd_doc)
|
||||
|
||||
# Load and preprocess XML
|
||||
with open(xml_file, "r") as f:
|
||||
xml_doc = lxml.etree.parse(f)
|
||||
|
||||
xml_doc, _ = self._remove_template_tags_from_text_nodes(xml_doc)
|
||||
xml_doc = self._preprocess_for_mc_ignorable(xml_doc)
|
||||
|
||||
# Clean ignorable namespaces if needed
|
||||
relative_path = xml_file.relative_to(base_path)
|
||||
if (
|
||||
relative_path.parts
|
||||
@@ -853,13 +773,11 @@ class BaseSchemaValidator:
|
||||
):
|
||||
xml_doc = self._clean_ignorable_namespaces(xml_doc)
|
||||
|
||||
# Validate
|
||||
if schema.validate(xml_doc):
|
||||
return True, set()
|
||||
else:
|
||||
errors = set()
|
||||
for error in schema.error_log:
|
||||
# Store normalized error message (without line numbers for comparison)
|
||||
errors.add(error.message)
|
||||
return False, errors
|
||||
|
||||
@@ -867,18 +785,12 @@ class BaseSchemaValidator:
|
||||
return False, {str(e)}
|
||||
|
||||
def _get_original_file_errors(self, xml_file):
|
||||
"""Get XSD validation errors from a single file in the original document.
|
||||
if self.original_file is None:
|
||||
return set()
|
||||
|
||||
Args:
|
||||
xml_file: Path to the XML file in unpacked_dir to check
|
||||
|
||||
Returns:
|
||||
set: Set of error messages from the original file
|
||||
"""
|
||||
import tempfile
|
||||
import zipfile
|
||||
|
||||
# Resolve both paths to handle symlinks (e.g., /var vs /private/var on macOS)
|
||||
xml_file = Path(xml_file).resolve()
|
||||
unpacked_dir = self.unpacked_dir.resolve()
|
||||
relative_path = xml_file.relative_to(unpacked_dir)
|
||||
@@ -886,37 +798,23 @@ class BaseSchemaValidator:
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
temp_path = Path(temp_dir)
|
||||
|
||||
# Extract original file
|
||||
with zipfile.ZipFile(self.original_file, "r") as zip_ref:
|
||||
zip_ref.extractall(temp_path)
|
||||
|
||||
# Find corresponding file in original
|
||||
original_xml_file = temp_path / relative_path
|
||||
|
||||
if not original_xml_file.exists():
|
||||
# File didn't exist in original, so no original errors
|
||||
return set()
|
||||
|
||||
# Validate the specific file in original
|
||||
is_valid, errors = self._validate_single_file_xsd(
|
||||
original_xml_file, temp_path
|
||||
)
|
||||
return errors if errors else set()
|
||||
|
||||
def _remove_template_tags_from_text_nodes(self, xml_doc):
|
||||
"""Remove template tags from XML text nodes and collect warnings.
|
||||
|
||||
Template tags follow the pattern {{ ... }} and are used as placeholders
|
||||
for content replacement. They should be removed from text content before
|
||||
XSD validation while preserving XML structure.
|
||||
|
||||
Returns:
|
||||
tuple: (cleaned_xml_doc, warnings_list)
|
||||
"""
|
||||
warnings = []
|
||||
template_pattern = re.compile(r"\{\{[^}]*\}\}")
|
||||
|
||||
# Create a copy of the document to avoid modifying the original
|
||||
xml_string = lxml.etree.tostring(xml_doc, encoding="unicode")
|
||||
xml_copy = lxml.etree.fromstring(xml_string)
|
||||
|
||||
@@ -932,9 +830,7 @@ class BaseSchemaValidator:
|
||||
return template_pattern.sub("", text)
|
||||
return text
|
||||
|
||||
# Process all text nodes in the document
|
||||
for elem in xml_copy.iter():
|
||||
# Skip processing if this is a w:t element
|
||||
if not hasattr(elem, "tag") or callable(elem.tag):
|
||||
continue
|
||||
tag_str = str(elem.tag)
|
||||
446
scientific-skills/docx/scripts/office/validators/docx.py
Normal file
446
scientific-skills/docx/scripts/office/validators/docx.py
Normal file
@@ -0,0 +1,446 @@
|
||||
"""
|
||||
Validator for Word document XML files against XSD schemas.
|
||||
"""
|
||||
|
||||
import random
|
||||
import re
|
||||
import tempfile
|
||||
import zipfile
|
||||
|
||||
import defusedxml.minidom
|
||||
import lxml.etree
|
||||
|
||||
from .base import BaseSchemaValidator
|
||||
|
||||
|
||||
class DOCXSchemaValidator(BaseSchemaValidator):
|
||||
|
||||
WORD_2006_NAMESPACE = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
|
||||
W14_NAMESPACE = "http://schemas.microsoft.com/office/word/2010/wordml"
|
||||
W16CID_NAMESPACE = "http://schemas.microsoft.com/office/word/2016/wordml/cid"
|
||||
|
||||
ELEMENT_RELATIONSHIP_TYPES = {}
|
||||
|
||||
def validate(self):
|
||||
if not self.validate_xml():
|
||||
return False
|
||||
|
||||
all_valid = True
|
||||
if not self.validate_namespaces():
|
||||
all_valid = False
|
||||
|
||||
if not self.validate_unique_ids():
|
||||
all_valid = False
|
||||
|
||||
if not self.validate_file_references():
|
||||
all_valid = False
|
||||
|
||||
if not self.validate_content_types():
|
||||
all_valid = False
|
||||
|
||||
if not self.validate_against_xsd():
|
||||
all_valid = False
|
||||
|
||||
if not self.validate_whitespace_preservation():
|
||||
all_valid = False
|
||||
|
||||
if not self.validate_deletions():
|
||||
all_valid = False
|
||||
|
||||
if not self.validate_insertions():
|
||||
all_valid = False
|
||||
|
||||
if not self.validate_all_relationship_ids():
|
||||
all_valid = False
|
||||
|
||||
if not self.validate_id_constraints():
|
||||
all_valid = False
|
||||
|
||||
if not self.validate_comment_markers():
|
||||
all_valid = False
|
||||
|
||||
self.compare_paragraph_counts()
|
||||
|
||||
return all_valid
|
||||
|
||||
def validate_whitespace_preservation(self):
|
||||
errors = []
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
|
||||
for elem in root.iter(f"{{{self.WORD_2006_NAMESPACE}}}t"):
|
||||
if elem.text:
|
||||
text = elem.text
|
||||
if re.search(r"^[ \t\n\r]", text) or re.search(
|
||||
r"[ \t\n\r]$", text
|
||||
):
|
||||
xml_space_attr = f"{{{self.XML_NAMESPACE}}}space"
|
||||
if (
|
||||
xml_space_attr not in elem.attrib
|
||||
or elem.attrib[xml_space_attr] != "preserve"
|
||||
):
|
||||
text_preview = (
|
||||
repr(text)[:50] + "..."
|
||||
if len(repr(text)) > 50
|
||||
else repr(text)
|
||||
)
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: "
|
||||
f"Line {elem.sourceline}: w:t element with whitespace missing xml:space='preserve': {text_preview}"
|
||||
)
|
||||
|
||||
except (lxml.etree.XMLSyntaxError, Exception) as e:
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
|
||||
)
|
||||
|
||||
if errors:
|
||||
print(f"FAILED - Found {len(errors)} whitespace preservation violations:")
|
||||
for error in errors:
|
||||
print(error)
|
||||
return False
|
||||
else:
|
||||
if self.verbose:
|
||||
print("PASSED - All whitespace is properly preserved")
|
||||
return True
|
||||
|
||||
def validate_deletions(self):
|
||||
errors = []
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
namespaces = {"w": self.WORD_2006_NAMESPACE}
|
||||
|
||||
for t_elem in root.xpath(".//w:del//w:t", namespaces=namespaces):
|
||||
if t_elem.text:
|
||||
text_preview = (
|
||||
repr(t_elem.text)[:50] + "..."
|
||||
if len(repr(t_elem.text)) > 50
|
||||
else repr(t_elem.text)
|
||||
)
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: "
|
||||
f"Line {t_elem.sourceline}: <w:t> found within <w:del>: {text_preview}"
|
||||
)
|
||||
|
||||
for instr_elem in root.xpath(
|
||||
".//w:del//w:instrText", namespaces=namespaces
|
||||
):
|
||||
text_preview = (
|
||||
repr(instr_elem.text or "")[:50] + "..."
|
||||
if len(repr(instr_elem.text or "")) > 50
|
||||
else repr(instr_elem.text or "")
|
||||
)
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: "
|
||||
f"Line {instr_elem.sourceline}: <w:instrText> found within <w:del> (use <w:delInstrText>): {text_preview}"
|
||||
)
|
||||
|
||||
except (lxml.etree.XMLSyntaxError, Exception) as e:
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
|
||||
)
|
||||
|
||||
if errors:
|
||||
print(f"FAILED - Found {len(errors)} deletion validation violations:")
|
||||
for error in errors:
|
||||
print(error)
|
||||
return False
|
||||
else:
|
||||
if self.verbose:
|
||||
print("PASSED - No w:t elements found within w:del elements")
|
||||
return True
|
||||
|
||||
def count_paragraphs_in_unpacked(self):
|
||||
count = 0
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p")
|
||||
count = len(paragraphs)
|
||||
except Exception as e:
|
||||
print(f"Error counting paragraphs in unpacked document: {e}")
|
||||
|
||||
return count
|
||||
|
||||
def count_paragraphs_in_original(self):
|
||||
original = self.original_file
|
||||
if original is None:
|
||||
return 0
|
||||
|
||||
count = 0
|
||||
|
||||
try:
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
with zipfile.ZipFile(original, "r") as zip_ref:
|
||||
zip_ref.extractall(temp_dir)
|
||||
|
||||
doc_xml_path = temp_dir + "/word/document.xml"
|
||||
root = lxml.etree.parse(doc_xml_path).getroot()
|
||||
|
||||
paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p")
|
||||
count = len(paragraphs)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error counting paragraphs in original document: {e}")
|
||||
|
||||
return count
|
||||
|
||||
def validate_insertions(self):
|
||||
errors = []
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
if xml_file.name != "document.xml":
|
||||
continue
|
||||
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
namespaces = {"w": self.WORD_2006_NAMESPACE}
|
||||
|
||||
invalid_elements = root.xpath(
|
||||
".//w:ins//w:delText[not(ancestor::w:del)]", namespaces=namespaces
|
||||
)
|
||||
|
||||
for elem in invalid_elements:
|
||||
text_preview = (
|
||||
repr(elem.text or "")[:50] + "..."
|
||||
if len(repr(elem.text or "")) > 50
|
||||
else repr(elem.text or "")
|
||||
)
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: "
|
||||
f"Line {elem.sourceline}: <w:delText> within <w:ins>: {text_preview}"
|
||||
)
|
||||
|
||||
except (lxml.etree.XMLSyntaxError, Exception) as e:
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
|
||||
)
|
||||
|
||||
if errors:
|
||||
print(f"FAILED - Found {len(errors)} insertion validation violations:")
|
||||
for error in errors:
|
||||
print(error)
|
||||
return False
|
||||
else:
|
||||
if self.verbose:
|
||||
print("PASSED - No w:delText elements within w:ins elements")
|
||||
return True
|
||||
|
||||
def compare_paragraph_counts(self):
|
||||
original_count = self.count_paragraphs_in_original()
|
||||
new_count = self.count_paragraphs_in_unpacked()
|
||||
|
||||
diff = new_count - original_count
|
||||
diff_str = f"+{diff}" if diff > 0 else str(diff)
|
||||
print(f"\nParagraphs: {original_count} → {new_count} ({diff_str})")
|
||||
|
||||
def _parse_id_value(self, val: str, base: int = 16) -> int:
|
||||
return int(val, base)
|
||||
|
||||
def validate_id_constraints(self):
|
||||
errors = []
|
||||
para_id_attr = f"{{{self.W14_NAMESPACE}}}paraId"
|
||||
durable_id_attr = f"{{{self.W16CID_NAMESPACE}}}durableId"
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
try:
|
||||
for elem in lxml.etree.parse(str(xml_file)).iter():
|
||||
if val := elem.get(para_id_attr):
|
||||
if self._parse_id_value(val, base=16) >= 0x80000000:
|
||||
errors.append(
|
||||
f" {xml_file.name}:{elem.sourceline}: paraId={val} >= 0x80000000"
|
||||
)
|
||||
|
||||
if val := elem.get(durable_id_attr):
|
||||
if xml_file.name == "numbering.xml":
|
||||
try:
|
||||
if self._parse_id_value(val, base=10) >= 0x7FFFFFFF:
|
||||
errors.append(
|
||||
f" {xml_file.name}:{elem.sourceline}: "
|
||||
f"durableId={val} >= 0x7FFFFFFF"
|
||||
)
|
||||
except ValueError:
|
||||
errors.append(
|
||||
f" {xml_file.name}:{elem.sourceline}: "
|
||||
f"durableId={val} must be decimal in numbering.xml"
|
||||
)
|
||||
else:
|
||||
if self._parse_id_value(val, base=16) >= 0x7FFFFFFF:
|
||||
errors.append(
|
||||
f" {xml_file.name}:{elem.sourceline}: "
|
||||
f"durableId={val} >= 0x7FFFFFFF"
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if errors:
|
||||
print(f"FAILED - {len(errors)} ID constraint violations:")
|
||||
for e in errors:
|
||||
print(e)
|
||||
elif self.verbose:
|
||||
print("PASSED - All paraId/durableId values within constraints")
|
||||
return not errors
|
||||
|
||||
def validate_comment_markers(self):
|
||||
errors = []
|
||||
|
||||
document_xml = None
|
||||
comments_xml = None
|
||||
for xml_file in self.xml_files:
|
||||
if xml_file.name == "document.xml" and "word" in str(xml_file):
|
||||
document_xml = xml_file
|
||||
elif xml_file.name == "comments.xml":
|
||||
comments_xml = xml_file
|
||||
|
||||
if not document_xml:
|
||||
if self.verbose:
|
||||
print("PASSED - No document.xml found (skipping comment validation)")
|
||||
return True
|
||||
|
||||
try:
|
||||
doc_root = lxml.etree.parse(str(document_xml)).getroot()
|
||||
namespaces = {"w": self.WORD_2006_NAMESPACE}
|
||||
|
||||
range_starts = {
|
||||
elem.get(f"{{{self.WORD_2006_NAMESPACE}}}id")
|
||||
for elem in doc_root.xpath(
|
||||
".//w:commentRangeStart", namespaces=namespaces
|
||||
)
|
||||
}
|
||||
range_ends = {
|
||||
elem.get(f"{{{self.WORD_2006_NAMESPACE}}}id")
|
||||
for elem in doc_root.xpath(
|
||||
".//w:commentRangeEnd", namespaces=namespaces
|
||||
)
|
||||
}
|
||||
references = {
|
||||
elem.get(f"{{{self.WORD_2006_NAMESPACE}}}id")
|
||||
for elem in doc_root.xpath(
|
||||
".//w:commentReference", namespaces=namespaces
|
||||
)
|
||||
}
|
||||
|
||||
orphaned_ends = range_ends - range_starts
|
||||
for comment_id in sorted(
|
||||
orphaned_ends, key=lambda x: int(x) if x and x.isdigit() else 0
|
||||
):
|
||||
errors.append(
|
||||
f' document.xml: commentRangeEnd id="{comment_id}" has no matching commentRangeStart'
|
||||
)
|
||||
|
||||
orphaned_starts = range_starts - range_ends
|
||||
for comment_id in sorted(
|
||||
orphaned_starts, key=lambda x: int(x) if x and x.isdigit() else 0
|
||||
):
|
||||
errors.append(
|
||||
f' document.xml: commentRangeStart id="{comment_id}" has no matching commentRangeEnd'
|
||||
)
|
||||
|
||||
comment_ids = set()
|
||||
if comments_xml and comments_xml.exists():
|
||||
comments_root = lxml.etree.parse(str(comments_xml)).getroot()
|
||||
comment_ids = {
|
||||
elem.get(f"{{{self.WORD_2006_NAMESPACE}}}id")
|
||||
for elem in comments_root.xpath(
|
||||
".//w:comment", namespaces=namespaces
|
||||
)
|
||||
}
|
||||
|
||||
marker_ids = range_starts | range_ends | references
|
||||
invalid_refs = marker_ids - comment_ids
|
||||
for comment_id in sorted(
|
||||
invalid_refs, key=lambda x: int(x) if x and x.isdigit() else 0
|
||||
):
|
||||
if comment_id:
|
||||
errors.append(
|
||||
f' document.xml: marker id="{comment_id}" references non-existent comment'
|
||||
)
|
||||
|
||||
except (lxml.etree.XMLSyntaxError, Exception) as e:
|
||||
errors.append(f" Error parsing XML: {e}")
|
||||
|
||||
if errors:
|
||||
print(f"FAILED - {len(errors)} comment marker violations:")
|
||||
for error in errors:
|
||||
print(error)
|
||||
return False
|
||||
else:
|
||||
if self.verbose:
|
||||
print("PASSED - All comment markers properly paired")
|
||||
return True
|
||||
|
||||
def repair(self) -> int:
|
||||
repairs = super().repair()
|
||||
repairs += self.repair_durableId()
|
||||
return repairs
|
||||
|
||||
def repair_durableId(self) -> int:
|
||||
repairs = 0
|
||||
|
||||
for xml_file in self.xml_files:
|
||||
try:
|
||||
content = xml_file.read_text(encoding="utf-8")
|
||||
dom = defusedxml.minidom.parseString(content)
|
||||
modified = False
|
||||
|
||||
for elem in dom.getElementsByTagName("*"):
|
||||
if not elem.hasAttribute("w16cid:durableId"):
|
||||
continue
|
||||
|
||||
durable_id = elem.getAttribute("w16cid:durableId")
|
||||
needs_repair = False
|
||||
|
||||
if xml_file.name == "numbering.xml":
|
||||
try:
|
||||
needs_repair = (
|
||||
self._parse_id_value(durable_id, base=10) >= 0x7FFFFFFF
|
||||
)
|
||||
except ValueError:
|
||||
needs_repair = True
|
||||
else:
|
||||
try:
|
||||
needs_repair = (
|
||||
self._parse_id_value(durable_id, base=16) >= 0x7FFFFFFF
|
||||
)
|
||||
except ValueError:
|
||||
needs_repair = True
|
||||
|
||||
if needs_repair:
|
||||
value = random.randint(1, 0x7FFFFFFE)
|
||||
if xml_file.name == "numbering.xml":
|
||||
new_id = str(value)
|
||||
else:
|
||||
new_id = f"{value:08X}"
|
||||
|
||||
elem.setAttribute("w16cid:durableId", new_id)
|
||||
print(
|
||||
f" Repaired: {xml_file.name}: durableId {durable_id} → {new_id}"
|
||||
)
|
||||
repairs += 1
|
||||
modified = True
|
||||
|
||||
if modified:
|
||||
xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
|
||||
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return repairs
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise RuntimeError("This module should not be run directly.")
|
||||
@@ -8,14 +8,11 @@ from .base import BaseSchemaValidator
|
||||
|
||||
|
||||
class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
"""Validator for PowerPoint presentation XML files against XSD schemas."""
|
||||
|
||||
# PowerPoint presentation namespace
|
||||
PRESENTATIONML_NAMESPACE = (
|
||||
"http://schemas.openxmlformats.org/presentationml/2006/main"
|
||||
)
|
||||
|
||||
# PowerPoint-specific element to relationship type mappings
|
||||
ELEMENT_RELATIONSHIP_TYPES = {
|
||||
"sldid": "slide",
|
||||
"sldmasterid": "slidemaster",
|
||||
@@ -26,60 +23,46 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
}
|
||||
|
||||
def validate(self):
|
||||
"""Run all validation checks and return True if all pass."""
|
||||
# Test 0: XML well-formedness
|
||||
if not self.validate_xml():
|
||||
return False
|
||||
|
||||
# Test 1: Namespace declarations
|
||||
all_valid = True
|
||||
if not self.validate_namespaces():
|
||||
all_valid = False
|
||||
|
||||
# Test 2: Unique IDs
|
||||
if not self.validate_unique_ids():
|
||||
all_valid = False
|
||||
|
||||
# Test 3: UUID ID validation
|
||||
if not self.validate_uuid_ids():
|
||||
all_valid = False
|
||||
|
||||
# Test 4: Relationship and file reference validation
|
||||
if not self.validate_file_references():
|
||||
all_valid = False
|
||||
|
||||
# Test 5: Slide layout ID validation
|
||||
if not self.validate_slide_layout_ids():
|
||||
all_valid = False
|
||||
|
||||
# Test 6: Content type declarations
|
||||
if not self.validate_content_types():
|
||||
all_valid = False
|
||||
|
||||
# Test 7: XSD schema validation
|
||||
if not self.validate_against_xsd():
|
||||
all_valid = False
|
||||
|
||||
# Test 8: Notes slide reference validation
|
||||
if not self.validate_notes_slide_references():
|
||||
all_valid = False
|
||||
|
||||
# Test 9: Relationship ID reference validation
|
||||
if not self.validate_all_relationship_ids():
|
||||
all_valid = False
|
||||
|
||||
# Test 10: Duplicate slide layout references validation
|
||||
if not self.validate_no_duplicate_slide_layouts():
|
||||
all_valid = False
|
||||
|
||||
return all_valid
|
||||
|
||||
def validate_uuid_ids(self):
|
||||
"""Validate that ID attributes that look like UUIDs contain only hex values."""
|
||||
import lxml.etree
|
||||
|
||||
errors = []
|
||||
# UUID pattern: 8-4-4-4-12 hex digits with optional braces/hyphens
|
||||
uuid_pattern = re.compile(
|
||||
r"^[\{\(]?[0-9A-Fa-f]{8}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{12}[\}\)]?$"
|
||||
)
|
||||
@@ -88,15 +71,11 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
try:
|
||||
root = lxml.etree.parse(str(xml_file)).getroot()
|
||||
|
||||
# Check all elements for ID attributes
|
||||
for elem in root.iter():
|
||||
for attr, value in elem.attrib.items():
|
||||
# Check if this is an ID attribute
|
||||
attr_name = attr.split("}")[-1].lower()
|
||||
if attr_name == "id" or attr_name.endswith("id"):
|
||||
# Check if value looks like a UUID (has the right length and pattern structure)
|
||||
if self._looks_like_uuid(value):
|
||||
# Validate that it contains only hex characters in the right positions
|
||||
if not uuid_pattern.match(value):
|
||||
errors.append(
|
||||
f" {xml_file.relative_to(self.unpacked_dir)}: "
|
||||
@@ -119,19 +98,14 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
return True
|
||||
|
||||
def _looks_like_uuid(self, value):
|
||||
"""Check if a value has the general structure of a UUID."""
|
||||
# Remove common UUID delimiters
|
||||
clean_value = value.strip("{}()").replace("-", "")
|
||||
# Check if it's 32 hex-like characters (could include invalid hex chars)
|
||||
return len(clean_value) == 32 and all(c.isalnum() for c in clean_value)
|
||||
|
||||
def validate_slide_layout_ids(self):
|
||||
"""Validate that sldLayoutId elements in slide masters reference valid slide layouts."""
|
||||
import lxml.etree
|
||||
|
||||
errors = []
|
||||
|
||||
# Find all slide master files
|
||||
slide_masters = list(self.unpacked_dir.glob("ppt/slideMasters/*.xml"))
|
||||
|
||||
if not slide_masters:
|
||||
@@ -141,10 +115,8 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
|
||||
for slide_master in slide_masters:
|
||||
try:
|
||||
# Parse the slide master file
|
||||
root = lxml.etree.parse(str(slide_master)).getroot()
|
||||
|
||||
# Find the corresponding _rels file for this slide master
|
||||
rels_file = slide_master.parent / "_rels" / f"{slide_master.name}.rels"
|
||||
|
||||
if not rels_file.exists():
|
||||
@@ -154,10 +126,8 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
)
|
||||
continue
|
||||
|
||||
# Parse the relationships file
|
||||
rels_root = lxml.etree.parse(str(rels_file)).getroot()
|
||||
|
||||
# Build a set of valid relationship IDs that point to slide layouts
|
||||
valid_layout_rids = set()
|
||||
for rel in rels_root.findall(
|
||||
f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship"
|
||||
@@ -166,7 +136,6 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
if "slideLayout" in rel_type:
|
||||
valid_layout_rids.add(rel.get("Id"))
|
||||
|
||||
# Find all sldLayoutId elements in the slide master
|
||||
for sld_layout_id in root.findall(
|
||||
f".//{{{self.PRESENTATIONML_NAMESPACE}}}sldLayoutId"
|
||||
):
|
||||
@@ -201,7 +170,6 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
return True
|
||||
|
||||
def validate_no_duplicate_slide_layouts(self):
|
||||
"""Validate that each slide has exactly one slideLayout reference."""
|
||||
import lxml.etree
|
||||
|
||||
errors = []
|
||||
@@ -211,7 +179,6 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
try:
|
||||
root = lxml.etree.parse(str(rels_file)).getroot()
|
||||
|
||||
# Find all slideLayout relationships
|
||||
layout_rels = [
|
||||
rel
|
||||
for rel in root.findall(
|
||||
@@ -241,13 +208,11 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
return True
|
||||
|
||||
def validate_notes_slide_references(self):
|
||||
"""Validate that each notesSlide file is referenced by only one slide."""
|
||||
import lxml.etree
|
||||
|
||||
errors = []
|
||||
notes_slide_references = {} # Track which slides reference each notesSlide
|
||||
notes_slide_references = {}
|
||||
|
||||
# Find all slide relationship files
|
||||
slide_rels_files = list(self.unpacked_dir.glob("ppt/slides/_rels/*.xml.rels"))
|
||||
|
||||
if not slide_rels_files:
|
||||
@@ -257,10 +222,8 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
|
||||
for rels_file in slide_rels_files:
|
||||
try:
|
||||
# Parse the relationships file
|
||||
root = lxml.etree.parse(str(rels_file)).getroot()
|
||||
|
||||
# Find all notesSlide relationships
|
||||
for rel in root.findall(
|
||||
f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship"
|
||||
):
|
||||
@@ -268,13 +231,11 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
if "notesSlide" in rel_type:
|
||||
target = rel.get("Target", "")
|
||||
if target:
|
||||
# Normalize the target path to handle relative paths
|
||||
normalized_target = target.replace("../", "")
|
||||
|
||||
# Track which slide references this notesSlide
|
||||
slide_name = rels_file.stem.replace(
|
||||
".xml", ""
|
||||
) # e.g., "slide1"
|
||||
)
|
||||
|
||||
if normalized_target not in notes_slide_references:
|
||||
notes_slide_references[normalized_target] = []
|
||||
@@ -287,7 +248,6 @@ class PPTXSchemaValidator(BaseSchemaValidator):
|
||||
f" {rels_file.relative_to(self.unpacked_dir)}: Error: {e}"
|
||||
)
|
||||
|
||||
# Check for duplicate references
|
||||
for target, references in notes_slide_references.items():
|
||||
if len(references) > 1:
|
||||
slide_names = [ref[0] for ref in references]
|
||||
@@ -9,62 +9,56 @@ from pathlib import Path
|
||||
|
||||
|
||||
class RedliningValidator:
|
||||
"""Validator for tracked changes in Word documents."""
|
||||
|
||||
def __init__(self, unpacked_dir, original_docx, verbose=False):
|
||||
def __init__(self, unpacked_dir, original_docx, verbose=False, author="Claude"):
|
||||
self.unpacked_dir = Path(unpacked_dir)
|
||||
self.original_docx = Path(original_docx)
|
||||
self.verbose = verbose
|
||||
self.author = author
|
||||
self.namespaces = {
|
||||
"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
|
||||
}
|
||||
|
||||
def repair(self) -> int:
|
||||
return 0
|
||||
|
||||
def validate(self):
|
||||
"""Main validation method that returns True if valid, False otherwise."""
|
||||
# Verify unpacked directory exists and has correct structure
|
||||
modified_file = self.unpacked_dir / "word" / "document.xml"
|
||||
if not modified_file.exists():
|
||||
print(f"FAILED - Modified document.xml not found at {modified_file}")
|
||||
return False
|
||||
|
||||
# First, check if there are any tracked changes by Scientific-Writer to validate
|
||||
try:
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
tree = ET.parse(modified_file)
|
||||
root = tree.getroot()
|
||||
|
||||
# Check for w:del or w:ins tags authored by Scientific-Writer
|
||||
del_elements = root.findall(".//w:del", self.namespaces)
|
||||
ins_elements = root.findall(".//w:ins", self.namespaces)
|
||||
|
||||
# Filter to only include changes by Scientific-Writer
|
||||
sw_del_elements = [
|
||||
author_del_elements = [
|
||||
elem
|
||||
for elem in del_elements
|
||||
if elem.get(f"{{{self.namespaces['w']}}}author") == "Scientific-Writer"
|
||||
if elem.get(f"{{{self.namespaces['w']}}}author") == self.author
|
||||
]
|
||||
sw_ins_elements = [
|
||||
author_ins_elements = [
|
||||
elem
|
||||
for elem in ins_elements
|
||||
if elem.get(f"{{{self.namespaces['w']}}}author") == "Scientific-Writer"
|
||||
if elem.get(f"{{{self.namespaces['w']}}}author") == self.author
|
||||
]
|
||||
|
||||
# Redlining validation is only needed if tracked changes by Scientific-Writer have been used.
|
||||
if not sw_del_elements and not sw_ins_elements:
|
||||
if not author_del_elements and not author_ins_elements:
|
||||
if self.verbose:
|
||||
print("PASSED - No tracked changes by Scientific-Writer found.")
|
||||
print(f"PASSED - No tracked changes by {self.author} found.")
|
||||
return True
|
||||
|
||||
except Exception:
|
||||
# If we can't parse the XML, continue with full validation
|
||||
pass
|
||||
|
||||
# Create temporary directory for unpacking original docx
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
temp_path = Path(temp_dir)
|
||||
|
||||
# Unpack original docx
|
||||
try:
|
||||
with zipfile.ZipFile(self.original_docx, "r") as zip_ref:
|
||||
zip_ref.extractall(temp_path)
|
||||
@@ -79,7 +73,6 @@ class RedliningValidator:
|
||||
)
|
||||
return False
|
||||
|
||||
# Parse both XML files using xml.etree.ElementTree for redlining validation
|
||||
try:
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
@@ -91,16 +84,13 @@ class RedliningValidator:
|
||||
print(f"FAILED - Error parsing XML files: {e}")
|
||||
return False
|
||||
|
||||
# Remove Scientific-Writer's tracked changes from both documents
|
||||
self._remove_sw_tracked_changes(original_root)
|
||||
self._remove_sw_tracked_changes(modified_root)
|
||||
self._remove_author_tracked_changes(original_root)
|
||||
self._remove_author_tracked_changes(modified_root)
|
||||
|
||||
# Extract and compare text content
|
||||
modified_text = self._extract_text_content(modified_root)
|
||||
original_text = self._extract_text_content(original_root)
|
||||
|
||||
if modified_text != original_text:
|
||||
# Show detailed character-level differences for each paragraph
|
||||
error_message = self._generate_detailed_diff(
|
||||
original_text, modified_text
|
||||
)
|
||||
@@ -108,13 +98,12 @@ class RedliningValidator:
|
||||
return False
|
||||
|
||||
if self.verbose:
|
||||
print("PASSED - All changes by Scientific-Writer are properly tracked")
|
||||
print(f"PASSED - All changes by {self.author} are properly tracked")
|
||||
return True
|
||||
|
||||
def _generate_detailed_diff(self, original_text, modified_text):
|
||||
"""Generate detailed word-level differences using git word diff."""
|
||||
error_parts = [
|
||||
"FAILED - Document text doesn't match after removing Scientific-Writer's tracked changes",
|
||||
f"FAILED - Document text doesn't match after removing {self.author}'s tracked changes",
|
||||
"",
|
||||
"Likely causes:",
|
||||
" 1. Modified text inside another author's <w:ins> or <w:del> tags",
|
||||
@@ -127,7 +116,6 @@ class RedliningValidator:
|
||||
"",
|
||||
]
|
||||
|
||||
# Show git word diff
|
||||
git_diff = self._get_git_word_diff(original_text, modified_text)
|
||||
if git_diff:
|
||||
error_parts.extend(["Differences:", "============", git_diff])
|
||||
@@ -137,26 +125,23 @@ class RedliningValidator:
|
||||
return "\n".join(error_parts)
|
||||
|
||||
def _get_git_word_diff(self, original_text, modified_text):
|
||||
"""Generate word diff using git with character-level precision."""
|
||||
try:
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
temp_path = Path(temp_dir)
|
||||
|
||||
# Create two files
|
||||
original_file = temp_path / "original.txt"
|
||||
modified_file = temp_path / "modified.txt"
|
||||
|
||||
original_file.write_text(original_text, encoding="utf-8")
|
||||
modified_file.write_text(modified_text, encoding="utf-8")
|
||||
|
||||
# Try character-level diff first for precise differences
|
||||
result = subprocess.run(
|
||||
[
|
||||
"git",
|
||||
"diff",
|
||||
"--word-diff=plain",
|
||||
"--word-diff-regex=.", # Character-by-character diff
|
||||
"-U0", # Zero lines of context - show only changed lines
|
||||
"--word-diff-regex=.",
|
||||
"-U0",
|
||||
"--no-index",
|
||||
str(original_file),
|
||||
str(modified_file),
|
||||
@@ -166,9 +151,7 @@ class RedliningValidator:
|
||||
)
|
||||
|
||||
if result.stdout.strip():
|
||||
# Clean up the output - remove git diff header lines
|
||||
lines = result.stdout.split("\n")
|
||||
# Skip the header lines (diff --git, index, +++, ---, @@)
|
||||
content_lines = []
|
||||
in_content = False
|
||||
for line in lines:
|
||||
@@ -181,13 +164,12 @@ class RedliningValidator:
|
||||
if content_lines:
|
||||
return "\n".join(content_lines)
|
||||
|
||||
# Fallback to word-level diff if character-level is too verbose
|
||||
result = subprocess.run(
|
||||
[
|
||||
"git",
|
||||
"diff",
|
||||
"--word-diff=plain",
|
||||
"-U0", # Zero lines of context
|
||||
"-U0",
|
||||
"--no-index",
|
||||
str(original_file),
|
||||
str(modified_file),
|
||||
@@ -209,66 +191,52 @@ class RedliningValidator:
|
||||
return "\n".join(content_lines)
|
||||
|
||||
except (subprocess.CalledProcessError, FileNotFoundError, Exception):
|
||||
# Git not available or other error, return None to use fallback
|
||||
pass
|
||||
|
||||
return None
|
||||
|
||||
def _remove_sw_tracked_changes(self, root):
|
||||
"""Remove tracked changes authored by Scientific-Writer from the XML root."""
|
||||
def _remove_author_tracked_changes(self, root):
|
||||
ins_tag = f"{{{self.namespaces['w']}}}ins"
|
||||
del_tag = f"{{{self.namespaces['w']}}}del"
|
||||
author_attr = f"{{{self.namespaces['w']}}}author"
|
||||
|
||||
# Remove w:ins elements
|
||||
for parent in root.iter():
|
||||
to_remove = []
|
||||
for child in parent:
|
||||
if child.tag == ins_tag and child.get(author_attr) == "Scientific-Writer":
|
||||
if child.tag == ins_tag and child.get(author_attr) == self.author:
|
||||
to_remove.append(child)
|
||||
for elem in to_remove:
|
||||
parent.remove(elem)
|
||||
|
||||
# Unwrap content in w:del elements where author is "Scientific-Writer"
|
||||
deltext_tag = f"{{{self.namespaces['w']}}}delText"
|
||||
t_tag = f"{{{self.namespaces['w']}}}t"
|
||||
|
||||
for parent in root.iter():
|
||||
to_process = []
|
||||
for child in parent:
|
||||
if child.tag == del_tag and child.get(author_attr) == "Scientific-Writer":
|
||||
if child.tag == del_tag and child.get(author_attr) == self.author:
|
||||
to_process.append((child, list(parent).index(child)))
|
||||
|
||||
# Process in reverse order to maintain indices
|
||||
for del_elem, del_index in reversed(to_process):
|
||||
# Convert w:delText to w:t before moving
|
||||
for elem in del_elem.iter():
|
||||
if elem.tag == deltext_tag:
|
||||
elem.tag = t_tag
|
||||
|
||||
# Move all children of w:del to its parent before removing w:del
|
||||
for child in reversed(list(del_elem)):
|
||||
parent.insert(del_index, child)
|
||||
parent.remove(del_elem)
|
||||
|
||||
def _extract_text_content(self, root):
|
||||
"""Extract text content from Word XML, preserving paragraph structure.
|
||||
|
||||
Empty paragraphs are skipped to avoid false positives when tracked
|
||||
insertions add only structural elements without text content.
|
||||
"""
|
||||
p_tag = f"{{{self.namespaces['w']}}}p"
|
||||
t_tag = f"{{{self.namespaces['w']}}}t"
|
||||
|
||||
paragraphs = []
|
||||
for p_elem in root.findall(f".//{p_tag}"):
|
||||
# Get all text elements within this paragraph
|
||||
text_parts = []
|
||||
for t_elem in p_elem.findall(f".//{t_tag}"):
|
||||
if t_elem.text:
|
||||
text_parts.append(t_elem.text)
|
||||
paragraph_text = "".join(text_parts)
|
||||
# Skip empty paragraphs - they don't affect content validation
|
||||
if paragraph_text:
|
||||
paragraphs.append(paragraph_text)
|
||||
|
||||
@@ -1,3 +1,3 @@
|
||||
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
||||
<?xml version="1.0" ?>
|
||||
<w:comments xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16du="http://schemas.microsoft.com/office/word/2023/wordml/word16du" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16sdtfl="http://schemas.microsoft.com/office/word/2024/wordml/sdtformatlock" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh w16sdtfl w16du wp14">
|
||||
</w:comments>
|
||||
</w:comments>
|
||||
@@ -1,3 +1,3 @@
|
||||
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
||||
<?xml version="1.0" ?>
|
||||
<w15:commentsEx xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16du="http://schemas.microsoft.com/office/word/2023/wordml/word16du" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16sdtfl="http://schemas.microsoft.com/office/word/2024/wordml/sdtformatlock" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh w16sdtfl w16du wp14">
|
||||
</w15:commentsEx>
|
||||
</w15:commentsEx>
|
||||
@@ -1,3 +1,3 @@
|
||||
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
||||
<?xml version="1.0" ?>
|
||||
<w16cex:commentsExtensible xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16du="http://schemas.microsoft.com/office/word/2023/wordml/word16du" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16sdtfl="http://schemas.microsoft.com/office/word/2024/wordml/sdtformatlock" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:cr="http://schemas.microsoft.com/office/comments/2020/reactions" mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh w16sdtfl cr w16du wp14">
|
||||
</w16cex:commentsExtensible>
|
||||
</w16cex:commentsExtensible>
|
||||
@@ -1,3 +1,3 @@
|
||||
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
||||
<?xml version="1.0" ?>
|
||||
<w16cid:commentsIds xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16du="http://schemas.microsoft.com/office/word/2023/wordml/word16du" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16sdtfl="http://schemas.microsoft.com/office/word/2024/wordml/sdtformatlock" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh w16sdtfl w16du wp14">
|
||||
</w16cid:commentsIds>
|
||||
</w16cid:commentsIds>
|
||||
@@ -1,3 +1,3 @@
|
||||
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
||||
<?xml version="1.0" ?>
|
||||
<w15:people xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml">
|
||||
</w15:people>
|
||||
</w15:people>
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
name: pdf
|
||||
description: PDF manipulation toolkit. Extract text/tables, create PDFs, merge/split, fill forms, for programmatic document processing and analysis.
|
||||
description: Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.
|
||||
license: Proprietary. LICENSE.txt has complete terms
|
||||
---
|
||||
|
||||
@@ -8,40 +8,7 @@ license: Proprietary. LICENSE.txt has complete terms
|
||||
|
||||
## Overview
|
||||
|
||||
Extract text/tables, create PDFs, merge/split files, fill forms using Python libraries and command-line tools. Apply this skill for programmatic document processing and analysis. For advanced features or form filling, consult reference.md and forms.md.
|
||||
|
||||
## Visual Enhancement with Scientific Schematics
|
||||
|
||||
**When creating documents with this skill, always consider adding scientific diagrams and schematics to enhance visual communication.**
|
||||
|
||||
If your document does not already contain schematics or diagrams:
|
||||
- Use the **scientific-schematics** skill to generate AI-powered publication-quality diagrams
|
||||
- Simply describe your desired diagram in natural language
|
||||
- Nano Banana Pro will automatically generate, review, and refine the schematic
|
||||
|
||||
**For new documents:** Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.
|
||||
|
||||
**How to generate schematics:**
|
||||
```bash
|
||||
python scripts/generate_schematic.py "your diagram description" -o figures/output.png
|
||||
```
|
||||
|
||||
The AI will automatically:
|
||||
- Create publication-quality images with proper formatting
|
||||
- Review and refine through multiple iterations
|
||||
- Ensure accessibility (colorblind-friendly, high contrast)
|
||||
- Save outputs in the figures/ directory
|
||||
|
||||
**When to add schematics:**
|
||||
- PDF processing workflow diagrams
|
||||
- Document manipulation flowcharts
|
||||
- Form processing visualizations
|
||||
- Data extraction pipeline diagrams
|
||||
- Any complex concept that benefits from visualization
|
||||
|
||||
For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.
|
||||
|
||||
---
|
||||
This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -199,6 +166,26 @@ story.append(Paragraph("Content for page 2", styles['Normal']))
|
||||
doc.build(story)
|
||||
```
|
||||
|
||||
#### Subscripts and Superscripts
|
||||
|
||||
**IMPORTANT**: Never use Unicode subscript/superscript characters (₀₁₂₃₄₅₆₇₈₉, ⁰¹²³⁴⁵⁶⁷⁸⁹) in ReportLab PDFs. The built-in fonts do not include these glyphs, causing them to render as solid black boxes.
|
||||
|
||||
Instead, use ReportLab's XML markup tags in Paragraph objects:
|
||||
```python
|
||||
from reportlab.platypus import Paragraph
|
||||
from reportlab.lib.styles import getSampleStyleSheet
|
||||
|
||||
styles = getSampleStyleSheet()
|
||||
|
||||
# Subscripts: use <sub> tag
|
||||
chemical = Paragraph("H<sub>2</sub>O", styles['Normal'])
|
||||
|
||||
# Superscripts: use <super> tag
|
||||
squared = Paragraph("x<super>2</super> + y<super>2</super>", styles['Normal'])
|
||||
```
|
||||
|
||||
For canvas-drawn text (not Paragraph objects), manually adjust font the size and position rather than using Unicode subscripts/superscripts.
|
||||
|
||||
## Command-Line Tools
|
||||
|
||||
### pdftotext (poppler-utils)
|
||||
@@ -317,12 +304,11 @@ with open("encrypted.pdf", "wb") as output:
|
||||
| Create PDFs | reportlab | Canvas or Platypus |
|
||||
| Command line merge | qpdf | `qpdf --empty --pages ...` |
|
||||
| OCR scanned PDFs | pytesseract | Convert to image first |
|
||||
| Fill PDF forms | pdf-lib or pypdf (see forms.md) | See forms.md |
|
||||
| Fill PDF forms | pdf-lib or pypdf (see FORMS.md) | See FORMS.md |
|
||||
|
||||
## Next Steps
|
||||
|
||||
- For advanced pypdfium2 usage, see reference.md
|
||||
- For JavaScript libraries (pdf-lib), see reference.md
|
||||
- If you need to fill out a PDF form, follow the instructions in forms.md
|
||||
- For troubleshooting guides, see reference.md
|
||||
|
||||
- For advanced pypdfium2 usage, see REFERENCE.md
|
||||
- For JavaScript libraries (pdf-lib), see REFERENCE.md
|
||||
- If you need to fill out a PDF form, follow the instructions in FORMS.md
|
||||
- For troubleshooting guides, see REFERENCE.md
|
||||
294
scientific-skills/pdf/forms.md
Normal file
294
scientific-skills/pdf/forms.md
Normal file
@@ -0,0 +1,294 @@
|
||||
**CRITICAL: You MUST complete these steps in order. Do not skip ahead to writing code.**
|
||||
|
||||
If you need to fill out a PDF form, first check to see if the PDF has fillable form fields. Run this script from this file's directory:
|
||||
`python scripts/check_fillable_fields <file.pdf>`, and depending on the result go to either the "Fillable fields" or "Non-fillable fields" and follow those instructions.
|
||||
|
||||
# Fillable fields
|
||||
If the PDF has fillable form fields:
|
||||
- Run this script from this file's directory: `python scripts/extract_form_field_info.py <input.pdf> <field_info.json>`. It will create a JSON file with a list of fields in this format:
|
||||
```
|
||||
[
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"rect": ([left, bottom, right, top] bounding box in PDF coordinates, y=0 is the bottom of the page),
|
||||
"type": ("text", "checkbox", "radio_group", or "choice"),
|
||||
},
|
||||
// Checkboxes have "checked_value" and "unchecked_value" properties:
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"type": "checkbox",
|
||||
"checked_value": (Set the field to this value to check the checkbox),
|
||||
"unchecked_value": (Set the field to this value to uncheck the checkbox),
|
||||
},
|
||||
// Radio groups have a "radio_options" list with the possible choices.
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"type": "radio_group",
|
||||
"radio_options": [
|
||||
{
|
||||
"value": (set the field to this value to select this radio option),
|
||||
"rect": (bounding box for the radio button for this option)
|
||||
},
|
||||
// Other radio options
|
||||
]
|
||||
},
|
||||
// Multiple choice fields have a "choice_options" list with the possible choices:
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"type": "choice",
|
||||
"choice_options": [
|
||||
{
|
||||
"value": (set the field to this value to select this option),
|
||||
"text": (display text of the option)
|
||||
},
|
||||
// Other choice options
|
||||
],
|
||||
}
|
||||
]
|
||||
```
|
||||
- Convert the PDF to PNGs (one image for each page) with this script (run from this file's directory):
|
||||
`python scripts/convert_pdf_to_images.py <file.pdf> <output_directory>`
|
||||
Then analyze the images to determine the purpose of each form field (make sure to convert the bounding box PDF coordinates to image coordinates).
|
||||
- Create a `field_values.json` file in this format with the values to be entered for each field:
|
||||
```
|
||||
[
|
||||
{
|
||||
"field_id": "last_name", // Must match the field_id from `extract_form_field_info.py`
|
||||
"description": "The user's last name",
|
||||
"page": 1, // Must match the "page" value in field_info.json
|
||||
"value": "Simpson"
|
||||
},
|
||||
{
|
||||
"field_id": "Checkbox12",
|
||||
"description": "Checkbox to be checked if the user is 18 or over",
|
||||
"page": 1,
|
||||
"value": "/On" // If this is a checkbox, use its "checked_value" value to check it. If it's a radio button group, use one of the "value" values in "radio_options".
|
||||
},
|
||||
// more fields
|
||||
]
|
||||
```
|
||||
- Run the `fill_fillable_fields.py` script from this file's directory to create a filled-in PDF:
|
||||
`python scripts/fill_fillable_fields.py <input pdf> <field_values.json> <output pdf>`
|
||||
This script will verify that the field IDs and values you provide are valid; if it prints error messages, correct the appropriate fields and try again.
|
||||
|
||||
# Non-fillable fields
|
||||
If the PDF doesn't have fillable form fields, you'll add text annotations. First try to extract coordinates from the PDF structure (more accurate), then fall back to visual estimation if needed.
|
||||
|
||||
## Step 1: Try Structure Extraction First
|
||||
|
||||
Run this script to extract text labels, lines, and checkboxes with their exact PDF coordinates:
|
||||
`python scripts/extract_form_structure.py <input.pdf> form_structure.json`
|
||||
|
||||
This creates a JSON file containing:
|
||||
- **labels**: Every text element with exact coordinates (x0, top, x1, bottom in PDF points)
|
||||
- **lines**: Horizontal lines that define row boundaries
|
||||
- **checkboxes**: Small square rectangles that are checkboxes (with center coordinates)
|
||||
- **row_boundaries**: Row top/bottom positions calculated from horizontal lines
|
||||
|
||||
**Check the results**: If `form_structure.json` has meaningful labels (text elements that correspond to form fields), use **Approach A: Structure-Based Coordinates**. If the PDF is scanned/image-based and has few or no labels, use **Approach B: Visual Estimation**.
|
||||
|
||||
---
|
||||
|
||||
## Approach A: Structure-Based Coordinates (Preferred)
|
||||
|
||||
Use this when `extract_form_structure.py` found text labels in the PDF.
|
||||
|
||||
### A.1: Analyze the Structure
|
||||
|
||||
Read form_structure.json and identify:
|
||||
|
||||
1. **Label groups**: Adjacent text elements that form a single label (e.g., "Last" + "Name")
|
||||
2. **Row structure**: Labels with similar `top` values are in the same row
|
||||
3. **Field columns**: Entry areas start after label ends (x0 = label.x1 + gap)
|
||||
4. **Checkboxes**: Use the checkbox coordinates directly from the structure
|
||||
|
||||
**Coordinate system**: PDF coordinates where y=0 is at TOP of page, y increases downward.
|
||||
|
||||
### A.2: Check for Missing Elements
|
||||
|
||||
The structure extraction may not detect all form elements. Common cases:
|
||||
- **Circular checkboxes**: Only square rectangles are detected as checkboxes
|
||||
- **Complex graphics**: Decorative elements or non-standard form controls
|
||||
- **Faded or light-colored elements**: May not be extracted
|
||||
|
||||
If you see form fields in the PDF images that aren't in form_structure.json, you'll need to use **visual analysis** for those specific fields (see "Hybrid Approach" below).
|
||||
|
||||
### A.3: Create fields.json with PDF Coordinates
|
||||
|
||||
For each field, calculate entry coordinates from the extracted structure:
|
||||
|
||||
**Text fields:**
|
||||
- entry x0 = label x1 + 5 (small gap after label)
|
||||
- entry x1 = next label's x0, or row boundary
|
||||
- entry top = same as label top
|
||||
- entry bottom = row boundary line below, or label bottom + row_height
|
||||
|
||||
**Checkboxes:**
|
||||
- Use the checkbox rectangle coordinates directly from form_structure.json
|
||||
- entry_bounding_box = [checkbox.x0, checkbox.top, checkbox.x1, checkbox.bottom]
|
||||
|
||||
Create fields.json using `pdf_width` and `pdf_height` (signals PDF coordinates):
|
||||
```json
|
||||
{
|
||||
"pages": [
|
||||
{"page_number": 1, "pdf_width": 612, "pdf_height": 792}
|
||||
],
|
||||
"form_fields": [
|
||||
{
|
||||
"page_number": 1,
|
||||
"description": "Last name entry field",
|
||||
"field_label": "Last Name",
|
||||
"label_bounding_box": [43, 63, 87, 73],
|
||||
"entry_bounding_box": [92, 63, 260, 79],
|
||||
"entry_text": {"text": "Smith", "font_size": 10}
|
||||
},
|
||||
{
|
||||
"page_number": 1,
|
||||
"description": "US Citizen Yes checkbox",
|
||||
"field_label": "Yes",
|
||||
"label_bounding_box": [260, 200, 280, 210],
|
||||
"entry_bounding_box": [285, 197, 292, 205],
|
||||
"entry_text": {"text": "X"}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Important**: Use `pdf_width`/`pdf_height` and coordinates directly from form_structure.json.
|
||||
|
||||
### A.4: Validate Bounding Boxes
|
||||
|
||||
Before filling, check your bounding boxes for errors:
|
||||
`python scripts/check_bounding_boxes.py fields.json`
|
||||
|
||||
This checks for intersecting bounding boxes and entry boxes that are too small for the font size. Fix any reported errors before filling.
|
||||
|
||||
---
|
||||
|
||||
## Approach B: Visual Estimation (Fallback)
|
||||
|
||||
Use this when the PDF is scanned/image-based and structure extraction found no usable text labels (e.g., all text shows as "(cid:X)" patterns).
|
||||
|
||||
### B.1: Convert PDF to Images
|
||||
|
||||
`python scripts/convert_pdf_to_images.py <input.pdf> <images_dir/>`
|
||||
|
||||
### B.2: Initial Field Identification
|
||||
|
||||
Examine each page image to identify form sections and get **rough estimates** of field locations:
|
||||
- Form field labels and their approximate positions
|
||||
- Entry areas (lines, boxes, or blank spaces for text input)
|
||||
- Checkboxes and their approximate locations
|
||||
|
||||
For each field, note approximate pixel coordinates (they don't need to be precise yet).
|
||||
|
||||
### B.3: Zoom Refinement (CRITICAL for accuracy)
|
||||
|
||||
For each field, crop a region around the estimated position to refine coordinates precisely.
|
||||
|
||||
**Create a zoomed crop using ImageMagick:**
|
||||
```bash
|
||||
magick <page_image> -crop <width>x<height>+<x>+<y> +repage <crop_output.png>
|
||||
```
|
||||
|
||||
Where:
|
||||
- `<x>, <y>` = top-left corner of crop region (use your rough estimate minus padding)
|
||||
- `<width>, <height>` = size of crop region (field area plus ~50px padding on each side)
|
||||
|
||||
**Example:** To refine a "Name" field estimated around (100, 150):
|
||||
```bash
|
||||
magick images_dir/page_1.png -crop 300x80+50+120 +repage crops/name_field.png
|
||||
```
|
||||
|
||||
(Note: if the `magick` command isn't available, try `convert` with the same arguments).
|
||||
|
||||
**Examine the cropped image** to determine precise coordinates:
|
||||
1. Identify the exact pixel where the entry area begins (after the label)
|
||||
2. Identify where the entry area ends (before next field or edge)
|
||||
3. Identify the top and bottom of the entry line/box
|
||||
|
||||
**Convert crop coordinates back to full image coordinates:**
|
||||
- full_x = crop_x + crop_offset_x
|
||||
- full_y = crop_y + crop_offset_y
|
||||
|
||||
Example: If the crop started at (50, 120) and the entry box starts at (52, 18) within the crop:
|
||||
- entry_x0 = 52 + 50 = 102
|
||||
- entry_top = 18 + 120 = 138
|
||||
|
||||
**Repeat for each field**, grouping nearby fields into single crops when possible.
|
||||
|
||||
### B.4: Create fields.json with Refined Coordinates
|
||||
|
||||
Create fields.json using `image_width` and `image_height` (signals image coordinates):
|
||||
```json
|
||||
{
|
||||
"pages": [
|
||||
{"page_number": 1, "image_width": 1700, "image_height": 2200}
|
||||
],
|
||||
"form_fields": [
|
||||
{
|
||||
"page_number": 1,
|
||||
"description": "Last name entry field",
|
||||
"field_label": "Last Name",
|
||||
"label_bounding_box": [120, 175, 242, 198],
|
||||
"entry_bounding_box": [255, 175, 720, 218],
|
||||
"entry_text": {"text": "Smith", "font_size": 10}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Important**: Use `image_width`/`image_height` and the refined pixel coordinates from the zoom analysis.
|
||||
|
||||
### B.5: Validate Bounding Boxes
|
||||
|
||||
Before filling, check your bounding boxes for errors:
|
||||
`python scripts/check_bounding_boxes.py fields.json`
|
||||
|
||||
This checks for intersecting bounding boxes and entry boxes that are too small for the font size. Fix any reported errors before filling.
|
||||
|
||||
---
|
||||
|
||||
## Hybrid Approach: Structure + Visual
|
||||
|
||||
Use this when structure extraction works for most fields but misses some elements (e.g., circular checkboxes, unusual form controls).
|
||||
|
||||
1. **Use Approach A** for fields that were detected in form_structure.json
|
||||
2. **Convert PDF to images** for visual analysis of missing fields
|
||||
3. **Use zoom refinement** (from Approach B) for the missing fields
|
||||
4. **Combine coordinates**: For fields from structure extraction, use `pdf_width`/`pdf_height`. For visually-estimated fields, you must convert image coordinates to PDF coordinates:
|
||||
- pdf_x = image_x * (pdf_width / image_width)
|
||||
- pdf_y = image_y * (pdf_height / image_height)
|
||||
5. **Use a single coordinate system** in fields.json - convert all to PDF coordinates with `pdf_width`/`pdf_height`
|
||||
|
||||
---
|
||||
|
||||
## Step 2: Validate Before Filling
|
||||
|
||||
**Always validate bounding boxes before filling:**
|
||||
`python scripts/check_bounding_boxes.py fields.json`
|
||||
|
||||
This checks for:
|
||||
- Intersecting bounding boxes (which would cause overlapping text)
|
||||
- Entry boxes that are too small for the specified font size
|
||||
|
||||
Fix any reported errors in fields.json before proceeding.
|
||||
|
||||
## Step 3: Fill the Form
|
||||
|
||||
The fill script auto-detects the coordinate system and handles conversion:
|
||||
`python scripts/fill_pdf_form_with_annotations.py <input.pdf> fields.json <output.pdf>`
|
||||
|
||||
## Step 4: Verify Output
|
||||
|
||||
Convert the filled PDF to images and verify text placement:
|
||||
`python scripts/convert_pdf_to_images.py <output.pdf> <verify_images/>`
|
||||
|
||||
If text is mispositioned:
|
||||
- **Approach A**: Check that you're using PDF coordinates from form_structure.json with `pdf_width`/`pdf_height`
|
||||
- **Approach B**: Check that image dimensions match and coordinates are accurate pixels
|
||||
- **Hybrid**: Ensure coordinate conversions are correct for visually-estimated fields
|
||||
@@ -3,8 +3,6 @@ import json
|
||||
import sys
|
||||
|
||||
|
||||
# Script to check that the `fields.json` file that Claude creates when analyzing PDFs
|
||||
# does not have overlapping bounding boxes. See forms.md.
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -14,7 +12,6 @@ class RectAndField:
|
||||
field: dict
|
||||
|
||||
|
||||
# Returns a list of messages that are printed to stdout for Claude to read.
|
||||
def get_bounding_box_messages(fields_json_stream) -> list[str]:
|
||||
messages = []
|
||||
fields = json.load(fields_json_stream)
|
||||
@@ -32,7 +29,6 @@ def get_bounding_box_messages(fields_json_stream) -> list[str]:
|
||||
|
||||
has_error = False
|
||||
for i, ri in enumerate(rects_and_fields):
|
||||
# This is O(N^2); we can optimize if it becomes a problem.
|
||||
for j in range(i + 1, len(rects_and_fields)):
|
||||
rj = rects_and_fields[j]
|
||||
if ri.field["page_number"] == rj.field["page_number"] and rects_intersect(ri.rect, rj.rect):
|
||||
@@ -63,7 +59,6 @@ if __name__ == "__main__":
|
||||
if len(sys.argv) != 2:
|
||||
print("Usage: check_bounding_boxes.py [fields.json]")
|
||||
sys.exit(1)
|
||||
# Input file should be in the `fields.json` format described in forms.md.
|
||||
with open(sys.argv[1]) as f:
|
||||
messages = get_bounding_box_messages(f)
|
||||
for msg in messages:
|
||||
@@ -2,7 +2,6 @@ import sys
|
||||
from pypdf import PdfReader
|
||||
|
||||
|
||||
# Script for Claude to run to determine whether a PDF has fillable form fields. See forms.md.
|
||||
|
||||
|
||||
reader = PdfReader(sys.argv[1])
|
||||
@@ -4,14 +4,12 @@ import sys
|
||||
from pdf2image import convert_from_path
|
||||
|
||||
|
||||
# Converts each page of a PDF to a PNG image.
|
||||
|
||||
|
||||
def convert(pdf_path, output_dir, max_dim=1000):
|
||||
images = convert_from_path(pdf_path, dpi=200)
|
||||
|
||||
for i, image in enumerate(images):
|
||||
# Scale image if needed to keep width/height under `max_dim`
|
||||
width, height = image.size
|
||||
if width > max_dim or height > max_dim:
|
||||
scale_factor = min(max_dim / width, max_dim / height)
|
||||
@@ -4,12 +4,9 @@ import sys
|
||||
from PIL import Image, ImageDraw
|
||||
|
||||
|
||||
# Creates "validation" images with rectangles for the bounding box information that
|
||||
# Claude creates when determining where to add text annotations in PDFs. See forms.md.
|
||||
|
||||
|
||||
def create_validation_image(page_number, fields_json_path, input_path, output_path):
|
||||
# Input file should be in the `fields.json` format described in forms.md.
|
||||
with open(fields_json_path, 'r') as f:
|
||||
data = json.load(f)
|
||||
|
||||
@@ -21,7 +18,6 @@ def create_validation_image(page_number, fields_json_path, input_path, output_pa
|
||||
if field["page_number"] == page_number:
|
||||
entry_box = field['entry_bounding_box']
|
||||
label_box = field['label_bounding_box']
|
||||
# Draw red rectangle over entry bounding box and blue rectangle over the label.
|
||||
draw.rectangle(entry_box, outline='red', width=2)
|
||||
draw.rectangle(label_box, outline='blue', width=2)
|
||||
num_boxes += 2
|
||||
@@ -4,11 +4,8 @@ import sys
|
||||
from pypdf import PdfReader
|
||||
|
||||
|
||||
# Extracts data for the fillable form fields in a PDF and outputs JSON that
|
||||
# Claude uses to fill the fields. See forms.md.
|
||||
|
||||
|
||||
# This matches the format used by PdfReader `get_fields` and `update_page_form_field_values` methods.
|
||||
def get_full_annotation_field_id(annotation):
|
||||
components = []
|
||||
while annotation:
|
||||
@@ -25,12 +22,9 @@ def make_field_dict(field, field_id):
|
||||
if ft == "/Tx":
|
||||
field_dict["type"] = "text"
|
||||
elif ft == "/Btn":
|
||||
field_dict["type"] = "checkbox" # radio groups handled separately
|
||||
field_dict["type"] = "checkbox"
|
||||
states = field.get("/_States_", [])
|
||||
if len(states) == 2:
|
||||
# "/Off" seems to always be the unchecked value, as suggested by
|
||||
# https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf#page=448
|
||||
# It can be either first or second in the "/_States_" list.
|
||||
if "/Off" in states:
|
||||
field_dict["checked_value"] = states[0] if states[0] != "/Off" else states[1]
|
||||
field_dict["unchecked_value"] = "/Off"
|
||||
@@ -50,15 +44,6 @@ def make_field_dict(field, field_id):
|
||||
return field_dict
|
||||
|
||||
|
||||
# Returns a list of fillable PDF fields:
|
||||
# [
|
||||
# {
|
||||
# "field_id": "name",
|
||||
# "page": 1,
|
||||
# "type": ("text", "checkbox", "radio_group", or "choice")
|
||||
# // Per-type additional fields described in forms.md
|
||||
# },
|
||||
# ]
|
||||
def get_field_info(reader: PdfReader):
|
||||
fields = reader.get_fields()
|
||||
|
||||
@@ -66,19 +51,13 @@ def get_field_info(reader: PdfReader):
|
||||
possible_radio_names = set()
|
||||
|
||||
for field_id, field in fields.items():
|
||||
# Skip if this is a container field with children, except that it might be
|
||||
# a parent group for radio button options.
|
||||
if field.get("/Kids"):
|
||||
if field.get("/FT") == "/Btn":
|
||||
possible_radio_names.add(field_id)
|
||||
continue
|
||||
field_info_by_id[field_id] = make_field_dict(field, field_id)
|
||||
|
||||
# Bounding rects are stored in annotations in page objects.
|
||||
|
||||
# Radio button options have a separate annotation for each choice;
|
||||
# all choices have the same field name.
|
||||
# See https://westhealth.github.io/exploring-fillable-forms-with-pdfrw.html
|
||||
radio_fields_by_id = {}
|
||||
|
||||
for page_index, page in enumerate(reader.pages):
|
||||
@@ -90,8 +69,6 @@ def get_field_info(reader: PdfReader):
|
||||
field_info_by_id[field_id]["rect"] = ann.get('/Rect')
|
||||
elif field_id in possible_radio_names:
|
||||
try:
|
||||
# ann['/AP']['/N'] should have two items. One of them is '/Off',
|
||||
# the other is the active value.
|
||||
on_values = [v for v in ann["/AP"]["/N"] if v != "/Off"]
|
||||
except KeyError:
|
||||
continue
|
||||
@@ -104,17 +81,11 @@ def get_field_info(reader: PdfReader):
|
||||
"page": page_index + 1,
|
||||
"radio_options": [],
|
||||
}
|
||||
# Note: at least on macOS 15.7, Preview.app doesn't show selected
|
||||
# radio buttons correctly. (It does if you remove the leading slash
|
||||
# from the value, but that causes them not to appear correctly in
|
||||
# Chrome/Firefox/Acrobat/etc).
|
||||
radio_fields_by_id[field_id]["radio_options"].append({
|
||||
"value": on_values[0],
|
||||
"rect": rect,
|
||||
})
|
||||
|
||||
# Some PDFs have form field definitions without corresponding annotations,
|
||||
# so we can't tell where they are. Ignore these fields for now.
|
||||
fields_with_location = []
|
||||
for field_info in field_info_by_id.values():
|
||||
if "page" in field_info:
|
||||
@@ -122,7 +93,6 @@ def get_field_info(reader: PdfReader):
|
||||
else:
|
||||
print(f"Unable to determine location for field id: {field_info.get('field_id')}, ignoring")
|
||||
|
||||
# Sort by page number, then Y position (flipped in PDF coordinate system), then X.
|
||||
def sort_key(f):
|
||||
if "radio_options" in f:
|
||||
rect = f["radio_options"][0]["rect"] or [0, 0, 0, 0]
|
||||
115
scientific-skills/pdf/scripts/extract_form_structure.py
Executable file
115
scientific-skills/pdf/scripts/extract_form_structure.py
Executable file
@@ -0,0 +1,115 @@
|
||||
"""
|
||||
Extract form structure from a non-fillable PDF.
|
||||
|
||||
This script analyzes the PDF to find:
|
||||
- Text labels with their exact coordinates
|
||||
- Horizontal lines (row boundaries)
|
||||
- Checkboxes (small rectangles)
|
||||
|
||||
Output: A JSON file with the form structure that can be used to generate
|
||||
accurate field coordinates for filling.
|
||||
|
||||
Usage: python extract_form_structure.py <input.pdf> <output.json>
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
import pdfplumber
|
||||
|
||||
|
||||
def extract_form_structure(pdf_path):
|
||||
structure = {
|
||||
"pages": [],
|
||||
"labels": [],
|
||||
"lines": [],
|
||||
"checkboxes": [],
|
||||
"row_boundaries": []
|
||||
}
|
||||
|
||||
with pdfplumber.open(pdf_path) as pdf:
|
||||
for page_num, page in enumerate(pdf.pages, 1):
|
||||
structure["pages"].append({
|
||||
"page_number": page_num,
|
||||
"width": float(page.width),
|
||||
"height": float(page.height)
|
||||
})
|
||||
|
||||
words = page.extract_words()
|
||||
for word in words:
|
||||
structure["labels"].append({
|
||||
"page": page_num,
|
||||
"text": word["text"],
|
||||
"x0": round(float(word["x0"]), 1),
|
||||
"top": round(float(word["top"]), 1),
|
||||
"x1": round(float(word["x1"]), 1),
|
||||
"bottom": round(float(word["bottom"]), 1)
|
||||
})
|
||||
|
||||
for line in page.lines:
|
||||
if abs(float(line["x1"]) - float(line["x0"])) > page.width * 0.5:
|
||||
structure["lines"].append({
|
||||
"page": page_num,
|
||||
"y": round(float(line["top"]), 1),
|
||||
"x0": round(float(line["x0"]), 1),
|
||||
"x1": round(float(line["x1"]), 1)
|
||||
})
|
||||
|
||||
for rect in page.rects:
|
||||
width = float(rect["x1"]) - float(rect["x0"])
|
||||
height = float(rect["bottom"]) - float(rect["top"])
|
||||
if 5 <= width <= 15 and 5 <= height <= 15 and abs(width - height) < 2:
|
||||
structure["checkboxes"].append({
|
||||
"page": page_num,
|
||||
"x0": round(float(rect["x0"]), 1),
|
||||
"top": round(float(rect["top"]), 1),
|
||||
"x1": round(float(rect["x1"]), 1),
|
||||
"bottom": round(float(rect["bottom"]), 1),
|
||||
"center_x": round((float(rect["x0"]) + float(rect["x1"])) / 2, 1),
|
||||
"center_y": round((float(rect["top"]) + float(rect["bottom"])) / 2, 1)
|
||||
})
|
||||
|
||||
lines_by_page = {}
|
||||
for line in structure["lines"]:
|
||||
page = line["page"]
|
||||
if page not in lines_by_page:
|
||||
lines_by_page[page] = []
|
||||
lines_by_page[page].append(line["y"])
|
||||
|
||||
for page, y_coords in lines_by_page.items():
|
||||
y_coords = sorted(set(y_coords))
|
||||
for i in range(len(y_coords) - 1):
|
||||
structure["row_boundaries"].append({
|
||||
"page": page,
|
||||
"row_top": y_coords[i],
|
||||
"row_bottom": y_coords[i + 1],
|
||||
"row_height": round(y_coords[i + 1] - y_coords[i], 1)
|
||||
})
|
||||
|
||||
return structure
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) != 3:
|
||||
print("Usage: extract_form_structure.py <input.pdf> <output.json>")
|
||||
sys.exit(1)
|
||||
|
||||
pdf_path = sys.argv[1]
|
||||
output_path = sys.argv[2]
|
||||
|
||||
print(f"Extracting structure from {pdf_path}...")
|
||||
structure = extract_form_structure(pdf_path)
|
||||
|
||||
with open(output_path, "w") as f:
|
||||
json.dump(structure, f, indent=2)
|
||||
|
||||
print(f"Found:")
|
||||
print(f" - {len(structure['pages'])} pages")
|
||||
print(f" - {len(structure['labels'])} text labels")
|
||||
print(f" - {len(structure['lines'])} horizontal lines")
|
||||
print(f" - {len(structure['checkboxes'])} checkboxes")
|
||||
print(f" - {len(structure['row_boundaries'])} row boundaries")
|
||||
print(f"Saved to {output_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -6,13 +6,11 @@ from pypdf import PdfReader, PdfWriter
|
||||
from extract_form_field_info import get_field_info
|
||||
|
||||
|
||||
# Fills fillable form fields in a PDF. See forms.md.
|
||||
|
||||
|
||||
def fill_pdf_fields(input_pdf_path: str, fields_json_path: str, output_pdf_path: str):
|
||||
with open(fields_json_path) as f:
|
||||
fields = json.load(f)
|
||||
# Group by page number.
|
||||
fields_by_page = {}
|
||||
for field in fields:
|
||||
if "value" in field:
|
||||
@@ -48,8 +46,6 @@ def fill_pdf_fields(input_pdf_path: str, fields_json_path: str, output_pdf_path:
|
||||
for page, field_values in fields_by_page.items():
|
||||
writer.update_page_form_field_values(writer.pages[page - 1], field_values, auto_regenerate=False)
|
||||
|
||||
# This seems to be necessary for many PDF viewers to format the form values correctly.
|
||||
# It may cause the viewer to show a "save changes" dialog even if the user doesn't make any changes.
|
||||
writer.set_need_appearances_writer(True)
|
||||
|
||||
with open(output_pdf_path, "wb") as f:
|
||||
@@ -75,18 +71,6 @@ def validation_error_for_field_value(field_info, field_value):
|
||||
return None
|
||||
|
||||
|
||||
# pypdf (at least version 5.7.0) has a bug when setting the value for a selection list field.
|
||||
# In _writer.py around line 966:
|
||||
#
|
||||
# if field.get(FA.FT, "/Tx") == "/Ch" and field_flags & FA.FfBits.Combo == 0:
|
||||
# txt = "\n".join(annotation.get_inherited(FA.Opt, []))
|
||||
#
|
||||
# The problem is that for selection lists, `get_inherited` returns a list of two-element lists like
|
||||
# [["value1", "Text 1"], ["value2", "Text 2"], ...]
|
||||
# This causes `join` to throw a TypeError because it expects an iterable of strings.
|
||||
# The horrible workaround is to patch `get_inherited` to return a list of the value strings.
|
||||
# We call the original method and adjust the return value only if the argument to `get_inherited`
|
||||
# is `FA.Opt` and if the return value is a list of two-element lists.
|
||||
def monkeypatch_pydpf_method():
|
||||
from pypdf.generic import DictionaryObject
|
||||
from pypdf.constants import FieldDictionaryAttributes
|
||||
@@ -5,64 +5,67 @@ from pypdf import PdfReader, PdfWriter
|
||||
from pypdf.annotations import FreeText
|
||||
|
||||
|
||||
# Fills a PDF by adding text annotations defined in `fields.json`. See forms.md.
|
||||
|
||||
|
||||
def transform_coordinates(bbox, image_width, image_height, pdf_width, pdf_height):
|
||||
"""Transform bounding box from image coordinates to PDF coordinates"""
|
||||
# Image coordinates: origin at top-left, y increases downward
|
||||
# PDF coordinates: origin at bottom-left, y increases upward
|
||||
def transform_from_image_coords(bbox, image_width, image_height, pdf_width, pdf_height):
|
||||
x_scale = pdf_width / image_width
|
||||
y_scale = pdf_height / image_height
|
||||
|
||||
|
||||
left = bbox[0] * x_scale
|
||||
right = bbox[2] * x_scale
|
||||
|
||||
# Flip Y coordinates for PDF
|
||||
|
||||
top = pdf_height - (bbox[1] * y_scale)
|
||||
bottom = pdf_height - (bbox[3] * y_scale)
|
||||
|
||||
|
||||
return left, bottom, right, top
|
||||
|
||||
|
||||
def transform_from_pdf_coords(bbox, pdf_height):
|
||||
left = bbox[0]
|
||||
right = bbox[2]
|
||||
|
||||
pypdf_top = pdf_height - bbox[1]
|
||||
pypdf_bottom = pdf_height - bbox[3]
|
||||
|
||||
return left, pypdf_bottom, right, pypdf_top
|
||||
|
||||
|
||||
def fill_pdf_form(input_pdf_path, fields_json_path, output_pdf_path):
|
||||
"""Fill the PDF form with data from fields.json"""
|
||||
|
||||
# `fields.json` format described in forms.md.
|
||||
with open(fields_json_path, "r") as f:
|
||||
fields_data = json.load(f)
|
||||
|
||||
# Open the PDF
|
||||
reader = PdfReader(input_pdf_path)
|
||||
writer = PdfWriter()
|
||||
|
||||
# Copy all pages to writer
|
||||
writer.append(reader)
|
||||
|
||||
# Get PDF dimensions for each page
|
||||
pdf_dimensions = {}
|
||||
for i, page in enumerate(reader.pages):
|
||||
mediabox = page.mediabox
|
||||
pdf_dimensions[i + 1] = [mediabox.width, mediabox.height]
|
||||
|
||||
# Process each form field
|
||||
annotations = []
|
||||
for field in fields_data["form_fields"]:
|
||||
page_num = field["page_number"]
|
||||
|
||||
# Get page dimensions and transform coordinates.
|
||||
|
||||
page_info = next(p for p in fields_data["pages"] if p["page_number"] == page_num)
|
||||
image_width = page_info["image_width"]
|
||||
image_height = page_info["image_height"]
|
||||
pdf_width, pdf_height = pdf_dimensions[page_num]
|
||||
|
||||
if "pdf_width" in page_info:
|
||||
transformed_entry_box = transform_from_pdf_coords(
|
||||
field["entry_bounding_box"],
|
||||
float(pdf_height)
|
||||
)
|
||||
else:
|
||||
image_width = page_info["image_width"]
|
||||
image_height = page_info["image_height"]
|
||||
transformed_entry_box = transform_from_image_coords(
|
||||
field["entry_bounding_box"],
|
||||
image_width, image_height,
|
||||
float(pdf_width), float(pdf_height)
|
||||
)
|
||||
|
||||
transformed_entry_box = transform_coordinates(
|
||||
field["entry_bounding_box"],
|
||||
image_width, image_height,
|
||||
pdf_width, pdf_height
|
||||
)
|
||||
|
||||
# Skip empty fields
|
||||
if "entry_text" not in field or "text" not in field["entry_text"]:
|
||||
continue
|
||||
entry_text = field["entry_text"]
|
||||
@@ -74,8 +77,6 @@ def fill_pdf_form(input_pdf_path, fields_json_path, output_pdf_path):
|
||||
font_size = str(entry_text.get("font_size", 14)) + "pt"
|
||||
font_color = entry_text.get("font_color", "000000")
|
||||
|
||||
# Font size/color seems to not work reliably across viewers:
|
||||
# https://github.com/py-pdf/pypdf/issues/2084
|
||||
annotation = FreeText(
|
||||
text=text,
|
||||
rect=transformed_entry_box,
|
||||
@@ -86,10 +87,8 @@ def fill_pdf_form(input_pdf_path, fields_json_path, output_pdf_path):
|
||||
background_color=None,
|
||||
)
|
||||
annotations.append(annotation)
|
||||
# page_number is 0-based for pypdf
|
||||
writer.add_annotation(page_number=page_num - 1, annotation=annotation)
|
||||
|
||||
# Save the filled PDF
|
||||
with open(output_pdf_path, "wb") as output:
|
||||
writer.write(output)
|
||||
|
||||
@@ -105,4 +104,4 @@ if __name__ == "__main__":
|
||||
fields_json = sys.argv[2]
|
||||
output_pdf = sys.argv[3]
|
||||
|
||||
fill_pdf_form(input_pdf, fields_json, output_pdf)
|
||||
fill_pdf_form(input_pdf, fields_json, output_pdf)
|
||||
232
scientific-skills/pptx/SKILL.md
Normal file
232
scientific-skills/pptx/SKILL.md
Normal file
@@ -0,0 +1,232 @@
|
||||
---
|
||||
name: pptx
|
||||
description: "Use this skill any time a .pptx file is involved in any way — as input, output, or both. This includes: creating slide decks, pitch decks, or presentations; reading, parsing, or extracting text from any .pptx file (even if the extracted content will be used elsewhere, like in an email or summary); editing, modifying, or updating existing presentations; combining or splitting slide files; working with templates, layouts, speaker notes, or comments. Trigger whenever the user mentions \"deck,\" \"slides,\" \"presentation,\" or references a .pptx filename, regardless of what they plan to do with the content afterward. If a .pptx file needs to be opened, created, or touched, use this skill."
|
||||
license: Proprietary. LICENSE.txt has complete terms
|
||||
---
|
||||
|
||||
# PPTX Skill
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Task | Guide |
|
||||
|------|-------|
|
||||
| Read/analyze content | `python -m markitdown presentation.pptx` |
|
||||
| Edit or create from template | Read [editing.md](editing.md) |
|
||||
| Create from scratch | Read [pptxgenjs.md](pptxgenjs.md) |
|
||||
|
||||
---
|
||||
|
||||
## Reading Content
|
||||
|
||||
```bash
|
||||
# Text extraction
|
||||
python -m markitdown presentation.pptx
|
||||
|
||||
# Visual overview
|
||||
python scripts/thumbnail.py presentation.pptx
|
||||
|
||||
# Raw XML
|
||||
python scripts/office/unpack.py presentation.pptx unpacked/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Editing Workflow
|
||||
|
||||
**Read [editing.md](editing.md) for full details.**
|
||||
|
||||
1. Analyze template with `thumbnail.py`
|
||||
2. Unpack → manipulate slides → edit content → clean → pack
|
||||
|
||||
---
|
||||
|
||||
## Creating from Scratch
|
||||
|
||||
**Read [pptxgenjs.md](pptxgenjs.md) for full details.**
|
||||
|
||||
Use when no template or reference presentation is available.
|
||||
|
||||
---
|
||||
|
||||
## Design Ideas
|
||||
|
||||
**Don't create boring slides.** Plain bullets on a white background won't impress anyone. Consider ideas from this list for each slide.
|
||||
|
||||
### Before Starting
|
||||
|
||||
- **Pick a bold, content-informed color palette**: The palette should feel designed for THIS topic. If swapping your colors into a completely different presentation would still "work," you haven't made specific enough choices.
|
||||
- **Dominance over equality**: One color should dominate (60-70% visual weight), with 1-2 supporting tones and one sharp accent. Never give all colors equal weight.
|
||||
- **Dark/light contrast**: Dark backgrounds for title + conclusion slides, light for content ("sandwich" structure). Or commit to dark throughout for a premium feel.
|
||||
- **Commit to a visual motif**: Pick ONE distinctive element and repeat it — rounded image frames, icons in colored circles, thick single-side borders. Carry it across every slide.
|
||||
|
||||
### Color Palettes
|
||||
|
||||
Choose colors that match your topic — don't default to generic blue. Use these palettes as inspiration:
|
||||
|
||||
| Theme | Primary | Secondary | Accent |
|
||||
|-------|---------|-----------|--------|
|
||||
| **Midnight Executive** | `1E2761` (navy) | `CADCFC` (ice blue) | `FFFFFF` (white) |
|
||||
| **Forest & Moss** | `2C5F2D` (forest) | `97BC62` (moss) | `F5F5F5` (cream) |
|
||||
| **Coral Energy** | `F96167` (coral) | `F9E795` (gold) | `2F3C7E` (navy) |
|
||||
| **Warm Terracotta** | `B85042` (terracotta) | `E7E8D1` (sand) | `A7BEAE` (sage) |
|
||||
| **Ocean Gradient** | `065A82` (deep blue) | `1C7293` (teal) | `21295C` (midnight) |
|
||||
| **Charcoal Minimal** | `36454F` (charcoal) | `F2F2F2` (off-white) | `212121` (black) |
|
||||
| **Teal Trust** | `028090` (teal) | `00A896` (seafoam) | `02C39A` (mint) |
|
||||
| **Berry & Cream** | `6D2E46` (berry) | `A26769` (dusty rose) | `ECE2D0` (cream) |
|
||||
| **Sage Calm** | `84B59F` (sage) | `69A297` (eucalyptus) | `50808E` (slate) |
|
||||
| **Cherry Bold** | `990011` (cherry) | `FCF6F5` (off-white) | `2F3C7E` (navy) |
|
||||
|
||||
### For Each Slide
|
||||
|
||||
**Every slide needs a visual element** — image, chart, icon, or shape. Text-only slides are forgettable.
|
||||
|
||||
**Layout options:**
|
||||
- Two-column (text left, illustration on right)
|
||||
- Icon + text rows (icon in colored circle, bold header, description below)
|
||||
- 2x2 or 2x3 grid (image on one side, grid of content blocks on other)
|
||||
- Half-bleed image (full left or right side) with content overlay
|
||||
|
||||
**Data display:**
|
||||
- Large stat callouts (big numbers 60-72pt with small labels below)
|
||||
- Comparison columns (before/after, pros/cons, side-by-side options)
|
||||
- Timeline or process flow (numbered steps, arrows)
|
||||
|
||||
**Visual polish:**
|
||||
- Icons in small colored circles next to section headers
|
||||
- Italic accent text for key stats or taglines
|
||||
|
||||
### Typography
|
||||
|
||||
**Choose an interesting font pairing** — don't default to Arial. Pick a header font with personality and pair it with a clean body font.
|
||||
|
||||
| Header Font | Body Font |
|
||||
|-------------|-----------|
|
||||
| Georgia | Calibri |
|
||||
| Arial Black | Arial |
|
||||
| Calibri | Calibri Light |
|
||||
| Cambria | Calibri |
|
||||
| Trebuchet MS | Calibri |
|
||||
| Impact | Arial |
|
||||
| Palatino | Garamond |
|
||||
| Consolas | Calibri |
|
||||
|
||||
| Element | Size |
|
||||
|---------|------|
|
||||
| Slide title | 36-44pt bold |
|
||||
| Section header | 20-24pt bold |
|
||||
| Body text | 14-16pt |
|
||||
| Captions | 10-12pt muted |
|
||||
|
||||
### Spacing
|
||||
|
||||
- 0.5" minimum margins
|
||||
- 0.3-0.5" between content blocks
|
||||
- Leave breathing room—don't fill every inch
|
||||
|
||||
### Avoid (Common Mistakes)
|
||||
|
||||
- **Don't repeat the same layout** — vary columns, cards, and callouts across slides
|
||||
- **Don't center body text** — left-align paragraphs and lists; center only titles
|
||||
- **Don't skimp on size contrast** — titles need 36pt+ to stand out from 14-16pt body
|
||||
- **Don't default to blue** — pick colors that reflect the specific topic
|
||||
- **Don't mix spacing randomly** — choose 0.3" or 0.5" gaps and use consistently
|
||||
- **Don't style one slide and leave the rest plain** — commit fully or keep it simple throughout
|
||||
- **Don't create text-only slides** — add images, icons, charts, or visual elements; avoid plain title + bullets
|
||||
- **Don't forget text box padding** — when aligning lines or shapes with text edges, set `margin: 0` on the text box or offset the shape to account for padding
|
||||
- **Don't use low-contrast elements** — icons AND text need strong contrast against the background; avoid light text on light backgrounds or dark text on dark backgrounds
|
||||
- **NEVER use accent lines under titles** — these are a hallmark of AI-generated slides; use whitespace or background color instead
|
||||
|
||||
---
|
||||
|
||||
## QA (Required)
|
||||
|
||||
**Assume there are problems. Your job is to find them.**
|
||||
|
||||
Your first render is almost never correct. Approach QA as a bug hunt, not a confirmation step. If you found zero issues on first inspection, you weren't looking hard enough.
|
||||
|
||||
### Content QA
|
||||
|
||||
```bash
|
||||
python -m markitdown output.pptx
|
||||
```
|
||||
|
||||
Check for missing content, typos, wrong order.
|
||||
|
||||
**When using templates, check for leftover placeholder text:**
|
||||
|
||||
```bash
|
||||
python -m markitdown output.pptx | grep -iE "xxxx|lorem|ipsum|this.*(page|slide).*layout"
|
||||
```
|
||||
|
||||
If grep returns results, fix them before declaring success.
|
||||
|
||||
### Visual QA
|
||||
|
||||
**⚠️ USE SUBAGENTS** — even for 2-3 slides. You've been staring at the code and will see what you expect, not what's there. Subagents have fresh eyes.
|
||||
|
||||
Convert slides to images (see [Converting to Images](#converting-to-images)), then use this prompt:
|
||||
|
||||
```
|
||||
Visually inspect these slides. Assume there are issues — find them.
|
||||
|
||||
Look for:
|
||||
- Overlapping elements (text through shapes, lines through words, stacked elements)
|
||||
- Text overflow or cut off at edges/box boundaries
|
||||
- Decorative lines positioned for single-line text but title wrapped to two lines
|
||||
- Source citations or footers colliding with content above
|
||||
- Elements too close (< 0.3" gaps) or cards/sections nearly touching
|
||||
- Uneven gaps (large empty area in one place, cramped in another)
|
||||
- Insufficient margin from slide edges (< 0.5")
|
||||
- Columns or similar elements not aligned consistently
|
||||
- Low-contrast text (e.g., light gray text on cream-colored background)
|
||||
- Low-contrast icons (e.g., dark icons on dark backgrounds without a contrasting circle)
|
||||
- Text boxes too narrow causing excessive wrapping
|
||||
- Leftover placeholder content
|
||||
|
||||
For each slide, list issues or areas of concern, even if minor.
|
||||
|
||||
Read and analyze these images:
|
||||
1. /path/to/slide-01.jpg (Expected: [brief description])
|
||||
2. /path/to/slide-02.jpg (Expected: [brief description])
|
||||
|
||||
Report ALL issues found, including minor ones.
|
||||
```
|
||||
|
||||
### Verification Loop
|
||||
|
||||
1. Generate slides → Convert to images → Inspect
|
||||
2. **List issues found** (if none found, look again more critically)
|
||||
3. Fix issues
|
||||
4. **Re-verify affected slides** — one fix often creates another problem
|
||||
5. Repeat until a full pass reveals no new issues
|
||||
|
||||
**Do not declare success until you've completed at least one fix-and-verify cycle.**
|
||||
|
||||
---
|
||||
|
||||
## Converting to Images
|
||||
|
||||
Convert presentations to individual slide images for visual inspection:
|
||||
|
||||
```bash
|
||||
python scripts/office/soffice.py --headless --convert-to pdf output.pptx
|
||||
pdftoppm -jpeg -r 150 output.pdf slide
|
||||
```
|
||||
|
||||
This creates `slide-01.jpg`, `slide-02.jpg`, etc.
|
||||
|
||||
To re-render specific slides after fixes:
|
||||
|
||||
```bash
|
||||
pdftoppm -jpeg -r 150 -f N -l N output.pdf slide-fixed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `pip install "markitdown[pptx]"` - text extraction
|
||||
- `pip install Pillow` - thumbnail grids
|
||||
- `npm install -g pptxgenjs` - creating from scratch
|
||||
- LibreOffice (`soffice`) - PDF conversion (auto-configured for sandboxed environments via `scripts/office/soffice.py`)
|
||||
- Poppler (`pdftoppm`) - PDF to images
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user