Add polars-bio skill for genomic interval operations and bioinformatics I/O

Adds a new skill covering polars-bio (v0.26.0), a high-performance library for genomic interval arithmetic and file I/O built on Polars, Arrow, and DataFusion. All code examples verified against the actual API at runtime. SKILL.md covers overlap, nearest, merge, coverage, complement, subtract, cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF, BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats. References: interval_operations, file_io, sql_processing, pileup_operations, configuration, bioframe_migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:09:27 +08:00 · 2026-03-14 10:27:11 +01:00
parent 575f1e586f
commit 436c8608f2
7 changed files with 1975 additions and 0 deletions
--- a/scientific-skills/polars-bio/references/pileup_operations.md
+++ b/scientific-skills/polars-bio/references/pileup_operations.md
@@ -0,0 +1,176 @@
+# Pileup Operations
+
+## Overview
+
+polars-bio provides the `pb.depth()` function for computing per-base or per-block read depth from BAM/CRAM files. It uses CIGAR-aware depth calculation to accurately account for insertions, deletions, and clipping. Returns a **LazyFrame** by default.
+
+## pb.depth()
+
+Compute read depth from alignment files.
+
+### Basic Usage
+
+```python
+import polars_bio as pb
+
+# Compute depth across entire BAM file (returns LazyFrame)
+depth_lf = pb.depth("aligned.bam")
+depth_df = depth_lf.collect()
+
+# Get DataFrame directly
+depth_df = pb.depth("aligned.bam", output_type="polars.DataFrame")
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `path` | str | required | Path to BAM or CRAM file |
+| `filter_flag` | int | `1796` | SAM flag filter (default excludes unmapped, secondary, duplicate, QC-fail) |
+| `min_mapping_quality` | int | `0` | Minimum mapping quality to include reads |
+| `binary_cigar` | bool | `True` | Use binary CIGAR for faster processing |
+| `dense_mode` | str | `"auto"` | Dense output mode |
+| `use_zero_based` | bool | `None` | Coordinate system (None = use global setting) |
+| `per_base` | bool | `False` | Per-base depth (True) vs block depth (False) |
+| `output_type` | str | `"polars.LazyFrame"` | Output format: `"polars.LazyFrame"`, `"polars.DataFrame"`, `"pandas.DataFrame"` |
+
+### Output Schema (Block Mode, default)
+
+When `per_base=False` (default), adjacent positions with the same depth are grouped into blocks:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `contig` | String | Chromosome/contig name |
+| `pos_start` | Int64 | Block start position |
+| `pos_end` | Int64 | Block end position |
+| `coverage` | Int16 | Read depth |
+
+### Output Schema (Per-Base Mode)
+
+When `per_base=True`, each position is reported individually:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `contig` | String | Chromosome/contig name |
+| `pos` | Int64 | Position |
+| `coverage` | Int16 | Read depth at position |
+
+### filter_flag
+
+The default `filter_flag=1796` excludes reads with these SAM flags:
+- 4: unmapped
+- 256: secondary alignment
+- 512: failed QC
+- 1024: PCR/optical duplicate
+
+### CIGAR-Aware Computation
+
+`pb.depth()` correctly handles CIGAR operations:
+- **M/X/=** (match/mismatch): Counted as coverage
+- **D** (deletion): Counted as coverage (reads span the deletion)
+- **N** (skipped region): Not counted (e.g., spliced alignments)
+- **I** (insertion): Not counted at reference positions
+- **S/H** (soft/hard clipping): Not counted
+
+## Examples
+
+### Whole-Genome Depth
+
+```python
+import polars_bio as pb
+import polars as pl
+
+# Compute depth genome-wide (block mode)
+depth = pb.depth("sample.bam", output_type="polars.DataFrame")
+
+# Summary statistics
+depth.select(
+    pl.col("coverage").cast(pl.Int64).mean().alias("mean_depth"),
+    pl.col("coverage").cast(pl.Int64).median().alias("median_depth"),
+    pl.col("coverage").cast(pl.Int64).max().alias("max_depth"),
+)
+```
+
+### Per-Base Depth
+
+```python
+import polars_bio as pb
+
+# Per-base depth (one row per position)
+depth = pb.depth("sample.bam", per_base=True, output_type="polars.DataFrame")
+```
+
+### Depth with Quality Filters
+
+```python
+import polars_bio as pb
+
+# Only count well-mapped reads
+depth = pb.depth(
+    "sample.bam",
+    min_mapping_quality=20,
+    output_type="polars.DataFrame",
+)
+```
+
+### Custom Flag Filter
+
+```python
+import polars_bio as pb
+
+# Only exclude unmapped (4) and duplicate (1024) reads
+depth = pb.depth(
+    "sample.bam",
+    filter_flag=4 + 1024,
+    output_type="polars.DataFrame",
+)
+```
+
+## Integration with Interval Operations
+
+Depth results can be used with polars-bio interval operations. Note that depth output uses `contig`/`pos_start`/`pos_end` column names, so use `cols` parameters to map them:
+
+```python
+import polars_bio as pb
+import polars as pl
+
+# Compute depth
+depth = pb.depth("sample.bam", output_type="polars.DataFrame")
+
+# Rename columns to match interval operation conventions
+depth_intervals = depth.rename({
+    "contig": "chrom",
+    "pos_start": "start",
+    "pos_end": "end",
+})
+
+# Find regions with adequate coverage
+adequate = depth_intervals.filter(pl.col("coverage") >= 30)
+
+# Merge adjacent adequate-coverage blocks
+merged = pb.merge(adequate, output_type="polars.DataFrame")
+
+# Find gaps in coverage (complement)
+genome = pl.DataFrame({
+    "chrom": ["chr1"],
+    "start": [0],
+    "end": [248956422],
+})
+gaps = pb.complement(adequate, view_df=genome, output_type="polars.DataFrame")
+```
+
+### Using cols Parameters Instead of Renaming
+
+```python
+import polars_bio as pb
+
+depth = pb.depth("sample.bam", output_type="polars.DataFrame")
+targets = pb.read_bed("targets.bed")
+
+# Use cols1 to specify depth column names
+overlapping = pb.overlap(
+    depth, targets,
+    cols1=["contig", "pos_start", "pos_end"],
+    output_type="polars.DataFrame",
+)
+```