Add polars-bio skill for genomic interval operations and bioinformatics I/O

Adds a new skill covering polars-bio (v0.26.0), a high-performance library for genomic interval arithmetic and file I/O built on Polars, Arrow, and DataFusion. All code examples verified against the actual API at runtime. SKILL.md covers overlap, nearest, merge, coverage, complement, subtract, cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF, BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats. References: interval_operations, file_io, sql_processing, pileup_operations, configuration, bioframe_migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:09:27 +08:00 · 2026-03-14 10:27:11 +01:00
parent 575f1e586f
commit 436c8608f2
7 changed files with 1975 additions and 0 deletions
--- a/scientific-skills/polars-bio/references/interval_operations.md
+++ b/scientific-skills/polars-bio/references/interval_operations.md
@@ -0,0 +1,362 @@
+# Genomic Interval Operations
+
+## Overview
+
+polars-bio provides 8 core operations for genomic interval arithmetic. All operations work on Polars DataFrames or LazyFrames containing genomic intervals (columns: `chrom`, `start`, `end` by default) and return a **LazyFrame** by default. Pass `output_type="polars.DataFrame"` for eager results.
+
+## Operations Summary
+
+| Operation | Inputs | Description |
+|-----------|--------|-------------|
+| `overlap` | two DataFrames | Find pairs of overlapping intervals |
+| `count_overlaps` | two DataFrames | Count overlaps per interval in the first set |
+| `nearest` | two DataFrames | Find nearest intervals between two sets |
+| `merge` | one DataFrame | Merge overlapping/bookended intervals |
+| `cluster` | one DataFrame | Assign cluster IDs to overlapping intervals |
+| `coverage` | two DataFrames | Compute per-interval coverage counts |
+| `complement` | one DataFrame + genome | Find gaps between intervals |
+| `subtract` | two DataFrames | Remove overlapping portions |
+
+## overlap
+
+Find pairs of overlapping intervals between two DataFrames.
+
+### Functional API
+
+```python
+import polars as pl
+import polars_bio as pb
+
+df1 = pl.DataFrame({
+    "chrom": ["chr1", "chr1", "chr1"],
+    "start": [1, 5, 22],
+    "end":   [6, 9, 30],
+})
+
+df2 = pl.DataFrame({
+    "chrom": ["chr1", "chr1"],
+    "start": [3, 25],
+    "end":   [8, 28],
+})
+
+# Returns LazyFrame by default
+result_lf = pb.overlap(df1, df2, suffixes=("_1", "_2"))
+result_df = result_lf.collect()
+
+# Or get DataFrame directly
+result_df = pb.overlap(df1, df2, suffixes=("_1", "_2"), output_type="polars.DataFrame")
+```
+
+### Method-Chaining API (LazyFrame only)
+
+```python
+result = df1.lazy().pb.overlap(df2, suffixes=("_1", "_2")).collect()
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df1` | DataFrame/LazyFrame/str | required | First (probe) interval set |
+| `df2` | DataFrame/LazyFrame/str | required | Second (build) interval set |
+| `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for overlapping column names |
+| `on_cols` | list[str] | `None` | Additional columns to join on (beyond genomic coords) |
+| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
+| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
+| `algorithm` | str | `"Coitrees"` | Interval algorithm |
+| `low_memory` | bool | `False` | Low memory mode |
+| `output_type` | str | `"polars.LazyFrame"` | Output format: `"polars.LazyFrame"`, `"polars.DataFrame"`, `"pandas.DataFrame"` |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown optimization |
+
+### Output Schema
+
+Returns columns from both inputs with suffixes applied:
+- `chrom_1`, `start_1`, `end_1` (from df1)
+- `chrom_2`, `start_2`, `end_2` (from df2)
+- Any additional columns from df1 and df2
+
+Column dtypes are `String` for chrom and `Int64` for start/end.
+
+## count_overlaps
+
+Count the number of overlapping intervals from df2 for each interval in df1.
+
+```python
+# Functional
+counts = pb.count_overlaps(df1, df2)
+
+# Method-chaining (LazyFrame)
+counts = df1.lazy().pb.count_overlaps(df2)
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df1` | DataFrame/LazyFrame/str | required | Query interval set |
+| `df2` | DataFrame/LazyFrame/str | required | Target interval set |
+| `suffixes` | tuple[str, str] | `("", "_")` | Suffixes for column names |
+| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
+| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
+| `on_cols` | list[str] | `None` | Additional join columns |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `naive_query` | bool | `True` | Use naive query strategy |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns df1 columns with an additional `count` column (Int64).
+
+## nearest
+
+Find the nearest interval in df2 for each interval in df1.
+
+```python
+# Find nearest (default: k=1, any direction)
+nearest = pb.nearest(df1, df2, output_type="polars.DataFrame")
+
+# Find k nearest
+nearest = pb.nearest(df1, df2, k=3)
+
+# Exclude overlapping intervals from results
+nearest = pb.nearest(df1, df2, overlap=False)
+
+# Without distance column
+nearest = pb.nearest(df1, df2, distance=False)
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df1` | DataFrame/LazyFrame/str | required | Query interval set |
+| `df2` | DataFrame/LazyFrame/str | required | Target interval set |
+| `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for column names |
+| `on_cols` | list[str] | `None` | Additional join columns |
+| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
+| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
+| `k` | int | `1` | Number of nearest neighbors to find |
+| `overlap` | bool | `True` | Include overlapping intervals in results |
+| `distance` | bool | `True` | Include distance column in output |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns columns from both DataFrames (with suffixes) plus a `distance` column (Int64) with the distance to the nearest interval (0 if overlapping). Distance column is omitted if `distance=False`.
+
+## merge
+
+Merge overlapping and bookended intervals within a single DataFrame.
+
+```python
+import polars as pl
+import polars_bio as pb
+
+df = pl.DataFrame({
+    "chrom": ["chr1", "chr1", "chr1", "chr2"],
+    "start": [1, 4, 20, 1],
+    "end":   [6, 9, 30, 10],
+})
+
+# Functional
+merged = pb.merge(df, output_type="polars.DataFrame")
+
+# Method-chaining (LazyFrame)
+merged = df.lazy().pb.merge().collect()
+
+# Merge intervals within a minimum distance
+merged = pb.merge(df, min_dist=10)
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df` | DataFrame/LazyFrame/str | required | Interval set to merge |
+| `min_dist` | int | `0` | Minimum distance between intervals to merge (0 = must overlap or be bookended) |
+| `cols` | list[str] | `["chrom", "start", "end"]` | Column names |
+| `on_cols` | list[str] | `None` | Additional grouping columns |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `chrom` | String | Chromosome |
+| `start` | Int64 | Merged interval start |
+| `end` | Int64 | Merged interval end |
+| `n_intervals` | Int64 | Number of intervals merged |
+
+## cluster
+
+Assign cluster IDs to overlapping intervals. Intervals that overlap are assigned the same cluster ID.
+
+```python
+# Functional
+clustered = pb.cluster(df, output_type="polars.DataFrame")
+
+# Method-chaining (LazyFrame)
+clustered = df.lazy().pb.cluster().collect()
+
+# With minimum distance
+clustered = pb.cluster(df, min_dist=5)
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df` | DataFrame/LazyFrame/str | required | Interval set |
+| `min_dist` | int | `0` | Minimum distance for clustering |
+| `cols` | list[str] | `["chrom", "start", "end"]` | Column names |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns the original columns plus:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `cluster` | Int64 | Cluster ID (intervals in the same cluster overlap) |
+| `cluster_start` | Int64 | Start of the cluster extent |
+| `cluster_end` | Int64 | End of the cluster extent |
+
+## coverage
+
+Compute per-interval coverage counts. This is a **two-input** operation: for each interval in df1, count the coverage from df2.
+
+```python
+# Functional
+cov = pb.coverage(df1, df2, output_type="polars.DataFrame")
+
+# Method-chaining (LazyFrame)
+cov = df1.lazy().pb.coverage(df2).collect()
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df1` | DataFrame/LazyFrame/str | required | Query intervals |
+| `df2` | DataFrame/LazyFrame/str | required | Coverage source intervals |
+| `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for column names |
+| `on_cols` | list[str] | `None` | Additional join columns |
+| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
+| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns columns from df1 plus a `coverage` column (Int64).
+
+## complement
+
+Find gaps between intervals within a genome. Requires a genome definition specifying chromosome sizes.
+
+```python
+import polars as pl
+import polars_bio as pb
+
+df = pl.DataFrame({
+    "chrom": ["chr1", "chr1"],
+    "start": [100, 500],
+    "end":   [200, 600],
+})
+
+genome = pl.DataFrame({
+    "chrom": ["chr1"],
+    "start": [0],
+    "end":   [1000],
+})
+
+# Functional
+gaps = pb.complement(df, view_df=genome, output_type="polars.DataFrame")
+
+# Method-chaining (LazyFrame)
+gaps = df.lazy().pb.complement(genome).collect()
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df` | DataFrame/LazyFrame/str | required | Interval set |
+| `view_df` | DataFrame/LazyFrame | `None` | Genome with chrom, start, end defining chromosome extents |
+| `cols` | list[str] | `["chrom", "start", "end"]` | Column names in df |
+| `view_cols` | list[str] | `None` | Column names in view_df |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns a DataFrame with `chrom` (String), `start` (Int64), `end` (Int64) columns representing gaps between intervals.
+
+## subtract
+
+Remove portions of intervals in df1 that overlap with intervals in df2.
+
+```python
+# Functional
+result = pb.subtract(df1, df2, output_type="polars.DataFrame")
+
+# Method-chaining (LazyFrame)
+result = df1.lazy().pb.subtract(df2).collect()
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df1` | DataFrame/LazyFrame/str | required | Intervals to subtract from |
+| `df2` | DataFrame/LazyFrame/str | required | Intervals to subtract |
+| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
+| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns `chrom` (String), `start` (Int64), `end` (Int64) representing the remaining portions of df1 intervals after subtraction.
+
+## Performance Considerations
+
+### Probe-Build Architecture
+
+Two-input operations (`overlap`, `nearest`, `count_overlaps`, `coverage`, `subtract`) use a probe-build join:
+- **Probe** (first DataFrame): Iterated over, row by row
+- **Build** (second DataFrame): Indexed into an interval tree for fast lookup
+
+For best performance, pass the **larger** DataFrame as the probe (first argument) and the **smaller** one as the build (second argument).
+
+### Parallelism
+
+By default, polars-bio uses a single execution partition. For large datasets, enable parallel execution:
+
+```python
+import os
+import polars_bio as pb
+
+pb.set_option("datafusion.execution.target_partitions", os.cpu_count())
+```
+
+### Streaming Execution
+
+DataFusion streaming is enabled by default for interval operations. Data is processed in batches, enabling out-of-core computation for datasets larger than available RAM.
+
+### When to Use Lazy Evaluation
+
+Use `scan_*` functions and lazy DataFrames for:
+- Files larger than available RAM
+- When only a subset of results is needed
+- Pipeline operations where intermediate results can be optimized away
+
+```python
+# Lazy pipeline
+lf1 = pb.scan_bed("large1.bed")
+lf2 = pb.scan_bed("large2.bed")
+result = pb.overlap(lf1, lf2).collect()
+```