Add polars-bio skill for genomic interval operations and bioinformatics I/O

Adds a new skill covering polars-bio (v0.26.0), a high-performance library
for genomic interval arithmetic and file I/O built on Polars, Arrow, and
DataFusion. All code examples verified against the actual API at runtime.

SKILL.md covers overlap, nearest, merge, coverage, complement, subtract,
cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF,
BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats.

References: interval_operations, file_io, sql_processing, pileup_operations,
configuration, bioframe_migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Marek Wieiwórka
2026-03-14 10:27:11 +01:00
parent 575f1e586f
commit 436c8608f2
7 changed files with 1975 additions and 0 deletions

View File

@@ -0,0 +1,362 @@
# Genomic Interval Operations
## Overview
polars-bio provides 8 core operations for genomic interval arithmetic. All operations work on Polars DataFrames or LazyFrames containing genomic intervals (columns: `chrom`, `start`, `end` by default) and return a **LazyFrame** by default. Pass `output_type="polars.DataFrame"` for eager results.
## Operations Summary
| Operation | Inputs | Description |
|-----------|--------|-------------|
| `overlap` | two DataFrames | Find pairs of overlapping intervals |
| `count_overlaps` | two DataFrames | Count overlaps per interval in the first set |
| `nearest` | two DataFrames | Find nearest intervals between two sets |
| `merge` | one DataFrame | Merge overlapping/bookended intervals |
| `cluster` | one DataFrame | Assign cluster IDs to overlapping intervals |
| `coverage` | two DataFrames | Compute per-interval coverage counts |
| `complement` | one DataFrame + genome | Find gaps between intervals |
| `subtract` | two DataFrames | Remove overlapping portions |
## overlap
Find pairs of overlapping intervals between two DataFrames.
### Functional API
```python
import polars as pl
import polars_bio as pb
df1 = pl.DataFrame({
"chrom": ["chr1", "chr1", "chr1"],
"start": [1, 5, 22],
"end": [6, 9, 30],
})
df2 = pl.DataFrame({
"chrom": ["chr1", "chr1"],
"start": [3, 25],
"end": [8, 28],
})
# Returns LazyFrame by default
result_lf = pb.overlap(df1, df2, suffixes=("_1", "_2"))
result_df = result_lf.collect()
# Or get DataFrame directly
result_df = pb.overlap(df1, df2, suffixes=("_1", "_2"), output_type="polars.DataFrame")
```
### Method-Chaining API (LazyFrame only)
```python
result = df1.lazy().pb.overlap(df2, suffixes=("_1", "_2")).collect()
```
### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `df1` | DataFrame/LazyFrame/str | required | First (probe) interval set |
| `df2` | DataFrame/LazyFrame/str | required | Second (build) interval set |
| `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for overlapping column names |
| `on_cols` | list[str] | `None` | Additional columns to join on (beyond genomic coords) |
| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
| `algorithm` | str | `"Coitrees"` | Interval algorithm |
| `low_memory` | bool | `False` | Low memory mode |
| `output_type` | str | `"polars.LazyFrame"` | Output format: `"polars.LazyFrame"`, `"polars.DataFrame"`, `"pandas.DataFrame"` |
| `projection_pushdown` | bool | `True` | Enable projection pushdown optimization |
### Output Schema
Returns columns from both inputs with suffixes applied:
- `chrom_1`, `start_1`, `end_1` (from df1)
- `chrom_2`, `start_2`, `end_2` (from df2)
- Any additional columns from df1 and df2
Column dtypes are `String` for chrom and `Int64` for start/end.
## count_overlaps
Count the number of overlapping intervals from df2 for each interval in df1.
```python
# Functional
counts = pb.count_overlaps(df1, df2)
# Method-chaining (LazyFrame)
counts = df1.lazy().pb.count_overlaps(df2)
```
### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `df1` | DataFrame/LazyFrame/str | required | Query interval set |
| `df2` | DataFrame/LazyFrame/str | required | Target interval set |
| `suffixes` | tuple[str, str] | `("", "_")` | Suffixes for column names |
| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
| `on_cols` | list[str] | `None` | Additional join columns |
| `output_type` | str | `"polars.LazyFrame"` | Output format |
| `naive_query` | bool | `True` | Use naive query strategy |
| `projection_pushdown` | bool | `True` | Enable projection pushdown |
### Output Schema
Returns df1 columns with an additional `count` column (Int64).
## nearest
Find the nearest interval in df2 for each interval in df1.
```python
# Find nearest (default: k=1, any direction)
nearest = pb.nearest(df1, df2, output_type="polars.DataFrame")
# Find k nearest
nearest = pb.nearest(df1, df2, k=3)
# Exclude overlapping intervals from results
nearest = pb.nearest(df1, df2, overlap=False)
# Without distance column
nearest = pb.nearest(df1, df2, distance=False)
```
### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `df1` | DataFrame/LazyFrame/str | required | Query interval set |
| `df2` | DataFrame/LazyFrame/str | required | Target interval set |
| `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for column names |
| `on_cols` | list[str] | `None` | Additional join columns |
| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
| `k` | int | `1` | Number of nearest neighbors to find |
| `overlap` | bool | `True` | Include overlapping intervals in results |
| `distance` | bool | `True` | Include distance column in output |
| `output_type` | str | `"polars.LazyFrame"` | Output format |
| `projection_pushdown` | bool | `True` | Enable projection pushdown |
### Output Schema
Returns columns from both DataFrames (with suffixes) plus a `distance` column (Int64) with the distance to the nearest interval (0 if overlapping). Distance column is omitted if `distance=False`.
## merge
Merge overlapping and bookended intervals within a single DataFrame.
```python
import polars as pl
import polars_bio as pb
df = pl.DataFrame({
"chrom": ["chr1", "chr1", "chr1", "chr2"],
"start": [1, 4, 20, 1],
"end": [6, 9, 30, 10],
})
# Functional
merged = pb.merge(df, output_type="polars.DataFrame")
# Method-chaining (LazyFrame)
merged = df.lazy().pb.merge().collect()
# Merge intervals within a minimum distance
merged = pb.merge(df, min_dist=10)
```
### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `df` | DataFrame/LazyFrame/str | required | Interval set to merge |
| `min_dist` | int | `0` | Minimum distance between intervals to merge (0 = must overlap or be bookended) |
| `cols` | list[str] | `["chrom", "start", "end"]` | Column names |
| `on_cols` | list[str] | `None` | Additional grouping columns |
| `output_type` | str | `"polars.LazyFrame"` | Output format |
| `projection_pushdown` | bool | `True` | Enable projection pushdown |
### Output Schema
| Column | Type | Description |
|--------|------|-------------|
| `chrom` | String | Chromosome |
| `start` | Int64 | Merged interval start |
| `end` | Int64 | Merged interval end |
| `n_intervals` | Int64 | Number of intervals merged |
## cluster
Assign cluster IDs to overlapping intervals. Intervals that overlap are assigned the same cluster ID.
```python
# Functional
clustered = pb.cluster(df, output_type="polars.DataFrame")
# Method-chaining (LazyFrame)
clustered = df.lazy().pb.cluster().collect()
# With minimum distance
clustered = pb.cluster(df, min_dist=5)
```
### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `df` | DataFrame/LazyFrame/str | required | Interval set |
| `min_dist` | int | `0` | Minimum distance for clustering |
| `cols` | list[str] | `["chrom", "start", "end"]` | Column names |
| `output_type` | str | `"polars.LazyFrame"` | Output format |
| `projection_pushdown` | bool | `True` | Enable projection pushdown |
### Output Schema
Returns the original columns plus:
| Column | Type | Description |
|--------|------|-------------|
| `cluster` | Int64 | Cluster ID (intervals in the same cluster overlap) |
| `cluster_start` | Int64 | Start of the cluster extent |
| `cluster_end` | Int64 | End of the cluster extent |
## coverage
Compute per-interval coverage counts. This is a **two-input** operation: for each interval in df1, count the coverage from df2.
```python
# Functional
cov = pb.coverage(df1, df2, output_type="polars.DataFrame")
# Method-chaining (LazyFrame)
cov = df1.lazy().pb.coverage(df2).collect()
```
### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `df1` | DataFrame/LazyFrame/str | required | Query intervals |
| `df2` | DataFrame/LazyFrame/str | required | Coverage source intervals |
| `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for column names |
| `on_cols` | list[str] | `None` | Additional join columns |
| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
| `output_type` | str | `"polars.LazyFrame"` | Output format |
| `projection_pushdown` | bool | `True` | Enable projection pushdown |
### Output Schema
Returns columns from df1 plus a `coverage` column (Int64).
## complement
Find gaps between intervals within a genome. Requires a genome definition specifying chromosome sizes.
```python
import polars as pl
import polars_bio as pb
df = pl.DataFrame({
"chrom": ["chr1", "chr1"],
"start": [100, 500],
"end": [200, 600],
})
genome = pl.DataFrame({
"chrom": ["chr1"],
"start": [0],
"end": [1000],
})
# Functional
gaps = pb.complement(df, view_df=genome, output_type="polars.DataFrame")
# Method-chaining (LazyFrame)
gaps = df.lazy().pb.complement(genome).collect()
```
### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `df` | DataFrame/LazyFrame/str | required | Interval set |
| `view_df` | DataFrame/LazyFrame | `None` | Genome with chrom, start, end defining chromosome extents |
| `cols` | list[str] | `["chrom", "start", "end"]` | Column names in df |
| `view_cols` | list[str] | `None` | Column names in view_df |
| `output_type` | str | `"polars.LazyFrame"` | Output format |
| `projection_pushdown` | bool | `True` | Enable projection pushdown |
### Output Schema
Returns a DataFrame with `chrom` (String), `start` (Int64), `end` (Int64) columns representing gaps between intervals.
## subtract
Remove portions of intervals in df1 that overlap with intervals in df2.
```python
# Functional
result = pb.subtract(df1, df2, output_type="polars.DataFrame")
# Method-chaining (LazyFrame)
result = df1.lazy().pb.subtract(df2).collect()
```
### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `df1` | DataFrame/LazyFrame/str | required | Intervals to subtract from |
| `df2` | DataFrame/LazyFrame/str | required | Intervals to subtract |
| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
| `output_type` | str | `"polars.LazyFrame"` | Output format |
| `projection_pushdown` | bool | `True` | Enable projection pushdown |
### Output Schema
Returns `chrom` (String), `start` (Int64), `end` (Int64) representing the remaining portions of df1 intervals after subtraction.
## Performance Considerations
### Probe-Build Architecture
Two-input operations (`overlap`, `nearest`, `count_overlaps`, `coverage`, `subtract`) use a probe-build join:
- **Probe** (first DataFrame): Iterated over, row by row
- **Build** (second DataFrame): Indexed into an interval tree for fast lookup
For best performance, pass the **larger** DataFrame as the probe (first argument) and the **smaller** one as the build (second argument).
### Parallelism
By default, polars-bio uses a single execution partition. For large datasets, enable parallel execution:
```python
import os
import polars_bio as pb
pb.set_option("datafusion.execution.target_partitions", os.cpu_count())
```
### Streaming Execution
DataFusion streaming is enabled by default for interval operations. Data is processed in batches, enabling out-of-core computation for datasets larger than available RAM.
### When to Use Lazy Evaluation
Use `scan_*` functions and lazy DataFrames for:
- Files larger than available RAM
- When only a subset of results is needed
- Pipeline operations where intermediate results can be optimized away
```python
# Lazy pipeline
lf1 = pb.scan_bed("large1.bed")
lf2 = pb.scan_bed("large2.bed")
result = pb.overlap(lf1, lf2).collect()
```