Files
Marek Wieiwórka 436c8608f2 Add polars-bio skill for genomic interval operations and bioinformatics I/O
Adds a new skill covering polars-bio (v0.26.0), a high-performance library
for genomic interval arithmetic and file I/O built on Polars, Arrow, and
DataFusion. All code examples verified against the actual API at runtime.

SKILL.md covers overlap, nearest, merge, coverage, complement, subtract,
cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF,
BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats.

References: interval_operations, file_io, sql_processing, pileup_operations,
configuration, bioframe_migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 10:27:11 +01:00

12 KiB

Genomic Interval Operations

Overview

polars-bio provides 8 core operations for genomic interval arithmetic. All operations work on Polars DataFrames or LazyFrames containing genomic intervals (columns: chrom, start, end by default) and return a LazyFrame by default. Pass output_type="polars.DataFrame" for eager results.

Operations Summary

Operation Inputs Description
overlap two DataFrames Find pairs of overlapping intervals
count_overlaps two DataFrames Count overlaps per interval in the first set
nearest two DataFrames Find nearest intervals between two sets
merge one DataFrame Merge overlapping/bookended intervals
cluster one DataFrame Assign cluster IDs to overlapping intervals
coverage two DataFrames Compute per-interval coverage counts
complement one DataFrame + genome Find gaps between intervals
subtract two DataFrames Remove overlapping portions

overlap

Find pairs of overlapping intervals between two DataFrames.

Functional API

import polars as pl
import polars_bio as pb

df1 = pl.DataFrame({
    "chrom": ["chr1", "chr1", "chr1"],
    "start": [1, 5, 22],
    "end":   [6, 9, 30],
})

df2 = pl.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [3, 25],
    "end":   [8, 28],
})

# Returns LazyFrame by default
result_lf = pb.overlap(df1, df2, suffixes=("_1", "_2"))
result_df = result_lf.collect()

# Or get DataFrame directly
result_df = pb.overlap(df1, df2, suffixes=("_1", "_2"), output_type="polars.DataFrame")

Method-Chaining API (LazyFrame only)

result = df1.lazy().pb.overlap(df2, suffixes=("_1", "_2")).collect()

Parameters

Parameter Type Default Description
df1 DataFrame/LazyFrame/str required First (probe) interval set
df2 DataFrame/LazyFrame/str required Second (build) interval set
suffixes tuple[str, str] ("_1", "_2") Suffixes for overlapping column names
on_cols list[str] None Additional columns to join on (beyond genomic coords)
cols1 list[str] ["chrom", "start", "end"] Column names in df1
cols2 list[str] ["chrom", "start", "end"] Column names in df2
algorithm str "Coitrees" Interval algorithm
low_memory bool False Low memory mode
output_type str "polars.LazyFrame" Output format: "polars.LazyFrame", "polars.DataFrame", "pandas.DataFrame"
projection_pushdown bool True Enable projection pushdown optimization

Output Schema

Returns columns from both inputs with suffixes applied:

  • chrom_1, start_1, end_1 (from df1)
  • chrom_2, start_2, end_2 (from df2)
  • Any additional columns from df1 and df2

Column dtypes are String for chrom and Int64 for start/end.

count_overlaps

Count the number of overlapping intervals from df2 for each interval in df1.

# Functional
counts = pb.count_overlaps(df1, df2)

# Method-chaining (LazyFrame)
counts = df1.lazy().pb.count_overlaps(df2)

Parameters

Parameter Type Default Description
df1 DataFrame/LazyFrame/str required Query interval set
df2 DataFrame/LazyFrame/str required Target interval set
suffixes tuple[str, str] ("", "_") Suffixes for column names
cols1 list[str] ["chrom", "start", "end"] Column names in df1
cols2 list[str] ["chrom", "start", "end"] Column names in df2
on_cols list[str] None Additional join columns
output_type str "polars.LazyFrame" Output format
naive_query bool True Use naive query strategy
projection_pushdown bool True Enable projection pushdown

Output Schema

Returns df1 columns with an additional count column (Int64).

nearest

Find the nearest interval in df2 for each interval in df1.

# Find nearest (default: k=1, any direction)
nearest = pb.nearest(df1, df2, output_type="polars.DataFrame")

# Find k nearest
nearest = pb.nearest(df1, df2, k=3)

# Exclude overlapping intervals from results
nearest = pb.nearest(df1, df2, overlap=False)

# Without distance column
nearest = pb.nearest(df1, df2, distance=False)

Parameters

Parameter Type Default Description
df1 DataFrame/LazyFrame/str required Query interval set
df2 DataFrame/LazyFrame/str required Target interval set
suffixes tuple[str, str] ("_1", "_2") Suffixes for column names
on_cols list[str] None Additional join columns
cols1 list[str] ["chrom", "start", "end"] Column names in df1
cols2 list[str] ["chrom", "start", "end"] Column names in df2
k int 1 Number of nearest neighbors to find
overlap bool True Include overlapping intervals in results
distance bool True Include distance column in output
output_type str "polars.LazyFrame" Output format
projection_pushdown bool True Enable projection pushdown

Output Schema

Returns columns from both DataFrames (with suffixes) plus a distance column (Int64) with the distance to the nearest interval (0 if overlapping). Distance column is omitted if distance=False.

merge

Merge overlapping and bookended intervals within a single DataFrame.

import polars as pl
import polars_bio as pb

df = pl.DataFrame({
    "chrom": ["chr1", "chr1", "chr1", "chr2"],
    "start": [1, 4, 20, 1],
    "end":   [6, 9, 30, 10],
})

# Functional
merged = pb.merge(df, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
merged = df.lazy().pb.merge().collect()

# Merge intervals within a minimum distance
merged = pb.merge(df, min_dist=10)

Parameters

Parameter Type Default Description
df DataFrame/LazyFrame/str required Interval set to merge
min_dist int 0 Minimum distance between intervals to merge (0 = must overlap or be bookended)
cols list[str] ["chrom", "start", "end"] Column names
on_cols list[str] None Additional grouping columns
output_type str "polars.LazyFrame" Output format
projection_pushdown bool True Enable projection pushdown

Output Schema

Column Type Description
chrom String Chromosome
start Int64 Merged interval start
end Int64 Merged interval end
n_intervals Int64 Number of intervals merged

cluster

Assign cluster IDs to overlapping intervals. Intervals that overlap are assigned the same cluster ID.

# Functional
clustered = pb.cluster(df, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
clustered = df.lazy().pb.cluster().collect()

# With minimum distance
clustered = pb.cluster(df, min_dist=5)

Parameters

Parameter Type Default Description
df DataFrame/LazyFrame/str required Interval set
min_dist int 0 Minimum distance for clustering
cols list[str] ["chrom", "start", "end"] Column names
output_type str "polars.LazyFrame" Output format
projection_pushdown bool True Enable projection pushdown

Output Schema

Returns the original columns plus:

Column Type Description
cluster Int64 Cluster ID (intervals in the same cluster overlap)
cluster_start Int64 Start of the cluster extent
cluster_end Int64 End of the cluster extent

coverage

Compute per-interval coverage counts. This is a two-input operation: for each interval in df1, count the coverage from df2.

# Functional
cov = pb.coverage(df1, df2, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
cov = df1.lazy().pb.coverage(df2).collect()

Parameters

Parameter Type Default Description
df1 DataFrame/LazyFrame/str required Query intervals
df2 DataFrame/LazyFrame/str required Coverage source intervals
suffixes tuple[str, str] ("_1", "_2") Suffixes for column names
on_cols list[str] None Additional join columns
cols1 list[str] ["chrom", "start", "end"] Column names in df1
cols2 list[str] ["chrom", "start", "end"] Column names in df2
output_type str "polars.LazyFrame" Output format
projection_pushdown bool True Enable projection pushdown

Output Schema

Returns columns from df1 plus a coverage column (Int64).

complement

Find gaps between intervals within a genome. Requires a genome definition specifying chromosome sizes.

import polars as pl
import polars_bio as pb

df = pl.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [100, 500],
    "end":   [200, 600],
})

genome = pl.DataFrame({
    "chrom": ["chr1"],
    "start": [0],
    "end":   [1000],
})

# Functional
gaps = pb.complement(df, view_df=genome, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
gaps = df.lazy().pb.complement(genome).collect()

Parameters

Parameter Type Default Description
df DataFrame/LazyFrame/str required Interval set
view_df DataFrame/LazyFrame None Genome with chrom, start, end defining chromosome extents
cols list[str] ["chrom", "start", "end"] Column names in df
view_cols list[str] None Column names in view_df
output_type str "polars.LazyFrame" Output format
projection_pushdown bool True Enable projection pushdown

Output Schema

Returns a DataFrame with chrom (String), start (Int64), end (Int64) columns representing gaps between intervals.

subtract

Remove portions of intervals in df1 that overlap with intervals in df2.

# Functional
result = pb.subtract(df1, df2, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
result = df1.lazy().pb.subtract(df2).collect()

Parameters

Parameter Type Default Description
df1 DataFrame/LazyFrame/str required Intervals to subtract from
df2 DataFrame/LazyFrame/str required Intervals to subtract
cols1 list[str] ["chrom", "start", "end"] Column names in df1
cols2 list[str] ["chrom", "start", "end"] Column names in df2
output_type str "polars.LazyFrame" Output format
projection_pushdown bool True Enable projection pushdown

Output Schema

Returns chrom (String), start (Int64), end (Int64) representing the remaining portions of df1 intervals after subtraction.

Performance Considerations

Probe-Build Architecture

Two-input operations (overlap, nearest, count_overlaps, coverage, subtract) use a probe-build join:

  • Probe (first DataFrame): Iterated over, row by row
  • Build (second DataFrame): Indexed into an interval tree for fast lookup

For best performance, pass the larger DataFrame as the probe (first argument) and the smaller one as the build (second argument).

Parallelism

By default, polars-bio uses a single execution partition. For large datasets, enable parallel execution:

import os
import polars_bio as pb

pb.set_option("datafusion.execution.target_partitions", os.cpu_count())

Streaming Execution

DataFusion streaming is enabled by default for interval operations. Data is processed in batches, enabling out-of-core computation for datasets larger than available RAM.

When to Use Lazy Evaluation

Use scan_* functions and lazy DataFrames for:

  • Files larger than available RAM
  • When only a subset of results is needed
  • Pipeline operations where intermediate results can be optimized away
# Lazy pipeline
lf1 = pb.scan_bed("large1.bed")
lf2 = pb.scan_bed("large2.bed")
result = pb.overlap(lf1, lf2).collect()