Files
Marek Wieiwórka 436c8608f2 Add polars-bio skill for genomic interval operations and bioinformatics I/O
Adds a new skill covering polars-bio (v0.26.0), a high-performance library
for genomic interval arithmetic and file I/O built on Polars, Arrow, and
DataFusion. All code examples verified against the actual API at runtime.

SKILL.md covers overlap, nearest, merge, coverage, complement, subtract,
cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF,
BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats.

References: interval_operations, file_io, sql_processing, pileup_operations,
configuration, bioframe_migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 10:27:11 +01:00

7.2 KiB

Migrating from bioframe to polars-bio

Overview

polars-bio is a drop-in replacement for bioframe's core interval operations, offering 6.5-38x speedups on real-world genomic benchmarks. The main differences are: Polars DataFrames instead of pandas, a Rust/DataFusion backend instead of pure Python, streaming support for large genomes, and LazyFrame returns by default.

Operation Mapping

bioframe polars-bio Notes
bioframe.overlap(df1, df2) pb.overlap(df1, df2) Returns LazyFrame; .collect() for DataFrame
bioframe.closest(df1, df2) pb.nearest(df1, df2) Renamed; uses k, overlap, distance params
bioframe.count_overlaps(df1, df2) pb.count_overlaps(df1, df2) Default suffixes differ: ("", "_") vs bioframe's
bioframe.merge(df) pb.merge(df) Output includes n_intervals column
bioframe.cluster(df) pb.cluster(df) Output cols: cluster, cluster_start, cluster_end
bioframe.coverage(df1, df2) pb.coverage(df1, df2) Two-input in both libraries
bioframe.complement(df, chromsizes) pb.complement(df, view_df=genome) Genome as DataFrame, not Series
bioframe.subtract(df1, df2) pb.subtract(df1, df2) Same semantics

Key API Differences

DataFrames: pandas vs Polars

bioframe (pandas):

import bioframe
import pandas as pd

df1 = pd.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [1, 10],
    "end":   [5, 20],
})

result = bioframe.overlap(df1, df2)
# result is a pandas DataFrame
result["start_1"]  # pandas column access

polars-bio (Polars):

import polars_bio as pb
import polars as pl

df1 = pl.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [1, 10],
    "end":   [5, 20],
})

result = pb.overlap(df1, df2)  # Returns LazyFrame
result_df = result.collect()   # Materialize to DataFrame
result_df.select("start_1")   # Polars column access

Return Types: LazyFrame by Default

All polars-bio operations return a LazyFrame by default. Use .collect() or output_type="polars.DataFrame":

# bioframe: always returns DataFrame
result = bioframe.overlap(df1, df2)

# polars-bio: returns LazyFrame, collect for DataFrame
result_lf = pb.overlap(df1, df2)
result_df = result_lf.collect()

# Or get DataFrame directly
result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")

Genome/Chromsizes

bioframe:

chromsizes = bioframe.fetch_chromsizes("hg38")  # Returns pandas Series
complement = bioframe.complement(df, chromsizes)

polars-bio:

genome = pl.DataFrame({
    "chrom": ["chr1", "chr2"],
    "start": [0, 0],
    "end":   [248956422, 242193529],
})
complement = pb.complement(df, view_df=genome)

closest vs nearest

bioframe:

result = bioframe.closest(df1, df2)

polars-bio:

# Basic nearest
result = pb.nearest(df1, df2)

# Find k nearest neighbors
result = pb.nearest(df1, df2, k=3)

# Exclude overlapping intervals
result = pb.nearest(df1, df2, overlap=False)

# Without distance column
result = pb.nearest(df1, df2, distance=False)

Method-Chaining (polars-bio only)

polars-bio adds a .pb accessor on LazyFrame for method chaining:

# bioframe: sequential function calls
merged = bioframe.merge(bioframe.overlap(df1, df2))

# polars-bio: fluent pipeline (must use LazyFrame)
# Note: overlap adds suffixes, so rename before merge
merged = (
    df1.lazy()
    .pb.overlap(df2)
    .select(
        pl.col("chrom_1").alias("chrom"),
        pl.col("start_1").alias("start"),
        pl.col("end_1").alias("end"),
    )
    .pb.merge()
    .collect()
)

Performance Comparison

Benchmarks on real-world genomic datasets (from the polars-bio paper, Bioinformatics 2025):

Operation bioframe polars-bio Speedup
overlap 1.0x 6.5x 6.5x
nearest 1.0x 38x 38x
merge 1.0x 8.2x 8.2x
coverage 1.0x 12x 12x

Speedups come from:

  • Rust-based interval tree implementation
  • Apache DataFusion query engine
  • Apache Arrow columnar memory format
  • Parallel execution (when configured)
  • Streaming/out-of-core support

Migration Code Examples

Example 1: Basic Overlap Pipeline

Before (bioframe):

import bioframe
import pandas as pd

df1 = pd.read_csv("peaks.bed", sep="\t", names=["chrom", "start", "end"])
df2 = pd.read_csv("genes.bed", sep="\t", names=["chrom", "start", "end", "name"])

overlaps = bioframe.overlap(df1, df2, suffixes=("_peak", "_gene"))
filtered = overlaps[overlaps["start_gene"] > 10000]
merged = bioframe.merge(filtered[["chrom_peak", "start_peak", "end_peak"]]
    .rename(columns={"chrom_peak": "chrom", "start_peak": "start", "end_peak": "end"}))

After (polars-bio):

import polars_bio as pb
import polars as pl

df1 = pb.read_bed("peaks.bed")
df2 = pb.read_bed("genes.bed")

overlaps = pb.overlap(df1, df2, suffixes=("_peak", "_gene"), output_type="polars.DataFrame")
filtered = overlaps.filter(pl.col("start_gene") > 10000)
merged = pb.merge(
    filtered.select(
        pl.col("chrom_peak").alias("chrom"),
        pl.col("start_peak").alias("start"),
        pl.col("end_peak").alias("end"),
    ),
    output_type="polars.DataFrame",
)

Example 2: Large-Scale Streaming

Before (bioframe) — limited to in-memory:

import bioframe
import pandas as pd

# Must load entire file into memory
df1 = pd.read_csv("huge_intervals.bed", sep="\t", names=["chrom", "start", "end"])
result = bioframe.merge(df1)  # Memory-bound

After (polars-bio) — streaming:

import polars_bio as pb

# Lazy scan, streaming execution
lf = pb.scan_bed("huge_intervals.bed")
result = pb.merge(lf).collect(streaming=True)

pandas Compatibility Mode

For gradual migration, install with pandas support:

pip install polars-bio[pandas]

This enables conversion between pandas and Polars DataFrames:

import polars_bio as pb
import polars as pl

# Convert pandas DataFrame to Polars for polars-bio
polars_df = pl.from_pandas(pandas_df)
result = pb.overlap(polars_df, other_df).collect()

# Convert back to pandas if needed
pandas_result = result.to_pandas()

# Or request pandas output directly
pandas_result = pb.overlap(polars_df, other_df, output_type="pandas.DataFrame")

Migration Checklist

  1. Replace import bioframe with import polars_bio as pb
  2. Replace import pandas as pd with import polars as pl
  3. Convert DataFrame creation from pd.DataFrame to pl.DataFrame
  4. Replace bioframe.closest with pb.nearest
  5. Add .collect() after operations (they return LazyFrame by default)
  6. Update column access from df["col"] to df.select("col") or pl.col("col")
  7. Replace pandas filtering df[df["col"] > x] with df.filter(pl.col("col") > x)
  8. Update chromsizes from Series to DataFrame with chrom, start, end; pass as view_df=
  9. Add pb.set_option("datafusion.execution.target_partitions", N) for parallelism
  10. Replace pd.read_csv for BED files with pb.read_bed or pb.scan_bed
  11. Note cluster output column is cluster (not cluster_id), plus cluster_start, cluster_end
  12. Note merge output includes n_intervals column