Adds a new skill covering polars-bio (v0.26.0), a high-performance library for genomic interval arithmetic and file I/O built on Polars, Arrow, and DataFusion. All code examples verified against the actual API at runtime. SKILL.md covers overlap, nearest, merge, coverage, complement, subtract, cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF, BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats. References: interval_operations, file_io, sql_processing, pileup_operations, configuration, bioframe_migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7.2 KiB
Migrating from bioframe to polars-bio
Overview
polars-bio is a drop-in replacement for bioframe's core interval operations, offering 6.5-38x speedups on real-world genomic benchmarks. The main differences are: Polars DataFrames instead of pandas, a Rust/DataFusion backend instead of pure Python, streaming support for large genomes, and LazyFrame returns by default.
Operation Mapping
| bioframe | polars-bio | Notes |
|---|---|---|
bioframe.overlap(df1, df2) |
pb.overlap(df1, df2) |
Returns LazyFrame; .collect() for DataFrame |
bioframe.closest(df1, df2) |
pb.nearest(df1, df2) |
Renamed; uses k, overlap, distance params |
bioframe.count_overlaps(df1, df2) |
pb.count_overlaps(df1, df2) |
Default suffixes differ: ("", "_") vs bioframe's |
bioframe.merge(df) |
pb.merge(df) |
Output includes n_intervals column |
bioframe.cluster(df) |
pb.cluster(df) |
Output cols: cluster, cluster_start, cluster_end |
bioframe.coverage(df1, df2) |
pb.coverage(df1, df2) |
Two-input in both libraries |
bioframe.complement(df, chromsizes) |
pb.complement(df, view_df=genome) |
Genome as DataFrame, not Series |
bioframe.subtract(df1, df2) |
pb.subtract(df1, df2) |
Same semantics |
Key API Differences
DataFrames: pandas vs Polars
bioframe (pandas):
import bioframe
import pandas as pd
df1 = pd.DataFrame({
"chrom": ["chr1", "chr1"],
"start": [1, 10],
"end": [5, 20],
})
result = bioframe.overlap(df1, df2)
# result is a pandas DataFrame
result["start_1"] # pandas column access
polars-bio (Polars):
import polars_bio as pb
import polars as pl
df1 = pl.DataFrame({
"chrom": ["chr1", "chr1"],
"start": [1, 10],
"end": [5, 20],
})
result = pb.overlap(df1, df2) # Returns LazyFrame
result_df = result.collect() # Materialize to DataFrame
result_df.select("start_1") # Polars column access
Return Types: LazyFrame by Default
All polars-bio operations return a LazyFrame by default. Use .collect() or output_type="polars.DataFrame":
# bioframe: always returns DataFrame
result = bioframe.overlap(df1, df2)
# polars-bio: returns LazyFrame, collect for DataFrame
result_lf = pb.overlap(df1, df2)
result_df = result_lf.collect()
# Or get DataFrame directly
result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")
Genome/Chromsizes
bioframe:
chromsizes = bioframe.fetch_chromsizes("hg38") # Returns pandas Series
complement = bioframe.complement(df, chromsizes)
polars-bio:
genome = pl.DataFrame({
"chrom": ["chr1", "chr2"],
"start": [0, 0],
"end": [248956422, 242193529],
})
complement = pb.complement(df, view_df=genome)
closest vs nearest
bioframe:
result = bioframe.closest(df1, df2)
polars-bio:
# Basic nearest
result = pb.nearest(df1, df2)
# Find k nearest neighbors
result = pb.nearest(df1, df2, k=3)
# Exclude overlapping intervals
result = pb.nearest(df1, df2, overlap=False)
# Without distance column
result = pb.nearest(df1, df2, distance=False)
Method-Chaining (polars-bio only)
polars-bio adds a .pb accessor on LazyFrame for method chaining:
# bioframe: sequential function calls
merged = bioframe.merge(bioframe.overlap(df1, df2))
# polars-bio: fluent pipeline (must use LazyFrame)
# Note: overlap adds suffixes, so rename before merge
merged = (
df1.lazy()
.pb.overlap(df2)
.select(
pl.col("chrom_1").alias("chrom"),
pl.col("start_1").alias("start"),
pl.col("end_1").alias("end"),
)
.pb.merge()
.collect()
)
Performance Comparison
Benchmarks on real-world genomic datasets (from the polars-bio paper, Bioinformatics 2025):
| Operation | bioframe | polars-bio | Speedup |
|---|---|---|---|
| overlap | 1.0x | 6.5x | 6.5x |
| nearest | 1.0x | 38x | 38x |
| merge | 1.0x | 8.2x | 8.2x |
| coverage | 1.0x | 12x | 12x |
Speedups come from:
- Rust-based interval tree implementation
- Apache DataFusion query engine
- Apache Arrow columnar memory format
- Parallel execution (when configured)
- Streaming/out-of-core support
Migration Code Examples
Example 1: Basic Overlap Pipeline
Before (bioframe):
import bioframe
import pandas as pd
df1 = pd.read_csv("peaks.bed", sep="\t", names=["chrom", "start", "end"])
df2 = pd.read_csv("genes.bed", sep="\t", names=["chrom", "start", "end", "name"])
overlaps = bioframe.overlap(df1, df2, suffixes=("_peak", "_gene"))
filtered = overlaps[overlaps["start_gene"] > 10000]
merged = bioframe.merge(filtered[["chrom_peak", "start_peak", "end_peak"]]
.rename(columns={"chrom_peak": "chrom", "start_peak": "start", "end_peak": "end"}))
After (polars-bio):
import polars_bio as pb
import polars as pl
df1 = pb.read_bed("peaks.bed")
df2 = pb.read_bed("genes.bed")
overlaps = pb.overlap(df1, df2, suffixes=("_peak", "_gene"), output_type="polars.DataFrame")
filtered = overlaps.filter(pl.col("start_gene") > 10000)
merged = pb.merge(
filtered.select(
pl.col("chrom_peak").alias("chrom"),
pl.col("start_peak").alias("start"),
pl.col("end_peak").alias("end"),
),
output_type="polars.DataFrame",
)
Example 2: Large-Scale Streaming
Before (bioframe) — limited to in-memory:
import bioframe
import pandas as pd
# Must load entire file into memory
df1 = pd.read_csv("huge_intervals.bed", sep="\t", names=["chrom", "start", "end"])
result = bioframe.merge(df1) # Memory-bound
After (polars-bio) — streaming:
import polars_bio as pb
# Lazy scan, streaming execution
lf = pb.scan_bed("huge_intervals.bed")
result = pb.merge(lf).collect(streaming=True)
pandas Compatibility Mode
For gradual migration, install with pandas support:
pip install polars-bio[pandas]
This enables conversion between pandas and Polars DataFrames:
import polars_bio as pb
import polars as pl
# Convert pandas DataFrame to Polars for polars-bio
polars_df = pl.from_pandas(pandas_df)
result = pb.overlap(polars_df, other_df).collect()
# Convert back to pandas if needed
pandas_result = result.to_pandas()
# Or request pandas output directly
pandas_result = pb.overlap(polars_df, other_df, output_type="pandas.DataFrame")
Migration Checklist
- Replace
import bioframewithimport polars_bio as pb - Replace
import pandas as pdwithimport polars as pl - Convert DataFrame creation from
pd.DataFrametopl.DataFrame - Replace
bioframe.closestwithpb.nearest - Add
.collect()after operations (they return LazyFrame by default) - Update column access from
df["col"]todf.select("col")orpl.col("col") - Replace pandas filtering
df[df["col"] > x]withdf.filter(pl.col("col") > x) - Update chromsizes from Series to DataFrame with
chrom,start,end; pass asview_df= - Add
pb.set_option("datafusion.execution.target_partitions", N)for parallelism - Replace
pd.read_csvfor BED files withpb.read_bedorpb.scan_bed - Note
clusteroutput column iscluster(notcluster_id), pluscluster_start,cluster_end - Note
mergeoutput includesn_intervalscolumn