mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-03-27 07:09:27 +08:00

Files

Marek Wieiwórka 436c8608f2 Add polars-bio skill for genomic interval operations and bioinformatics I/O

Adds a new skill covering polars-bio (v0.26.0), a high-performance library
for genomic interval arithmetic and file I/O built on Polars, Arrow, and
DataFusion. All code examples verified against the actual API at runtime.

SKILL.md covers overlap, nearest, merge, coverage, complement, subtract,
cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF,
BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats.

References: interval_operations, file_io, sql_processing, pileup_operations,
configuration, bioframe_migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-14 10:27:11 +01:00

12 KiB

Raw Blame History

Genomic Interval Operations

Overview

polars-bio provides 8 core operations for genomic interval arithmetic. All operations work on Polars DataFrames or LazyFrames containing genomic intervals (columns: chrom, start, end by default) and return a LazyFrame by default. Pass output_type="polars.DataFrame" for eager results.

Operations Summary

Operation	Inputs	Description
`overlap`	two DataFrames	Find pairs of overlapping intervals
`count_overlaps`	two DataFrames	Count overlaps per interval in the first set
`nearest`	two DataFrames	Find nearest intervals between two sets
`merge`	one DataFrame	Merge overlapping/bookended intervals
`cluster`	one DataFrame	Assign cluster IDs to overlapping intervals
`coverage`	two DataFrames	Compute per-interval coverage counts
`complement`	one DataFrame + genome	Find gaps between intervals
`subtract`	two DataFrames	Remove overlapping portions

overlap

Find pairs of overlapping intervals between two DataFrames.

Functional API

import polars as pl
import polars_bio as pb

df1 = pl.DataFrame({
    "chrom": ["chr1", "chr1", "chr1"],
    "start": [1, 5, 22],
    "end":   [6, 9, 30],
})

df2 = pl.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [3, 25],
    "end":   [8, 28],
})

# Returns LazyFrame by default
result_lf = pb.overlap(df1, df2, suffixes=("_1", "_2"))
result_df = result_lf.collect()

# Or get DataFrame directly
result_df = pb.overlap(df1, df2, suffixes=("_1", "_2"), output_type="polars.DataFrame")

Method-Chaining API (LazyFrame only)

result = df1.lazy().pb.overlap(df2, suffixes=("_1", "_2")).collect()

Parameters

Parameter	Type	Default	Description
`df1`	DataFrame/LazyFrame/str	required	First (probe) interval set
`df2`	DataFrame/LazyFrame/str	required	Second (build) interval set
`suffixes`	tuple[str, str]	`("_1", "_2")`	Suffixes for overlapping column names
`on_cols`	list[str]	`None`	Additional columns to join on (beyond genomic coords)
`cols1`	list[str]	`["chrom", "start", "end"]`	Column names in df1
`cols2`	list[str]	`["chrom", "start", "end"]`	Column names in df2
`algorithm`	str	`"Coitrees"`	Interval algorithm
`low_memory`	bool	`False`	Low memory mode
`output_type`	str	`"polars.LazyFrame"`	Output format: `"polars.LazyFrame"`, `"polars.DataFrame"`, `"pandas.DataFrame"`
`projection_pushdown`	bool	`True`	Enable projection pushdown optimization

Output Schema

Returns columns from both inputs with suffixes applied:

chrom_1, start_1, end_1 (from df1)
chrom_2, start_2, end_2 (from df2)
Any additional columns from df1 and df2

Column dtypes are String for chrom and Int64 for start/end.

count_overlaps

Count the number of overlapping intervals from df2 for each interval in df1.

# Functional
counts = pb.count_overlaps(df1, df2)

# Method-chaining (LazyFrame)
counts = df1.lazy().pb.count_overlaps(df2)

Parameters

Parameter	Type	Default	Description
`df1`	DataFrame/LazyFrame/str	required	Query interval set
`df2`	DataFrame/LazyFrame/str	required	Target interval set
`suffixes`	tuple[str, str]	`("", "_")`	Suffixes for column names
`cols1`	list[str]	`["chrom", "start", "end"]`	Column names in df1
`cols2`	list[str]	`["chrom", "start", "end"]`	Column names in df2
`on_cols`	list[str]	`None`	Additional join columns
`output_type`	str	`"polars.LazyFrame"`	Output format
`naive_query`	bool	`True`	Use naive query strategy
`projection_pushdown`	bool	`True`	Enable projection pushdown

Output Schema

Returns df1 columns with an additional count column (Int64).

nearest

Find the nearest interval in df2 for each interval in df1.

# Find nearest (default: k=1, any direction)
nearest = pb.nearest(df1, df2, output_type="polars.DataFrame")

# Find k nearest
nearest = pb.nearest(df1, df2, k=3)

# Exclude overlapping intervals from results
nearest = pb.nearest(df1, df2, overlap=False)

# Without distance column
nearest = pb.nearest(df1, df2, distance=False)

Parameters

Parameter	Type	Default	Description
`df1`	DataFrame/LazyFrame/str	required	Query interval set
`df2`	DataFrame/LazyFrame/str	required	Target interval set
`suffixes`	tuple[str, str]	`("_1", "_2")`	Suffixes for column names
`on_cols`	list[str]	`None`	Additional join columns
`cols1`	list[str]	`["chrom", "start", "end"]`	Column names in df1
`cols2`	list[str]	`["chrom", "start", "end"]`	Column names in df2
`k`	int	`1`	Number of nearest neighbors to find
`overlap`	bool	`True`	Include overlapping intervals in results
`distance`	bool	`True`	Include distance column in output
`output_type`	str	`"polars.LazyFrame"`	Output format
`projection_pushdown`	bool	`True`	Enable projection pushdown

Output Schema

Returns columns from both DataFrames (with suffixes) plus a distance column (Int64) with the distance to the nearest interval (0 if overlapping). Distance column is omitted if distance=False.

merge

Merge overlapping and bookended intervals within a single DataFrame.

import polars as pl
import polars_bio as pb

df = pl.DataFrame({
    "chrom": ["chr1", "chr1", "chr1", "chr2"],
    "start": [1, 4, 20, 1],
    "end":   [6, 9, 30, 10],
})

# Functional
merged = pb.merge(df, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
merged = df.lazy().pb.merge().collect()

# Merge intervals within a minimum distance
merged = pb.merge(df, min_dist=10)

Parameters

Parameter	Type	Default	Description
`df`	DataFrame/LazyFrame/str	required	Interval set to merge
`min_dist`	int	`0`	Minimum distance between intervals to merge (0 = must overlap or be bookended)
`cols`	list[str]	`["chrom", "start", "end"]`	Column names
`on_cols`	list[str]	`None`	Additional grouping columns
`output_type`	str	`"polars.LazyFrame"`	Output format
`projection_pushdown`	bool	`True`	Enable projection pushdown

Output Schema

Column	Type	Description
`chrom`	String	Chromosome
`start`	Int64	Merged interval start
`end`	Int64	Merged interval end
`n_intervals`	Int64	Number of intervals merged

cluster

Assign cluster IDs to overlapping intervals. Intervals that overlap are assigned the same cluster ID.

# Functional
clustered = pb.cluster(df, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
clustered = df.lazy().pb.cluster().collect()

# With minimum distance
clustered = pb.cluster(df, min_dist=5)

Parameters

Parameter	Type	Default	Description
`df`	DataFrame/LazyFrame/str	required	Interval set
`min_dist`	int	`0`	Minimum distance for clustering
`cols`	list[str]	`["chrom", "start", "end"]`	Column names
`output_type`	str	`"polars.LazyFrame"`	Output format
`projection_pushdown`	bool	`True`	Enable projection pushdown

Output Schema

Returns the original columns plus:

Column	Type	Description
`cluster`	Int64	Cluster ID (intervals in the same cluster overlap)
`cluster_start`	Int64	Start of the cluster extent
`cluster_end`	Int64	End of the cluster extent

coverage

Compute per-interval coverage counts. This is a two-input operation: for each interval in df1, count the coverage from df2.

# Functional
cov = pb.coverage(df1, df2, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
cov = df1.lazy().pb.coverage(df2).collect()

Parameters

Parameter	Type	Default	Description
`df1`	DataFrame/LazyFrame/str	required	Query intervals
`df2`	DataFrame/LazyFrame/str	required	Coverage source intervals
`suffixes`	tuple[str, str]	`("_1", "_2")`	Suffixes for column names
`on_cols`	list[str]	`None`	Additional join columns
`cols1`	list[str]	`["chrom", "start", "end"]`	Column names in df1
`cols2`	list[str]	`["chrom", "start", "end"]`	Column names in df2
`output_type`	str	`"polars.LazyFrame"`	Output format
`projection_pushdown`	bool	`True`	Enable projection pushdown

Output Schema

Returns columns from df1 plus a coverage column (Int64).

complement

Find gaps between intervals within a genome. Requires a genome definition specifying chromosome sizes.

import polars as pl
import polars_bio as pb

df = pl.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [100, 500],
    "end":   [200, 600],
})

genome = pl.DataFrame({
    "chrom": ["chr1"],
    "start": [0],
    "end":   [1000],
})

# Functional
gaps = pb.complement(df, view_df=genome, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
gaps = df.lazy().pb.complement(genome).collect()

Parameters

Parameter	Type	Default	Description
`df`	DataFrame/LazyFrame/str	required	Interval set
`view_df`	DataFrame/LazyFrame	`None`	Genome with chrom, start, end defining chromosome extents
`cols`	list[str]	`["chrom", "start", "end"]`	Column names in df
`view_cols`	list[str]	`None`	Column names in view_df
`output_type`	str	`"polars.LazyFrame"`	Output format
`projection_pushdown`	bool	`True`	Enable projection pushdown

Output Schema

Returns a DataFrame with chrom (String), start (Int64), end (Int64) columns representing gaps between intervals.

subtract

Remove portions of intervals in df1 that overlap with intervals in df2.

# Functional
result = pb.subtract(df1, df2, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
result = df1.lazy().pb.subtract(df2).collect()

Parameters

Parameter	Type	Default	Description
`df1`	DataFrame/LazyFrame/str	required	Intervals to subtract from
`df2`	DataFrame/LazyFrame/str	required	Intervals to subtract
`cols1`	list[str]	`["chrom", "start", "end"]`	Column names in df1
`cols2`	list[str]	`["chrom", "start", "end"]`	Column names in df2
`output_type`	str	`"polars.LazyFrame"`	Output format
`projection_pushdown`	bool	`True`	Enable projection pushdown

Output Schema

Returns chrom (String), start (Int64), end (Int64) representing the remaining portions of df1 intervals after subtraction.

Performance Considerations

Probe-Build Architecture

Two-input operations (overlap, nearest, count_overlaps, coverage, subtract) use a probe-build join:

Probe (first DataFrame): Iterated over, row by row
Build (second DataFrame): Indexed into an interval tree for fast lookup

For best performance, pass the larger DataFrame as the probe (first argument) and the smaller one as the build (second argument).

Parallelism

By default, polars-bio uses a single execution partition. For large datasets, enable parallel execution:

import os
import polars_bio as pb

pb.set_option("datafusion.execution.target_partitions", os.cpu_count())

Streaming Execution

DataFusion streaming is enabled by default for interval operations. Data is processed in batches, enabling out-of-core computation for datasets larger than available RAM.

When to Use Lazy Evaluation

Use scan_* functions and lazy DataFrames for:

Files larger than available RAM
When only a subset of results is needed
Pipeline operations where intermediate results can be optimized away

# Lazy pipeline
lf1 = pb.scan_bed("large1.bed")
lf2 = pb.scan_bed("large2.bed")
result = pb.overlap(lf1, lf2).collect()

12 KiB Raw Blame History

Genomic Interval Operations

Overview

Operations Summary

overlap

Functional API

Method-Chaining API (LazyFrame only)

Parameters

Output Schema

count_overlaps

Parameters

Output Schema

nearest

Parameters

Output Schema

merge

Parameters

Output Schema

cluster

Parameters

Output Schema

coverage

Parameters

Output Schema

complement

Parameters

Output Schema

subtract

Parameters

Output Schema

Performance Considerations

Probe-Build Architecture

Parallelism

Streaming Execution

When to Use Lazy Evaluation

12 KiB

Raw Blame History