Files
Marek Wieiwórka 436c8608f2 Add polars-bio skill for genomic interval operations and bioinformatics I/O
Adds a new skill covering polars-bio (v0.26.0), a high-performance library
for genomic interval arithmetic and file I/O built on Polars, Arrow, and
DataFusion. All code examples verified against the actual API at runtime.

SKILL.md covers overlap, nearest, merge, coverage, complement, subtract,
cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF,
BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats.

References: interval_operations, file_io, sql_processing, pileup_operations,
configuration, bioframe_migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 10:27:11 +01:00

4.8 KiB

Pileup Operations

Overview

polars-bio provides the pb.depth() function for computing per-base or per-block read depth from BAM/CRAM files. It uses CIGAR-aware depth calculation to accurately account for insertions, deletions, and clipping. Returns a LazyFrame by default.

pb.depth()

Compute read depth from alignment files.

Basic Usage

import polars_bio as pb

# Compute depth across entire BAM file (returns LazyFrame)
depth_lf = pb.depth("aligned.bam")
depth_df = depth_lf.collect()

# Get DataFrame directly
depth_df = pb.depth("aligned.bam", output_type="polars.DataFrame")

Parameters

Parameter Type Default Description
path str required Path to BAM or CRAM file
filter_flag int 1796 SAM flag filter (default excludes unmapped, secondary, duplicate, QC-fail)
min_mapping_quality int 0 Minimum mapping quality to include reads
binary_cigar bool True Use binary CIGAR for faster processing
dense_mode str "auto" Dense output mode
use_zero_based bool None Coordinate system (None = use global setting)
per_base bool False Per-base depth (True) vs block depth (False)
output_type str "polars.LazyFrame" Output format: "polars.LazyFrame", "polars.DataFrame", "pandas.DataFrame"

Output Schema (Block Mode, default)

When per_base=False (default), adjacent positions with the same depth are grouped into blocks:

Column Type Description
contig String Chromosome/contig name
pos_start Int64 Block start position
pos_end Int64 Block end position
coverage Int16 Read depth

Output Schema (Per-Base Mode)

When per_base=True, each position is reported individually:

Column Type Description
contig String Chromosome/contig name
pos Int64 Position
coverage Int16 Read depth at position

filter_flag

The default filter_flag=1796 excludes reads with these SAM flags:

  • 4: unmapped
  • 256: secondary alignment
  • 512: failed QC
  • 1024: PCR/optical duplicate

CIGAR-Aware Computation

pb.depth() correctly handles CIGAR operations:

  • M/X/= (match/mismatch): Counted as coverage
  • D (deletion): Counted as coverage (reads span the deletion)
  • N (skipped region): Not counted (e.g., spliced alignments)
  • I (insertion): Not counted at reference positions
  • S/H (soft/hard clipping): Not counted

Examples

Whole-Genome Depth

import polars_bio as pb
import polars as pl

# Compute depth genome-wide (block mode)
depth = pb.depth("sample.bam", output_type="polars.DataFrame")

# Summary statistics
depth.select(
    pl.col("coverage").cast(pl.Int64).mean().alias("mean_depth"),
    pl.col("coverage").cast(pl.Int64).median().alias("median_depth"),
    pl.col("coverage").cast(pl.Int64).max().alias("max_depth"),
)

Per-Base Depth

import polars_bio as pb

# Per-base depth (one row per position)
depth = pb.depth("sample.bam", per_base=True, output_type="polars.DataFrame")

Depth with Quality Filters

import polars_bio as pb

# Only count well-mapped reads
depth = pb.depth(
    "sample.bam",
    min_mapping_quality=20,
    output_type="polars.DataFrame",
)

Custom Flag Filter

import polars_bio as pb

# Only exclude unmapped (4) and duplicate (1024) reads
depth = pb.depth(
    "sample.bam",
    filter_flag=4 + 1024,
    output_type="polars.DataFrame",
)

Integration with Interval Operations

Depth results can be used with polars-bio interval operations. Note that depth output uses contig/pos_start/pos_end column names, so use cols parameters to map them:

import polars_bio as pb
import polars as pl

# Compute depth
depth = pb.depth("sample.bam", output_type="polars.DataFrame")

# Rename columns to match interval operation conventions
depth_intervals = depth.rename({
    "contig": "chrom",
    "pos_start": "start",
    "pos_end": "end",
})

# Find regions with adequate coverage
adequate = depth_intervals.filter(pl.col("coverage") >= 30)

# Merge adjacent adequate-coverage blocks
merged = pb.merge(adequate, output_type="polars.DataFrame")

# Find gaps in coverage (complement)
genome = pl.DataFrame({
    "chrom": ["chr1"],
    "start": [0],
    "end": [248956422],
})
gaps = pb.complement(adequate, view_df=genome, output_type="polars.DataFrame")

Using cols Parameters Instead of Renaming

import polars_bio as pb

depth = pb.depth("sample.bam", output_type="polars.DataFrame")
targets = pb.read_bed("targets.bed")

# Use cols1 to specify depth column names
overlapping = pb.overlap(
    depth, targets,
    cols1=["contig", "pos_start", "pos_end"],
    output_type="polars.DataFrame",
)