mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-03-27 07:09:27 +08:00

Files

Marek Wieiwórka 436c8608f2 Add polars-bio skill for genomic interval operations and bioinformatics I/O

Adds a new skill covering polars-bio (v0.26.0), a high-performance library
for genomic interval arithmetic and file I/O built on Polars, Arrow, and
DataFusion. All code examples verified against the actual API at runtime.

SKILL.md covers overlap, nearest, merge, coverage, complement, subtract,
cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF,
BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats.

References: interval_operations, file_io, sql_processing, pileup_operations,
configuration, bioframe_migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-14 10:27:11 +01:00

4.8 KiB

Raw Permalink Blame History

Pileup Operations

Overview

polars-bio provides the pb.depth() function for computing per-base or per-block read depth from BAM/CRAM files. It uses CIGAR-aware depth calculation to accurately account for insertions, deletions, and clipping. Returns a LazyFrame by default.

pb.depth()

Compute read depth from alignment files.

Basic Usage

import polars_bio as pb

# Compute depth across entire BAM file (returns LazyFrame)
depth_lf = pb.depth("aligned.bam")
depth_df = depth_lf.collect()

# Get DataFrame directly
depth_df = pb.depth("aligned.bam", output_type="polars.DataFrame")

Parameters

Parameter	Type	Default	Description
`path`	str	required	Path to BAM or CRAM file
`filter_flag`	int	`1796`	SAM flag filter (default excludes unmapped, secondary, duplicate, QC-fail)
`min_mapping_quality`	int	`0`	Minimum mapping quality to include reads
`binary_cigar`	bool	`True`	Use binary CIGAR for faster processing
`dense_mode`	str	`"auto"`	Dense output mode
`use_zero_based`	bool	`None`	Coordinate system (None = use global setting)
`per_base`	bool	`False`	Per-base depth (True) vs block depth (False)
`output_type`	str	`"polars.LazyFrame"`	Output format: `"polars.LazyFrame"`, `"polars.DataFrame"`, `"pandas.DataFrame"`

Output Schema (Block Mode, default)

When per_base=False (default), adjacent positions with the same depth are grouped into blocks:

Column	Type	Description
`contig`	String	Chromosome/contig name
`pos_start`	Int64	Block start position
`pos_end`	Int64	Block end position
`coverage`	Int16	Read depth

Output Schema (Per-Base Mode)

When per_base=True, each position is reported individually:

Column	Type	Description
`contig`	String	Chromosome/contig name
`pos`	Int64	Position
`coverage`	Int16	Read depth at position

filter_flag

The default filter_flag=1796 excludes reads with these SAM flags:

4: unmapped
256: secondary alignment
512: failed QC
1024: PCR/optical duplicate

CIGAR-Aware Computation

pb.depth() correctly handles CIGAR operations:

M/X/= (match/mismatch): Counted as coverage
D (deletion): Counted as coverage (reads span the deletion)
N (skipped region): Not counted (e.g., spliced alignments)
I (insertion): Not counted at reference positions
S/H (soft/hard clipping): Not counted

Examples

Whole-Genome Depth

import polars_bio as pb
import polars as pl

# Compute depth genome-wide (block mode)
depth = pb.depth("sample.bam", output_type="polars.DataFrame")

# Summary statistics
depth.select(
    pl.col("coverage").cast(pl.Int64).mean().alias("mean_depth"),
    pl.col("coverage").cast(pl.Int64).median().alias("median_depth"),
    pl.col("coverage").cast(pl.Int64).max().alias("max_depth"),
)

Per-Base Depth

import polars_bio as pb

# Per-base depth (one row per position)
depth = pb.depth("sample.bam", per_base=True, output_type="polars.DataFrame")

Depth with Quality Filters

import polars_bio as pb

# Only count well-mapped reads
depth = pb.depth(
    "sample.bam",
    min_mapping_quality=20,
    output_type="polars.DataFrame",
)

Custom Flag Filter

import polars_bio as pb

# Only exclude unmapped (4) and duplicate (1024) reads
depth = pb.depth(
    "sample.bam",
    filter_flag=4 + 1024,
    output_type="polars.DataFrame",
)

Integration with Interval Operations

Depth results can be used with polars-bio interval operations. Note that depth output uses contig/pos_start/pos_end column names, so use cols parameters to map them:

import polars_bio as pb
import polars as pl

# Compute depth
depth = pb.depth("sample.bam", output_type="polars.DataFrame")

# Rename columns to match interval operation conventions
depth_intervals = depth.rename({
    "contig": "chrom",
    "pos_start": "start",
    "pos_end": "end",
})

# Find regions with adequate coverage
adequate = depth_intervals.filter(pl.col("coverage") >= 30)

# Merge adjacent adequate-coverage blocks
merged = pb.merge(adequate, output_type="polars.DataFrame")

# Find gaps in coverage (complement)
genome = pl.DataFrame({
    "chrom": ["chr1"],
    "start": [0],
    "end": [248956422],
})
gaps = pb.complement(adequate, view_df=genome, output_type="polars.DataFrame")

Using cols Parameters Instead of Renaming

import polars_bio as pb

depth = pb.depth("sample.bam", output_type="polars.DataFrame")
targets = pb.read_bed("targets.bed")

# Use cols1 to specify depth column names
overlapping = pb.overlap(
    depth, targets,
    cols1=["contig", "pos_start", "pos_end"],
    output_type="polars.DataFrame",
)

4.8 KiB Raw Permalink Blame History