mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Adds a new skill covering polars-bio (v0.26.0), a high-performance library for genomic interval arithmetic and file I/O built on Polars, Arrow, and DataFusion. All code examples verified against the actual API at runtime. SKILL.md covers overlap, nearest, merge, coverage, complement, subtract, cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF, BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats. References: interval_operations, file_io, sql_processing, pileup_operations, configuration, bioframe_migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.8 KiB
4.8 KiB
Pileup Operations
Overview
polars-bio provides the pb.depth() function for computing per-base or per-block read depth from BAM/CRAM files. It uses CIGAR-aware depth calculation to accurately account for insertions, deletions, and clipping. Returns a LazyFrame by default.
pb.depth()
Compute read depth from alignment files.
Basic Usage
import polars_bio as pb
# Compute depth across entire BAM file (returns LazyFrame)
depth_lf = pb.depth("aligned.bam")
depth_df = depth_lf.collect()
# Get DataFrame directly
depth_df = pb.depth("aligned.bam", output_type="polars.DataFrame")
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str | required | Path to BAM or CRAM file |
filter_flag |
int | 1796 |
SAM flag filter (default excludes unmapped, secondary, duplicate, QC-fail) |
min_mapping_quality |
int | 0 |
Minimum mapping quality to include reads |
binary_cigar |
bool | True |
Use binary CIGAR for faster processing |
dense_mode |
str | "auto" |
Dense output mode |
use_zero_based |
bool | None |
Coordinate system (None = use global setting) |
per_base |
bool | False |
Per-base depth (True) vs block depth (False) |
output_type |
str | "polars.LazyFrame" |
Output format: "polars.LazyFrame", "polars.DataFrame", "pandas.DataFrame" |
Output Schema (Block Mode, default)
When per_base=False (default), adjacent positions with the same depth are grouped into blocks:
| Column | Type | Description |
|---|---|---|
contig |
String | Chromosome/contig name |
pos_start |
Int64 | Block start position |
pos_end |
Int64 | Block end position |
coverage |
Int16 | Read depth |
Output Schema (Per-Base Mode)
When per_base=True, each position is reported individually:
| Column | Type | Description |
|---|---|---|
contig |
String | Chromosome/contig name |
pos |
Int64 | Position |
coverage |
Int16 | Read depth at position |
filter_flag
The default filter_flag=1796 excludes reads with these SAM flags:
- 4: unmapped
- 256: secondary alignment
- 512: failed QC
- 1024: PCR/optical duplicate
CIGAR-Aware Computation
pb.depth() correctly handles CIGAR operations:
- M/X/= (match/mismatch): Counted as coverage
- D (deletion): Counted as coverage (reads span the deletion)
- N (skipped region): Not counted (e.g., spliced alignments)
- I (insertion): Not counted at reference positions
- S/H (soft/hard clipping): Not counted
Examples
Whole-Genome Depth
import polars_bio as pb
import polars as pl
# Compute depth genome-wide (block mode)
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
# Summary statistics
depth.select(
pl.col("coverage").cast(pl.Int64).mean().alias("mean_depth"),
pl.col("coverage").cast(pl.Int64).median().alias("median_depth"),
pl.col("coverage").cast(pl.Int64).max().alias("max_depth"),
)
Per-Base Depth
import polars_bio as pb
# Per-base depth (one row per position)
depth = pb.depth("sample.bam", per_base=True, output_type="polars.DataFrame")
Depth with Quality Filters
import polars_bio as pb
# Only count well-mapped reads
depth = pb.depth(
"sample.bam",
min_mapping_quality=20,
output_type="polars.DataFrame",
)
Custom Flag Filter
import polars_bio as pb
# Only exclude unmapped (4) and duplicate (1024) reads
depth = pb.depth(
"sample.bam",
filter_flag=4 + 1024,
output_type="polars.DataFrame",
)
Integration with Interval Operations
Depth results can be used with polars-bio interval operations. Note that depth output uses contig/pos_start/pos_end column names, so use cols parameters to map them:
import polars_bio as pb
import polars as pl
# Compute depth
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
# Rename columns to match interval operation conventions
depth_intervals = depth.rename({
"contig": "chrom",
"pos_start": "start",
"pos_end": "end",
})
# Find regions with adequate coverage
adequate = depth_intervals.filter(pl.col("coverage") >= 30)
# Merge adjacent adequate-coverage blocks
merged = pb.merge(adequate, output_type="polars.DataFrame")
# Find gaps in coverage (complement)
genome = pl.DataFrame({
"chrom": ["chr1"],
"start": [0],
"end": [248956422],
})
gaps = pb.complement(adequate, view_df=genome, output_type="polars.DataFrame")
Using cols Parameters Instead of Renaming
import polars_bio as pb
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
targets = pb.read_bed("targets.bed")
# Use cols1 to specify depth column names
overlapping = pb.overlap(
depth, targets,
cols1=["contig", "pos_start", "pos_end"],
output_type="polars.DataFrame",
)