claude-scientific-skills/scientific-skills/polars-bio/references/pileup_operations.md

# Pileup Operations

## Overview

polars-bio provides the `pb.depth()` function for computing per-base or per-block read depth from BAM/CRAM files. It uses CIGAR-aware depth calculation to accurately account for insertions, deletions, and clipping. Returns a **LazyFrame** by default.

## pb.depth()

Compute read depth from alignment files.

### Basic Usage

```python
import polars_bio as pb

# Compute depth across entire BAM file (returns LazyFrame)
depth_lf = pb.depth("aligned.bam")
depth_df = depth_lf.collect()

# Get DataFrame directly
depth_df = pb.depth("aligned.bam", output_type="polars.DataFrame")
```

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path` | str | required | Path to BAM or CRAM file |
| `filter_flag` | int | `1796` | SAM flag filter (default excludes unmapped, secondary, duplicate, QC-fail) |
| `min_mapping_quality` | int | `0` | Minimum mapping quality to include reads |
| `binary_cigar` | bool | `True` | Use binary CIGAR for faster processing |
| `dense_mode` | str | `"auto"` | Dense output mode |
| `use_zero_based` | bool | `None` | Coordinate system (None = use global setting) |
| `per_base` | bool | `False` | Per-base depth (True) vs block depth (False) |
| `output_type` | str | `"polars.LazyFrame"` | Output format: `"polars.LazyFrame"`, `"polars.DataFrame"`, `"pandas.DataFrame"` |

### Output Schema (Block Mode, default)

When `per_base=False` (default), adjacent positions with the same depth are grouped into blocks:

| Column | Type | Description |
|--------|------|-------------|
| `contig` | String | Chromosome/contig name |
| `pos_start` | Int64 | Block start position |
| `pos_end` | Int64 | Block end position |
| `coverage` | Int16 | Read depth |

### Output Schema (Per-Base Mode)

When `per_base=True`, each position is reported individually:

| Column | Type | Description |
|--------|------|-------------|
| `contig` | String | Chromosome/contig name |
| `pos` | Int64 | Position |
| `coverage` | Int16 | Read depth at position |

### filter_flag

The default `filter_flag=1796` excludes reads with these SAM flags:
- 4: unmapped
- 256: secondary alignment
- 512: failed QC
- 1024: PCR/optical duplicate

### CIGAR-Aware Computation

`pb.depth()` correctly handles CIGAR operations:
- **M/X/=** (match/mismatch): Counted as coverage
- **D** (deletion): Counted as coverage (reads span the deletion)
- **N** (skipped region): Not counted (e.g., spliced alignments)
- **I** (insertion): Not counted at reference positions
- **S/H** (soft/hard clipping): Not counted

## Examples

### Whole-Genome Depth

```python
import polars_bio as pb
import polars as pl

# Compute depth genome-wide (block mode)
depth = pb.depth("sample.bam", output_type="polars.DataFrame")

# Summary statistics
depth.select(
    pl.col("coverage").cast(pl.Int64).mean().alias("mean_depth"),
    pl.col("coverage").cast(pl.Int64).median().alias("median_depth"),
    pl.col("coverage").cast(pl.Int64).max().alias("max_depth"),
)
```

### Per-Base Depth

```python
import polars_bio as pb

# Per-base depth (one row per position)
depth = pb.depth("sample.bam", per_base=True, output_type="polars.DataFrame")
```

### Depth with Quality Filters

```python
import polars_bio as pb

# Only count well-mapped reads
depth = pb.depth(
    "sample.bam",
    min_mapping_quality=20,
    output_type="polars.DataFrame",
)
```

### Custom Flag Filter

```python
import polars_bio as pb

# Only exclude unmapped (4) and duplicate (1024) reads
depth = pb.depth(
    "sample.bam",
    filter_flag=4 + 1024,
    output_type="polars.DataFrame",
)
```

## Integration with Interval Operations

Depth results can be used with polars-bio interval operations. Note that depth output uses `contig`/`pos_start`/`pos_end` column names, so use `cols` parameters to map them:

```python
import polars_bio as pb
import polars as pl

# Compute depth
depth = pb.depth("sample.bam", output_type="polars.DataFrame")

# Rename columns to match interval operation conventions
depth_intervals = depth.rename({
    "contig": "chrom",
    "pos_start": "start",
    "pos_end": "end",
})

# Find regions with adequate coverage
adequate = depth_intervals.filter(pl.col("coverage") >= 30)

# Merge adjacent adequate-coverage blocks
merged = pb.merge(adequate, output_type="polars.DataFrame")

# Find gaps in coverage (complement)
genome = pl.DataFrame({
    "chrom": ["chr1"],
    "start": [0],
    "end": [248956422],
})
gaps = pb.complement(adequate, view_df=genome, output_type="polars.DataFrame")
```

### Using cols Parameters Instead of Renaming

```python
import polars_bio as pb

depth = pb.depth("sample.bam", output_type="polars.DataFrame")
targets = pb.read_bed("targets.bed")

# Use cols1 to specify depth column names
overlapping = pb.overlap(
    depth, targets,
    cols1=["contig", "pos_start", "pos_end"],
    output_type="polars.DataFrame",
)
```