mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Adds a new skill covering polars-bio (v0.26.0), a high-performance library for genomic interval arithmetic and file I/O built on Polars, Arrow, and DataFusion. All code examples verified against the actual API at runtime. SKILL.md covers overlap, nearest, merge, coverage, complement, subtract, cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF, BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats. References: interval_operations, file_io, sql_processing, pileup_operations, configuration, bioframe_migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
177 lines
4.8 KiB
Markdown
177 lines
4.8 KiB
Markdown
# Pileup Operations
|
|
|
|
## Overview
|
|
|
|
polars-bio provides the `pb.depth()` function for computing per-base or per-block read depth from BAM/CRAM files. It uses CIGAR-aware depth calculation to accurately account for insertions, deletions, and clipping. Returns a **LazyFrame** by default.
|
|
|
|
## pb.depth()
|
|
|
|
Compute read depth from alignment files.
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
# Compute depth across entire BAM file (returns LazyFrame)
|
|
depth_lf = pb.depth("aligned.bam")
|
|
depth_df = depth_lf.collect()
|
|
|
|
# Get DataFrame directly
|
|
depth_df = pb.depth("aligned.bam", output_type="polars.DataFrame")
|
|
```
|
|
|
|
### Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `path` | str | required | Path to BAM or CRAM file |
|
|
| `filter_flag` | int | `1796` | SAM flag filter (default excludes unmapped, secondary, duplicate, QC-fail) |
|
|
| `min_mapping_quality` | int | `0` | Minimum mapping quality to include reads |
|
|
| `binary_cigar` | bool | `True` | Use binary CIGAR for faster processing |
|
|
| `dense_mode` | str | `"auto"` | Dense output mode |
|
|
| `use_zero_based` | bool | `None` | Coordinate system (None = use global setting) |
|
|
| `per_base` | bool | `False` | Per-base depth (True) vs block depth (False) |
|
|
| `output_type` | str | `"polars.LazyFrame"` | Output format: `"polars.LazyFrame"`, `"polars.DataFrame"`, `"pandas.DataFrame"` |
|
|
|
|
### Output Schema (Block Mode, default)
|
|
|
|
When `per_base=False` (default), adjacent positions with the same depth are grouped into blocks:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `contig` | String | Chromosome/contig name |
|
|
| `pos_start` | Int64 | Block start position |
|
|
| `pos_end` | Int64 | Block end position |
|
|
| `coverage` | Int16 | Read depth |
|
|
|
|
### Output Schema (Per-Base Mode)
|
|
|
|
When `per_base=True`, each position is reported individually:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `contig` | String | Chromosome/contig name |
|
|
| `pos` | Int64 | Position |
|
|
| `coverage` | Int16 | Read depth at position |
|
|
|
|
### filter_flag
|
|
|
|
The default `filter_flag=1796` excludes reads with these SAM flags:
|
|
- 4: unmapped
|
|
- 256: secondary alignment
|
|
- 512: failed QC
|
|
- 1024: PCR/optical duplicate
|
|
|
|
### CIGAR-Aware Computation
|
|
|
|
`pb.depth()` correctly handles CIGAR operations:
|
|
- **M/X/=** (match/mismatch): Counted as coverage
|
|
- **D** (deletion): Counted as coverage (reads span the deletion)
|
|
- **N** (skipped region): Not counted (e.g., spliced alignments)
|
|
- **I** (insertion): Not counted at reference positions
|
|
- **S/H** (soft/hard clipping): Not counted
|
|
|
|
## Examples
|
|
|
|
### Whole-Genome Depth
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
import polars as pl
|
|
|
|
# Compute depth genome-wide (block mode)
|
|
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
|
|
|
|
# Summary statistics
|
|
depth.select(
|
|
pl.col("coverage").cast(pl.Int64).mean().alias("mean_depth"),
|
|
pl.col("coverage").cast(pl.Int64).median().alias("median_depth"),
|
|
pl.col("coverage").cast(pl.Int64).max().alias("max_depth"),
|
|
)
|
|
```
|
|
|
|
### Per-Base Depth
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
# Per-base depth (one row per position)
|
|
depth = pb.depth("sample.bam", per_base=True, output_type="polars.DataFrame")
|
|
```
|
|
|
|
### Depth with Quality Filters
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
# Only count well-mapped reads
|
|
depth = pb.depth(
|
|
"sample.bam",
|
|
min_mapping_quality=20,
|
|
output_type="polars.DataFrame",
|
|
)
|
|
```
|
|
|
|
### Custom Flag Filter
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
# Only exclude unmapped (4) and duplicate (1024) reads
|
|
depth = pb.depth(
|
|
"sample.bam",
|
|
filter_flag=4 + 1024,
|
|
output_type="polars.DataFrame",
|
|
)
|
|
```
|
|
|
|
## Integration with Interval Operations
|
|
|
|
Depth results can be used with polars-bio interval operations. Note that depth output uses `contig`/`pos_start`/`pos_end` column names, so use `cols` parameters to map them:
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
import polars as pl
|
|
|
|
# Compute depth
|
|
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
|
|
|
|
# Rename columns to match interval operation conventions
|
|
depth_intervals = depth.rename({
|
|
"contig": "chrom",
|
|
"pos_start": "start",
|
|
"pos_end": "end",
|
|
})
|
|
|
|
# Find regions with adequate coverage
|
|
adequate = depth_intervals.filter(pl.col("coverage") >= 30)
|
|
|
|
# Merge adjacent adequate-coverage blocks
|
|
merged = pb.merge(adequate, output_type="polars.DataFrame")
|
|
|
|
# Find gaps in coverage (complement)
|
|
genome = pl.DataFrame({
|
|
"chrom": ["chr1"],
|
|
"start": [0],
|
|
"end": [248956422],
|
|
})
|
|
gaps = pb.complement(adequate, view_df=genome, output_type="polars.DataFrame")
|
|
```
|
|
|
|
### Using cols Parameters Instead of Renaming
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
|
|
targets = pb.read_bed("targets.bed")
|
|
|
|
# Use cols1 to specify depth column names
|
|
overlapping = pb.overlap(
|
|
depth, targets,
|
|
cols1=["contig", "pos_start", "pos_end"],
|
|
output_type="polars.DataFrame",
|
|
)
|
|
```
|