mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Add polars-bio skill for genomic interval operations and bioinformatics I/O
Adds a new skill covering polars-bio (v0.26.0), a high-performance library for genomic interval arithmetic and file I/O built on Polars, Arrow, and DataFusion. All code examples verified against the actual API at runtime. SKILL.md covers overlap, nearest, merge, coverage, complement, subtract, cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF, BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats. References: interval_operations, file_io, sql_processing, pileup_operations, configuration, bioframe_migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
176
scientific-skills/polars-bio/references/pileup_operations.md
Normal file
176
scientific-skills/polars-bio/references/pileup_operations.md
Normal file
@@ -0,0 +1,176 @@
|
||||
# Pileup Operations
|
||||
|
||||
## Overview
|
||||
|
||||
polars-bio provides the `pb.depth()` function for computing per-base or per-block read depth from BAM/CRAM files. It uses CIGAR-aware depth calculation to accurately account for insertions, deletions, and clipping. Returns a **LazyFrame** by default.
|
||||
|
||||
## pb.depth()
|
||||
|
||||
Compute read depth from alignment files.
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
|
||||
# Compute depth across entire BAM file (returns LazyFrame)
|
||||
depth_lf = pb.depth("aligned.bam")
|
||||
depth_df = depth_lf.collect()
|
||||
|
||||
# Get DataFrame directly
|
||||
depth_df = pb.depth("aligned.bam", output_type="polars.DataFrame")
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `path` | str | required | Path to BAM or CRAM file |
|
||||
| `filter_flag` | int | `1796` | SAM flag filter (default excludes unmapped, secondary, duplicate, QC-fail) |
|
||||
| `min_mapping_quality` | int | `0` | Minimum mapping quality to include reads |
|
||||
| `binary_cigar` | bool | `True` | Use binary CIGAR for faster processing |
|
||||
| `dense_mode` | str | `"auto"` | Dense output mode |
|
||||
| `use_zero_based` | bool | `None` | Coordinate system (None = use global setting) |
|
||||
| `per_base` | bool | `False` | Per-base depth (True) vs block depth (False) |
|
||||
| `output_type` | str | `"polars.LazyFrame"` | Output format: `"polars.LazyFrame"`, `"polars.DataFrame"`, `"pandas.DataFrame"` |
|
||||
|
||||
### Output Schema (Block Mode, default)
|
||||
|
||||
When `per_base=False` (default), adjacent positions with the same depth are grouped into blocks:
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `contig` | String | Chromosome/contig name |
|
||||
| `pos_start` | Int64 | Block start position |
|
||||
| `pos_end` | Int64 | Block end position |
|
||||
| `coverage` | Int16 | Read depth |
|
||||
|
||||
### Output Schema (Per-Base Mode)
|
||||
|
||||
When `per_base=True`, each position is reported individually:
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `contig` | String | Chromosome/contig name |
|
||||
| `pos` | Int64 | Position |
|
||||
| `coverage` | Int16 | Read depth at position |
|
||||
|
||||
### filter_flag
|
||||
|
||||
The default `filter_flag=1796` excludes reads with these SAM flags:
|
||||
- 4: unmapped
|
||||
- 256: secondary alignment
|
||||
- 512: failed QC
|
||||
- 1024: PCR/optical duplicate
|
||||
|
||||
### CIGAR-Aware Computation
|
||||
|
||||
`pb.depth()` correctly handles CIGAR operations:
|
||||
- **M/X/=** (match/mismatch): Counted as coverage
|
||||
- **D** (deletion): Counted as coverage (reads span the deletion)
|
||||
- **N** (skipped region): Not counted (e.g., spliced alignments)
|
||||
- **I** (insertion): Not counted at reference positions
|
||||
- **S/H** (soft/hard clipping): Not counted
|
||||
|
||||
## Examples
|
||||
|
||||
### Whole-Genome Depth
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
import polars as pl
|
||||
|
||||
# Compute depth genome-wide (block mode)
|
||||
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
|
||||
|
||||
# Summary statistics
|
||||
depth.select(
|
||||
pl.col("coverage").cast(pl.Int64).mean().alias("mean_depth"),
|
||||
pl.col("coverage").cast(pl.Int64).median().alias("median_depth"),
|
||||
pl.col("coverage").cast(pl.Int64).max().alias("max_depth"),
|
||||
)
|
||||
```
|
||||
|
||||
### Per-Base Depth
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
|
||||
# Per-base depth (one row per position)
|
||||
depth = pb.depth("sample.bam", per_base=True, output_type="polars.DataFrame")
|
||||
```
|
||||
|
||||
### Depth with Quality Filters
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
|
||||
# Only count well-mapped reads
|
||||
depth = pb.depth(
|
||||
"sample.bam",
|
||||
min_mapping_quality=20,
|
||||
output_type="polars.DataFrame",
|
||||
)
|
||||
```
|
||||
|
||||
### Custom Flag Filter
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
|
||||
# Only exclude unmapped (4) and duplicate (1024) reads
|
||||
depth = pb.depth(
|
||||
"sample.bam",
|
||||
filter_flag=4 + 1024,
|
||||
output_type="polars.DataFrame",
|
||||
)
|
||||
```
|
||||
|
||||
## Integration with Interval Operations
|
||||
|
||||
Depth results can be used with polars-bio interval operations. Note that depth output uses `contig`/`pos_start`/`pos_end` column names, so use `cols` parameters to map them:
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
import polars as pl
|
||||
|
||||
# Compute depth
|
||||
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
|
||||
|
||||
# Rename columns to match interval operation conventions
|
||||
depth_intervals = depth.rename({
|
||||
"contig": "chrom",
|
||||
"pos_start": "start",
|
||||
"pos_end": "end",
|
||||
})
|
||||
|
||||
# Find regions with adequate coverage
|
||||
adequate = depth_intervals.filter(pl.col("coverage") >= 30)
|
||||
|
||||
# Merge adjacent adequate-coverage blocks
|
||||
merged = pb.merge(adequate, output_type="polars.DataFrame")
|
||||
|
||||
# Find gaps in coverage (complement)
|
||||
genome = pl.DataFrame({
|
||||
"chrom": ["chr1"],
|
||||
"start": [0],
|
||||
"end": [248956422],
|
||||
})
|
||||
gaps = pb.complement(adequate, view_df=genome, output_type="polars.DataFrame")
|
||||
```
|
||||
|
||||
### Using cols Parameters Instead of Renaming
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
|
||||
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
|
||||
targets = pb.read_bed("targets.bed")
|
||||
|
||||
# Use cols1 to specify depth column names
|
||||
overlapping = pb.overlap(
|
||||
depth, targets,
|
||||
cols1=["contig", "pos_start", "pos_end"],
|
||||
output_type="polars.DataFrame",
|
||||
)
|
||||
```
|
||||
Reference in New Issue
Block a user