Add polars-bio skill for genomic interval operations and bioinformatics I/O

Adds a new skill covering polars-bio (v0.26.0), a high-performance library for genomic interval arithmetic and file I/O built on Polars, Arrow, and DataFusion. All code examples verified against the actual API at runtime. SKILL.md covers overlap, nearest, merge, coverage, complement, subtract, cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF, BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats. References: interval_operations, file_io, sql_processing, pileup_operations, configuration, bioframe_migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:09:27 +08:00 · 2026-03-14 10:27:11 +01:00
parent 575f1e586f
commit 436c8608f2
7 changed files with 1975 additions and 0 deletions
--- a/scientific-skills/polars-bio/SKILL.md
+++ b/scientific-skills/polars-bio/SKILL.md
@@ -0,0 +1,374 @@
+---
+name: polars-bio
+description: High-performance genomic interval operations and bioinformatics file I/O on Polars DataFrames. Overlap, nearest, merge, coverage, complement, subtract for BED/VCF/BAM/GFF intervals. Streaming, cloud-native, faster bioframe alternative.
+license: https://github.com/biodatageeks/polars-bio/blob/main/LICENSE
+metadata:
+    skill-author: K-Dense Inc.
+---
+
+# polars-bio
+
+## Overview
+
+polars-bio is a high-performance Python library for genomic interval operations and bioinformatics file I/O, built on Polars, Apache Arrow, and Apache DataFusion. It provides a familiar DataFrame-centric API for interval arithmetic (overlap, nearest, merge, coverage, complement, subtract) and reading/writing common bioinformatics formats (BED, VCF, BAM, CRAM, GFF/GTF, FASTA, FASTQ).
+
+Key value propositions:
+- **6-38x faster** than bioframe on real-world genomic benchmarks
+- **Streaming/out-of-core** support for large genomes via DataFusion
+- **Cloud-native** file I/O (S3, GCS, Azure) with predicate pushdown
+- **Two API styles**: functional (`pb.overlap(df1, df2)`) and method-chaining (`df1.lazy().pb.overlap(df2)`)
+- **SQL interface** for genomic data via DataFusion SQL engine
+
+## When to Use This Skill
+
+Use this skill when:
+- Performing genomic interval operations (overlap, nearest, merge, coverage, complement, subtract)
+- Reading/writing bioinformatics file formats (BED, VCF, BAM, CRAM, GFF/GTF, FASTA, FASTQ)
+- Processing large genomic datasets that don't fit in memory (streaming mode)
+- Running SQL queries on genomic data files
+- Migrating from bioframe to a faster alternative
+- Computing read depth/pileup from BAM/CRAM files
+- Working with Polars DataFrames containing genomic intervals
+
+## Quick Start
+
+### Installation
+
+```bash
+pip install polars-bio
+# or
+uv pip install polars-bio
+```
+
+For pandas compatibility:
+```bash
+pip install polars-bio[pandas]
+```
+
+### Basic Overlap Example
+
+```python
+import polars as pl
+import polars_bio as pb
+
+# Create two interval DataFrames
+df1 = pl.DataFrame({
+    "chrom": ["chr1", "chr1", "chr1"],
+    "start": [1, 5, 22],
+    "end":   [6, 9, 30],
+})
+
+df2 = pl.DataFrame({
+    "chrom": ["chr1", "chr1"],
+    "start": [3, 25],
+    "end":   [8, 28],
+})
+
+# Functional API (returns LazyFrame by default)
+result = pb.overlap(df1, df2)
+result_df = result.collect()
+
+# Get a DataFrame directly
+result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")
+
+# Method-chaining API (via .pb accessor on LazyFrame)
+result = df1.lazy().pb.overlap(df2)
+result_df = result.collect()
+```
+
+### Reading a BED File
+
+```python
+import polars_bio as pb
+
+# Eager read (loads entire file)
+df = pb.read_bed("regions.bed")
+
+# Lazy scan (streaming, for large files)
+lf = pb.scan_bed("regions.bed")
+result = lf.collect()
+```
+
+## Core Capabilities
+
+### 1. Genomic Interval Operations
+
+polars-bio provides 8 core interval operations for genomic range arithmetic. All operations accept Polars DataFrames with `chrom`, `start`, `end` columns (configurable). All operations return a `LazyFrame` by default (use `output_type="polars.DataFrame"` for eager results).
+
+**Operations:**
+- `overlap` / `count_overlaps` - Find or count overlapping intervals between two sets
+- `nearest` - Find nearest intervals (with configurable `k`, `overlap`, `distance` params)
+- `merge` - Merge overlapping/bookended intervals within a set
+- `cluster` - Assign cluster IDs to overlapping intervals
+- `coverage` - Compute per-interval coverage counts (two-input operation)
+- `complement` - Find gaps between intervals within a genome
+- `subtract` - Remove portions of intervals that overlap another set
+
+**Example:**
+```python
+import polars_bio as pb
+
+# Find overlapping intervals (returns LazyFrame)
+result = pb.overlap(df1, df2, suffixes=("_1", "_2"))
+
+# Count overlaps per interval
+counts = pb.count_overlaps(df1, df2)
+
+# Merge overlapping intervals
+merged = pb.merge(df1)
+
+# Find nearest intervals
+nearest = pb.nearest(df1, df2)
+
+# Collect any LazyFrame result to DataFrame
+result_df = result.collect()
+```
+
+**Reference:** See `references/interval_operations.md` for detailed documentation on all operations, parameters, output schemas, and performance considerations.
+
+### 2. Bioinformatics File I/O
+
+Read and write common bioinformatics formats with `read_*`, `scan_*`, `write_*`, and `sink_*` functions. Supports cloud storage (S3, GCS, Azure) and compression (GZIP, BGZF).
+
+**Supported formats:**
+- **BED** - Genomic intervals (`read_bed`, `scan_bed`, `write_*` via generic)
+- **VCF** - Genetic variants (`read_vcf`, `scan_vcf`, `write_vcf`, `sink_vcf`)
+- **BAM** - Aligned reads (`read_bam`, `scan_bam`, `write_bam`, `sink_bam`)
+- **CRAM** - Compressed alignments (`read_cram`, `scan_cram`, `write_cram`, `sink_cram`)
+- **GFF** - Gene annotations (`read_gff`, `scan_gff`)
+- **GTF** - Gene annotations (`read_gtf`, `scan_gtf`)
+- **FASTA** - Reference sequences (`read_fasta`, `scan_fasta`)
+- **FASTQ** - Sequencing reads (`read_fastq`, `scan_fastq`, `write_fastq`, `sink_fastq`)
+- **SAM** - Text alignments (`read_sam`, `scan_sam`, `write_sam`, `sink_sam`)
+- **Hi-C pairs** - Chromatin contacts (`read_pairs`, `scan_pairs`)
+
+**Example:**
+```python
+import polars_bio as pb
+
+# Read VCF file
+variants = pb.read_vcf("samples.vcf.gz")
+
+# Lazy scan BAM file (streaming)
+alignments = pb.scan_bam("aligned.bam")
+
+# Read GFF annotations
+genes = pb.read_gff("annotations.gff3")
+
+# Cloud storage (individual params, not a dict)
+df = pb.read_bed("s3://bucket/regions.bed",
+                 allow_anonymous=True)
+```
+
+**Reference:** See `references/file_io.md` for per-format column schemas, parameters, cloud storage options, and compression support.
+
+### 3. SQL Data Processing
+
+Register bioinformatics files as tables and query them using DataFusion SQL. Combines the power of SQL with polars-bio's genomic-aware readers.
+
+```python
+import polars as pl
+import polars_bio as pb
+
+# Register files as SQL tables (path first, name= keyword)
+pb.register_vcf("samples.vcf.gz", name="variants")
+pb.register_bed("target_regions.bed", name="regions")
+
+# Query with SQL (returns LazyFrame)
+result = pb.sql("SELECT chrom, start, end, ref, alt FROM variants WHERE qual > 30")
+result_df = result.collect()
+
+# Register a Polars DataFrame as a SQL table
+pb.from_polars("my_intervals", df)
+result = pb.sql("SELECT * FROM my_intervals WHERE chrom = 'chr1'").collect()
+```
+
+**Reference:** See `references/sql_processing.md` for register functions, SQL syntax, and examples.
+
+### 4. Pileup Operations
+
+Compute per-base read depth from BAM/CRAM files with CIGAR-aware depth calculation.
+
+```python
+import polars_bio as pb
+
+# Compute depth across a BAM file
+depth_lf = pb.depth("aligned.bam")
+depth_df = depth_lf.collect()
+
+# With quality filter
+depth_lf = pb.depth("aligned.bam", min_mapping_quality=20)
+```
+
+**Reference:** See `references/pileup_operations.md` for parameters and integration patterns.
+
+## Key Concepts
+
+### Coordinate Systems
+
+polars-bio defaults to **1-based** coordinates (genomic convention). This can be changed globally:
+
+```python
+import polars_bio as pb
+
+# Switch to 0-based coordinates
+pb.set_option("coordinate_system", "0-based")
+
+# Switch back to 1-based (default)
+pb.set_option("coordinate_system", "1-based")
+```
+
+I/O functions also accept `use_zero_based` to set coordinate metadata on the resulting DataFrame:
+
+```python
+# Read BED with explicit 0-based metadata
+df = pb.read_bed("regions.bed", use_zero_based=True)
+```
+
+**Important:** BED files are always 0-based half-open in the file format. polars-bio handles the conversion automatically when reading BED files. Coordinate metadata is attached to DataFrames by I/O functions and propagated through operations.
+
+### Two API Styles
+
+**Functional API** - standalone functions, explicit inputs:
+```python
+result = pb.overlap(df1, df2, suffixes=("_1", "_2"))
+merged = pb.merge(df)
+```
+
+**Method-chaining API** - via `.pb` accessor on **LazyFrames** (not DataFrames):
+```python
+result = df1.lazy().pb.overlap(df2)
+merged = df.lazy().pb.merge()
+```
+
+**Important:** The `.pb` accessor for interval operations is only available on `LazyFrame`. On `DataFrame`, `.pb` provides write operations only (`write_bam`, `write_vcf`, etc.).
+
+Method-chaining enables fluent pipelines:
+```python
+# Chain interval operations (note: overlap outputs suffixed columns,
+# so rename before merge which expects chrom/start/end)
+result = (
+    df1.lazy()
+    .pb.overlap(df2)
+    .filter(pl.col("start_2") > 1000)
+    .select(
+        pl.col("chrom_1").alias("chrom"),
+        pl.col("start_1").alias("start"),
+        pl.col("end_1").alias("end"),
+    )
+    .pb.merge()
+    .collect()
+)
+```
+
+### Probe-Build Architecture
+
+For two-input operations (overlap, nearest, count_overlaps, coverage), polars-bio uses a probe-build join strategy:
+- The **first** DataFrame is the **probe** (iterated over)
+- The **second** DataFrame is the **build** (indexed for lookup)
+
+For best performance, pass the larger DataFrame as the first argument (probe) and the smaller one as the second (build).
+
+### Column Conventions
+
+By default, polars-bio expects columns named `chrom`, `start`, `end`. Custom column names can be specified via lists:
+
+```python
+result = pb.overlap(
+    df1, df2,
+    cols1=["chromosome", "begin", "finish"],
+    cols2=["chr", "pos_start", "pos_end"],
+)
+```
+
+### Return Types and Collecting Results
+
+All interval operations and `pb.sql()` return a **LazyFrame** by default. Use `.collect()` to materialize results, or pass `output_type="polars.DataFrame"` for eager evaluation:
+
+```python
+# Lazy (default) - collect when needed
+result_lf = pb.overlap(df1, df2)
+result_df = result_lf.collect()
+
+# Eager - get DataFrame directly
+result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")
+```
+
+### Streaming and Out-of-Core Processing
+
+For datasets larger than available RAM, use `scan_*` functions and streaming execution:
+
+```python
+# Scan files lazily
+lf = pb.scan_bed("large_intervals.bed")
+
+# Process with streaming
+result = lf.collect(streaming=True)
+```
+
+DataFusion streaming is enabled by default for interval operations, processing data in batches without loading the full dataset into memory.
+
+## Common Pitfalls
+
+1. **`.pb` accessor on DataFrame vs LazyFrame:** Interval operations (overlap, merge, etc.) are only on `LazyFrame.pb`. `DataFrame.pb` only has write methods. Use `.lazy()` to convert before chaining interval ops.
+
+2. **LazyFrame returns:** All interval operations and `pb.sql()` return `LazyFrame` by default. Don't forget `.collect()` or use `output_type="polars.DataFrame"`.
+
+3. **Column name mismatches:** polars-bio expects `chrom`, `start`, `end` by default. Use `cols1`/`cols2` parameters (as lists) if your columns have different names.
+
+4. **Coordinate system metadata:** When constructing DataFrames manually (not via `read_*`/`scan_*`), polars-bio warns about missing coordinate metadata. Use `pb.set_option("coordinate_system", "0-based")` globally, or use I/O functions that set metadata automatically.
+
+5. **Probe-build order matters:** For overlap, nearest, and coverage, the first DataFrame is probed against the second. Swapping arguments changes which intervals appear in the left vs right output columns, and can affect performance.
+
+6. **INT32 position limit:** Genomic positions are stored as 32-bit integers, limiting coordinates to ~2.1 billion. This is sufficient for all known genomes but may be an issue with custom coordinate spaces.
+
+7. **BAM index requirements:** `read_bam` and `scan_bam` require a `.bai` index file alongside the BAM. Create one with `samtools index` if missing.
+
+8. **Parallel execution disabled by default:** DataFusion parallelism defaults to 1 partition. Enable for large datasets:
+   ```python
+   pb.set_option("datafusion.execution.target_partitions", 8)
+   ```
+
+9. **CRAM has separate functions:** Use `read_cram`/`scan_cram`/`register_cram` for CRAM files (not `read_bam`). CRAM functions require a `reference_path` parameter.
+
+## Best Practices
+
+1. **Use `scan_*` for large files:** Prefer `scan_bed`, `scan_vcf`, etc. over `read_*` for files larger than available RAM. Scan functions enable streaming and predicate pushdown.
+
+2. **Configure parallelism for large datasets:**
+   ```python
+   import os
+   pb.set_option("datafusion.execution.target_partitions", os.cpu_count())
+   ```
+
+3. **Use BGZF compression:** BGZF-compressed files (`.bed.gz`, `.vcf.gz`) support parallel block decompression, significantly faster than plain GZIP.
+
+4. **Select columns early:** When only specific columns are needed, select them early to reduce memory usage:
+   ```python
+   df = pb.read_vcf("large.vcf.gz").select("chrom", "start", "end", "ref", "alt")
+   ```
+
+5. **Use cloud paths directly:** Pass S3/GCS/Azure URIs directly to read/scan functions instead of downloading files first:
+   ```python
+   df = pb.read_bed("s3://my-bucket/regions.bed", allow_anonymous=True)
+   ```
+
+6. **Prefer functional API for single operations, method-chaining for pipelines:** Use `pb.overlap()` for one-off operations and `.lazy().pb.overlap()` when building multi-step pipelines.
+
+## Resources
+
+### references/
+
+Detailed documentation for each major capability:
+
+- **interval_operations.md** - All 8 interval operations with parameters, examples, output schemas, and performance tips. Core reference for genomic range arithmetic.
+
+- **file_io.md** - Supported formats table, per-format column schemas, cloud storage configuration, compression support, and common parameters.
+
+- **sql_processing.md** - Register functions, DataFusion SQL syntax, combining SQL with interval operations, and example queries.
+
+- **pileup_operations.md** - Per-base read depth computation from BAM/CRAM files, parameters, and integration with interval operations.
+
+- **configuration.md** - Global settings (parallelism, coordinate systems, streaming modes), logging, and metadata management.
+
+- **bioframe_migration.md** - Operation mapping table, API differences, performance comparison, migration code examples, and pandas compatibility mode.
--- a/scientific-skills/polars-bio/references/bioframe_migration.md
+++ b/scientific-skills/polars-bio/references/bioframe_migration.md
@@ -0,0 +1,250 @@
+# Migrating from bioframe to polars-bio
+
+## Overview
+
+polars-bio is a drop-in replacement for bioframe's core interval operations, offering 6.5-38x speedups on real-world genomic benchmarks. The main differences are: Polars DataFrames instead of pandas, a Rust/DataFusion backend instead of pure Python, streaming support for large genomes, and LazyFrame returns by default.
+
+## Operation Mapping
+
+| bioframe | polars-bio | Notes |
+|----------|------------|-------|
+| `bioframe.overlap(df1, df2)` | `pb.overlap(df1, df2)` | Returns LazyFrame; `.collect()` for DataFrame |
+| `bioframe.closest(df1, df2)` | `pb.nearest(df1, df2)` | Renamed; uses `k`, `overlap`, `distance` params |
+| `bioframe.count_overlaps(df1, df2)` | `pb.count_overlaps(df1, df2)` | Default suffixes differ: `("", "_")` vs bioframe's |
+| `bioframe.merge(df)` | `pb.merge(df)` | Output includes `n_intervals` column |
+| `bioframe.cluster(df)` | `pb.cluster(df)` | Output cols: `cluster`, `cluster_start`, `cluster_end` |
+| `bioframe.coverage(df1, df2)` | `pb.coverage(df1, df2)` | Two-input in both libraries |
+| `bioframe.complement(df, chromsizes)` | `pb.complement(df, view_df=genome)` | Genome as DataFrame, not Series |
+| `bioframe.subtract(df1, df2)` | `pb.subtract(df1, df2)` | Same semantics |
+
+## Key API Differences
+
+### DataFrames: pandas vs Polars
+
+**bioframe (pandas):**
+```python
+import bioframe
+import pandas as pd
+
+df1 = pd.DataFrame({
+    "chrom": ["chr1", "chr1"],
+    "start": [1, 10],
+    "end":   [5, 20],
+})
+
+result = bioframe.overlap(df1, df2)
+# result is a pandas DataFrame
+result["start_1"]  # pandas column access
+```
+
+**polars-bio (Polars):**
+```python
+import polars_bio as pb
+import polars as pl
+
+df1 = pl.DataFrame({
+    "chrom": ["chr1", "chr1"],
+    "start": [1, 10],
+    "end":   [5, 20],
+})
+
+result = pb.overlap(df1, df2)  # Returns LazyFrame
+result_df = result.collect()   # Materialize to DataFrame
+result_df.select("start_1")   # Polars column access
+```
+
+### Return Types: LazyFrame by Default
+
+All polars-bio operations return a **LazyFrame** by default. Use `.collect()` or `output_type="polars.DataFrame"`:
+
+```python
+# bioframe: always returns DataFrame
+result = bioframe.overlap(df1, df2)
+
+# polars-bio: returns LazyFrame, collect for DataFrame
+result_lf = pb.overlap(df1, df2)
+result_df = result_lf.collect()
+
+# Or get DataFrame directly
+result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")
+```
+
+### Genome/Chromsizes
+
+**bioframe:**
+```python
+chromsizes = bioframe.fetch_chromsizes("hg38")  # Returns pandas Series
+complement = bioframe.complement(df, chromsizes)
+```
+
+**polars-bio:**
+```python
+genome = pl.DataFrame({
+    "chrom": ["chr1", "chr2"],
+    "start": [0, 0],
+    "end":   [248956422, 242193529],
+})
+complement = pb.complement(df, view_df=genome)
+```
+
+### closest vs nearest
+
+**bioframe:**
+```python
+result = bioframe.closest(df1, df2)
+```
+
+**polars-bio:**
+```python
+# Basic nearest
+result = pb.nearest(df1, df2)
+
+# Find k nearest neighbors
+result = pb.nearest(df1, df2, k=3)
+
+# Exclude overlapping intervals
+result = pb.nearest(df1, df2, overlap=False)
+
+# Without distance column
+result = pb.nearest(df1, df2, distance=False)
+```
+
+### Method-Chaining (polars-bio only)
+
+polars-bio adds a `.pb` accessor on **LazyFrame** for method chaining:
+
+```python
+# bioframe: sequential function calls
+merged = bioframe.merge(bioframe.overlap(df1, df2))
+
+# polars-bio: fluent pipeline (must use LazyFrame)
+# Note: overlap adds suffixes, so rename before merge
+merged = (
+    df1.lazy()
+    .pb.overlap(df2)
+    .select(
+        pl.col("chrom_1").alias("chrom"),
+        pl.col("start_1").alias("start"),
+        pl.col("end_1").alias("end"),
+    )
+    .pb.merge()
+    .collect()
+)
+```
+
+## Performance Comparison
+
+Benchmarks on real-world genomic datasets (from the polars-bio paper, Bioinformatics 2025):
+
+| Operation | bioframe | polars-bio | Speedup |
+|-----------|----------|------------|---------|
+| overlap | 1.0x | 6.5x | 6.5x |
+| nearest | 1.0x | 38x | 38x |
+| merge | 1.0x | 8.2x | 8.2x |
+| coverage | 1.0x | 12x | 12x |
+
+Speedups come from:
+- Rust-based interval tree implementation
+- Apache DataFusion query engine
+- Apache Arrow columnar memory format
+- Parallel execution (when configured)
+- Streaming/out-of-core support
+
+## Migration Code Examples
+
+### Example 1: Basic Overlap Pipeline
+
+**Before (bioframe):**
+```python
+import bioframe
+import pandas as pd
+
+df1 = pd.read_csv("peaks.bed", sep="\t", names=["chrom", "start", "end"])
+df2 = pd.read_csv("genes.bed", sep="\t", names=["chrom", "start", "end", "name"])
+
+overlaps = bioframe.overlap(df1, df2, suffixes=("_peak", "_gene"))
+filtered = overlaps[overlaps["start_gene"] > 10000]
+merged = bioframe.merge(filtered[["chrom_peak", "start_peak", "end_peak"]]
+    .rename(columns={"chrom_peak": "chrom", "start_peak": "start", "end_peak": "end"}))
+```
+
+**After (polars-bio):**
+```python
+import polars_bio as pb
+import polars as pl
+
+df1 = pb.read_bed("peaks.bed")
+df2 = pb.read_bed("genes.bed")
+
+overlaps = pb.overlap(df1, df2, suffixes=("_peak", "_gene"), output_type="polars.DataFrame")
+filtered = overlaps.filter(pl.col("start_gene") > 10000)
+merged = pb.merge(
+    filtered.select(
+        pl.col("chrom_peak").alias("chrom"),
+        pl.col("start_peak").alias("start"),
+        pl.col("end_peak").alias("end"),
+    ),
+    output_type="polars.DataFrame",
+)
+```
+
+### Example 2: Large-Scale Streaming
+
+**Before (bioframe) — limited to in-memory:**
+```python
+import bioframe
+import pandas as pd
+
+# Must load entire file into memory
+df1 = pd.read_csv("huge_intervals.bed", sep="\t", names=["chrom", "start", "end"])
+result = bioframe.merge(df1)  # Memory-bound
+```
+
+**After (polars-bio) — streaming:**
+```python
+import polars_bio as pb
+
+# Lazy scan, streaming execution
+lf = pb.scan_bed("huge_intervals.bed")
+result = pb.merge(lf).collect(streaming=True)
+```
+
+## pandas Compatibility Mode
+
+For gradual migration, install with pandas support:
+
+```bash
+pip install polars-bio[pandas]
+```
+
+This enables conversion between pandas and Polars DataFrames:
+
+```python
+import polars_bio as pb
+import polars as pl
+
+# Convert pandas DataFrame to Polars for polars-bio
+polars_df = pl.from_pandas(pandas_df)
+result = pb.overlap(polars_df, other_df).collect()
+
+# Convert back to pandas if needed
+pandas_result = result.to_pandas()
+
+# Or request pandas output directly
+pandas_result = pb.overlap(polars_df, other_df, output_type="pandas.DataFrame")
+```
+
+## Migration Checklist
+
+1. Replace `import bioframe` with `import polars_bio as pb`
+2. Replace `import pandas as pd` with `import polars as pl`
+3. Convert DataFrame creation from `pd.DataFrame` to `pl.DataFrame`
+4. Replace `bioframe.closest` with `pb.nearest`
+5. Add `.collect()` after operations (they return LazyFrame by default)
+6. Update column access from `df["col"]` to `df.select("col")` or `pl.col("col")`
+7. Replace pandas filtering `df[df["col"] > x]` with `df.filter(pl.col("col") > x)`
+8. Update chromsizes from Series to DataFrame with `chrom`, `start`, `end`; pass as `view_df=`
+9. Add `pb.set_option("datafusion.execution.target_partitions", N)` for parallelism
+10. Replace `pd.read_csv` for BED files with `pb.read_bed` or `pb.scan_bed`
+11. Note `cluster` output column is `cluster` (not `cluster_id`), plus `cluster_start`, `cluster_end`
+12. Note `merge` output includes `n_intervals` column
--- a/scientific-skills/polars-bio/references/configuration.md
+++ b/scientific-skills/polars-bio/references/configuration.md
@@ -0,0 +1,175 @@
+# Configuration
+
+## Overview
+
+polars-bio uses a global configuration system based on `set_option` and `get_option` to control execution behavior, coordinate systems, parallelism, and streaming modes.
+
+## set_option / get_option
+
+```python
+import polars_bio as pb
+
+# Set a configuration option
+pb.set_option("datafusion.execution.target_partitions", 8)
+
+# Get current value
+value = pb.get_option("datafusion.execution.target_partitions")
+```
+
+## Parallelism
+
+### DataFusion Target Partitions
+
+Controls the number of parallel execution partitions. Defaults to 1 (single-threaded).
+
+```python
+import os
+import polars_bio as pb
+
+# Use all available CPU cores
+pb.set_option("datafusion.execution.target_partitions", os.cpu_count())
+
+# Set specific number of partitions
+pb.set_option("datafusion.execution.target_partitions", 8)
+```
+
+**When to increase parallelism:**
+- Processing large files (>1GB)
+- Running interval operations on millions of intervals
+- Batch processing multiple chromosomes
+
+**When to keep default (1):**
+- Small datasets
+- Memory-constrained environments
+- Debugging (deterministic execution)
+
+## Coordinate Systems
+
+polars-bio defaults to 1-based coordinates (standard genomic convention).
+
+### Global Coordinate System
+
+```python
+import polars_bio as pb
+
+# Switch to 0-based half-open coordinates
+pb.set_option("coordinate_system", "0-based")
+
+# Switch back to 1-based (default)
+pb.set_option("coordinate_system", "1-based")
+
+# Check current setting
+print(pb.get_option("coordinate_system"))
+```
+
+### Per-File Override via I/O Functions
+
+I/O functions accept `use_zero_based` to set coordinate metadata on the resulting DataFrame:
+
+```python
+# Read with explicit 0-based metadata
+df = pb.read_bed("regions.bed", use_zero_based=True)
+```
+
+**Note:** Interval operations (overlap, nearest, etc.) do **not** accept `use_zero_based`. They read coordinate metadata from the DataFrames, which is set by I/O functions or the global option. When using manually constructed DataFrames, polars-bio warns about missing metadata and falls back to the global setting.
+
+### Setting Metadata on Manual DataFrames
+
+```python
+import polars_bio as pb
+
+# Set coordinate metadata on a manually created DataFrame
+pb.set_source_metadata(df, format="bed", path="")
+```
+
+### File Format Conventions
+
+| Format | Native Coordinate System | polars-bio Conversion |
+|--------|-------------------------|----------------------|
+| BED | 0-based half-open | Converted to configured system on read |
+| VCF | 1-based | Converted to configured system on read |
+| GFF/GTF | 1-based | Converted to configured system on read |
+| BAM/SAM | 0-based | Converted to configured system on read |
+
+## Streaming Execution Modes
+
+polars-bio supports two streaming modes for out-of-core processing:
+
+### DataFusion Streaming
+
+Enabled by default for interval operations. Processes data in batches through the DataFusion execution engine.
+
+```python
+# DataFusion streaming is automatic for interval operations
+result = pb.overlap(lf1, lf2)  # Streams if inputs are LazyFrames
+```
+
+### Polars Streaming
+
+Use Polars' native streaming for post-processing operations:
+
+```python
+# Collect with Polars streaming
+result = lf.collect(streaming=True)
+```
+
+### Combining Both
+
+```python
+import polars_bio as pb
+
+# Scan files lazily (DataFusion streaming for I/O)
+lf1 = pb.scan_bed("large1.bed")
+lf2 = pb.scan_bed("large2.bed")
+
+# Interval operation (DataFusion streaming)
+result_lf = pb.overlap(lf1, lf2)
+
+# Collect with Polars streaming for final materialization
+result = result_lf.collect(streaming=True)
+```
+
+## Logging
+
+Control log verbosity for debugging:
+
+```python
+import polars_bio as pb
+
+# Set log level
+pb.set_loglevel("debug")   # Detailed execution info
+pb.set_loglevel("info")    # Standard messages
+pb.set_loglevel("warn")    # Warnings only (default)
+```
+
+**Note:** Only `"debug"`, `"info"`, and `"warn"` are valid log levels.
+
+## Metadata Management
+
+polars-bio attaches coordinate system and source metadata to DataFrames produced by I/O functions. This metadata is used by interval operations to determine the coordinate system.
+
+```python
+import polars_bio as pb
+
+# Inspect metadata on a DataFrame
+metadata = pb.get_metadata(df)
+
+# Print metadata summary
+pb.print_metadata_summary(df)
+
+# Print metadata as JSON
+pb.print_metadata_json(df)
+
+# Set metadata on a manually created DataFrame
+pb.set_source_metadata(df, format="bed", path="regions.bed")
+
+# Register a DataFrame as a SQL table
+pb.from_polars("my_table", df)
+```
+
+## Complete Configuration Reference
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `datafusion.execution.target_partitions` | `1` | Number of parallel execution partitions |
+| `coordinate_system` | `"1-based"` | Default coordinate system (`"0-based"` or `"1-based"`) |
--- a/scientific-skills/polars-bio/references/file_io.md
+++ b/scientific-skills/polars-bio/references/file_io.md
@@ -0,0 +1,414 @@
+# Bioinformatics File I/O
+
+## Overview
+
+polars-bio provides `read_*`, `scan_*`, `write_*`, and `sink_*` functions for common bioinformatics formats. `read_*` loads data eagerly into a DataFrame, while `scan_*` creates a LazyFrame for streaming/out-of-core processing. `write_*` writes from DataFrame/LazyFrame and returns a row count, while `sink_*` streams from a LazyFrame.
+
+## Supported Formats
+
+| Format | Read | Scan | Register (SQL) | Write | Sink |
+|--------|------|------|-----------------|-------|------|
+| BED | `read_bed` | `scan_bed` | `register_bed` | — | — |
+| VCF | `read_vcf` | `scan_vcf` | `register_vcf` | `write_vcf` | `sink_vcf` |
+| BAM | `read_bam` | `scan_bam` | `register_bam` | `write_bam` | `sink_bam` |
+| CRAM | `read_cram` | `scan_cram` | `register_cram` | `write_cram` | `sink_cram` |
+| GFF | `read_gff` | `scan_gff` | `register_gff` | — | — |
+| GTF | `read_gtf` | `scan_gtf` | `register_gtf` | — | — |
+| FASTA | `read_fasta` | `scan_fasta` | — | — | — |
+| FASTQ | `read_fastq` | `scan_fastq` | `register_fastq` | `write_fastq` | `sink_fastq` |
+| SAM | `read_sam` | `scan_sam` | `register_sam` | `write_sam` | `sink_sam` |
+| Hi-C pairs | `read_pairs` | `scan_pairs` | `register_pairs` | — | — |
+| Generic table | `read_table` | `scan_table` | — | — | — |
+
+## Common Cloud/IO Parameters
+
+All `read_*` and `scan_*` functions share these parameters (instead of a single `storage_options` dict):
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `path` | str | required | File path (local, S3, GCS, Azure) |
+| `chunk_size` | int | `8` | Number of chunks for parallel reading |
+| `concurrent_fetches` | int | `1` | Number of concurrent fetches for cloud storage |
+| `allow_anonymous` | bool | `True` | Allow anonymous access to cloud storage |
+| `enable_request_payer` | bool | `False` | Enable requester-pays for cloud storage |
+| `max_retries` | int | `5` | Maximum retries for cloud operations |
+| `timeout` | int | `300` | Timeout in seconds for cloud operations |
+| `compression_type` | str | `"auto"` | Compression type (auto-detected from extension) |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown optimization |
+| `use_zero_based` | bool | `None` | Set coordinate system metadata (None = use global setting) |
+
+Not all functions support all parameters. SAM functions lack cloud parameters. FASTA/FASTQ lack `predicate_pushdown`.
+
+## BED Format
+
+### read_bed / scan_bed
+
+Read BED files. Columns are auto-detected (BED3 through BED12). BED files use 0-based half-open coordinates; polars-bio attaches coordinate metadata automatically.
+
+```python
+import polars_bio as pb
+
+# Eager read
+df = pb.read_bed("regions.bed")
+
+# Lazy scan
+lf = pb.scan_bed("regions.bed")
+```
+
+### Column Schema (BED3)
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `chrom` | String | Chromosome name |
+| `start` | Int64 | Start position |
+| `end` | Int64 | End position |
+
+Extended BED fields (auto-detected) add: `name`, `score`, `strand`, `thickStart`, `thickEnd`, `itemRgb`, `blockCount`, `blockSizes`, `blockStarts`.
+
+## VCF Format
+
+### read_vcf / scan_vcf
+
+Read VCF/BCF files. Supports `.vcf`, `.vcf.gz`, `.bcf`.
+
+```python
+import polars_bio as pb
+
+# Read VCF
+df = pb.read_vcf("variants.vcf.gz")
+
+# Read with specific INFO and FORMAT fields extracted as columns
+df = pb.read_vcf("variants.vcf.gz", info_fields=["AF", "DP"], format_fields=["GT", "GQ"])
+
+# Read specific samples
+df = pb.read_vcf("variants.vcf.gz", samples=["SAMPLE1", "SAMPLE2"])
+```
+
+### Additional Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `info_fields` | list[str] | `None` | INFO fields to extract as columns |
+| `format_fields` | list[str] | `None` | FORMAT fields to extract as columns |
+| `samples` | list[str] | `None` | Samples to include |
+| `predicate_pushdown` | bool | `True` | Enable predicate pushdown |
+
+### Column Schema
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `chrom` | String | Chromosome |
+| `start` | UInt32 | Start position |
+| `end` | UInt32 | End position |
+| `id` | String | Variant ID |
+| `ref` | String | Reference allele |
+| `alt` | String | Alternate allele(s) |
+| `qual` | Float32 | Quality score |
+| `filter` | String | Filter status |
+| `info` | String | INFO field (raw, unless `info_fields` specified) |
+
+### write_vcf / sink_vcf
+
+```python
+import polars_bio as pb
+
+# Write DataFrame to VCF
+rows_written = pb.write_vcf(df, "output.vcf")
+
+# Stream LazyFrame to VCF
+pb.sink_vcf(lf, "output.vcf")
+```
+
+## BAM Format
+
+### read_bam / scan_bam
+
+Read aligned sequencing reads from BAM files. Requires a `.bai` index file.
+
+```python
+import polars_bio as pb
+
+# Read BAM
+df = pb.read_bam("aligned.bam")
+
+# Scan BAM (streaming)
+lf = pb.scan_bam("aligned.bam")
+
+# Read with specific tags
+df = pb.read_bam("aligned.bam", tag_fields=["NM", "MD"])
+```
+
+### Additional Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `tag_fields` | list[str] | `None` | SAM tags to extract as columns |
+| `predicate_pushdown` | bool | `True` | Enable predicate pushdown |
+| `infer_tag_types` | bool | `True` | Infer tag column types from data |
+| `infer_tag_sample_size` | int | `100` | Number of records to sample for type inference |
+| `tag_type_hints` | list[str] | `None` | Explicit type hints for tags |
+
+### Column Schema
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `chrom` | String | Reference sequence name |
+| `start` | Int64 | Alignment start position |
+| `end` | Int64 | Alignment end position |
+| `name` | String | Read name |
+| `flags` | UInt32 | SAM flags |
+| `mapping_quality` | UInt32 | Mapping quality |
+| `cigar` | String | CIGAR string |
+| `sequence` | String | Read sequence |
+| `quality_scores` | String | Base quality string |
+| `mate_chrom` | String | Mate reference name |
+| `mate_start` | Int64 | Mate start position |
+| `template_length` | Int64 | Template length |
+
+### write_bam / sink_bam
+
+```python
+rows_written = pb.write_bam(df, "output.bam")
+rows_written = pb.write_bam(df, "output.bam", sort_on_write=True)
+
+pb.sink_bam(lf, "output.bam")
+pb.sink_bam(lf, "output.bam", sort_on_write=True)
+```
+
+## CRAM Format
+
+### read_cram / scan_cram
+
+CRAM files have **separate functions** from BAM. Require a reference FASTA and `.crai` index.
+
+```python
+import polars_bio as pb
+
+# Read CRAM (reference required)
+df = pb.read_cram("aligned.cram", reference_path="reference.fasta")
+
+# Scan CRAM (streaming)
+lf = pb.scan_cram("aligned.cram", reference_path="reference.fasta")
+```
+
+Same additional parameters and column schema as BAM, plus:
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `reference_path` | str | `None` | Path to reference FASTA |
+
+### write_cram / sink_cram
+
+```python
+rows_written = pb.write_cram(df, "output.cram", reference_path="reference.fasta")
+pb.sink_cram(lf, "output.cram", reference_path="reference.fasta")
+```
+
+## GFF/GTF Format
+
+### read_gff / scan_gff / read_gtf / scan_gtf
+
+GFF3 and GTF have separate functions.
+
+```python
+import polars_bio as pb
+
+# Read GFF3
+df = pb.read_gff("annotations.gff3")
+
+# Read GTF
+df = pb.read_gtf("genes.gtf")
+
+# Extract specific attributes as columns
+df = pb.read_gff("annotations.gff3", attr_fields=["gene_id", "gene_name"])
+```
+
+### Additional Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `attr_fields` | list[str] | `None` | Attribute fields to extract as columns |
+| `predicate_pushdown` | bool | `True` | Enable predicate pushdown |
+
+### Column Schema
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `chrom` | String | Sequence name |
+| `source` | String | Feature source |
+| `type` | String | Feature type (gene, exon, etc.) |
+| `start` | Int64 | Start position |
+| `end` | Int64 | End position |
+| `score` | Float32 | Score |
+| `strand` | String | Strand (+/-/.) |
+| `phase` | UInt32 | Phase (0/1/2) |
+| `attributes` | String | Attributes string |
+
+## FASTA Format
+
+### read_fasta / scan_fasta
+
+Read reference sequences from FASTA files.
+
+```python
+import polars_bio as pb
+
+df = pb.read_fasta("reference.fasta")
+```
+
+### Column Schema
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `name` | String | Sequence name |
+| `description` | String | Description line |
+| `sequence` | String | Nucleotide sequence |
+
+## FASTQ Format
+
+### read_fastq / scan_fastq
+
+Read raw sequencing reads with quality scores.
+
+```python
+import polars_bio as pb
+
+df = pb.read_fastq("reads.fastq.gz")
+```
+
+### Column Schema
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `name` | String | Read name |
+| `description` | String | Description line |
+| `sequence` | String | Nucleotide sequence |
+| `quality` | String | Quality string (Phred+33 encoded) |
+
+### write_fastq / sink_fastq
+
+```python
+rows_written = pb.write_fastq(df, "output.fastq")
+pb.sink_fastq(lf, "output.fastq")
+```
+
+## SAM Format
+
+### read_sam / scan_sam
+
+Read text-format alignment files. Same column schema as BAM. No cloud parameters.
+
+```python
+import polars_bio as pb
+
+df = pb.read_sam("alignments.sam")
+```
+
+### Additional Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `tag_fields` | list[str] | `None` | SAM tags to extract |
+| `infer_tag_types` | bool | `True` | Infer tag types |
+| `infer_tag_sample_size` | int | `100` | Sample size for inference |
+| `tag_type_hints` | list[str] | `None` | Explicit type hints |
+
+### write_sam / sink_sam
+
+```python
+rows_written = pb.write_sam(df, "output.sam")
+pb.sink_sam(lf, "output.sam", sort_on_write=True)
+```
+
+## Hi-C Pairs
+
+### read_pairs / scan_pairs
+
+Read Hi-C pairs format files for chromatin contact data.
+
+```python
+import polars_bio as pb
+
+df = pb.read_pairs("contacts.pairs")
+lf = pb.scan_pairs("contacts.pairs")
+```
+
+### Column Schema
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `readID` | String | Read identifier |
+| `chrom1` | String | Chromosome of first contact |
+| `pos1` | Int32 | Position of first contact |
+| `chrom2` | String | Chromosome of second contact |
+| `pos2` | Int32 | Position of second contact |
+| `strand1` | String | Strand of first contact |
+| `strand2` | String | Strand of second contact |
+
+## Generic Table Reader
+
+### read_table / scan_table
+
+Read tab-delimited files with custom schema. Useful for non-standard formats or bioframe-compatible tables.
+
+```python
+import polars_bio as pb
+
+df = pb.read_table("custom.tsv", schema={"chrom": str, "start": int, "end": int, "name": str})
+lf = pb.scan_table("custom.tsv", schema={"chrom": str, "start": int, "end": int})
+```
+
+## Cloud Storage
+
+All `read_*` and `scan_*` functions support cloud storage via individual parameters:
+
+### Amazon S3
+
+```python
+df = pb.read_bed(
+    "s3://bucket/regions.bed",
+    allow_anonymous=False,
+    max_retries=10,
+    timeout=600,
+)
+```
+
+### Google Cloud Storage
+
+```python
+df = pb.read_vcf("gs://bucket/variants.vcf.gz", allow_anonymous=True)
+```
+
+### Azure Blob Storage
+
+```python
+df = pb.read_bam("az://container/aligned.bam", allow_anonymous=False)
+```
+
+**Note:** For authenticated access, configure credentials via environment variables or cloud SDK configuration (e.g., `AWS_ACCESS_KEY_ID`, `GOOGLE_APPLICATION_CREDENTIALS`).
+
+## Compression Support
+
+polars-bio transparently handles compressed files:
+
+| Compression | Extension | Parallel Decompression |
+|-------------|-----------|----------------------|
+| GZIP | `.gz` | No |
+| BGZF | `.gz` (with BGZF blocks) | Yes |
+| Uncompressed | (none) | N/A |
+
+**Recommendation:** Use BGZF compression (e.g., created with `bgzip`) for large files. BGZF supports parallel block decompression, significantly improving read performance compared to plain GZIP.
+
+## Describe Functions
+
+Inspect file structure without fully reading:
+
+```python
+import polars_bio as pb
+
+# Describe file schemas and metadata
+schema_df = pb.describe_vcf("samples.vcf.gz")
+schema_df = pb.describe_bam("aligned.bam")
+schema_df = pb.describe_sam("alignments.sam")
+schema_df = pb.describe_cram("aligned.cram", reference_path="ref.fasta")
+```
--- a/scientific-skills/polars-bio/references/interval_operations.md
+++ b/scientific-skills/polars-bio/references/interval_operations.md
@@ -0,0 +1,362 @@
+# Genomic Interval Operations
+
+## Overview
+
+polars-bio provides 8 core operations for genomic interval arithmetic. All operations work on Polars DataFrames or LazyFrames containing genomic intervals (columns: `chrom`, `start`, `end` by default) and return a **LazyFrame** by default. Pass `output_type="polars.DataFrame"` for eager results.
+
+## Operations Summary
+
+| Operation | Inputs | Description |
+|-----------|--------|-------------|
+| `overlap` | two DataFrames | Find pairs of overlapping intervals |
+| `count_overlaps` | two DataFrames | Count overlaps per interval in the first set |
+| `nearest` | two DataFrames | Find nearest intervals between two sets |
+| `merge` | one DataFrame | Merge overlapping/bookended intervals |
+| `cluster` | one DataFrame | Assign cluster IDs to overlapping intervals |
+| `coverage` | two DataFrames | Compute per-interval coverage counts |
+| `complement` | one DataFrame + genome | Find gaps between intervals |
+| `subtract` | two DataFrames | Remove overlapping portions |
+
+## overlap
+
+Find pairs of overlapping intervals between two DataFrames.
+
+### Functional API
+
+```python
+import polars as pl
+import polars_bio as pb
+
+df1 = pl.DataFrame({
+    "chrom": ["chr1", "chr1", "chr1"],
+    "start": [1, 5, 22],
+    "end":   [6, 9, 30],
+})
+
+df2 = pl.DataFrame({
+    "chrom": ["chr1", "chr1"],
+    "start": [3, 25],
+    "end":   [8, 28],
+})
+
+# Returns LazyFrame by default
+result_lf = pb.overlap(df1, df2, suffixes=("_1", "_2"))
+result_df = result_lf.collect()
+
+# Or get DataFrame directly
+result_df = pb.overlap(df1, df2, suffixes=("_1", "_2"), output_type="polars.DataFrame")
+```
+
+### Method-Chaining API (LazyFrame only)
+
+```python
+result = df1.lazy().pb.overlap(df2, suffixes=("_1", "_2")).collect()
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df1` | DataFrame/LazyFrame/str | required | First (probe) interval set |
+| `df2` | DataFrame/LazyFrame/str | required | Second (build) interval set |
+| `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for overlapping column names |
+| `on_cols` | list[str] | `None` | Additional columns to join on (beyond genomic coords) |
+| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
+| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
+| `algorithm` | str | `"Coitrees"` | Interval algorithm |
+| `low_memory` | bool | `False` | Low memory mode |
+| `output_type` | str | `"polars.LazyFrame"` | Output format: `"polars.LazyFrame"`, `"polars.DataFrame"`, `"pandas.DataFrame"` |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown optimization |
+
+### Output Schema
+
+Returns columns from both inputs with suffixes applied:
+- `chrom_1`, `start_1`, `end_1` (from df1)
+- `chrom_2`, `start_2`, `end_2` (from df2)
+- Any additional columns from df1 and df2
+
+Column dtypes are `String` for chrom and `Int64` for start/end.
+
+## count_overlaps
+
+Count the number of overlapping intervals from df2 for each interval in df1.
+
+```python
+# Functional
+counts = pb.count_overlaps(df1, df2)
+
+# Method-chaining (LazyFrame)
+counts = df1.lazy().pb.count_overlaps(df2)
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df1` | DataFrame/LazyFrame/str | required | Query interval set |
+| `df2` | DataFrame/LazyFrame/str | required | Target interval set |
+| `suffixes` | tuple[str, str] | `("", "_")` | Suffixes for column names |
+| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
+| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
+| `on_cols` | list[str] | `None` | Additional join columns |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `naive_query` | bool | `True` | Use naive query strategy |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns df1 columns with an additional `count` column (Int64).
+
+## nearest
+
+Find the nearest interval in df2 for each interval in df1.
+
+```python
+# Find nearest (default: k=1, any direction)
+nearest = pb.nearest(df1, df2, output_type="polars.DataFrame")
+
+# Find k nearest
+nearest = pb.nearest(df1, df2, k=3)
+
+# Exclude overlapping intervals from results
+nearest = pb.nearest(df1, df2, overlap=False)
+
+# Without distance column
+nearest = pb.nearest(df1, df2, distance=False)
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df1` | DataFrame/LazyFrame/str | required | Query interval set |
+| `df2` | DataFrame/LazyFrame/str | required | Target interval set |
+| `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for column names |
+| `on_cols` | list[str] | `None` | Additional join columns |
+| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
+| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
+| `k` | int | `1` | Number of nearest neighbors to find |
+| `overlap` | bool | `True` | Include overlapping intervals in results |
+| `distance` | bool | `True` | Include distance column in output |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns columns from both DataFrames (with suffixes) plus a `distance` column (Int64) with the distance to the nearest interval (0 if overlapping). Distance column is omitted if `distance=False`.
+
+## merge
+
+Merge overlapping and bookended intervals within a single DataFrame.
+
+```python
+import polars as pl
+import polars_bio as pb
+
+df = pl.DataFrame({
+    "chrom": ["chr1", "chr1", "chr1", "chr2"],
+    "start": [1, 4, 20, 1],
+    "end":   [6, 9, 30, 10],
+})
+
+# Functional
+merged = pb.merge(df, output_type="polars.DataFrame")
+
+# Method-chaining (LazyFrame)
+merged = df.lazy().pb.merge().collect()
+
+# Merge intervals within a minimum distance
+merged = pb.merge(df, min_dist=10)
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df` | DataFrame/LazyFrame/str | required | Interval set to merge |
+| `min_dist` | int | `0` | Minimum distance between intervals to merge (0 = must overlap or be bookended) |
+| `cols` | list[str] | `["chrom", "start", "end"]` | Column names |
+| `on_cols` | list[str] | `None` | Additional grouping columns |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `chrom` | String | Chromosome |
+| `start` | Int64 | Merged interval start |
+| `end` | Int64 | Merged interval end |
+| `n_intervals` | Int64 | Number of intervals merged |
+
+## cluster
+
+Assign cluster IDs to overlapping intervals. Intervals that overlap are assigned the same cluster ID.
+
+```python
+# Functional
+clustered = pb.cluster(df, output_type="polars.DataFrame")
+
+# Method-chaining (LazyFrame)
+clustered = df.lazy().pb.cluster().collect()
+
+# With minimum distance
+clustered = pb.cluster(df, min_dist=5)
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df` | DataFrame/LazyFrame/str | required | Interval set |
+| `min_dist` | int | `0` | Minimum distance for clustering |
+| `cols` | list[str] | `["chrom", "start", "end"]` | Column names |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns the original columns plus:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `cluster` | Int64 | Cluster ID (intervals in the same cluster overlap) |
+| `cluster_start` | Int64 | Start of the cluster extent |
+| `cluster_end` | Int64 | End of the cluster extent |
+
+## coverage
+
+Compute per-interval coverage counts. This is a **two-input** operation: for each interval in df1, count the coverage from df2.
+
+```python
+# Functional
+cov = pb.coverage(df1, df2, output_type="polars.DataFrame")
+
+# Method-chaining (LazyFrame)
+cov = df1.lazy().pb.coverage(df2).collect()
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df1` | DataFrame/LazyFrame/str | required | Query intervals |
+| `df2` | DataFrame/LazyFrame/str | required | Coverage source intervals |
+| `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for column names |
+| `on_cols` | list[str] | `None` | Additional join columns |
+| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
+| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns columns from df1 plus a `coverage` column (Int64).
+
+## complement
+
+Find gaps between intervals within a genome. Requires a genome definition specifying chromosome sizes.
+
+```python
+import polars as pl
+import polars_bio as pb
+
+df = pl.DataFrame({
+    "chrom": ["chr1", "chr1"],
+    "start": [100, 500],
+    "end":   [200, 600],
+})
+
+genome = pl.DataFrame({
+    "chrom": ["chr1"],
+    "start": [0],
+    "end":   [1000],
+})
+
+# Functional
+gaps = pb.complement(df, view_df=genome, output_type="polars.DataFrame")
+
+# Method-chaining (LazyFrame)
+gaps = df.lazy().pb.complement(genome).collect()
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df` | DataFrame/LazyFrame/str | required | Interval set |
+| `view_df` | DataFrame/LazyFrame | `None` | Genome with chrom, start, end defining chromosome extents |
+| `cols` | list[str] | `["chrom", "start", "end"]` | Column names in df |
+| `view_cols` | list[str] | `None` | Column names in view_df |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns a DataFrame with `chrom` (String), `start` (Int64), `end` (Int64) columns representing gaps between intervals.
+
+## subtract
+
+Remove portions of intervals in df1 that overlap with intervals in df2.
+
+```python
+# Functional
+result = pb.subtract(df1, df2, output_type="polars.DataFrame")
+
+# Method-chaining (LazyFrame)
+result = df1.lazy().pb.subtract(df2).collect()
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `df1` | DataFrame/LazyFrame/str | required | Intervals to subtract from |
+| `df2` | DataFrame/LazyFrame/str | required | Intervals to subtract |
+| `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 |
+| `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 |
+| `output_type` | str | `"polars.LazyFrame"` | Output format |
+| `projection_pushdown` | bool | `True` | Enable projection pushdown |
+
+### Output Schema
+
+Returns `chrom` (String), `start` (Int64), `end` (Int64) representing the remaining portions of df1 intervals after subtraction.
+
+## Performance Considerations
+
+### Probe-Build Architecture
+
+Two-input operations (`overlap`, `nearest`, `count_overlaps`, `coverage`, `subtract`) use a probe-build join:
+- **Probe** (first DataFrame): Iterated over, row by row
+- **Build** (second DataFrame): Indexed into an interval tree for fast lookup
+
+For best performance, pass the **larger** DataFrame as the probe (first argument) and the **smaller** one as the build (second argument).
+
+### Parallelism
+
+By default, polars-bio uses a single execution partition. For large datasets, enable parallel execution:
+
+```python
+import os
+import polars_bio as pb
+
+pb.set_option("datafusion.execution.target_partitions", os.cpu_count())
+```
+
+### Streaming Execution
+
+DataFusion streaming is enabled by default for interval operations. Data is processed in batches, enabling out-of-core computation for datasets larger than available RAM.
+
+### When to Use Lazy Evaluation
+
+Use `scan_*` functions and lazy DataFrames for:
+- Files larger than available RAM
+- When only a subset of results is needed
+- Pipeline operations where intermediate results can be optimized away
+
+```python
+# Lazy pipeline
+lf1 = pb.scan_bed("large1.bed")
+lf2 = pb.scan_bed("large2.bed")
+result = pb.overlap(lf1, lf2).collect()
+```
--- a/scientific-skills/polars-bio/references/pileup_operations.md
+++ b/scientific-skills/polars-bio/references/pileup_operations.md
@@ -0,0 +1,176 @@
+# Pileup Operations
+
+## Overview
+
+polars-bio provides the `pb.depth()` function for computing per-base or per-block read depth from BAM/CRAM files. It uses CIGAR-aware depth calculation to accurately account for insertions, deletions, and clipping. Returns a **LazyFrame** by default.
+
+## pb.depth()
+
+Compute read depth from alignment files.
+
+### Basic Usage
+
+```python
+import polars_bio as pb
+
+# Compute depth across entire BAM file (returns LazyFrame)
+depth_lf = pb.depth("aligned.bam")
+depth_df = depth_lf.collect()
+
+# Get DataFrame directly
+depth_df = pb.depth("aligned.bam", output_type="polars.DataFrame")
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `path` | str | required | Path to BAM or CRAM file |
+| `filter_flag` | int | `1796` | SAM flag filter (default excludes unmapped, secondary, duplicate, QC-fail) |
+| `min_mapping_quality` | int | `0` | Minimum mapping quality to include reads |
+| `binary_cigar` | bool | `True` | Use binary CIGAR for faster processing |
+| `dense_mode` | str | `"auto"` | Dense output mode |
+| `use_zero_based` | bool | `None` | Coordinate system (None = use global setting) |
+| `per_base` | bool | `False` | Per-base depth (True) vs block depth (False) |
+| `output_type` | str | `"polars.LazyFrame"` | Output format: `"polars.LazyFrame"`, `"polars.DataFrame"`, `"pandas.DataFrame"` |
+
+### Output Schema (Block Mode, default)
+
+When `per_base=False` (default), adjacent positions with the same depth are grouped into blocks:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `contig` | String | Chromosome/contig name |
+| `pos_start` | Int64 | Block start position |
+| `pos_end` | Int64 | Block end position |
+| `coverage` | Int16 | Read depth |
+
+### Output Schema (Per-Base Mode)
+
+When `per_base=True`, each position is reported individually:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `contig` | String | Chromosome/contig name |
+| `pos` | Int64 | Position |
+| `coverage` | Int16 | Read depth at position |
+
+### filter_flag
+
+The default `filter_flag=1796` excludes reads with these SAM flags:
+- 4: unmapped
+- 256: secondary alignment
+- 512: failed QC
+- 1024: PCR/optical duplicate
+
+### CIGAR-Aware Computation
+
+`pb.depth()` correctly handles CIGAR operations:
+- **M/X/=** (match/mismatch): Counted as coverage
+- **D** (deletion): Counted as coverage (reads span the deletion)
+- **N** (skipped region): Not counted (e.g., spliced alignments)
+- **I** (insertion): Not counted at reference positions
+- **S/H** (soft/hard clipping): Not counted
+
+## Examples
+
+### Whole-Genome Depth
+
+```python
+import polars_bio as pb
+import polars as pl
+
+# Compute depth genome-wide (block mode)
+depth = pb.depth("sample.bam", output_type="polars.DataFrame")
+
+# Summary statistics
+depth.select(
+    pl.col("coverage").cast(pl.Int64).mean().alias("mean_depth"),
+    pl.col("coverage").cast(pl.Int64).median().alias("median_depth"),
+    pl.col("coverage").cast(pl.Int64).max().alias("max_depth"),
+)
+```
+
+### Per-Base Depth
+
+```python
+import polars_bio as pb
+
+# Per-base depth (one row per position)
+depth = pb.depth("sample.bam", per_base=True, output_type="polars.DataFrame")
+```
+
+### Depth with Quality Filters
+
+```python
+import polars_bio as pb
+
+# Only count well-mapped reads
+depth = pb.depth(
+    "sample.bam",
+    min_mapping_quality=20,
+    output_type="polars.DataFrame",
+)
+```
+
+### Custom Flag Filter
+
+```python
+import polars_bio as pb
+
+# Only exclude unmapped (4) and duplicate (1024) reads
+depth = pb.depth(
+    "sample.bam",
+    filter_flag=4 + 1024,
+    output_type="polars.DataFrame",
+)
+```
+
+## Integration with Interval Operations
+
+Depth results can be used with polars-bio interval operations. Note that depth output uses `contig`/`pos_start`/`pos_end` column names, so use `cols` parameters to map them:
+
+```python
+import polars_bio as pb
+import polars as pl
+
+# Compute depth
+depth = pb.depth("sample.bam", output_type="polars.DataFrame")
+
+# Rename columns to match interval operation conventions
+depth_intervals = depth.rename({
+    "contig": "chrom",
+    "pos_start": "start",
+    "pos_end": "end",
+})
+
+# Find regions with adequate coverage
+adequate = depth_intervals.filter(pl.col("coverage") >= 30)
+
+# Merge adjacent adequate-coverage blocks
+merged = pb.merge(adequate, output_type="polars.DataFrame")
+
+# Find gaps in coverage (complement)
+genome = pl.DataFrame({
+    "chrom": ["chr1"],
+    "start": [0],
+    "end": [248956422],
+})
+gaps = pb.complement(adequate, view_df=genome, output_type="polars.DataFrame")
+```
+
+### Using cols Parameters Instead of Renaming
+
+```python
+import polars_bio as pb
+
+depth = pb.depth("sample.bam", output_type="polars.DataFrame")
+targets = pb.read_bed("targets.bed")
+
+# Use cols1 to specify depth column names
+overlapping = pb.overlap(
+    depth, targets,
+    cols1=["contig", "pos_start", "pos_end"],
+    output_type="polars.DataFrame",
+)
+```
--- a/scientific-skills/polars-bio/references/sql_processing.md
+++ b/scientific-skills/polars-bio/references/sql_processing.md
@@ -0,0 +1,224 @@
+# SQL Data Processing
+
+## Overview
+
+polars-bio integrates Apache DataFusion's SQL engine, enabling SQL queries on bioinformatics files and Polars DataFrames. Register files as tables and query them using standard SQL syntax. All queries return a **LazyFrame** — call `.collect()` to materialize results.
+
+## Register Functions
+
+Register bioinformatics files as SQL tables. **Path is the first argument**, name is an optional keyword:
+
+```python
+import polars_bio as pb
+
+# Register various file formats (path first, name= keyword)
+pb.register_vcf("samples.vcf.gz", name="variants")
+pb.register_bed("target_regions.bed", name="regions")
+pb.register_bam("aligned.bam", name="alignments")
+pb.register_cram("aligned.cram", name="cram_alignments")
+pb.register_gff("genes.gff3", name="annotations")
+pb.register_gtf("genes.gtf", name="gtf_annotations")
+pb.register_fastq("sample.fastq.gz", name="reads")
+pb.register_sam("alignments.sam", name="sam_alignments")
+pb.register_pairs("contacts.pairs", name="hic_contacts")
+```
+
+### Parameters
+
+All `register_*` functions share these parameters:
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `path` | str | required (first positional) | Path to file (local or cloud) |
+| `name` | str | `None` | Table name for SQL queries (auto-generated if omitted) |
+| `chunk_size` | int | `64` | Chunk size for reading |
+| `concurrent_fetches` | int | `8` | Concurrent cloud fetches |
+| `allow_anonymous` | bool | `True` | Allow anonymous cloud access |
+| `max_retries` | int | `5` | Cloud retry count |
+| `timeout` | int | `300` | Cloud timeout in seconds |
+| `enable_request_payer` | bool | `False` | Requester-pays cloud |
+| `compression_type` | str | `"auto"` | Compression type |
+
+Some register functions have additional format-specific parameters (e.g., `info_fields` on `register_vcf`).
+
+**Note:** `register_fasta` does not exist. Use `scan_fasta` + `from_polars` as a workaround.
+
+## from_polars
+
+Register an existing Polars DataFrame as a SQL-queryable table:
+
+```python
+import polars as pl
+import polars_bio as pb
+
+df = pl.DataFrame({
+    "chrom": ["chr1", "chr1", "chr2"],
+    "start": [100, 500, 200],
+    "end":   [200, 600, 400],
+    "name":  ["peak1", "peak2", "peak3"],
+})
+
+pb.from_polars("my_peaks", df)
+
+# Now query with SQL
+result = pb.sql("SELECT * FROM my_peaks WHERE chrom = 'chr1'").collect()
+```
+
+**Important:** `register_view` takes a SQL query string, not a DataFrame. Use `from_polars` to register DataFrames.
+
+## register_view
+
+Create a SQL view from a query string:
+
+```python
+import polars_bio as pb
+
+# Create a view from a SQL query
+pb.register_view("chr1_variants", "SELECT * FROM variants WHERE chrom = 'chr1'")
+
+# Query the view
+result = pb.sql("SELECT * FROM chr1_variants WHERE qual > 30").collect()
+```
+
+### Parameters
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `name` | str | View name |
+| `query` | str | SQL query string defining the view |
+
+## pb.sql()
+
+Execute SQL queries using DataFusion SQL syntax. **Returns a LazyFrame** — call `.collect()` to get a DataFrame.
+
+```python
+import polars_bio as pb
+
+# Simple query
+result = pb.sql("SELECT chrom, start, end FROM regions WHERE chrom = 'chr1'").collect()
+
+# Aggregation
+result = pb.sql("""
+    SELECT chrom, COUNT(*) as variant_count, AVG(qual) as avg_qual
+    FROM variants
+    GROUP BY chrom
+    ORDER BY variant_count DESC
+""").collect()
+
+# Join tables
+result = pb.sql("""
+    SELECT v.chrom, v.start, v.end, v.ref, v.alt, r.name
+    FROM variants v
+    JOIN regions r ON v.chrom = r.chrom
+        AND v.start >= r.start
+        AND v.end <= r.end
+""").collect()
+```
+
+## DataFusion SQL Syntax
+
+polars-bio uses Apache DataFusion's SQL dialect. Key features:
+
+### Filtering
+
+```sql
+SELECT * FROM variants WHERE qual > 30 AND filter = 'PASS'
+```
+
+### Aggregations
+
+```sql
+SELECT chrom, COUNT(*) as n, MIN(start) as min_pos, MAX(end) as max_pos
+FROM regions
+GROUP BY chrom
+HAVING COUNT(*) > 100
+```
+
+### Window Functions
+
+```sql
+SELECT chrom, start, end,
+    ROW_NUMBER() OVER (PARTITION BY chrom ORDER BY start) as row_num,
+    LAG(end) OVER (PARTITION BY chrom ORDER BY start) as prev_end
+FROM regions
+```
+
+### Subqueries
+
+```sql
+SELECT * FROM variants
+WHERE chrom IN (SELECT DISTINCT chrom FROM regions)
+```
+
+### Common Table Expressions (CTEs)
+
+```sql
+WITH filtered_variants AS (
+    SELECT * FROM variants WHERE qual > 30
+),
+chr1_regions AS (
+    SELECT * FROM regions WHERE chrom = 'chr1'
+)
+SELECT f.chrom, f.start, f.ref, f.alt
+FROM filtered_variants f
+JOIN chr1_regions r ON f.start BETWEEN r.start AND r.end
+```
+
+## Combining SQL with Interval Operations
+
+SQL queries return LazyFrames that can be used directly with polars-bio interval operations:
+
+```python
+import polars_bio as pb
+
+# Register files
+pb.register_vcf("samples.vcf.gz", name="variants")
+pb.register_bed("target_regions.bed", name="targets")
+
+# SQL to filter (returns LazyFrame)
+high_qual = pb.sql("SELECT chrom, start, end FROM variants WHERE qual > 30").collect()
+targets = pb.sql("SELECT chrom, start, end FROM targets WHERE chrom = 'chr1'").collect()
+
+# Interval operation on SQL results
+overlapping = pb.overlap(high_qual, targets).collect()
+```
+
+## Example Workflows
+
+### Variant Density Analysis
+
+```python
+import polars_bio as pb
+
+pb.register_vcf("cohort.vcf.gz", name="variants")
+pb.register_bed("genome_windows_1mb.bed", name="windows")
+
+# Count variants per window using SQL
+result = pb.sql("""
+    SELECT w.chrom, w.start, w.end, COUNT(v.start) as variant_count
+    FROM windows w
+    LEFT JOIN variants v ON w.chrom = v.chrom
+        AND v.start >= w.start
+        AND v.start < w.end
+    GROUP BY w.chrom, w.start, w.end
+    ORDER BY variant_count DESC
+""").collect()
+```
+
+### Gene Annotation Lookup
+
+```python
+import polars_bio as pb
+
+pb.register_gff("gencode.gff3", name="genes")
+
+# Find all protein-coding genes on chromosome 1
+coding_genes = pb.sql("""
+    SELECT chrom, start, end, attributes
+    FROM genes
+    WHERE type = 'gene'
+        AND chrom = 'chr1'
+        AND attributes LIKE '%protein_coding%'
+    ORDER BY start
+""").collect()
+```