mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Add polars-bio skill for genomic interval operations and bioinformatics I/O
Adds a new skill covering polars-bio (v0.26.0), a high-performance library for genomic interval arithmetic and file I/O built on Polars, Arrow, and DataFusion. All code examples verified against the actual API at runtime. SKILL.md covers overlap, nearest, merge, coverage, complement, subtract, cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF, BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats. References: interval_operations, file_io, sql_processing, pileup_operations, configuration, bioframe_migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
175
scientific-skills/polars-bio/references/configuration.md
Normal file
175
scientific-skills/polars-bio/references/configuration.md
Normal file
@@ -0,0 +1,175 @@
|
||||
# Configuration
|
||||
|
||||
## Overview
|
||||
|
||||
polars-bio uses a global configuration system based on `set_option` and `get_option` to control execution behavior, coordinate systems, parallelism, and streaming modes.
|
||||
|
||||
## set_option / get_option
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
|
||||
# Set a configuration option
|
||||
pb.set_option("datafusion.execution.target_partitions", 8)
|
||||
|
||||
# Get current value
|
||||
value = pb.get_option("datafusion.execution.target_partitions")
|
||||
```
|
||||
|
||||
## Parallelism
|
||||
|
||||
### DataFusion Target Partitions
|
||||
|
||||
Controls the number of parallel execution partitions. Defaults to 1 (single-threaded).
|
||||
|
||||
```python
|
||||
import os
|
||||
import polars_bio as pb
|
||||
|
||||
# Use all available CPU cores
|
||||
pb.set_option("datafusion.execution.target_partitions", os.cpu_count())
|
||||
|
||||
# Set specific number of partitions
|
||||
pb.set_option("datafusion.execution.target_partitions", 8)
|
||||
```
|
||||
|
||||
**When to increase parallelism:**
|
||||
- Processing large files (>1GB)
|
||||
- Running interval operations on millions of intervals
|
||||
- Batch processing multiple chromosomes
|
||||
|
||||
**When to keep default (1):**
|
||||
- Small datasets
|
||||
- Memory-constrained environments
|
||||
- Debugging (deterministic execution)
|
||||
|
||||
## Coordinate Systems
|
||||
|
||||
polars-bio defaults to 1-based coordinates (standard genomic convention).
|
||||
|
||||
### Global Coordinate System
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
|
||||
# Switch to 0-based half-open coordinates
|
||||
pb.set_option("coordinate_system", "0-based")
|
||||
|
||||
# Switch back to 1-based (default)
|
||||
pb.set_option("coordinate_system", "1-based")
|
||||
|
||||
# Check current setting
|
||||
print(pb.get_option("coordinate_system"))
|
||||
```
|
||||
|
||||
### Per-File Override via I/O Functions
|
||||
|
||||
I/O functions accept `use_zero_based` to set coordinate metadata on the resulting DataFrame:
|
||||
|
||||
```python
|
||||
# Read with explicit 0-based metadata
|
||||
df = pb.read_bed("regions.bed", use_zero_based=True)
|
||||
```
|
||||
|
||||
**Note:** Interval operations (overlap, nearest, etc.) do **not** accept `use_zero_based`. They read coordinate metadata from the DataFrames, which is set by I/O functions or the global option. When using manually constructed DataFrames, polars-bio warns about missing metadata and falls back to the global setting.
|
||||
|
||||
### Setting Metadata on Manual DataFrames
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
|
||||
# Set coordinate metadata on a manually created DataFrame
|
||||
pb.set_source_metadata(df, format="bed", path="")
|
||||
```
|
||||
|
||||
### File Format Conventions
|
||||
|
||||
| Format | Native Coordinate System | polars-bio Conversion |
|
||||
|--------|-------------------------|----------------------|
|
||||
| BED | 0-based half-open | Converted to configured system on read |
|
||||
| VCF | 1-based | Converted to configured system on read |
|
||||
| GFF/GTF | 1-based | Converted to configured system on read |
|
||||
| BAM/SAM | 0-based | Converted to configured system on read |
|
||||
|
||||
## Streaming Execution Modes
|
||||
|
||||
polars-bio supports two streaming modes for out-of-core processing:
|
||||
|
||||
### DataFusion Streaming
|
||||
|
||||
Enabled by default for interval operations. Processes data in batches through the DataFusion execution engine.
|
||||
|
||||
```python
|
||||
# DataFusion streaming is automatic for interval operations
|
||||
result = pb.overlap(lf1, lf2) # Streams if inputs are LazyFrames
|
||||
```
|
||||
|
||||
### Polars Streaming
|
||||
|
||||
Use Polars' native streaming for post-processing operations:
|
||||
|
||||
```python
|
||||
# Collect with Polars streaming
|
||||
result = lf.collect(streaming=True)
|
||||
```
|
||||
|
||||
### Combining Both
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
|
||||
# Scan files lazily (DataFusion streaming for I/O)
|
||||
lf1 = pb.scan_bed("large1.bed")
|
||||
lf2 = pb.scan_bed("large2.bed")
|
||||
|
||||
# Interval operation (DataFusion streaming)
|
||||
result_lf = pb.overlap(lf1, lf2)
|
||||
|
||||
# Collect with Polars streaming for final materialization
|
||||
result = result_lf.collect(streaming=True)
|
||||
```
|
||||
|
||||
## Logging
|
||||
|
||||
Control log verbosity for debugging:
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
|
||||
# Set log level
|
||||
pb.set_loglevel("debug") # Detailed execution info
|
||||
pb.set_loglevel("info") # Standard messages
|
||||
pb.set_loglevel("warn") # Warnings only (default)
|
||||
```
|
||||
|
||||
**Note:** Only `"debug"`, `"info"`, and `"warn"` are valid log levels.
|
||||
|
||||
## Metadata Management
|
||||
|
||||
polars-bio attaches coordinate system and source metadata to DataFrames produced by I/O functions. This metadata is used by interval operations to determine the coordinate system.
|
||||
|
||||
```python
|
||||
import polars_bio as pb
|
||||
|
||||
# Inspect metadata on a DataFrame
|
||||
metadata = pb.get_metadata(df)
|
||||
|
||||
# Print metadata summary
|
||||
pb.print_metadata_summary(df)
|
||||
|
||||
# Print metadata as JSON
|
||||
pb.print_metadata_json(df)
|
||||
|
||||
# Set metadata on a manually created DataFrame
|
||||
pb.set_source_metadata(df, format="bed", path="regions.bed")
|
||||
|
||||
# Register a DataFrame as a SQL table
|
||||
pb.from_polars("my_table", df)
|
||||
```
|
||||
|
||||
## Complete Configuration Reference
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `datafusion.execution.target_partitions` | `1` | Number of parallel execution partitions |
|
||||
| `coordinate_system` | `"1-based"` | Default coordinate system (`"0-based"` or `"1-based"`) |
|
||||
Reference in New Issue
Block a user