Files
claude-scientific-skills/scientific-skills/polars-bio/references/file_io.md
Marek Wieiwórka 436c8608f2 Add polars-bio skill for genomic interval operations and bioinformatics I/O
Adds a new skill covering polars-bio (v0.26.0), a high-performance library
for genomic interval arithmetic and file I/O built on Polars, Arrow, and
DataFusion. All code examples verified against the actual API at runtime.

SKILL.md covers overlap, nearest, merge, coverage, complement, subtract,
cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF,
BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats.

References: interval_operations, file_io, sql_processing, pileup_operations,
configuration, bioframe_migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 10:27:11 +01:00

415 lines
12 KiB
Markdown

# Bioinformatics File I/O
## Overview
polars-bio provides `read_*`, `scan_*`, `write_*`, and `sink_*` functions for common bioinformatics formats. `read_*` loads data eagerly into a DataFrame, while `scan_*` creates a LazyFrame for streaming/out-of-core processing. `write_*` writes from DataFrame/LazyFrame and returns a row count, while `sink_*` streams from a LazyFrame.
## Supported Formats
| Format | Read | Scan | Register (SQL) | Write | Sink |
|--------|------|------|-----------------|-------|------|
| BED | `read_bed` | `scan_bed` | `register_bed` | — | — |
| VCF | `read_vcf` | `scan_vcf` | `register_vcf` | `write_vcf` | `sink_vcf` |
| BAM | `read_bam` | `scan_bam` | `register_bam` | `write_bam` | `sink_bam` |
| CRAM | `read_cram` | `scan_cram` | `register_cram` | `write_cram` | `sink_cram` |
| GFF | `read_gff` | `scan_gff` | `register_gff` | — | — |
| GTF | `read_gtf` | `scan_gtf` | `register_gtf` | — | — |
| FASTA | `read_fasta` | `scan_fasta` | — | — | — |
| FASTQ | `read_fastq` | `scan_fastq` | `register_fastq` | `write_fastq` | `sink_fastq` |
| SAM | `read_sam` | `scan_sam` | `register_sam` | `write_sam` | `sink_sam` |
| Hi-C pairs | `read_pairs` | `scan_pairs` | `register_pairs` | — | — |
| Generic table | `read_table` | `scan_table` | — | — | — |
## Common Cloud/IO Parameters
All `read_*` and `scan_*` functions share these parameters (instead of a single `storage_options` dict):
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path` | str | required | File path (local, S3, GCS, Azure) |
| `chunk_size` | int | `8` | Number of chunks for parallel reading |
| `concurrent_fetches` | int | `1` | Number of concurrent fetches for cloud storage |
| `allow_anonymous` | bool | `True` | Allow anonymous access to cloud storage |
| `enable_request_payer` | bool | `False` | Enable requester-pays for cloud storage |
| `max_retries` | int | `5` | Maximum retries for cloud operations |
| `timeout` | int | `300` | Timeout in seconds for cloud operations |
| `compression_type` | str | `"auto"` | Compression type (auto-detected from extension) |
| `projection_pushdown` | bool | `True` | Enable projection pushdown optimization |
| `use_zero_based` | bool | `None` | Set coordinate system metadata (None = use global setting) |
Not all functions support all parameters. SAM functions lack cloud parameters. FASTA/FASTQ lack `predicate_pushdown`.
## BED Format
### read_bed / scan_bed
Read BED files. Columns are auto-detected (BED3 through BED12). BED files use 0-based half-open coordinates; polars-bio attaches coordinate metadata automatically.
```python
import polars_bio as pb
# Eager read
df = pb.read_bed("regions.bed")
# Lazy scan
lf = pb.scan_bed("regions.bed")
```
### Column Schema (BED3)
| Column | Type | Description |
|--------|------|-------------|
| `chrom` | String | Chromosome name |
| `start` | Int64 | Start position |
| `end` | Int64 | End position |
Extended BED fields (auto-detected) add: `name`, `score`, `strand`, `thickStart`, `thickEnd`, `itemRgb`, `blockCount`, `blockSizes`, `blockStarts`.
## VCF Format
### read_vcf / scan_vcf
Read VCF/BCF files. Supports `.vcf`, `.vcf.gz`, `.bcf`.
```python
import polars_bio as pb
# Read VCF
df = pb.read_vcf("variants.vcf.gz")
# Read with specific INFO and FORMAT fields extracted as columns
df = pb.read_vcf("variants.vcf.gz", info_fields=["AF", "DP"], format_fields=["GT", "GQ"])
# Read specific samples
df = pb.read_vcf("variants.vcf.gz", samples=["SAMPLE1", "SAMPLE2"])
```
### Additional Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `info_fields` | list[str] | `None` | INFO fields to extract as columns |
| `format_fields` | list[str] | `None` | FORMAT fields to extract as columns |
| `samples` | list[str] | `None` | Samples to include |
| `predicate_pushdown` | bool | `True` | Enable predicate pushdown |
### Column Schema
| Column | Type | Description |
|--------|------|-------------|
| `chrom` | String | Chromosome |
| `start` | UInt32 | Start position |
| `end` | UInt32 | End position |
| `id` | String | Variant ID |
| `ref` | String | Reference allele |
| `alt` | String | Alternate allele(s) |
| `qual` | Float32 | Quality score |
| `filter` | String | Filter status |
| `info` | String | INFO field (raw, unless `info_fields` specified) |
### write_vcf / sink_vcf
```python
import polars_bio as pb
# Write DataFrame to VCF
rows_written = pb.write_vcf(df, "output.vcf")
# Stream LazyFrame to VCF
pb.sink_vcf(lf, "output.vcf")
```
## BAM Format
### read_bam / scan_bam
Read aligned sequencing reads from BAM files. Requires a `.bai` index file.
```python
import polars_bio as pb
# Read BAM
df = pb.read_bam("aligned.bam")
# Scan BAM (streaming)
lf = pb.scan_bam("aligned.bam")
# Read with specific tags
df = pb.read_bam("aligned.bam", tag_fields=["NM", "MD"])
```
### Additional Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `tag_fields` | list[str] | `None` | SAM tags to extract as columns |
| `predicate_pushdown` | bool | `True` | Enable predicate pushdown |
| `infer_tag_types` | bool | `True` | Infer tag column types from data |
| `infer_tag_sample_size` | int | `100` | Number of records to sample for type inference |
| `tag_type_hints` | list[str] | `None` | Explicit type hints for tags |
### Column Schema
| Column | Type | Description |
|--------|------|-------------|
| `chrom` | String | Reference sequence name |
| `start` | Int64 | Alignment start position |
| `end` | Int64 | Alignment end position |
| `name` | String | Read name |
| `flags` | UInt32 | SAM flags |
| `mapping_quality` | UInt32 | Mapping quality |
| `cigar` | String | CIGAR string |
| `sequence` | String | Read sequence |
| `quality_scores` | String | Base quality string |
| `mate_chrom` | String | Mate reference name |
| `mate_start` | Int64 | Mate start position |
| `template_length` | Int64 | Template length |
### write_bam / sink_bam
```python
rows_written = pb.write_bam(df, "output.bam")
rows_written = pb.write_bam(df, "output.bam", sort_on_write=True)
pb.sink_bam(lf, "output.bam")
pb.sink_bam(lf, "output.bam", sort_on_write=True)
```
## CRAM Format
### read_cram / scan_cram
CRAM files have **separate functions** from BAM. Require a reference FASTA and `.crai` index.
```python
import polars_bio as pb
# Read CRAM (reference required)
df = pb.read_cram("aligned.cram", reference_path="reference.fasta")
# Scan CRAM (streaming)
lf = pb.scan_cram("aligned.cram", reference_path="reference.fasta")
```
Same additional parameters and column schema as BAM, plus:
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `reference_path` | str | `None` | Path to reference FASTA |
### write_cram / sink_cram
```python
rows_written = pb.write_cram(df, "output.cram", reference_path="reference.fasta")
pb.sink_cram(lf, "output.cram", reference_path="reference.fasta")
```
## GFF/GTF Format
### read_gff / scan_gff / read_gtf / scan_gtf
GFF3 and GTF have separate functions.
```python
import polars_bio as pb
# Read GFF3
df = pb.read_gff("annotations.gff3")
# Read GTF
df = pb.read_gtf("genes.gtf")
# Extract specific attributes as columns
df = pb.read_gff("annotations.gff3", attr_fields=["gene_id", "gene_name"])
```
### Additional Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `attr_fields` | list[str] | `None` | Attribute fields to extract as columns |
| `predicate_pushdown` | bool | `True` | Enable predicate pushdown |
### Column Schema
| Column | Type | Description |
|--------|------|-------------|
| `chrom` | String | Sequence name |
| `source` | String | Feature source |
| `type` | String | Feature type (gene, exon, etc.) |
| `start` | Int64 | Start position |
| `end` | Int64 | End position |
| `score` | Float32 | Score |
| `strand` | String | Strand (+/-/.) |
| `phase` | UInt32 | Phase (0/1/2) |
| `attributes` | String | Attributes string |
## FASTA Format
### read_fasta / scan_fasta
Read reference sequences from FASTA files.
```python
import polars_bio as pb
df = pb.read_fasta("reference.fasta")
```
### Column Schema
| Column | Type | Description |
|--------|------|-------------|
| `name` | String | Sequence name |
| `description` | String | Description line |
| `sequence` | String | Nucleotide sequence |
## FASTQ Format
### read_fastq / scan_fastq
Read raw sequencing reads with quality scores.
```python
import polars_bio as pb
df = pb.read_fastq("reads.fastq.gz")
```
### Column Schema
| Column | Type | Description |
|--------|------|-------------|
| `name` | String | Read name |
| `description` | String | Description line |
| `sequence` | String | Nucleotide sequence |
| `quality` | String | Quality string (Phred+33 encoded) |
### write_fastq / sink_fastq
```python
rows_written = pb.write_fastq(df, "output.fastq")
pb.sink_fastq(lf, "output.fastq")
```
## SAM Format
### read_sam / scan_sam
Read text-format alignment files. Same column schema as BAM. No cloud parameters.
```python
import polars_bio as pb
df = pb.read_sam("alignments.sam")
```
### Additional Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `tag_fields` | list[str] | `None` | SAM tags to extract |
| `infer_tag_types` | bool | `True` | Infer tag types |
| `infer_tag_sample_size` | int | `100` | Sample size for inference |
| `tag_type_hints` | list[str] | `None` | Explicit type hints |
### write_sam / sink_sam
```python
rows_written = pb.write_sam(df, "output.sam")
pb.sink_sam(lf, "output.sam", sort_on_write=True)
```
## Hi-C Pairs
### read_pairs / scan_pairs
Read Hi-C pairs format files for chromatin contact data.
```python
import polars_bio as pb
df = pb.read_pairs("contacts.pairs")
lf = pb.scan_pairs("contacts.pairs")
```
### Column Schema
| Column | Type | Description |
|--------|------|-------------|
| `readID` | String | Read identifier |
| `chrom1` | String | Chromosome of first contact |
| `pos1` | Int32 | Position of first contact |
| `chrom2` | String | Chromosome of second contact |
| `pos2` | Int32 | Position of second contact |
| `strand1` | String | Strand of first contact |
| `strand2` | String | Strand of second contact |
## Generic Table Reader
### read_table / scan_table
Read tab-delimited files with custom schema. Useful for non-standard formats or bioframe-compatible tables.
```python
import polars_bio as pb
df = pb.read_table("custom.tsv", schema={"chrom": str, "start": int, "end": int, "name": str})
lf = pb.scan_table("custom.tsv", schema={"chrom": str, "start": int, "end": int})
```
## Cloud Storage
All `read_*` and `scan_*` functions support cloud storage via individual parameters:
### Amazon S3
```python
df = pb.read_bed(
"s3://bucket/regions.bed",
allow_anonymous=False,
max_retries=10,
timeout=600,
)
```
### Google Cloud Storage
```python
df = pb.read_vcf("gs://bucket/variants.vcf.gz", allow_anonymous=True)
```
### Azure Blob Storage
```python
df = pb.read_bam("az://container/aligned.bam", allow_anonymous=False)
```
**Note:** For authenticated access, configure credentials via environment variables or cloud SDK configuration (e.g., `AWS_ACCESS_KEY_ID`, `GOOGLE_APPLICATION_CREDENTIALS`).
## Compression Support
polars-bio transparently handles compressed files:
| Compression | Extension | Parallel Decompression |
|-------------|-----------|----------------------|
| GZIP | `.gz` | No |
| BGZF | `.gz` (with BGZF blocks) | Yes |
| Uncompressed | (none) | N/A |
**Recommendation:** Use BGZF compression (e.g., created with `bgzip`) for large files. BGZF supports parallel block decompression, significantly improving read performance compared to plain GZIP.
## Describe Functions
Inspect file structure without fully reading:
```python
import polars_bio as pb
# Describe file schemas and metadata
schema_df = pb.describe_vcf("samples.vcf.gz")
schema_df = pb.describe_bam("aligned.bam")
schema_df = pb.describe_sam("alignments.sam")
schema_df = pb.describe_cram("aligned.cram", reference_path="ref.fasta")
```