Adds a new skill covering polars-bio (v0.26.0), a high-performance library for genomic interval arithmetic and file I/O built on Polars, Arrow, and DataFusion. All code examples verified against the actual API at runtime. SKILL.md covers overlap, nearest, merge, coverage, complement, subtract, cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF, BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats. References: interval_operations, file_io, sql_processing, pileup_operations, configuration, bioframe_migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12 KiB
Bioinformatics File I/O
Overview
polars-bio provides read_*, scan_*, write_*, and sink_* functions for common bioinformatics formats. read_* loads data eagerly into a DataFrame, while scan_* creates a LazyFrame for streaming/out-of-core processing. write_* writes from DataFrame/LazyFrame and returns a row count, while sink_* streams from a LazyFrame.
Supported Formats
| Format | Read | Scan | Register (SQL) | Write | Sink |
|---|---|---|---|---|---|
| BED | read_bed |
scan_bed |
register_bed |
— | — |
| VCF | read_vcf |
scan_vcf |
register_vcf |
write_vcf |
sink_vcf |
| BAM | read_bam |
scan_bam |
register_bam |
write_bam |
sink_bam |
| CRAM | read_cram |
scan_cram |
register_cram |
write_cram |
sink_cram |
| GFF | read_gff |
scan_gff |
register_gff |
— | — |
| GTF | read_gtf |
scan_gtf |
register_gtf |
— | — |
| FASTA | read_fasta |
scan_fasta |
— | — | — |
| FASTQ | read_fastq |
scan_fastq |
register_fastq |
write_fastq |
sink_fastq |
| SAM | read_sam |
scan_sam |
register_sam |
write_sam |
sink_sam |
| Hi-C pairs | read_pairs |
scan_pairs |
register_pairs |
— | — |
| Generic table | read_table |
scan_table |
— | — | — |
Common Cloud/IO Parameters
All read_* and scan_* functions share these parameters (instead of a single storage_options dict):
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str | required | File path (local, S3, GCS, Azure) |
chunk_size |
int | 8 |
Number of chunks for parallel reading |
concurrent_fetches |
int | 1 |
Number of concurrent fetches for cloud storage |
allow_anonymous |
bool | True |
Allow anonymous access to cloud storage |
enable_request_payer |
bool | False |
Enable requester-pays for cloud storage |
max_retries |
int | 5 |
Maximum retries for cloud operations |
timeout |
int | 300 |
Timeout in seconds for cloud operations |
compression_type |
str | "auto" |
Compression type (auto-detected from extension) |
projection_pushdown |
bool | True |
Enable projection pushdown optimization |
use_zero_based |
bool | None |
Set coordinate system metadata (None = use global setting) |
Not all functions support all parameters. SAM functions lack cloud parameters. FASTA/FASTQ lack predicate_pushdown.
BED Format
read_bed / scan_bed
Read BED files. Columns are auto-detected (BED3 through BED12). BED files use 0-based half-open coordinates; polars-bio attaches coordinate metadata automatically.
import polars_bio as pb
# Eager read
df = pb.read_bed("regions.bed")
# Lazy scan
lf = pb.scan_bed("regions.bed")
Column Schema (BED3)
| Column | Type | Description |
|---|---|---|
chrom |
String | Chromosome name |
start |
Int64 | Start position |
end |
Int64 | End position |
Extended BED fields (auto-detected) add: name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts.
VCF Format
read_vcf / scan_vcf
Read VCF/BCF files. Supports .vcf, .vcf.gz, .bcf.
import polars_bio as pb
# Read VCF
df = pb.read_vcf("variants.vcf.gz")
# Read with specific INFO and FORMAT fields extracted as columns
df = pb.read_vcf("variants.vcf.gz", info_fields=["AF", "DP"], format_fields=["GT", "GQ"])
# Read specific samples
df = pb.read_vcf("variants.vcf.gz", samples=["SAMPLE1", "SAMPLE2"])
Additional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
info_fields |
list[str] | None |
INFO fields to extract as columns |
format_fields |
list[str] | None |
FORMAT fields to extract as columns |
samples |
list[str] | None |
Samples to include |
predicate_pushdown |
bool | True |
Enable predicate pushdown |
Column Schema
| Column | Type | Description |
|---|---|---|
chrom |
String | Chromosome |
start |
UInt32 | Start position |
end |
UInt32 | End position |
id |
String | Variant ID |
ref |
String | Reference allele |
alt |
String | Alternate allele(s) |
qual |
Float32 | Quality score |
filter |
String | Filter status |
info |
String | INFO field (raw, unless info_fields specified) |
write_vcf / sink_vcf
import polars_bio as pb
# Write DataFrame to VCF
rows_written = pb.write_vcf(df, "output.vcf")
# Stream LazyFrame to VCF
pb.sink_vcf(lf, "output.vcf")
BAM Format
read_bam / scan_bam
Read aligned sequencing reads from BAM files. Requires a .bai index file.
import polars_bio as pb
# Read BAM
df = pb.read_bam("aligned.bam")
# Scan BAM (streaming)
lf = pb.scan_bam("aligned.bam")
# Read with specific tags
df = pb.read_bam("aligned.bam", tag_fields=["NM", "MD"])
Additional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
tag_fields |
list[str] | None |
SAM tags to extract as columns |
predicate_pushdown |
bool | True |
Enable predicate pushdown |
infer_tag_types |
bool | True |
Infer tag column types from data |
infer_tag_sample_size |
int | 100 |
Number of records to sample for type inference |
tag_type_hints |
list[str] | None |
Explicit type hints for tags |
Column Schema
| Column | Type | Description |
|---|---|---|
chrom |
String | Reference sequence name |
start |
Int64 | Alignment start position |
end |
Int64 | Alignment end position |
name |
String | Read name |
flags |
UInt32 | SAM flags |
mapping_quality |
UInt32 | Mapping quality |
cigar |
String | CIGAR string |
sequence |
String | Read sequence |
quality_scores |
String | Base quality string |
mate_chrom |
String | Mate reference name |
mate_start |
Int64 | Mate start position |
template_length |
Int64 | Template length |
write_bam / sink_bam
rows_written = pb.write_bam(df, "output.bam")
rows_written = pb.write_bam(df, "output.bam", sort_on_write=True)
pb.sink_bam(lf, "output.bam")
pb.sink_bam(lf, "output.bam", sort_on_write=True)
CRAM Format
read_cram / scan_cram
CRAM files have separate functions from BAM. Require a reference FASTA and .crai index.
import polars_bio as pb
# Read CRAM (reference required)
df = pb.read_cram("aligned.cram", reference_path="reference.fasta")
# Scan CRAM (streaming)
lf = pb.scan_cram("aligned.cram", reference_path="reference.fasta")
Same additional parameters and column schema as BAM, plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
reference_path |
str | None |
Path to reference FASTA |
write_cram / sink_cram
rows_written = pb.write_cram(df, "output.cram", reference_path="reference.fasta")
pb.sink_cram(lf, "output.cram", reference_path="reference.fasta")
GFF/GTF Format
read_gff / scan_gff / read_gtf / scan_gtf
GFF3 and GTF have separate functions.
import polars_bio as pb
# Read GFF3
df = pb.read_gff("annotations.gff3")
# Read GTF
df = pb.read_gtf("genes.gtf")
# Extract specific attributes as columns
df = pb.read_gff("annotations.gff3", attr_fields=["gene_id", "gene_name"])
Additional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
attr_fields |
list[str] | None |
Attribute fields to extract as columns |
predicate_pushdown |
bool | True |
Enable predicate pushdown |
Column Schema
| Column | Type | Description |
|---|---|---|
chrom |
String | Sequence name |
source |
String | Feature source |
type |
String | Feature type (gene, exon, etc.) |
start |
Int64 | Start position |
end |
Int64 | End position |
score |
Float32 | Score |
strand |
String | Strand (+/-/.) |
phase |
UInt32 | Phase (0/1/2) |
attributes |
String | Attributes string |
FASTA Format
read_fasta / scan_fasta
Read reference sequences from FASTA files.
import polars_bio as pb
df = pb.read_fasta("reference.fasta")
Column Schema
| Column | Type | Description |
|---|---|---|
name |
String | Sequence name |
description |
String | Description line |
sequence |
String | Nucleotide sequence |
FASTQ Format
read_fastq / scan_fastq
Read raw sequencing reads with quality scores.
import polars_bio as pb
df = pb.read_fastq("reads.fastq.gz")
Column Schema
| Column | Type | Description |
|---|---|---|
name |
String | Read name |
description |
String | Description line |
sequence |
String | Nucleotide sequence |
quality |
String | Quality string (Phred+33 encoded) |
write_fastq / sink_fastq
rows_written = pb.write_fastq(df, "output.fastq")
pb.sink_fastq(lf, "output.fastq")
SAM Format
read_sam / scan_sam
Read text-format alignment files. Same column schema as BAM. No cloud parameters.
import polars_bio as pb
df = pb.read_sam("alignments.sam")
Additional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
tag_fields |
list[str] | None |
SAM tags to extract |
infer_tag_types |
bool | True |
Infer tag types |
infer_tag_sample_size |
int | 100 |
Sample size for inference |
tag_type_hints |
list[str] | None |
Explicit type hints |
write_sam / sink_sam
rows_written = pb.write_sam(df, "output.sam")
pb.sink_sam(lf, "output.sam", sort_on_write=True)
Hi-C Pairs
read_pairs / scan_pairs
Read Hi-C pairs format files for chromatin contact data.
import polars_bio as pb
df = pb.read_pairs("contacts.pairs")
lf = pb.scan_pairs("contacts.pairs")
Column Schema
| Column | Type | Description |
|---|---|---|
readID |
String | Read identifier |
chrom1 |
String | Chromosome of first contact |
pos1 |
Int32 | Position of first contact |
chrom2 |
String | Chromosome of second contact |
pos2 |
Int32 | Position of second contact |
strand1 |
String | Strand of first contact |
strand2 |
String | Strand of second contact |
Generic Table Reader
read_table / scan_table
Read tab-delimited files with custom schema. Useful for non-standard formats or bioframe-compatible tables.
import polars_bio as pb
df = pb.read_table("custom.tsv", schema={"chrom": str, "start": int, "end": int, "name": str})
lf = pb.scan_table("custom.tsv", schema={"chrom": str, "start": int, "end": int})
Cloud Storage
All read_* and scan_* functions support cloud storage via individual parameters:
Amazon S3
df = pb.read_bed(
"s3://bucket/regions.bed",
allow_anonymous=False,
max_retries=10,
timeout=600,
)
Google Cloud Storage
df = pb.read_vcf("gs://bucket/variants.vcf.gz", allow_anonymous=True)
Azure Blob Storage
df = pb.read_bam("az://container/aligned.bam", allow_anonymous=False)
Note: For authenticated access, configure credentials via environment variables or cloud SDK configuration (e.g., AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS).
Compression Support
polars-bio transparently handles compressed files:
| Compression | Extension | Parallel Decompression |
|---|---|---|
| GZIP | .gz |
No |
| BGZF | .gz (with BGZF blocks) |
Yes |
| Uncompressed | (none) | N/A |
Recommendation: Use BGZF compression (e.g., created with bgzip) for large files. BGZF supports parallel block decompression, significantly improving read performance compared to plain GZIP.
Describe Functions
Inspect file structure without fully reading:
import polars_bio as pb
# Describe file schemas and metadata
schema_df = pb.describe_vcf("samples.vcf.gz")
schema_df = pb.describe_bam("aligned.bam")
schema_df = pb.describe_sam("alignments.sam")
schema_df = pb.describe_cram("aligned.cram", reference_path="ref.fasta")