Files
Marek Wieiwórka 436c8608f2 Add polars-bio skill for genomic interval operations and bioinformatics I/O
Adds a new skill covering polars-bio (v0.26.0), a high-performance library
for genomic interval arithmetic and file I/O built on Polars, Arrow, and
DataFusion. All code examples verified against the actual API at runtime.

SKILL.md covers overlap, nearest, merge, coverage, complement, subtract,
cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF,
BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats.

References: interval_operations, file_io, sql_processing, pileup_operations,
configuration, bioframe_migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 10:27:11 +01:00

12 KiB

Bioinformatics File I/O

Overview

polars-bio provides read_*, scan_*, write_*, and sink_* functions for common bioinformatics formats. read_* loads data eagerly into a DataFrame, while scan_* creates a LazyFrame for streaming/out-of-core processing. write_* writes from DataFrame/LazyFrame and returns a row count, while sink_* streams from a LazyFrame.

Supported Formats

Format Read Scan Register (SQL) Write Sink
BED read_bed scan_bed register_bed
VCF read_vcf scan_vcf register_vcf write_vcf sink_vcf
BAM read_bam scan_bam register_bam write_bam sink_bam
CRAM read_cram scan_cram register_cram write_cram sink_cram
GFF read_gff scan_gff register_gff
GTF read_gtf scan_gtf register_gtf
FASTA read_fasta scan_fasta
FASTQ read_fastq scan_fastq register_fastq write_fastq sink_fastq
SAM read_sam scan_sam register_sam write_sam sink_sam
Hi-C pairs read_pairs scan_pairs register_pairs
Generic table read_table scan_table

Common Cloud/IO Parameters

All read_* and scan_* functions share these parameters (instead of a single storage_options dict):

Parameter Type Default Description
path str required File path (local, S3, GCS, Azure)
chunk_size int 8 Number of chunks for parallel reading
concurrent_fetches int 1 Number of concurrent fetches for cloud storage
allow_anonymous bool True Allow anonymous access to cloud storage
enable_request_payer bool False Enable requester-pays for cloud storage
max_retries int 5 Maximum retries for cloud operations
timeout int 300 Timeout in seconds for cloud operations
compression_type str "auto" Compression type (auto-detected from extension)
projection_pushdown bool True Enable projection pushdown optimization
use_zero_based bool None Set coordinate system metadata (None = use global setting)

Not all functions support all parameters. SAM functions lack cloud parameters. FASTA/FASTQ lack predicate_pushdown.

BED Format

read_bed / scan_bed

Read BED files. Columns are auto-detected (BED3 through BED12). BED files use 0-based half-open coordinates; polars-bio attaches coordinate metadata automatically.

import polars_bio as pb

# Eager read
df = pb.read_bed("regions.bed")

# Lazy scan
lf = pb.scan_bed("regions.bed")

Column Schema (BED3)

Column Type Description
chrom String Chromosome name
start Int64 Start position
end Int64 End position

Extended BED fields (auto-detected) add: name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts.

VCF Format

read_vcf / scan_vcf

Read VCF/BCF files. Supports .vcf, .vcf.gz, .bcf.

import polars_bio as pb

# Read VCF
df = pb.read_vcf("variants.vcf.gz")

# Read with specific INFO and FORMAT fields extracted as columns
df = pb.read_vcf("variants.vcf.gz", info_fields=["AF", "DP"], format_fields=["GT", "GQ"])

# Read specific samples
df = pb.read_vcf("variants.vcf.gz", samples=["SAMPLE1", "SAMPLE2"])

Additional Parameters

Parameter Type Default Description
info_fields list[str] None INFO fields to extract as columns
format_fields list[str] None FORMAT fields to extract as columns
samples list[str] None Samples to include
predicate_pushdown bool True Enable predicate pushdown

Column Schema

Column Type Description
chrom String Chromosome
start UInt32 Start position
end UInt32 End position
id String Variant ID
ref String Reference allele
alt String Alternate allele(s)
qual Float32 Quality score
filter String Filter status
info String INFO field (raw, unless info_fields specified)

write_vcf / sink_vcf

import polars_bio as pb

# Write DataFrame to VCF
rows_written = pb.write_vcf(df, "output.vcf")

# Stream LazyFrame to VCF
pb.sink_vcf(lf, "output.vcf")

BAM Format

read_bam / scan_bam

Read aligned sequencing reads from BAM files. Requires a .bai index file.

import polars_bio as pb

# Read BAM
df = pb.read_bam("aligned.bam")

# Scan BAM (streaming)
lf = pb.scan_bam("aligned.bam")

# Read with specific tags
df = pb.read_bam("aligned.bam", tag_fields=["NM", "MD"])

Additional Parameters

Parameter Type Default Description
tag_fields list[str] None SAM tags to extract as columns
predicate_pushdown bool True Enable predicate pushdown
infer_tag_types bool True Infer tag column types from data
infer_tag_sample_size int 100 Number of records to sample for type inference
tag_type_hints list[str] None Explicit type hints for tags

Column Schema

Column Type Description
chrom String Reference sequence name
start Int64 Alignment start position
end Int64 Alignment end position
name String Read name
flags UInt32 SAM flags
mapping_quality UInt32 Mapping quality
cigar String CIGAR string
sequence String Read sequence
quality_scores String Base quality string
mate_chrom String Mate reference name
mate_start Int64 Mate start position
template_length Int64 Template length

write_bam / sink_bam

rows_written = pb.write_bam(df, "output.bam")
rows_written = pb.write_bam(df, "output.bam", sort_on_write=True)

pb.sink_bam(lf, "output.bam")
pb.sink_bam(lf, "output.bam", sort_on_write=True)

CRAM Format

read_cram / scan_cram

CRAM files have separate functions from BAM. Require a reference FASTA and .crai index.

import polars_bio as pb

# Read CRAM (reference required)
df = pb.read_cram("aligned.cram", reference_path="reference.fasta")

# Scan CRAM (streaming)
lf = pb.scan_cram("aligned.cram", reference_path="reference.fasta")

Same additional parameters and column schema as BAM, plus:

Parameter Type Default Description
reference_path str None Path to reference FASTA

write_cram / sink_cram

rows_written = pb.write_cram(df, "output.cram", reference_path="reference.fasta")
pb.sink_cram(lf, "output.cram", reference_path="reference.fasta")

GFF/GTF Format

read_gff / scan_gff / read_gtf / scan_gtf

GFF3 and GTF have separate functions.

import polars_bio as pb

# Read GFF3
df = pb.read_gff("annotations.gff3")

# Read GTF
df = pb.read_gtf("genes.gtf")

# Extract specific attributes as columns
df = pb.read_gff("annotations.gff3", attr_fields=["gene_id", "gene_name"])

Additional Parameters

Parameter Type Default Description
attr_fields list[str] None Attribute fields to extract as columns
predicate_pushdown bool True Enable predicate pushdown

Column Schema

Column Type Description
chrom String Sequence name
source String Feature source
type String Feature type (gene, exon, etc.)
start Int64 Start position
end Int64 End position
score Float32 Score
strand String Strand (+/-/.)
phase UInt32 Phase (0/1/2)
attributes String Attributes string

FASTA Format

read_fasta / scan_fasta

Read reference sequences from FASTA files.

import polars_bio as pb

df = pb.read_fasta("reference.fasta")

Column Schema

Column Type Description
name String Sequence name
description String Description line
sequence String Nucleotide sequence

FASTQ Format

read_fastq / scan_fastq

Read raw sequencing reads with quality scores.

import polars_bio as pb

df = pb.read_fastq("reads.fastq.gz")

Column Schema

Column Type Description
name String Read name
description String Description line
sequence String Nucleotide sequence
quality String Quality string (Phred+33 encoded)

write_fastq / sink_fastq

rows_written = pb.write_fastq(df, "output.fastq")
pb.sink_fastq(lf, "output.fastq")

SAM Format

read_sam / scan_sam

Read text-format alignment files. Same column schema as BAM. No cloud parameters.

import polars_bio as pb

df = pb.read_sam("alignments.sam")

Additional Parameters

Parameter Type Default Description
tag_fields list[str] None SAM tags to extract
infer_tag_types bool True Infer tag types
infer_tag_sample_size int 100 Sample size for inference
tag_type_hints list[str] None Explicit type hints

write_sam / sink_sam

rows_written = pb.write_sam(df, "output.sam")
pb.sink_sam(lf, "output.sam", sort_on_write=True)

Hi-C Pairs

read_pairs / scan_pairs

Read Hi-C pairs format files for chromatin contact data.

import polars_bio as pb

df = pb.read_pairs("contacts.pairs")
lf = pb.scan_pairs("contacts.pairs")

Column Schema

Column Type Description
readID String Read identifier
chrom1 String Chromosome of first contact
pos1 Int32 Position of first contact
chrom2 String Chromosome of second contact
pos2 Int32 Position of second contact
strand1 String Strand of first contact
strand2 String Strand of second contact

Generic Table Reader

read_table / scan_table

Read tab-delimited files with custom schema. Useful for non-standard formats or bioframe-compatible tables.

import polars_bio as pb

df = pb.read_table("custom.tsv", schema={"chrom": str, "start": int, "end": int, "name": str})
lf = pb.scan_table("custom.tsv", schema={"chrom": str, "start": int, "end": int})

Cloud Storage

All read_* and scan_* functions support cloud storage via individual parameters:

Amazon S3

df = pb.read_bed(
    "s3://bucket/regions.bed",
    allow_anonymous=False,
    max_retries=10,
    timeout=600,
)

Google Cloud Storage

df = pb.read_vcf("gs://bucket/variants.vcf.gz", allow_anonymous=True)

Azure Blob Storage

df = pb.read_bam("az://container/aligned.bam", allow_anonymous=False)

Note: For authenticated access, configure credentials via environment variables or cloud SDK configuration (e.g., AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS).

Compression Support

polars-bio transparently handles compressed files:

Compression Extension Parallel Decompression
GZIP .gz No
BGZF .gz (with BGZF blocks) Yes
Uncompressed (none) N/A

Recommendation: Use BGZF compression (e.g., created with bgzip) for large files. BGZF supports parallel block decompression, significantly improving read performance compared to plain GZIP.

Describe Functions

Inspect file structure without fully reading:

import polars_bio as pb

# Describe file schemas and metadata
schema_df = pb.describe_vcf("samples.vcf.gz")
schema_df = pb.describe_bam("aligned.bam")
schema_df = pb.describe_sam("alignments.sam")
schema_df = pb.describe_cram("aligned.cram", reference_path="ref.fasta")