mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-03-27 07:09:27 +08:00

Files

Marek Wieiwórka 436c8608f2 Add polars-bio skill for genomic interval operations and bioinformatics I/O

Adds a new skill covering polars-bio (v0.26.0), a high-performance library
for genomic interval arithmetic and file I/O built on Polars, Arrow, and
DataFusion. All code examples verified against the actual API at runtime.

SKILL.md covers overlap, nearest, merge, coverage, complement, subtract,
cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF,
BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats.

References: interval_operations, file_io, sql_processing, pileup_operations,
configuration, bioframe_migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-14 10:27:11 +01:00

12 KiB

Raw Permalink Blame History

Bioinformatics File I/O

Overview

polars-bio provides read_*, scan_*, write_*, and sink_* functions for common bioinformatics formats. read_* loads data eagerly into a DataFrame, while scan_* creates a LazyFrame for streaming/out-of-core processing. write_* writes from DataFrame/LazyFrame and returns a row count, while sink_* streams from a LazyFrame.

Supported Formats

Format	Read	Scan	Register (SQL)	Write	Sink
BED	`read_bed`	`scan_bed`	`register_bed`	—	—
VCF	`read_vcf`	`scan_vcf`	`register_vcf`	`write_vcf`	`sink_vcf`
BAM	`read_bam`	`scan_bam`	`register_bam`	`write_bam`	`sink_bam`
CRAM	`read_cram`	`scan_cram`	`register_cram`	`write_cram`	`sink_cram`
GFF	`read_gff`	`scan_gff`	`register_gff`	—	—
GTF	`read_gtf`	`scan_gtf`	`register_gtf`	—	—
FASTA	`read_fasta`	`scan_fasta`	—	—	—
FASTQ	`read_fastq`	`scan_fastq`	`register_fastq`	`write_fastq`	`sink_fastq`
SAM	`read_sam`	`scan_sam`	`register_sam`	`write_sam`	`sink_sam`
Hi-C pairs	`read_pairs`	`scan_pairs`	`register_pairs`	—	—
Generic table	`read_table`	`scan_table`	—	—	—

Common Cloud/IO Parameters

All read_* and scan_* functions share these parameters (instead of a single storage_options dict):

Parameter	Type	Default	Description
`path`	str	required	File path (local, S3, GCS, Azure)
`chunk_size`	int	`8`	Number of chunks for parallel reading
`concurrent_fetches`	int	`1`	Number of concurrent fetches for cloud storage
`allow_anonymous`	bool	`True`	Allow anonymous access to cloud storage
`enable_request_payer`	bool	`False`	Enable requester-pays for cloud storage
`max_retries`	int	`5`	Maximum retries for cloud operations
`timeout`	int	`300`	Timeout in seconds for cloud operations
`compression_type`	str	`"auto"`	Compression type (auto-detected from extension)
`projection_pushdown`	bool	`True`	Enable projection pushdown optimization
`use_zero_based`	bool	`None`	Set coordinate system metadata (None = use global setting)

Not all functions support all parameters. SAM functions lack cloud parameters. FASTA/FASTQ lack predicate_pushdown.

BED Format

read_bed / scan_bed

Read BED files. Columns are auto-detected (BED3 through BED12). BED files use 0-based half-open coordinates; polars-bio attaches coordinate metadata automatically.

import polars_bio as pb

# Eager read
df = pb.read_bed("regions.bed")

# Lazy scan
lf = pb.scan_bed("regions.bed")

Column Schema (BED3)

Column	Type	Description
`chrom`	String	Chromosome name
`start`	Int64	Start position
`end`	Int64	End position

Extended BED fields (auto-detected) add: name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts.

VCF Format

read_vcf / scan_vcf

Read VCF/BCF files. Supports .vcf, .vcf.gz, .bcf.

import polars_bio as pb

# Read VCF
df = pb.read_vcf("variants.vcf.gz")

# Read with specific INFO and FORMAT fields extracted as columns
df = pb.read_vcf("variants.vcf.gz", info_fields=["AF", "DP"], format_fields=["GT", "GQ"])

# Read specific samples
df = pb.read_vcf("variants.vcf.gz", samples=["SAMPLE1", "SAMPLE2"])

Additional Parameters

Parameter	Type	Default	Description
`info_fields`	list[str]	`None`	INFO fields to extract as columns
`format_fields`	list[str]	`None`	FORMAT fields to extract as columns
`samples`	list[str]	`None`	Samples to include
`predicate_pushdown`	bool	`True`	Enable predicate pushdown

Column Schema

Column	Type	Description
`chrom`	String	Chromosome
`start`	UInt32	Start position
`end`	UInt32	End position
`id`	String	Variant ID
`ref`	String	Reference allele
`alt`	String	Alternate allele(s)
`qual`	Float32	Quality score
`filter`	String	Filter status
`info`	String	INFO field (raw, unless `info_fields` specified)

write_vcf / sink_vcf

import polars_bio as pb

# Write DataFrame to VCF
rows_written = pb.write_vcf(df, "output.vcf")

# Stream LazyFrame to VCF
pb.sink_vcf(lf, "output.vcf")

BAM Format

read_bam / scan_bam

Read aligned sequencing reads from BAM files. Requires a .bai index file.

import polars_bio as pb

# Read BAM
df = pb.read_bam("aligned.bam")

# Scan BAM (streaming)
lf = pb.scan_bam("aligned.bam")

# Read with specific tags
df = pb.read_bam("aligned.bam", tag_fields=["NM", "MD"])

Additional Parameters

Parameter	Type	Default	Description
`tag_fields`	list[str]	`None`	SAM tags to extract as columns
`predicate_pushdown`	bool	`True`	Enable predicate pushdown
`infer_tag_types`	bool	`True`	Infer tag column types from data
`infer_tag_sample_size`	int	`100`	Number of records to sample for type inference
`tag_type_hints`	list[str]	`None`	Explicit type hints for tags

Column Schema

Column	Type	Description
`chrom`	String	Reference sequence name
`start`	Int64	Alignment start position
`end`	Int64	Alignment end position
`name`	String	Read name
`flags`	UInt32	SAM flags
`mapping_quality`	UInt32	Mapping quality
`cigar`	String	CIGAR string
`sequence`	String	Read sequence
`quality_scores`	String	Base quality string
`mate_chrom`	String	Mate reference name
`mate_start`	Int64	Mate start position
`template_length`	Int64	Template length

write_bam / sink_bam

rows_written = pb.write_bam(df, "output.bam")
rows_written = pb.write_bam(df, "output.bam", sort_on_write=True)

pb.sink_bam(lf, "output.bam")
pb.sink_bam(lf, "output.bam", sort_on_write=True)

CRAM Format

read_cram / scan_cram

CRAM files have separate functions from BAM. Require a reference FASTA and .crai index.

import polars_bio as pb

# Read CRAM (reference required)
df = pb.read_cram("aligned.cram", reference_path="reference.fasta")

# Scan CRAM (streaming)
lf = pb.scan_cram("aligned.cram", reference_path="reference.fasta")

Same additional parameters and column schema as BAM, plus:

Parameter	Type	Default	Description
`reference_path`	str	`None`	Path to reference FASTA

write_cram / sink_cram

rows_written = pb.write_cram(df, "output.cram", reference_path="reference.fasta")
pb.sink_cram(lf, "output.cram", reference_path="reference.fasta")

GFF/GTF Format

read_gff / scan_gff / read_gtf / scan_gtf

GFF3 and GTF have separate functions.

import polars_bio as pb

# Read GFF3
df = pb.read_gff("annotations.gff3")

# Read GTF
df = pb.read_gtf("genes.gtf")

# Extract specific attributes as columns
df = pb.read_gff("annotations.gff3", attr_fields=["gene_id", "gene_name"])

Additional Parameters

Parameter	Type	Default	Description
`attr_fields`	list[str]	`None`	Attribute fields to extract as columns
`predicate_pushdown`	bool	`True`	Enable predicate pushdown

Column Schema

Column	Type	Description
`chrom`	String	Sequence name
`source`	String	Feature source
`type`	String	Feature type (gene, exon, etc.)
`start`	Int64	Start position
`end`	Int64	End position
`score`	Float32	Score
`strand`	String	Strand (+/-/.)
`phase`	UInt32	Phase (0/1/2)
`attributes`	String	Attributes string

FASTA Format

read_fasta / scan_fasta

Read reference sequences from FASTA files.

import polars_bio as pb

df = pb.read_fasta("reference.fasta")

Column Schema

Column	Type	Description
`name`	String	Sequence name
`description`	String	Description line
`sequence`	String	Nucleotide sequence

FASTQ Format

read_fastq / scan_fastq

Read raw sequencing reads with quality scores.

import polars_bio as pb

df = pb.read_fastq("reads.fastq.gz")

Column Schema

Column	Type	Description
`name`	String	Read name
`description`	String	Description line
`sequence`	String	Nucleotide sequence
`quality`	String	Quality string (Phred+33 encoded)

write_fastq / sink_fastq

rows_written = pb.write_fastq(df, "output.fastq")
pb.sink_fastq(lf, "output.fastq")

SAM Format

read_sam / scan_sam

Read text-format alignment files. Same column schema as BAM. No cloud parameters.

import polars_bio as pb

df = pb.read_sam("alignments.sam")

Additional Parameters

Parameter	Type	Default	Description
`tag_fields`	list[str]	`None`	SAM tags to extract
`infer_tag_types`	bool	`True`	Infer tag types
`infer_tag_sample_size`	int	`100`	Sample size for inference
`tag_type_hints`	list[str]	`None`	Explicit type hints

write_sam / sink_sam

rows_written = pb.write_sam(df, "output.sam")
pb.sink_sam(lf, "output.sam", sort_on_write=True)

Hi-C Pairs

read_pairs / scan_pairs

Read Hi-C pairs format files for chromatin contact data.

import polars_bio as pb

df = pb.read_pairs("contacts.pairs")
lf = pb.scan_pairs("contacts.pairs")

Column Schema

Column	Type	Description
`readID`	String	Read identifier
`chrom1`	String	Chromosome of first contact
`pos1`	Int32	Position of first contact
`chrom2`	String	Chromosome of second contact
`pos2`	Int32	Position of second contact
`strand1`	String	Strand of first contact
`strand2`	String	Strand of second contact

Generic Table Reader

read_table / scan_table

Read tab-delimited files with custom schema. Useful for non-standard formats or bioframe-compatible tables.

import polars_bio as pb

df = pb.read_table("custom.tsv", schema={"chrom": str, "start": int, "end": int, "name": str})
lf = pb.scan_table("custom.tsv", schema={"chrom": str, "start": int, "end": int})

Cloud Storage

All read_* and scan_* functions support cloud storage via individual parameters:

Amazon S3

df = pb.read_bed(
    "s3://bucket/regions.bed",
    allow_anonymous=False,
    max_retries=10,
    timeout=600,
)

Google Cloud Storage

df = pb.read_vcf("gs://bucket/variants.vcf.gz", allow_anonymous=True)

Azure Blob Storage

df = pb.read_bam("az://container/aligned.bam", allow_anonymous=False)

Note: For authenticated access, configure credentials via environment variables or cloud SDK configuration (e.g., AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS).

Compression Support

polars-bio transparently handles compressed files:

Compression	Extension	Parallel Decompression
GZIP	`.gz`	No
BGZF	`.gz` (with BGZF blocks)	Yes
Uncompressed	(none)	N/A

Recommendation: Use BGZF compression (e.g., created with bgzip) for large files. BGZF supports parallel block decompression, significantly improving read performance compared to plain GZIP.

Describe Functions

Inspect file structure without fully reading:

import polars_bio as pb

# Describe file schemas and metadata
schema_df = pb.describe_vcf("samples.vcf.gz")
schema_df = pb.describe_bam("aligned.bam")
schema_df = pb.describe_sam("alignments.sam")
schema_df = pb.describe_cram("aligned.cram", reference_path="ref.fasta")

12 KiB Raw Permalink Blame History

Bioinformatics File I/O

Overview

Supported Formats

Common Cloud/IO Parameters

BED Format

read_bed / scan_bed

Column Schema (BED3)

VCF Format

read_vcf / scan_vcf

Additional Parameters

Column Schema

write_vcf / sink_vcf

BAM Format

read_bam / scan_bam

Additional Parameters

Column Schema

write_bam / sink_bam

CRAM Format

read_cram / scan_cram

write_cram / sink_cram

GFF/GTF Format

read_gff / scan_gff / read_gtf / scan_gtf

Additional Parameters

Column Schema

FASTA Format

read_fasta / scan_fasta

Column Schema

FASTQ Format

read_fastq / scan_fastq

Column Schema

write_fastq / sink_fastq

SAM Format

read_sam / scan_sam

Additional Parameters

write_sam / sink_sam

Hi-C Pairs

read_pairs / scan_pairs

Column Schema

Generic Table Reader

read_table / scan_table

Cloud Storage

Amazon S3

Google Cloud Storage

Azure Blob Storage

Compression Support

Describe Functions

12 KiB

Raw Permalink Blame History