mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Adds a new skill covering polars-bio (v0.26.0), a high-performance library for genomic interval arithmetic and file I/O built on Polars, Arrow, and DataFusion. All code examples verified against the actual API at runtime. SKILL.md covers overlap, nearest, merge, coverage, complement, subtract, cluster, count_overlaps operations plus read/scan/write/sink for BED, VCF, BAM, CRAM, GFF, GTF, FASTA, FASTQ, SAM, and Hi-C pairs formats. References: interval_operations, file_io, sql_processing, pileup_operations, configuration, bioframe_migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
415 lines
12 KiB
Markdown
415 lines
12 KiB
Markdown
# Bioinformatics File I/O
|
|
|
|
## Overview
|
|
|
|
polars-bio provides `read_*`, `scan_*`, `write_*`, and `sink_*` functions for common bioinformatics formats. `read_*` loads data eagerly into a DataFrame, while `scan_*` creates a LazyFrame for streaming/out-of-core processing. `write_*` writes from DataFrame/LazyFrame and returns a row count, while `sink_*` streams from a LazyFrame.
|
|
|
|
## Supported Formats
|
|
|
|
| Format | Read | Scan | Register (SQL) | Write | Sink |
|
|
|--------|------|------|-----------------|-------|------|
|
|
| BED | `read_bed` | `scan_bed` | `register_bed` | — | — |
|
|
| VCF | `read_vcf` | `scan_vcf` | `register_vcf` | `write_vcf` | `sink_vcf` |
|
|
| BAM | `read_bam` | `scan_bam` | `register_bam` | `write_bam` | `sink_bam` |
|
|
| CRAM | `read_cram` | `scan_cram` | `register_cram` | `write_cram` | `sink_cram` |
|
|
| GFF | `read_gff` | `scan_gff` | `register_gff` | — | — |
|
|
| GTF | `read_gtf` | `scan_gtf` | `register_gtf` | — | — |
|
|
| FASTA | `read_fasta` | `scan_fasta` | — | — | — |
|
|
| FASTQ | `read_fastq` | `scan_fastq` | `register_fastq` | `write_fastq` | `sink_fastq` |
|
|
| SAM | `read_sam` | `scan_sam` | `register_sam` | `write_sam` | `sink_sam` |
|
|
| Hi-C pairs | `read_pairs` | `scan_pairs` | `register_pairs` | — | — |
|
|
| Generic table | `read_table` | `scan_table` | — | — | — |
|
|
|
|
## Common Cloud/IO Parameters
|
|
|
|
All `read_*` and `scan_*` functions share these parameters (instead of a single `storage_options` dict):
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `path` | str | required | File path (local, S3, GCS, Azure) |
|
|
| `chunk_size` | int | `8` | Number of chunks for parallel reading |
|
|
| `concurrent_fetches` | int | `1` | Number of concurrent fetches for cloud storage |
|
|
| `allow_anonymous` | bool | `True` | Allow anonymous access to cloud storage |
|
|
| `enable_request_payer` | bool | `False` | Enable requester-pays for cloud storage |
|
|
| `max_retries` | int | `5` | Maximum retries for cloud operations |
|
|
| `timeout` | int | `300` | Timeout in seconds for cloud operations |
|
|
| `compression_type` | str | `"auto"` | Compression type (auto-detected from extension) |
|
|
| `projection_pushdown` | bool | `True` | Enable projection pushdown optimization |
|
|
| `use_zero_based` | bool | `None` | Set coordinate system metadata (None = use global setting) |
|
|
|
|
Not all functions support all parameters. SAM functions lack cloud parameters. FASTA/FASTQ lack `predicate_pushdown`.
|
|
|
|
## BED Format
|
|
|
|
### read_bed / scan_bed
|
|
|
|
Read BED files. Columns are auto-detected (BED3 through BED12). BED files use 0-based half-open coordinates; polars-bio attaches coordinate metadata automatically.
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
# Eager read
|
|
df = pb.read_bed("regions.bed")
|
|
|
|
# Lazy scan
|
|
lf = pb.scan_bed("regions.bed")
|
|
```
|
|
|
|
### Column Schema (BED3)
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `chrom` | String | Chromosome name |
|
|
| `start` | Int64 | Start position |
|
|
| `end` | Int64 | End position |
|
|
|
|
Extended BED fields (auto-detected) add: `name`, `score`, `strand`, `thickStart`, `thickEnd`, `itemRgb`, `blockCount`, `blockSizes`, `blockStarts`.
|
|
|
|
## VCF Format
|
|
|
|
### read_vcf / scan_vcf
|
|
|
|
Read VCF/BCF files. Supports `.vcf`, `.vcf.gz`, `.bcf`.
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
# Read VCF
|
|
df = pb.read_vcf("variants.vcf.gz")
|
|
|
|
# Read with specific INFO and FORMAT fields extracted as columns
|
|
df = pb.read_vcf("variants.vcf.gz", info_fields=["AF", "DP"], format_fields=["GT", "GQ"])
|
|
|
|
# Read specific samples
|
|
df = pb.read_vcf("variants.vcf.gz", samples=["SAMPLE1", "SAMPLE2"])
|
|
```
|
|
|
|
### Additional Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `info_fields` | list[str] | `None` | INFO fields to extract as columns |
|
|
| `format_fields` | list[str] | `None` | FORMAT fields to extract as columns |
|
|
| `samples` | list[str] | `None` | Samples to include |
|
|
| `predicate_pushdown` | bool | `True` | Enable predicate pushdown |
|
|
|
|
### Column Schema
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `chrom` | String | Chromosome |
|
|
| `start` | UInt32 | Start position |
|
|
| `end` | UInt32 | End position |
|
|
| `id` | String | Variant ID |
|
|
| `ref` | String | Reference allele |
|
|
| `alt` | String | Alternate allele(s) |
|
|
| `qual` | Float32 | Quality score |
|
|
| `filter` | String | Filter status |
|
|
| `info` | String | INFO field (raw, unless `info_fields` specified) |
|
|
|
|
### write_vcf / sink_vcf
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
# Write DataFrame to VCF
|
|
rows_written = pb.write_vcf(df, "output.vcf")
|
|
|
|
# Stream LazyFrame to VCF
|
|
pb.sink_vcf(lf, "output.vcf")
|
|
```
|
|
|
|
## BAM Format
|
|
|
|
### read_bam / scan_bam
|
|
|
|
Read aligned sequencing reads from BAM files. Requires a `.bai` index file.
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
# Read BAM
|
|
df = pb.read_bam("aligned.bam")
|
|
|
|
# Scan BAM (streaming)
|
|
lf = pb.scan_bam("aligned.bam")
|
|
|
|
# Read with specific tags
|
|
df = pb.read_bam("aligned.bam", tag_fields=["NM", "MD"])
|
|
```
|
|
|
|
### Additional Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `tag_fields` | list[str] | `None` | SAM tags to extract as columns |
|
|
| `predicate_pushdown` | bool | `True` | Enable predicate pushdown |
|
|
| `infer_tag_types` | bool | `True` | Infer tag column types from data |
|
|
| `infer_tag_sample_size` | int | `100` | Number of records to sample for type inference |
|
|
| `tag_type_hints` | list[str] | `None` | Explicit type hints for tags |
|
|
|
|
### Column Schema
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `chrom` | String | Reference sequence name |
|
|
| `start` | Int64 | Alignment start position |
|
|
| `end` | Int64 | Alignment end position |
|
|
| `name` | String | Read name |
|
|
| `flags` | UInt32 | SAM flags |
|
|
| `mapping_quality` | UInt32 | Mapping quality |
|
|
| `cigar` | String | CIGAR string |
|
|
| `sequence` | String | Read sequence |
|
|
| `quality_scores` | String | Base quality string |
|
|
| `mate_chrom` | String | Mate reference name |
|
|
| `mate_start` | Int64 | Mate start position |
|
|
| `template_length` | Int64 | Template length |
|
|
|
|
### write_bam / sink_bam
|
|
|
|
```python
|
|
rows_written = pb.write_bam(df, "output.bam")
|
|
rows_written = pb.write_bam(df, "output.bam", sort_on_write=True)
|
|
|
|
pb.sink_bam(lf, "output.bam")
|
|
pb.sink_bam(lf, "output.bam", sort_on_write=True)
|
|
```
|
|
|
|
## CRAM Format
|
|
|
|
### read_cram / scan_cram
|
|
|
|
CRAM files have **separate functions** from BAM. Require a reference FASTA and `.crai` index.
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
# Read CRAM (reference required)
|
|
df = pb.read_cram("aligned.cram", reference_path="reference.fasta")
|
|
|
|
# Scan CRAM (streaming)
|
|
lf = pb.scan_cram("aligned.cram", reference_path="reference.fasta")
|
|
```
|
|
|
|
Same additional parameters and column schema as BAM, plus:
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `reference_path` | str | `None` | Path to reference FASTA |
|
|
|
|
### write_cram / sink_cram
|
|
|
|
```python
|
|
rows_written = pb.write_cram(df, "output.cram", reference_path="reference.fasta")
|
|
pb.sink_cram(lf, "output.cram", reference_path="reference.fasta")
|
|
```
|
|
|
|
## GFF/GTF Format
|
|
|
|
### read_gff / scan_gff / read_gtf / scan_gtf
|
|
|
|
GFF3 and GTF have separate functions.
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
# Read GFF3
|
|
df = pb.read_gff("annotations.gff3")
|
|
|
|
# Read GTF
|
|
df = pb.read_gtf("genes.gtf")
|
|
|
|
# Extract specific attributes as columns
|
|
df = pb.read_gff("annotations.gff3", attr_fields=["gene_id", "gene_name"])
|
|
```
|
|
|
|
### Additional Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `attr_fields` | list[str] | `None` | Attribute fields to extract as columns |
|
|
| `predicate_pushdown` | bool | `True` | Enable predicate pushdown |
|
|
|
|
### Column Schema
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `chrom` | String | Sequence name |
|
|
| `source` | String | Feature source |
|
|
| `type` | String | Feature type (gene, exon, etc.) |
|
|
| `start` | Int64 | Start position |
|
|
| `end` | Int64 | End position |
|
|
| `score` | Float32 | Score |
|
|
| `strand` | String | Strand (+/-/.) |
|
|
| `phase` | UInt32 | Phase (0/1/2) |
|
|
| `attributes` | String | Attributes string |
|
|
|
|
## FASTA Format
|
|
|
|
### read_fasta / scan_fasta
|
|
|
|
Read reference sequences from FASTA files.
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
df = pb.read_fasta("reference.fasta")
|
|
```
|
|
|
|
### Column Schema
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `name` | String | Sequence name |
|
|
| `description` | String | Description line |
|
|
| `sequence` | String | Nucleotide sequence |
|
|
|
|
## FASTQ Format
|
|
|
|
### read_fastq / scan_fastq
|
|
|
|
Read raw sequencing reads with quality scores.
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
df = pb.read_fastq("reads.fastq.gz")
|
|
```
|
|
|
|
### Column Schema
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `name` | String | Read name |
|
|
| `description` | String | Description line |
|
|
| `sequence` | String | Nucleotide sequence |
|
|
| `quality` | String | Quality string (Phred+33 encoded) |
|
|
|
|
### write_fastq / sink_fastq
|
|
|
|
```python
|
|
rows_written = pb.write_fastq(df, "output.fastq")
|
|
pb.sink_fastq(lf, "output.fastq")
|
|
```
|
|
|
|
## SAM Format
|
|
|
|
### read_sam / scan_sam
|
|
|
|
Read text-format alignment files. Same column schema as BAM. No cloud parameters.
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
df = pb.read_sam("alignments.sam")
|
|
```
|
|
|
|
### Additional Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `tag_fields` | list[str] | `None` | SAM tags to extract |
|
|
| `infer_tag_types` | bool | `True` | Infer tag types |
|
|
| `infer_tag_sample_size` | int | `100` | Sample size for inference |
|
|
| `tag_type_hints` | list[str] | `None` | Explicit type hints |
|
|
|
|
### write_sam / sink_sam
|
|
|
|
```python
|
|
rows_written = pb.write_sam(df, "output.sam")
|
|
pb.sink_sam(lf, "output.sam", sort_on_write=True)
|
|
```
|
|
|
|
## Hi-C Pairs
|
|
|
|
### read_pairs / scan_pairs
|
|
|
|
Read Hi-C pairs format files for chromatin contact data.
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
df = pb.read_pairs("contacts.pairs")
|
|
lf = pb.scan_pairs("contacts.pairs")
|
|
```
|
|
|
|
### Column Schema
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `readID` | String | Read identifier |
|
|
| `chrom1` | String | Chromosome of first contact |
|
|
| `pos1` | Int32 | Position of first contact |
|
|
| `chrom2` | String | Chromosome of second contact |
|
|
| `pos2` | Int32 | Position of second contact |
|
|
| `strand1` | String | Strand of first contact |
|
|
| `strand2` | String | Strand of second contact |
|
|
|
|
## Generic Table Reader
|
|
|
|
### read_table / scan_table
|
|
|
|
Read tab-delimited files with custom schema. Useful for non-standard formats or bioframe-compatible tables.
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
df = pb.read_table("custom.tsv", schema={"chrom": str, "start": int, "end": int, "name": str})
|
|
lf = pb.scan_table("custom.tsv", schema={"chrom": str, "start": int, "end": int})
|
|
```
|
|
|
|
## Cloud Storage
|
|
|
|
All `read_*` and `scan_*` functions support cloud storage via individual parameters:
|
|
|
|
### Amazon S3
|
|
|
|
```python
|
|
df = pb.read_bed(
|
|
"s3://bucket/regions.bed",
|
|
allow_anonymous=False,
|
|
max_retries=10,
|
|
timeout=600,
|
|
)
|
|
```
|
|
|
|
### Google Cloud Storage
|
|
|
|
```python
|
|
df = pb.read_vcf("gs://bucket/variants.vcf.gz", allow_anonymous=True)
|
|
```
|
|
|
|
### Azure Blob Storage
|
|
|
|
```python
|
|
df = pb.read_bam("az://container/aligned.bam", allow_anonymous=False)
|
|
```
|
|
|
|
**Note:** For authenticated access, configure credentials via environment variables or cloud SDK configuration (e.g., `AWS_ACCESS_KEY_ID`, `GOOGLE_APPLICATION_CREDENTIALS`).
|
|
|
|
## Compression Support
|
|
|
|
polars-bio transparently handles compressed files:
|
|
|
|
| Compression | Extension | Parallel Decompression |
|
|
|-------------|-----------|----------------------|
|
|
| GZIP | `.gz` | No |
|
|
| BGZF | `.gz` (with BGZF blocks) | Yes |
|
|
| Uncompressed | (none) | N/A |
|
|
|
|
**Recommendation:** Use BGZF compression (e.g., created with `bgzip`) for large files. BGZF supports parallel block decompression, significantly improving read performance compared to plain GZIP.
|
|
|
|
## Describe Functions
|
|
|
|
Inspect file structure without fully reading:
|
|
|
|
```python
|
|
import polars_bio as pb
|
|
|
|
# Describe file schemas and metadata
|
|
schema_df = pb.describe_vcf("samples.vcf.gz")
|
|
schema_df = pb.describe_bam("aligned.bam")
|
|
schema_df = pb.describe_sam("alignments.sam")
|
|
schema_df = pb.describe_cram("aligned.cram", reference_path="ref.fasta")
|
|
```
|