Add critical VCF ingestion requirements

- VCFs must be single-sample (multi-sample not supported)
- Index files (.csi or .tbi) are required for all VCF/BCF files
- Add indexing examples with bcftools and tabix
- Document requirements prominently in both main skill and ingestion guide
This commit is contained in:
Jeremy Leipzig
2026-02-24 11:07:20 -07:00
parent 07e8e0e284
commit 3f76537f75
2 changed files with 24 additions and 1 deletions

View File

@@ -63,7 +63,10 @@ import tiledbvcf
ds = tiledbvcf.Dataset(uri="my_dataset", mode="w",
cfg=tiledbvcf.ReadConfig(memory_budget=1024))
# Ingest VCF files (can be run incrementally)
# Ingest VCF files (must be single-sample with indexes)
# Requirements:
# - VCFs must be single-sample (not multi-sample)
# - Must have indexes: .csi (bcftools) or .tbi (tabix)
ds.ingest_samples(["sample1.vcf.gz", "sample2.vcf.gz"])
```
@@ -100,6 +103,10 @@ ds.export(
Create TileDB-VCF datasets and incrementally ingest variant data from multiple VCF/BCF files. This is appropriate for building population genomics databases and cohort studies.
**Requirements:**
- **Single-sample VCFs only**: Multi-sample VCFs are not supported
- **Index files required**: VCF/BCF files must have indexes (.csi or .tbi)
**Common operations:**
- Create new datasets with optimized array schemas
- Ingest single or multiple VCF/BCF files in parallel