Add critical VCF ingestion requirements

- VCFs must be single-sample (multi-sample not supported)
- Index files (.csi or .tbi) are required for all VCF/BCF files
- Add indexing examples with bcftools and tabix
- Document requirements prominently in both main skill and ingestion guide
This commit is contained in:
Jeremy Leipzig
2026-02-24 11:07:20 -07:00
parent 07e8e0e284
commit 3f76537f75
2 changed files with 24 additions and 1 deletions

View File

@@ -63,7 +63,10 @@ import tiledbvcf
ds = tiledbvcf.Dataset(uri="my_dataset", mode="w",
cfg=tiledbvcf.ReadConfig(memory_budget=1024))
# Ingest VCF files (can be run incrementally)
# Ingest VCF files (must be single-sample with indexes)
# Requirements:
# - VCFs must be single-sample (not multi-sample)
# - Must have indexes: .csi (bcftools) or .tbi (tabix)
ds.ingest_samples(["sample1.vcf.gz", "sample2.vcf.gz"])
```
@@ -100,6 +103,10 @@ ds.export(
Create TileDB-VCF datasets and incrementally ingest variant data from multiple VCF/BCF files. This is appropriate for building population genomics databases and cohort studies.
**Requirements:**
- **Single-sample VCFs only**: Multi-sample VCFs are not supported
- **Index files required**: VCF/BCF files must have indexes (.csi or .tbi)
**Common operations:**
- Create new datasets with optimized array schemas
- Ingest single or multiple VCF/BCF files in parallel

View File

@@ -2,6 +2,22 @@
Complete guide to creating TileDB-VCF datasets and ingesting VCF/BCF files with optimal performance and reliability.
## Important Requirements
**Before ingesting VCF files, ensure they meet these requirements:**
- **Single-sample VCFs only**: Multi-sample VCFs are not supported by TileDB-VCF
- **Index files required**: All VCF/BCF files must have corresponding index files:
- `.csi` files (created with `bcftools index`)
- `.tbi` files (created with `tabix`)
```bash
# Create indexes if they don't exist
bcftools index sample.vcf.gz # Creates sample.vcf.gz.csi
# OR
tabix -p vcf sample.vcf.gz # Creates sample.vcf.gz.tbi
```
## Dataset Creation
### Basic Dataset Creation