mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Add critical VCF ingestion requirements
- VCFs must be single-sample (multi-sample not supported) - Index files (.csi or .tbi) are required for all VCF/BCF files - Add indexing examples with bcftools and tabix - Document requirements prominently in both main skill and ingestion guide
This commit is contained in:
@@ -63,7 +63,10 @@ import tiledbvcf
|
|||||||
ds = tiledbvcf.Dataset(uri="my_dataset", mode="w",
|
ds = tiledbvcf.Dataset(uri="my_dataset", mode="w",
|
||||||
cfg=tiledbvcf.ReadConfig(memory_budget=1024))
|
cfg=tiledbvcf.ReadConfig(memory_budget=1024))
|
||||||
|
|
||||||
# Ingest VCF files (can be run incrementally)
|
# Ingest VCF files (must be single-sample with indexes)
|
||||||
|
# Requirements:
|
||||||
|
# - VCFs must be single-sample (not multi-sample)
|
||||||
|
# - Must have indexes: .csi (bcftools) or .tbi (tabix)
|
||||||
ds.ingest_samples(["sample1.vcf.gz", "sample2.vcf.gz"])
|
ds.ingest_samples(["sample1.vcf.gz", "sample2.vcf.gz"])
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -100,6 +103,10 @@ ds.export(
|
|||||||
|
|
||||||
Create TileDB-VCF datasets and incrementally ingest variant data from multiple VCF/BCF files. This is appropriate for building population genomics databases and cohort studies.
|
Create TileDB-VCF datasets and incrementally ingest variant data from multiple VCF/BCF files. This is appropriate for building population genomics databases and cohort studies.
|
||||||
|
|
||||||
|
**Requirements:**
|
||||||
|
- **Single-sample VCFs only**: Multi-sample VCFs are not supported
|
||||||
|
- **Index files required**: VCF/BCF files must have indexes (.csi or .tbi)
|
||||||
|
|
||||||
**Common operations:**
|
**Common operations:**
|
||||||
- Create new datasets with optimized array schemas
|
- Create new datasets with optimized array schemas
|
||||||
- Ingest single or multiple VCF/BCF files in parallel
|
- Ingest single or multiple VCF/BCF files in parallel
|
||||||
|
|||||||
@@ -2,6 +2,22 @@
|
|||||||
|
|
||||||
Complete guide to creating TileDB-VCF datasets and ingesting VCF/BCF files with optimal performance and reliability.
|
Complete guide to creating TileDB-VCF datasets and ingesting VCF/BCF files with optimal performance and reliability.
|
||||||
|
|
||||||
|
## Important Requirements
|
||||||
|
|
||||||
|
**Before ingesting VCF files, ensure they meet these requirements:**
|
||||||
|
|
||||||
|
- **Single-sample VCFs only**: Multi-sample VCFs are not supported by TileDB-VCF
|
||||||
|
- **Index files required**: All VCF/BCF files must have corresponding index files:
|
||||||
|
- `.csi` files (created with `bcftools index`)
|
||||||
|
- `.tbi` files (created with `tabix`)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create indexes if they don't exist
|
||||||
|
bcftools index sample.vcf.gz # Creates sample.vcf.gz.csi
|
||||||
|
# OR
|
||||||
|
tabix -p vcf sample.vcf.gz # Creates sample.vcf.gz.tbi
|
||||||
|
```
|
||||||
|
|
||||||
## Dataset Creation
|
## Dataset Creation
|
||||||
|
|
||||||
### Basic Dataset Creation
|
### Basic Dataset Creation
|
||||||
|
|||||||
Reference in New Issue
Block a user