Add scVelo RNA velocity analysis workflow and IQ-TREE reference documentation

- Introduced a comprehensive RNA velocity analysis pipeline using scVelo, including data loading, preprocessing, velocity estimation, and visualization.
- Added a script for running RNA velocity analysis with customizable parameters and output options.
- Created detailed documentation for IQ-TREE 2 phylogenetic inference, covering command syntax, model selection, bootstrapping methods, and output interpretation.
- Included references for velocity models and their mathematical framework, along with a comparison of different models.
- Enhanced the scVelo skill documentation with installation instructions, use cases, and best practices for RNA velocity analysis.
This commit is contained in:
huangkuanlin
2026-03-03 07:15:36 -05:00
parent b271271df4
commit 7f94783fab
27 changed files with 6961 additions and 0 deletions

View File

@@ -0,0 +1,395 @@
---
name: gnomad-database
description: Query gnomAD (Genome Aggregation Database) for population allele frequencies, variant constraint scores (pLI, LOEUF), and loss-of-function intolerance. Essential for variant pathogenicity interpretation, rare disease genetics, and identifying loss-of-function intolerant genes.
license: CC0-1.0
metadata:
skill-author: Kuan-lin Huang
---
# gnomAD Database
## Overview
The Genome Aggregation Database (gnomAD) is the largest publicly available collection of human genetic variation, aggregated from large-scale sequencing projects. gnomAD v4 contains exome sequences from 730,947 individuals and genome sequences from 76,215 individuals across diverse ancestries. It provides population allele frequencies, variant consequence annotations, and gene-level constraint metrics that are essential for interpreting the clinical significance of genetic variants.
**Key resources:**
- gnomAD browser: https://gnomad.broadinstitute.org/
- GraphQL API: https://gnomad.broadinstitute.org/api
- Data downloads: https://gnomad.broadinstitute.org/downloads
- Documentation: https://gnomad.broadinstitute.org/help
## When to Use This Skill
Use gnomAD when:
- **Variant frequency lookup**: Checking if a variant is rare, common, or absent in the general population
- **Pathogenicity assessment**: Rare variants (MAF < 1%) are candidates for disease causation; gnomAD helps filter benign common variants
- **Loss-of-function intolerance**: Using pLI and LOEUF scores to assess whether a gene tolerates protein-truncating variants
- **Population-stratified frequencies**: Comparing allele frequencies across ancestries (African/African American, Admixed American, Ashkenazi Jewish, East Asian, Finnish, Middle Eastern, Non-Finnish European, South Asian)
- **ClinVar/ACMG variant classification**: gnomAD frequency data feeds into BA1/BS1 evidence codes for variant classification
- **Constraint analysis**: Identifying genes depleted of missense or loss-of-function variation (z-scores, pLI, LOEUF)
## Core Capabilities
### 1. gnomAD GraphQL API
gnomAD uses a GraphQL API accessible at `https://gnomad.broadinstitute.org/api`. Most queries fetch variants by gene or specific genomic position.
**Datasets available:**
- `gnomad_r4` — gnomAD v4 exomes (recommended default, GRCh38)
- `gnomad_r4_genomes` — gnomAD v4 genomes (GRCh38)
- `gnomad_r3` — gnomAD v3 genomes (GRCh38)
- `gnomad_r2_1` — gnomAD v2 exomes (GRCh37)
**Reference genomes:**
- `GRCh38` — default for v3/v4
- `GRCh37` — for v2
### 2. Querying Variants by Gene
```python
import requests
def query_gnomad_gene(gene_symbol, dataset="gnomad_r4", reference_genome="GRCh38"):
"""Fetch variants in a gene from gnomAD."""
url = "https://gnomad.broadinstitute.org/api"
query = """
query GeneVariants($gene_symbol: String!, $dataset: DatasetId!, $reference_genome: ReferenceGenomeId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
gene_id
gene_symbol
variants(dataset: $dataset) {
variant_id
pos
ref
alt
consequence
genome {
af
ac
an
ac_hom
populations {
id
ac
an
af
}
}
exome {
af
ac
an
ac_hom
}
lof
lof_flags
lof_filter
}
}
}
"""
variables = {
"gene_symbol": gene_symbol,
"dataset": dataset,
"reference_genome": reference_genome
}
response = requests.post(url, json={"query": query, "variables": variables})
return response.json()
# Example
result = query_gnomad_gene("BRCA1")
gene_data = result["data"]["gene"]
variants = gene_data["variants"]
# Filter to rare PTVs
rare_ptvs = [
v for v in variants
if v.get("lof") == "LC" or v.get("consequence") in ["stop_gained", "frameshift_variant"]
and v.get("genome", {}).get("af", 1) < 0.001
]
print(f"Found {len(rare_ptvs)} rare PTVs in {gene_data['gene_symbol']}")
```
### 3. Querying a Specific Variant
```python
import requests
def query_gnomad_variant(variant_id, dataset="gnomad_r4"):
"""Fetch details for a specific variant (e.g., '1-55516888-G-GA')."""
url = "https://gnomad.broadinstitute.org/api"
query = """
query VariantDetails($variantId: String!, $dataset: DatasetId!) {
variant(variantId: $variantId, dataset: $dataset) {
variant_id
chrom
pos
ref
alt
genome {
af
ac
an
ac_hom
populations {
id
ac
an
af
}
}
exome {
af
ac
an
ac_hom
populations {
id
ac
an
af
}
}
consequence
lof
rsids
in_silico_predictors {
id
value
flags
}
clinvar_variation_id
}
}
"""
response = requests.post(
url,
json={"query": query, "variables": {"variantId": variant_id, "dataset": dataset}}
)
return response.json()
# Example: query a specific variant
result = query_gnomad_variant("17-43094692-G-A") # BRCA1 missense
variant = result["data"]["variant"]
if variant:
genome_af = variant.get("genome", {}).get("af", "N/A")
exome_af = variant.get("exome", {}).get("af", "N/A")
print(f"Variant: {variant['variant_id']}")
print(f" Consequence: {variant['consequence']}")
print(f" Genome AF: {genome_af}")
print(f" Exome AF: {exome_af}")
print(f" LoF: {variant.get('lof')}")
```
### 4. Gene Constraint Scores
gnomAD constraint scores assess how tolerant a gene is to variation relative to expectation:
```python
import requests
def query_gnomad_constraint(gene_symbol, reference_genome="GRCh38"):
"""Fetch constraint scores for a gene."""
url = "https://gnomad.broadinstitute.org/api"
query = """
query GeneConstraint($gene_symbol: String!, $reference_genome: ReferenceGenomeId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
gene_id
gene_symbol
gnomad_constraint {
exp_lof
exp_mis
exp_syn
obs_lof
obs_mis
obs_syn
oe_lof
oe_mis
oe_syn
oe_lof_lower
oe_lof_upper
lof_z
mis_z
syn_z
pLI
}
}
}
"""
response = requests.post(
url,
json={"query": query, "variables": {"gene_symbol": gene_symbol, "reference_genome": reference_genome}}
)
return response.json()
# Example
result = query_gnomad_constraint("KCNQ2")
gene = result["data"]["gene"]
constraint = gene["gnomad_constraint"]
print(f"Gene: {gene['gene_symbol']}")
print(f" pLI: {constraint['pLI']:.3f} (>0.9 = LoF intolerant)")
print(f" LOEUF: {constraint['oe_lof_upper']:.3f} (<0.35 = highly constrained)")
print(f" Obs/Exp LoF: {constraint['oe_lof']:.3f}")
print(f" Missense Z: {constraint['mis_z']:.3f}")
```
**Constraint score interpretation:**
| Score | Range | Meaning |
|-------|-------|---------|
| `pLI` | 01 | Probability of LoF intolerance; >0.9 = highly intolerant |
| `LOEUF` | 0∞ | LoF observed/expected upper bound; <0.35 = constrained |
| `oe_lof` | 0∞ | Observed/expected ratio for LoF variants |
| `mis_z` | −∞ to ∞ | Missense constraint z-score; >3.09 = constrained |
| `syn_z` | −∞ to ∞ | Synonymous z-score (control; should be near 0) |
### 5. Population Frequency Analysis
```python
import requests
import pandas as pd
def get_population_frequencies(variant_id, dataset="gnomad_r4"):
"""Extract per-population allele frequencies for a variant."""
url = "https://gnomad.broadinstitute.org/api"
query = """
query PopFreqs($variantId: String!, $dataset: DatasetId!) {
variant(variantId: $variantId, dataset: $dataset) {
variant_id
genome {
populations {
id
ac
an
af
ac_hom
}
}
}
}
"""
response = requests.post(
url,
json={"query": query, "variables": {"variantId": variant_id, "dataset": dataset}}
)
data = response.json()
populations = data["data"]["variant"]["genome"]["populations"]
df = pd.DataFrame(populations)
df = df[df["an"] > 0].copy()
df["af"] = df["ac"] / df["an"]
df = df.sort_values("af", ascending=False)
return df
# Population IDs in gnomAD v4:
# afr = African/African American
# ami = Amish
# amr = Admixed American
# asj = Ashkenazi Jewish
# eas = East Asian
# fin = Finnish
# mid = Middle Eastern
# nfe = Non-Finnish European
# sas = South Asian
# remaining = Other
```
### 6. Structural Variants (gnomAD-SV)
gnomAD also contains a structural variant dataset:
```python
import requests
def query_gnomad_sv(gene_symbol):
"""Query structural variants overlapping a gene."""
url = "https://gnomad.broadinstitute.org/api"
query = """
query SVsByGene($gene_symbol: String!) {
gene(gene_symbol: $gene_symbol, reference_genome: GRCh38) {
structural_variants {
variant_id
type
chrom
pos
end
af
ac
an
}
}
}
"""
response = requests.post(url, json={"query": query, "variables": {"gene_symbol": gene_symbol}})
return response.json()
```
## Query Workflows
### Workflow 1: Variant Pathogenicity Assessment
1. **Check population frequency** — Is the variant rare enough to be pathogenic?
- Use gnomAD AF < 1% for recessive, < 0.1% for dominant conditions
- Check ancestry-specific frequencies (a variant rare overall may be common in one population)
2. **Assess functional impact** — LoF variants have highest prior probability
- Check `lof` field: `HC` = high-confidence LoF, `LC` = low-confidence
- Check `lof_flags` for issues like "NAGNAG_SITE", "PHYLOCSF_WEAK"
3. **Apply ACMG criteria:**
- BA1: AF > 5% → Benign Stand-Alone
- BS1: AF > disease prevalence threshold → Benign Supporting
- PM2: Absent or very rare in gnomAD → Pathogenic Moderate
### Workflow 2: Gene Prioritization in Rare Disease
1. Query constraint scores for candidate genes
2. Filter for pLI > 0.9 (haploinsufficient) or LOEUF < 0.35
3. Cross-reference with observed LoF variants in the gene
4. Integrate with ClinVar and disease databases
### Workflow 3: Population Genetics Research
1. Identify variant of interest from GWAS or clinical data
2. Query per-population frequencies
3. Compare frequency differences across ancestries
4. Test for enrichment in specific founder populations
## Best Practices
- **Use gnomAD v4 (gnomad_r4)** for the most current data; use v2 (gnomad_r2_1) only for GRCh37 compatibility
- **Handle null responses**: Variants not observed in gnomAD are not necessarily pathogenic — absence is informative
- **Distinguish exome vs. genome data**: Genome data has more uniform coverage; exome data is larger but may have coverage gaps
- **Rate limit GraphQL queries**: Add delays between requests; batch queries when possible
- **Homozygous counts** (`ac_hom`) are relevant for recessive disease analysis
- **LOEUF is preferred over pLI** for gene constraint (less sensitive to sample size)
## Data Access
- **Browser**: https://gnomad.broadinstitute.org/ — interactive variant and gene browsing
- **GraphQL API**: https://gnomad.broadinstitute.org/api — programmatic access
- **Downloads**: https://gnomad.broadinstitute.org/downloads — VCF, Hail tables, constraint tables
- **Google Cloud**: gs://gcp-public-data--gnomad/
## Additional Resources
- **gnomAD website**: https://gnomad.broadinstitute.org/
- **gnomAD blog**: https://gnomad.broadinstitute.org/news
- **Downloads**: https://gnomad.broadinstitute.org/downloads
- **API explorer**: https://gnomad.broadinstitute.org/api (interactive GraphiQL)
- **Constraint documentation**: https://gnomad.broadinstitute.org/help/constraint
- **Citation**: Karczewski KJ et al. (2020) Nature. PMID: 32461654; Chen S et al. (2024) Nature. PMID: 38conservation
- **GitHub**: https://github.com/broadinstitute/gnomad-browser

View File

@@ -0,0 +1,219 @@
# gnomAD GraphQL Query Reference
## API Endpoint
```
POST https://gnomad.broadinstitute.org/api
Content-Type: application/json
Body: { "query": "<graphql_query>", "variables": { ... } }
```
## Dataset Identifiers
| ID | Description | Reference Genome |
|----|-------------|-----------------|
| `gnomad_r4` | gnomAD v4 exomes (730K individuals) | GRCh38 |
| `gnomad_r4_genomes` | gnomAD v4 genomes (76K individuals) | GRCh38 |
| `gnomad_r3` | gnomAD v3 genomes (76K individuals) | GRCh38 |
| `gnomad_r2_1` | gnomAD v2 exomes (125K individuals) | GRCh37 |
| `gnomad_r2_1_non_cancer` | v2 non-cancer subset | GRCh37 |
| `gnomad_cnv_r4` | Copy number variants | GRCh38 |
## Core Query Templates
### 1. Variants in a Gene
```graphql
query GeneVariants($gene_symbol: String!, $dataset: DatasetId!, $reference_genome: ReferenceGenomeId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
gene_id
gene_symbol
chrom
start
stop
variants(dataset: $dataset) {
variant_id
pos
ref
alt
consequence
lof
lof_flags
lof_filter
genome {
af
ac
an
ac_hom
populations { id ac an af ac_hom }
}
exome {
af
ac
an
ac_hom
populations { id ac an af ac_hom }
}
rsids
clinvar_variation_id
in_silico_predictors { id value flags }
}
}
}
```
### 2. Single Variant Lookup
```graphql
query VariantDetails($variantId: String!, $dataset: DatasetId!) {
variant(variantId: $variantId, dataset: $dataset) {
variant_id
chrom
pos
ref
alt
consequence
lof
lof_flags
rsids
genome { af ac an ac_hom populations { id ac an af } }
exome { af ac an ac_hom populations { id ac an af } }
in_silico_predictors { id value flags }
clinvar_variation_id
}
}
```
**Variant ID format:** `{chrom}-{pos}-{ref}-{alt}` (e.g., `17-43094692-G-A`)
### 3. Gene Constraint
```graphql
query GeneConstraint($gene_symbol: String!, $reference_genome: ReferenceGenomeId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
gene_id
gene_symbol
gnomad_constraint {
exp_lof exp_mis exp_syn
obs_lof obs_mis obs_syn
oe_lof oe_mis oe_syn
oe_lof_lower oe_lof_upper
oe_mis_lower oe_mis_upper
lof_z mis_z syn_z
pLI
flags
}
}
}
```
### 4. Region Query (by genomic position)
```graphql
query RegionVariants($chrom: String!, $start: Int!, $stop: Int!, $dataset: DatasetId!, $reference_genome: ReferenceGenomeId!) {
region(chrom: $chrom, start: $start, stop: $stop, reference_genome: $reference_genome) {
variants(dataset: $dataset) {
variant_id
pos
ref
alt
consequence
genome { af ac an }
exome { af ac an }
}
}
}
```
### 5. ClinVar Variants in Gene
```graphql
query ClinVarVariants($gene_symbol: String!, $reference_genome: ReferenceGenomeId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
clinvar_variants {
variant_id
pos
ref
alt
clinical_significance
clinvar_variation_id
gold_stars
major_consequence
in_gnomad
gnomad_exomes { ac an af }
}
}
}
```
## Population IDs
| ID | Population |
|----|-----------|
| `afr` | African/African American |
| `ami` | Amish |
| `amr` | Admixed American |
| `asj` | Ashkenazi Jewish |
| `eas` | East Asian |
| `fin` | Finnish |
| `mid` | Middle Eastern |
| `nfe` | Non-Finnish European |
| `sas` | South Asian |
| `remaining` | Other/Unassigned |
| `XX` | Female (appended to above, e.g., `afr_XX`) |
| `XY` | Male |
## LoF Annotation Fields
| Field | Values | Meaning |
|-------|--------|---------|
| `lof` | `HC`, `LC`, `null` | High/low-confidence LoF, or not annotated as LoF |
| `lof_flags` | comma-separated strings | Quality flags (e.g., `NAGNAG_SITE`, `NON_CANONICAL_SPLICE_SITE`) |
| `lof_filter` | string or null | Reason for LC classification |
## In Silico Predictor IDs
Common values for `in_silico_predictors[].id`:
- `cadd` — CADD PHRED score
- `revel` — REVEL score
- `spliceai_ds_max` — SpliceAI max delta score
- `pangolin_largest_ds` — Pangolin splicing score
- `polyphen` — PolyPhen-2 prediction
- `sift` — SIFT prediction
## Python Helper
```python
import requests
import time
def gnomad_query(query: str, variables: dict, retries: int = 3) -> dict:
"""Execute a gnomAD GraphQL query with retry logic."""
url = "https://gnomad.broadinstitute.org/api"
headers = {"Content-Type": "application/json"}
for attempt in range(retries):
try:
response = requests.post(
url,
json={"query": query, "variables": variables},
headers=headers,
timeout=60
)
response.raise_for_status()
result = response.json()
if "errors" in result:
print(f"GraphQL errors: {result['errors']}")
return result
return result
except requests.exceptions.RequestException as e:
if attempt < retries - 1:
time.sleep(2 ** attempt) # exponential backoff
else:
raise
return {}
```

View File

@@ -0,0 +1,85 @@
# gnomAD Variant Interpretation Guide
## Allele Frequency Thresholds for Disease Interpretation
### ACMG/AMP Criteria
| Criterion | AF threshold | Classification |
|-----------|-------------|----------------|
| BA1 | > 0.05 (5%) | Benign Stand-Alone |
| BS1 | > disease prevalence | Benign Supporting |
| PM2_Supporting | < 0.0001 (0.01%) for dominant; absent for recessive | Pathogenic Moderate → Supporting |
**Notes:**
- BA1 applies to most conditions; exceptions include autosomal dominant with high penetrance (e.g., LDLR for FH: BA1 threshold is ~0.1%)
- BS1 requires knowing disease prevalence; for rare diseases (1:10,000), BS1 if AF > 0.01%
- Homozygous counts (`ac_hom`) matter for recessive diseases
### Practical Thresholds
| Inheritance | Suggested max AF |
|-------------|-----------------|
| Autosomal Dominant (high penetrance) | < 0.001 (0.1%) |
| Autosomal Dominant (reduced penetrance) | < 0.01 (1%) |
| Autosomal Recessive | < 0.01 (1%) |
| X-linked recessive | < 0.001 in females |
## Absence in gnomAD
A variant **absent in gnomAD** (ac = 0) is evidence of rarity, but interpret carefully:
- gnomAD does not capture all rare variants (sequencing depth, coverage, calling thresholds)
- A variant absent in 730K exomes is very strong evidence of rarity for PM2
- Check coverage at the position: if < 10x, absence is less informative
## Loss-of-Function Variant Assessment
### LOFTEE Classification (lof field)
- **HC (High Confidence):** Predicted to truncate functional protein
- Stop-gained, splice site (±1,2), frameshift variants
- Passes all LOFTEE quality filters
- **LC (Low Confidence):** LoF annotation with quality concerns
- Check `lof_flags` for specific reason
- May still be pathogenic — requires manual review
### Common lof_flags
| Flag | Meaning |
|------|---------|
| `NAGNAG_SITE` | Splice site may be rescued by nearby alternative site |
| `NON_CANONICAL_SPLICE_SITE` | Not a canonical splice donor/acceptor |
| `PHYLOCSF_WEAK` | Weak phylogenetic conservation signal |
| `SMALL_INTRON` | Intron too small to affect splicing |
| `SINGLE_EXON` | Single-exon gene (no splicing) |
| `LAST_EXON` | In last exon (NMD may not apply) |
## Homozygous Observations
The `ac_hom` field counts homozygous (or hemizygous in males for chrX) observations.
**For recessive diseases:**
- If a variant is observed homozygous in healthy individuals in gnomAD, it is strong evidence against pathogenicity (BS2 criterion)
- Even a single homozygous observation can be informative
## Coverage at Position
Always check that gnomAD has adequate coverage at the variant position before concluding absence is meaningful. The gnomAD browser shows coverage tracks, and coverage data can be downloaded from:
- https://gnomad.broadinstitute.org/downloads#v4-coverage
## In Silico Predictor Scores
| Predictor | Score Range | Pathogenic Threshold |
|-----------|-------------|---------------------|
| CADD PHRED | 099 | > 20 deleterious; > 30 highly deleterious |
| REVEL | 01 | > 0.75 likely pathogenic (for missense) |
| SpliceAI max_ds | 01 | > 0.5 likely splice-altering |
| SIFT | 01 | < 0.05 deleterious |
| PolyPhen-2 | 01 | > 0.909 probably damaging |
## Ancestry-Specific Considerations
- A variant rare overall may be a common founder variant in a specific population
- Always check all ancestry-specific AFs, not just the total
- Finnish and Ashkenazi Jewish populations have high rates of founder variants
- Report ancestry-specific frequencies when relevant to patient ancestry