Initial commit for opentargets-database

This commit is contained in:
dfty
2026-01-28 12:45:55 +08:00
commit 1fb527f733
5 changed files with 1732 additions and 0 deletions

View File

@@ -0,0 +1,306 @@
# Evidence Types and Data Sources
## Overview
Evidence represents any event or set of events that identifies a target as a potential causal gene or protein for a disease. Evidence is standardized and mapped to:
- **Ensembl gene IDs** for targets
- **EFO (Experimental Factor Ontology)** for diseases/phenotypes
Evidence is organized into **data types** (broader categories) and **data sources** (specific databases/studies).
## Evidence Data Types
### 1. Genetic Association
Evidence from human genetics linking genetic variants to disease phenotypes.
#### Data Sources:
**GWAS (Genome-Wide Association Studies)**
- Population-level common variant associations
- Filtered with Locus-to-Gene (L2G) scores >0.05
- Includes fine-mapping and colocalization data
- Sources: GWAS Catalog, FinnGen, UK Biobank, EBI GWAS
**Gene Burden Tests**
- Rare variant association analyses
- Aggregate effects of multiple rare variants in a gene
- Particularly relevant for Mendelian and rare diseases
**ClinVar Germline**
- Clinical variant interpretations
- Classifications: pathogenic, likely pathogenic, VUS, benign
- Expert-reviewed variant-disease associations
**Genomics England PanelApp**
- Expert gene-disease ratings
- Green (confirmed), amber (probable), red (no evidence)
- Focus on rare diseases and cancer
**Gene2Phenotype**
- Curated gene-disease relationships
- Allelic requirements and inheritance patterns
- Clinical validity assessments
**UniProt Literature & Variants**
- Literature-based gene-disease associations
- Expert-curated from scientific publications
**Orphanet**
- Rare disease gene associations
- Expert-reviewed and maintained
**ClinGen**
- Clinical genome resource classifications
- Gene-disease validity assertions
### 2. Somatic Mutations
Evidence from cancer genomics identifying driver genes and therapeutic targets.
#### Data Sources:
**Cancer Gene Census**
- Expert-curated cancer genes
- Tier classifications (1 = strong evidence, 2 = emerging)
- Mutation types and cancer types
**IntOGen**
- Computational driver gene predictions
- Aggregated from large cohort studies
- Statistical significance of mutations
**ClinVar Somatic**
- Somatic clinical variant interpretations
- Oncogenic/likely oncogenic classifications
**Cancer Biomarkers**
- FDA/EMA approved biomarkers
- Clinical trial biomarkers
- Prognostic and predictive markers
### 3. Known Drugs
Evidence from clinical precedence showing drugs targeting genes for disease indications.
#### Data Source:
**ChEMBL**
- Approved drugs (Phase 4)
- Clinical candidates (Phase 1-3)
- Withdrawn drugs
- Drug-target-indication triplets with mechanism of action
**Clinical Trial Information:**
- `phase`: Maximum clinical trial phase (1, 2, 3, 4)
- `status`: Active, terminated, completed, withdrawn
- `mechanismOfAction`: How drug affects target
### 4. Affected Pathways
Evidence linking genes to disease through pathway perturbations and functional screens.
#### Data Sources:
**CRISPR Screens**
- Genome-scale knockout screens
- Cancer dependency and essentiality data
**Project Score (Cancer Dependency Map)**
- CRISPR-Cas9 fitness screens across cancer cell lines
- Gene essentiality profiles
**SLAPenrich**
- Pathway enrichment analysis
- Somatic mutation pathway impacts
**PROGENy**
- Pathway activity inference
- Signaling pathway perturbations
**Reactome**
- Expert-curated pathway annotations
- Biological pathway representations
**Gene Signatures**
- Expression-based signatures
- Pathway activity patterns
### 5. RNA Expression
Evidence from differential gene expression in disease vs. control tissues.
#### Data Source:
**Expression Atlas**
- Differential expression data
- Baseline expression across tissues/conditions
- RNA-Seq and microarray studies
- Log2 fold-change and p-values
### 6. Animal Models
Evidence from in vivo studies showing phenotypes associated with gene perturbations.
#### Data Source:
**IMPC (International Mouse Phenotyping Consortium)**
- Systematic mouse knockout phenotypes
- Phenotype-disease mappings via ontologies
- Standardized phenotyping procedures
### 7. Literature
Evidence from text-mining of biomedical literature.
#### Data Source:
**Europe PMC**
- Co-occurrence of genes and diseases in abstracts
- Normalized citation counts
- Weighted by publication type and recency
## Evidence Scoring
Each evidence source has its own scoring methodology:
### Score Ranges
- Most scores normalized to 0-1 range
- Higher scores indicate stronger evidence
- Scores are NOT confidence levels but relative strength indicators
### Common Scoring Approaches:
**Binary Classifications:**
- ClinVar: Pathogenic (1.0), Likely pathogenic (0.99), etc.
- Gene2Phenotype: Confirmed/probable ratings
- PanelApp: Green/amber/red classifications
**Statistical Measures:**
- GWAS: L2G scores incorporating multiple lines of evidence
- Gene Burden: Statistical significance of variant aggregation
- Expression: Adjusted p-values and fold-changes
**Clinical Precedence:**
- Known Drugs: Phase weights (Phase 4 = 1.0, Phase 3 = 0.8, etc.)
- Clinical status modifiers
**Computational Predictions:**
- IntOGen: Q-values from driver mutation analysis
- PROGENy/SLAPenrich: Pathway activity/enrichment scores
## Evidence Interpretation Guidelines
### Strengths by Data Type
**Genetic Association** - Strongest human genetic evidence
- Direct link between genetic variation and disease
- Mendelian diseases: high confidence
- GWAS: requires L2G to identify causal gene
- Consider ancestry and population-specific effects
**Somatic Mutations** - Direct evidence in cancer
- Strong for oncology indications
- Driver mutations indicate therapeutic potential
- Consider cancer type specificity
**Known Drugs** - Clinical validation
- Highest confidence: approved drugs (Phase 4)
- Consider mechanism relevance to new indication
- Phase 1-2: early evidence, higher risk
**Affected Pathways** - Mechanistic insights
- Supports biological plausibility
- May not predict clinical success
- Useful for hypothesis generation
**RNA Expression** - Observational evidence
- Correlation, not causation
- May reflect disease consequence vs. cause
- Useful for biomarker identification
**Animal Models** - Translational evidence
- Strong for understanding biology
- Variable translation to human disease
- Most useful when phenotype matches human disease
**Literature** - Exploratory signal
- Text-mining captures research focus
- May reflect publication bias
- Requires manual literature review for validation
### Important Considerations
1. **Multiple evidence types strengthen confidence** - Convergent evidence from different data types provides stronger support
2. **Under-studied diseases score lower** - Novel or rare diseases may have strong evidence but lower aggregate scores due to limited research
3. **Association scores are not probabilities** - Scores rank relative evidence strength, not success probability
4. **Context matters** - Evidence strength depends on:
- Disease mechanism understanding
- Target biology and druggability
- Clinical precedence in related indications
- Safety considerations
5. **Data source reliability varies** - Weight expert-curated sources (ClinGen, Gene2Phenotype) higher than computational predictions
## Using Evidence in Queries
### Filtering by Data Type
```python
query = """
query evidenceByType($ensemblId: String!, $efoId: String!, $dataTypes: [String!]) {
disease(efoId: $efoId) {
evidences(ensemblIds: [$ensemblId], datatypes: $dataTypes) {
rows {
datasourceId
score
}
}
}
}
"""
variables = {
"ensemblId": "ENSG00000157764",
"efoId": "EFO_0000249",
"dataTypes": ["genetic_association", "somatic_mutation"]
}
```
### Accessing Data Type Scores
Data type scores aggregate all source scores within that type:
```python
query = """
query associationScores($ensemblId: String!, $efoId: String!) {
target(ensemblId: $ensemblId) {
associatedDiseases(efoIds: [$efoId]) {
rows {
disease {
name
}
score
datatypeScores {
componentId
score
}
}
}
}
}
"""
```
## Evidence Quality Assessment
When evaluating evidence:
1. **Check multiple sources** - Single source may be unreliable
2. **Prioritize human genetic evidence** - Strongest disease relevance
3. **Consider clinical precedence** - Known drugs indicate druggability
4. **Assess mechanistic support** - Pathway evidence supports biology
5. **Review literature manually** - For critical decisions, read primary publications
6. **Validate in primary databases** - Cross-reference with ClinVar, ClinGen, etc.