Initial commit for opentargets-database
This commit is contained in:
306
references/evidence_types.md
Normal file
306
references/evidence_types.md
Normal file
@@ -0,0 +1,306 @@
|
||||
# Evidence Types and Data Sources
|
||||
|
||||
## Overview
|
||||
|
||||
Evidence represents any event or set of events that identifies a target as a potential causal gene or protein for a disease. Evidence is standardized and mapped to:
|
||||
- **Ensembl gene IDs** for targets
|
||||
- **EFO (Experimental Factor Ontology)** for diseases/phenotypes
|
||||
|
||||
Evidence is organized into **data types** (broader categories) and **data sources** (specific databases/studies).
|
||||
|
||||
## Evidence Data Types
|
||||
|
||||
### 1. Genetic Association
|
||||
|
||||
Evidence from human genetics linking genetic variants to disease phenotypes.
|
||||
|
||||
#### Data Sources:
|
||||
|
||||
**GWAS (Genome-Wide Association Studies)**
|
||||
- Population-level common variant associations
|
||||
- Filtered with Locus-to-Gene (L2G) scores >0.05
|
||||
- Includes fine-mapping and colocalization data
|
||||
- Sources: GWAS Catalog, FinnGen, UK Biobank, EBI GWAS
|
||||
|
||||
**Gene Burden Tests**
|
||||
- Rare variant association analyses
|
||||
- Aggregate effects of multiple rare variants in a gene
|
||||
- Particularly relevant for Mendelian and rare diseases
|
||||
|
||||
**ClinVar Germline**
|
||||
- Clinical variant interpretations
|
||||
- Classifications: pathogenic, likely pathogenic, VUS, benign
|
||||
- Expert-reviewed variant-disease associations
|
||||
|
||||
**Genomics England PanelApp**
|
||||
- Expert gene-disease ratings
|
||||
- Green (confirmed), amber (probable), red (no evidence)
|
||||
- Focus on rare diseases and cancer
|
||||
|
||||
**Gene2Phenotype**
|
||||
- Curated gene-disease relationships
|
||||
- Allelic requirements and inheritance patterns
|
||||
- Clinical validity assessments
|
||||
|
||||
**UniProt Literature & Variants**
|
||||
- Literature-based gene-disease associations
|
||||
- Expert-curated from scientific publications
|
||||
|
||||
**Orphanet**
|
||||
- Rare disease gene associations
|
||||
- Expert-reviewed and maintained
|
||||
|
||||
**ClinGen**
|
||||
- Clinical genome resource classifications
|
||||
- Gene-disease validity assertions
|
||||
|
||||
### 2. Somatic Mutations
|
||||
|
||||
Evidence from cancer genomics identifying driver genes and therapeutic targets.
|
||||
|
||||
#### Data Sources:
|
||||
|
||||
**Cancer Gene Census**
|
||||
- Expert-curated cancer genes
|
||||
- Tier classifications (1 = strong evidence, 2 = emerging)
|
||||
- Mutation types and cancer types
|
||||
|
||||
**IntOGen**
|
||||
- Computational driver gene predictions
|
||||
- Aggregated from large cohort studies
|
||||
- Statistical significance of mutations
|
||||
|
||||
**ClinVar Somatic**
|
||||
- Somatic clinical variant interpretations
|
||||
- Oncogenic/likely oncogenic classifications
|
||||
|
||||
**Cancer Biomarkers**
|
||||
- FDA/EMA approved biomarkers
|
||||
- Clinical trial biomarkers
|
||||
- Prognostic and predictive markers
|
||||
|
||||
### 3. Known Drugs
|
||||
|
||||
Evidence from clinical precedence showing drugs targeting genes for disease indications.
|
||||
|
||||
#### Data Source:
|
||||
|
||||
**ChEMBL**
|
||||
- Approved drugs (Phase 4)
|
||||
- Clinical candidates (Phase 1-3)
|
||||
- Withdrawn drugs
|
||||
- Drug-target-indication triplets with mechanism of action
|
||||
|
||||
**Clinical Trial Information:**
|
||||
- `phase`: Maximum clinical trial phase (1, 2, 3, 4)
|
||||
- `status`: Active, terminated, completed, withdrawn
|
||||
- `mechanismOfAction`: How drug affects target
|
||||
|
||||
### 4. Affected Pathways
|
||||
|
||||
Evidence linking genes to disease through pathway perturbations and functional screens.
|
||||
|
||||
#### Data Sources:
|
||||
|
||||
**CRISPR Screens**
|
||||
- Genome-scale knockout screens
|
||||
- Cancer dependency and essentiality data
|
||||
|
||||
**Project Score (Cancer Dependency Map)**
|
||||
- CRISPR-Cas9 fitness screens across cancer cell lines
|
||||
- Gene essentiality profiles
|
||||
|
||||
**SLAPenrich**
|
||||
- Pathway enrichment analysis
|
||||
- Somatic mutation pathway impacts
|
||||
|
||||
**PROGENy**
|
||||
- Pathway activity inference
|
||||
- Signaling pathway perturbations
|
||||
|
||||
**Reactome**
|
||||
- Expert-curated pathway annotations
|
||||
- Biological pathway representations
|
||||
|
||||
**Gene Signatures**
|
||||
- Expression-based signatures
|
||||
- Pathway activity patterns
|
||||
|
||||
### 5. RNA Expression
|
||||
|
||||
Evidence from differential gene expression in disease vs. control tissues.
|
||||
|
||||
#### Data Source:
|
||||
|
||||
**Expression Atlas**
|
||||
- Differential expression data
|
||||
- Baseline expression across tissues/conditions
|
||||
- RNA-Seq and microarray studies
|
||||
- Log2 fold-change and p-values
|
||||
|
||||
### 6. Animal Models
|
||||
|
||||
Evidence from in vivo studies showing phenotypes associated with gene perturbations.
|
||||
|
||||
#### Data Source:
|
||||
|
||||
**IMPC (International Mouse Phenotyping Consortium)**
|
||||
- Systematic mouse knockout phenotypes
|
||||
- Phenotype-disease mappings via ontologies
|
||||
- Standardized phenotyping procedures
|
||||
|
||||
### 7. Literature
|
||||
|
||||
Evidence from text-mining of biomedical literature.
|
||||
|
||||
#### Data Source:
|
||||
|
||||
**Europe PMC**
|
||||
- Co-occurrence of genes and diseases in abstracts
|
||||
- Normalized citation counts
|
||||
- Weighted by publication type and recency
|
||||
|
||||
## Evidence Scoring
|
||||
|
||||
Each evidence source has its own scoring methodology:
|
||||
|
||||
### Score Ranges
|
||||
- Most scores normalized to 0-1 range
|
||||
- Higher scores indicate stronger evidence
|
||||
- Scores are NOT confidence levels but relative strength indicators
|
||||
|
||||
### Common Scoring Approaches:
|
||||
|
||||
**Binary Classifications:**
|
||||
- ClinVar: Pathogenic (1.0), Likely pathogenic (0.99), etc.
|
||||
- Gene2Phenotype: Confirmed/probable ratings
|
||||
- PanelApp: Green/amber/red classifications
|
||||
|
||||
**Statistical Measures:**
|
||||
- GWAS: L2G scores incorporating multiple lines of evidence
|
||||
- Gene Burden: Statistical significance of variant aggregation
|
||||
- Expression: Adjusted p-values and fold-changes
|
||||
|
||||
**Clinical Precedence:**
|
||||
- Known Drugs: Phase weights (Phase 4 = 1.0, Phase 3 = 0.8, etc.)
|
||||
- Clinical status modifiers
|
||||
|
||||
**Computational Predictions:**
|
||||
- IntOGen: Q-values from driver mutation analysis
|
||||
- PROGENy/SLAPenrich: Pathway activity/enrichment scores
|
||||
|
||||
## Evidence Interpretation Guidelines
|
||||
|
||||
### Strengths by Data Type
|
||||
|
||||
**Genetic Association** - Strongest human genetic evidence
|
||||
- Direct link between genetic variation and disease
|
||||
- Mendelian diseases: high confidence
|
||||
- GWAS: requires L2G to identify causal gene
|
||||
- Consider ancestry and population-specific effects
|
||||
|
||||
**Somatic Mutations** - Direct evidence in cancer
|
||||
- Strong for oncology indications
|
||||
- Driver mutations indicate therapeutic potential
|
||||
- Consider cancer type specificity
|
||||
|
||||
**Known Drugs** - Clinical validation
|
||||
- Highest confidence: approved drugs (Phase 4)
|
||||
- Consider mechanism relevance to new indication
|
||||
- Phase 1-2: early evidence, higher risk
|
||||
|
||||
**Affected Pathways** - Mechanistic insights
|
||||
- Supports biological plausibility
|
||||
- May not predict clinical success
|
||||
- Useful for hypothesis generation
|
||||
|
||||
**RNA Expression** - Observational evidence
|
||||
- Correlation, not causation
|
||||
- May reflect disease consequence vs. cause
|
||||
- Useful for biomarker identification
|
||||
|
||||
**Animal Models** - Translational evidence
|
||||
- Strong for understanding biology
|
||||
- Variable translation to human disease
|
||||
- Most useful when phenotype matches human disease
|
||||
|
||||
**Literature** - Exploratory signal
|
||||
- Text-mining captures research focus
|
||||
- May reflect publication bias
|
||||
- Requires manual literature review for validation
|
||||
|
||||
### Important Considerations
|
||||
|
||||
1. **Multiple evidence types strengthen confidence** - Convergent evidence from different data types provides stronger support
|
||||
|
||||
2. **Under-studied diseases score lower** - Novel or rare diseases may have strong evidence but lower aggregate scores due to limited research
|
||||
|
||||
3. **Association scores are not probabilities** - Scores rank relative evidence strength, not success probability
|
||||
|
||||
4. **Context matters** - Evidence strength depends on:
|
||||
- Disease mechanism understanding
|
||||
- Target biology and druggability
|
||||
- Clinical precedence in related indications
|
||||
- Safety considerations
|
||||
|
||||
5. **Data source reliability varies** - Weight expert-curated sources (ClinGen, Gene2Phenotype) higher than computational predictions
|
||||
|
||||
## Using Evidence in Queries
|
||||
|
||||
### Filtering by Data Type
|
||||
|
||||
```python
|
||||
query = """
|
||||
query evidenceByType($ensemblId: String!, $efoId: String!, $dataTypes: [String!]) {
|
||||
disease(efoId: $efoId) {
|
||||
evidences(ensemblIds: [$ensemblId], datatypes: $dataTypes) {
|
||||
rows {
|
||||
datasourceId
|
||||
score
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
"""
|
||||
variables = {
|
||||
"ensemblId": "ENSG00000157764",
|
||||
"efoId": "EFO_0000249",
|
||||
"dataTypes": ["genetic_association", "somatic_mutation"]
|
||||
}
|
||||
```
|
||||
|
||||
### Accessing Data Type Scores
|
||||
|
||||
Data type scores aggregate all source scores within that type:
|
||||
|
||||
```python
|
||||
query = """
|
||||
query associationScores($ensemblId: String!, $efoId: String!) {
|
||||
target(ensemblId: $ensemblId) {
|
||||
associatedDiseases(efoIds: [$efoId]) {
|
||||
rows {
|
||||
disease {
|
||||
name
|
||||
}
|
||||
score
|
||||
datatypeScores {
|
||||
componentId
|
||||
score
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
"""
|
||||
```
|
||||
|
||||
## Evidence Quality Assessment
|
||||
|
||||
When evaluating evidence:
|
||||
|
||||
1. **Check multiple sources** - Single source may be unreliable
|
||||
2. **Prioritize human genetic evidence** - Strongest disease relevance
|
||||
3. **Consider clinical precedence** - Known drugs indicate druggability
|
||||
4. **Assess mechanistic support** - Pathway evidence supports biology
|
||||
5. **Review literature manually** - For critical decisions, read primary publications
|
||||
6. **Validate in primary databases** - Cross-reference with ClinVar, ClinGen, etc.
|
||||
Reference in New Issue
Block a user