Files
opentargets-database/references/evidence_types.md
2026-01-28 12:45:55 +08:00

8.6 KiB

Evidence Types and Data Sources

Overview

Evidence represents any event or set of events that identifies a target as a potential causal gene or protein for a disease. Evidence is standardized and mapped to:

  • Ensembl gene IDs for targets
  • EFO (Experimental Factor Ontology) for diseases/phenotypes

Evidence is organized into data types (broader categories) and data sources (specific databases/studies).

Evidence Data Types

1. Genetic Association

Evidence from human genetics linking genetic variants to disease phenotypes.

Data Sources:

GWAS (Genome-Wide Association Studies)

  • Population-level common variant associations
  • Filtered with Locus-to-Gene (L2G) scores >0.05
  • Includes fine-mapping and colocalization data
  • Sources: GWAS Catalog, FinnGen, UK Biobank, EBI GWAS

Gene Burden Tests

  • Rare variant association analyses
  • Aggregate effects of multiple rare variants in a gene
  • Particularly relevant for Mendelian and rare diseases

ClinVar Germline

  • Clinical variant interpretations
  • Classifications: pathogenic, likely pathogenic, VUS, benign
  • Expert-reviewed variant-disease associations

Genomics England PanelApp

  • Expert gene-disease ratings
  • Green (confirmed), amber (probable), red (no evidence)
  • Focus on rare diseases and cancer

Gene2Phenotype

  • Curated gene-disease relationships
  • Allelic requirements and inheritance patterns
  • Clinical validity assessments

UniProt Literature & Variants

  • Literature-based gene-disease associations
  • Expert-curated from scientific publications

Orphanet

  • Rare disease gene associations
  • Expert-reviewed and maintained

ClinGen

  • Clinical genome resource classifications
  • Gene-disease validity assertions

2. Somatic Mutations

Evidence from cancer genomics identifying driver genes and therapeutic targets.

Data Sources:

Cancer Gene Census

  • Expert-curated cancer genes
  • Tier classifications (1 = strong evidence, 2 = emerging)
  • Mutation types and cancer types

IntOGen

  • Computational driver gene predictions
  • Aggregated from large cohort studies
  • Statistical significance of mutations

ClinVar Somatic

  • Somatic clinical variant interpretations
  • Oncogenic/likely oncogenic classifications

Cancer Biomarkers

  • FDA/EMA approved biomarkers
  • Clinical trial biomarkers
  • Prognostic and predictive markers

3. Known Drugs

Evidence from clinical precedence showing drugs targeting genes for disease indications.

Data Source:

ChEMBL

  • Approved drugs (Phase 4)
  • Clinical candidates (Phase 1-3)
  • Withdrawn drugs
  • Drug-target-indication triplets with mechanism of action

Clinical Trial Information:

  • phase: Maximum clinical trial phase (1, 2, 3, 4)
  • status: Active, terminated, completed, withdrawn
  • mechanismOfAction: How drug affects target

4. Affected Pathways

Evidence linking genes to disease through pathway perturbations and functional screens.

Data Sources:

CRISPR Screens

  • Genome-scale knockout screens
  • Cancer dependency and essentiality data

Project Score (Cancer Dependency Map)

  • CRISPR-Cas9 fitness screens across cancer cell lines
  • Gene essentiality profiles

SLAPenrich

  • Pathway enrichment analysis
  • Somatic mutation pathway impacts

PROGENy

  • Pathway activity inference
  • Signaling pathway perturbations

Reactome

  • Expert-curated pathway annotations
  • Biological pathway representations

Gene Signatures

  • Expression-based signatures
  • Pathway activity patterns

5. RNA Expression

Evidence from differential gene expression in disease vs. control tissues.

Data Source:

Expression Atlas

  • Differential expression data
  • Baseline expression across tissues/conditions
  • RNA-Seq and microarray studies
  • Log2 fold-change and p-values

6. Animal Models

Evidence from in vivo studies showing phenotypes associated with gene perturbations.

Data Source:

IMPC (International Mouse Phenotyping Consortium)

  • Systematic mouse knockout phenotypes
  • Phenotype-disease mappings via ontologies
  • Standardized phenotyping procedures

7. Literature

Evidence from text-mining of biomedical literature.

Data Source:

Europe PMC

  • Co-occurrence of genes and diseases in abstracts
  • Normalized citation counts
  • Weighted by publication type and recency

Evidence Scoring

Each evidence source has its own scoring methodology:

Score Ranges

  • Most scores normalized to 0-1 range
  • Higher scores indicate stronger evidence
  • Scores are NOT confidence levels but relative strength indicators

Common Scoring Approaches:

Binary Classifications:

  • ClinVar: Pathogenic (1.0), Likely pathogenic (0.99), etc.
  • Gene2Phenotype: Confirmed/probable ratings
  • PanelApp: Green/amber/red classifications

Statistical Measures:

  • GWAS: L2G scores incorporating multiple lines of evidence
  • Gene Burden: Statistical significance of variant aggregation
  • Expression: Adjusted p-values and fold-changes

Clinical Precedence:

  • Known Drugs: Phase weights (Phase 4 = 1.0, Phase 3 = 0.8, etc.)
  • Clinical status modifiers

Computational Predictions:

  • IntOGen: Q-values from driver mutation analysis
  • PROGENy/SLAPenrich: Pathway activity/enrichment scores

Evidence Interpretation Guidelines

Strengths by Data Type

Genetic Association - Strongest human genetic evidence

  • Direct link between genetic variation and disease
  • Mendelian diseases: high confidence
  • GWAS: requires L2G to identify causal gene
  • Consider ancestry and population-specific effects

Somatic Mutations - Direct evidence in cancer

  • Strong for oncology indications
  • Driver mutations indicate therapeutic potential
  • Consider cancer type specificity

Known Drugs - Clinical validation

  • Highest confidence: approved drugs (Phase 4)
  • Consider mechanism relevance to new indication
  • Phase 1-2: early evidence, higher risk

Affected Pathways - Mechanistic insights

  • Supports biological plausibility
  • May not predict clinical success
  • Useful for hypothesis generation

RNA Expression - Observational evidence

  • Correlation, not causation
  • May reflect disease consequence vs. cause
  • Useful for biomarker identification

Animal Models - Translational evidence

  • Strong for understanding biology
  • Variable translation to human disease
  • Most useful when phenotype matches human disease

Literature - Exploratory signal

  • Text-mining captures research focus
  • May reflect publication bias
  • Requires manual literature review for validation

Important Considerations

  1. Multiple evidence types strengthen confidence - Convergent evidence from different data types provides stronger support

  2. Under-studied diseases score lower - Novel or rare diseases may have strong evidence but lower aggregate scores due to limited research

  3. Association scores are not probabilities - Scores rank relative evidence strength, not success probability

  4. Context matters - Evidence strength depends on:

    • Disease mechanism understanding
    • Target biology and druggability
    • Clinical precedence in related indications
    • Safety considerations
  5. Data source reliability varies - Weight expert-curated sources (ClinGen, Gene2Phenotype) higher than computational predictions

Using Evidence in Queries

Filtering by Data Type

query = """
  query evidenceByType($ensemblId: String!, $efoId: String!, $dataTypes: [String!]) {
    disease(efoId: $efoId) {
      evidences(ensemblIds: [$ensemblId], datatypes: $dataTypes) {
        rows {
          datasourceId
          score
        }
      }
    }
  }
"""
variables = {
    "ensemblId": "ENSG00000157764",
    "efoId": "EFO_0000249",
    "dataTypes": ["genetic_association", "somatic_mutation"]
}

Accessing Data Type Scores

Data type scores aggregate all source scores within that type:

query = """
  query associationScores($ensemblId: String!, $efoId: String!) {
    target(ensemblId: $ensemblId) {
      associatedDiseases(efoIds: [$efoId]) {
        rows {
          disease {
            name
          }
          score
          datatypeScores {
            componentId
            score
          }
        }
      }
    }
  }
"""

Evidence Quality Assessment

When evaluating evidence:

  1. Check multiple sources - Single source may be unreliable
  2. Prioritize human genetic evidence - Strongest disease relevance
  3. Consider clinical precedence - Known drugs indicate druggability
  4. Assess mechanistic support - Pathway evidence supports biology
  5. Review literature manually - For critical decisions, read primary publications
  6. Validate in primary databases - Cross-reference with ClinVar, ClinGen, etc.