Files
claude-scientific-skills/scientific-skills/interpro-database/references/domain_analysis.md
huangkuanlin 7f94783fab Add scVelo RNA velocity analysis workflow and IQ-TREE reference documentation
- Introduced a comprehensive RNA velocity analysis pipeline using scVelo, including data loading, preprocessing, velocity estimation, and visualization.
- Added a script for running RNA velocity analysis with customizable parameters and output options.
- Created detailed documentation for IQ-TREE 2 phylogenetic inference, covering command syntax, model selection, bootstrapping methods, and output interpretation.
- Included references for velocity models and their mathematical framework, along with a comparison of different models.
- Enhanced the scVelo skill documentation with installation instructions, use cases, and best practices for RNA velocity analysis.
2026-03-03 07:15:36 -05:00

4.8 KiB
Raw Blame History

InterPro Domain Analysis Reference

Entry Types

Type Description Example
family Group of related proteins sharing common evolutionary origin IPR013872: p53 family
domain Distinct structural/functional unit that can exist independently IPR011615: p53 tetramerisation domain
homologous_superfamily Proteins related by structure but not necessarily sequence IPR009003: Peptidase, aspartic
repeat Short sequence unit that occurs in multiple copies IPR000822: Ankyrin repeat
site Residues important for function IPR018060: Metalloprotease active site
conserved_site Conserved sequence motif (functional) IPR016152: PTB/PI domain binding site
active_site Catalytic residues IPR000743: RING domain
binding_site Residues involved in binding
ptm Post-translational modification site

Common Domain Accessions

Signaling Domains

Accession Name Function
IPR000719 Protein kinase domain ATP-dependent phosphorylation
IPR001245 Serine-threonine/tyrosine-protein kinase Kinase catalytic domain
IPR000980 SH2 domain Phosphotyrosine binding
IPR001452 SH3 domain Proline-rich sequence binding
IPR011993 PH domain Phosphoinositide binding
IPR000048 IQ motif Calmodulin binding
IPR000008 C2 domain Ca2+/phospholipid binding
IPR001849 PH domain Pleckstrin homology

DNA Binding Domains

Accession Name Function
IPR013087 Zinc finger, C2H2 DNA binding
IPR017456 CCCH zinc finger RNA binding
IPR011991 Winged helix-turn-helix Transcription factor DNA binding
IPR011607 MH1 domain SMAD DNA binding
IPR003313 ARID domain AT-rich DNA binding
IPR014756 E1-E2 ATPase, nucleotide-binding

Structural Domains

Accession Name Function
IPR001357 BRCT domain DNA repair protein interaction
IPR000536 Nuclear hormone receptor, ligand-binding Hormone binding
IPR001628 Zinc finger, nuclear hormone receptor DNA binding (NHR)
IPR003961 Fibronectin type III Cell adhesion
IPR000742 EGF-like domain Receptor-ligand interaction

Domain Architecture Patterns

Common multi-domain architectures and their biological meanings:

Receptor Tyrosine Kinases

[EGF domain]... - [TM] - [Kinase domain]
e.g., EGFR: IPR000742 (EGF) + IPR000719 (kinase)

Adapter Proteins

[SH3] - [SH2] - [SH3]
e.g., Grb2, Crk — signaling adapters

Nuclear Receptors

[DBD/C2H2 zinc finger] - [Ligand binding domain]
e.g., ERα (ESR1)

Kinases

[N-lobe] - [Activation loop] - [C-lobe]
Standard protein kinase fold (IPR000719)

GO Term Categories

InterPro GO annotations use three ontologies:

Code Ontology Examples
P Biological Process GO:0006468 (protein phosphorylation)
F Molecular Function GO:0004672 (protein kinase activity)
C Cellular Component GO:0005886 (plasma membrane)

InterProScan for Novel Sequences

For protein sequences not in UniProt (novel/predicted sequences), run InterProScan:

# Command-line (install InterProScan locally)
./interproscan.sh -i my_proteins.fasta -f tsv,json -dp

# Options:
# -i: input FASTA
# -f: output formats (tsv, json, xml, gff3, html)
# -dp: disable precalculation lookup (use for non-UniProt sequences)
# --goterms: include GO term mappings
# --pathways: include pathway mappings

# Or use the web service:
# https://www.ebi.ac.uk/interpro/search/sequence/

Output fields (TSV):

  1. Protein accession
  2. Sequence MD5
  3. Sequence length
  4. Analysis (e.g., Pfam, SMART)
  5. Signature accession (e.g., PF00397)
  6. Signature description
  7. Start
  8. Stop
  9. Score
  10. Status (T = true)
  11. Date
  12. InterPro accession (if integrated)
  13. InterPro description

Useful Entry ID Collections

Human Disease-Relevant Domains

DISEASE_DOMAINS = {
    # Cancer
    "IPR011615": "p53 tetramerization",
    "IPR012346": "p53/p63/p73, tetramerization domain",
    "IPR000719": "Protein kinase domain",
    "IPR004827": "Basic-leucine zipper (bZIP) TF",

    # Neurodegenerative
    "IPR003527": "MAP kinase, ERK1/2",
    "IPR016024": "ARM-type fold",

    # Metabolic
    "IPR001764": "Glycoside hydrolase, family 13 (amylase)",
    "IPR006047": "Glycoside hydrolase superfamily",
}

Commonly Referenced Pfam IDs

Pfam ID Domain Name
PF00069 Pkinase (protein kinase)
PF00076 RRM_1 (RNA recognition motif)
PF00096 zf-C2H2 (zinc finger)
PF00397 WW domain
PF00400 WD40 repeat
PF00415 RasGEF domain
PF00018 SH3 domain
PF00017 SH2 domain
PF02196 zf-C3HC4 (RING finger)