mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
368 lines
11 KiB
Markdown
368 lines
11 KiB
Markdown
---
|
|
name: kegg-database
|
|
description: "Query KEGG REST API for pathway analysis, gene-pathway mapping, metabolic pathways, drug interactions, ID conversion, compound structures, for systems biology and enrichment analysis."
|
|
---
|
|
|
|
# KEGG Database
|
|
|
|
## Overview
|
|
|
|
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive bioinformatics resource for biological pathway analysis and molecular interaction networks. This skill enables interaction with KEGG's REST API to query pathways, genes, compounds, enzymes, diseases, and drugs across multiple organisms.
|
|
|
|
**Important**: KEGG API is made available only for academic use by academic users.
|
|
|
|
## Quick Start
|
|
|
|
The skill provides:
|
|
1. Python helper functions (`scripts/kegg_api.py`) for all KEGG REST API operations
|
|
2. Comprehensive reference documentation (`references/kegg_reference.md`) with detailed API specifications
|
|
|
|
When users request KEGG data, determine which operation is needed and use the appropriate function from `scripts/kegg_api.py`.
|
|
|
|
## Core Operations
|
|
|
|
### 1. Database Information (`kegg_info`)
|
|
|
|
Retrieve metadata and statistics about KEGG databases.
|
|
|
|
**When to use**: Understanding database structure, checking available data, getting release information.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_info
|
|
|
|
# Get pathway database info
|
|
info = kegg_info('pathway')
|
|
|
|
# Get organism-specific info
|
|
hsa_info = kegg_info('hsa') # Human genome
|
|
```
|
|
|
|
**Common databases**: `kegg`, `pathway`, `module`, `brite`, `genes`, `genome`, `compound`, `glycan`, `reaction`, `enzyme`, `disease`, `drug`
|
|
|
|
### 2. Listing Entries (`kegg_list`)
|
|
|
|
List entry identifiers and names from KEGG databases.
|
|
|
|
**When to use**: Getting all pathways for an organism, listing genes, retrieving compound catalogs.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_list
|
|
|
|
# List all reference pathways
|
|
pathways = kegg_list('pathway')
|
|
|
|
# List human-specific pathways
|
|
hsa_pathways = kegg_list('pathway', 'hsa')
|
|
|
|
# List specific genes (max 10)
|
|
genes = kegg_list('hsa:10458+hsa:10459')
|
|
```
|
|
|
|
**Common organism codes**: `hsa` (human), `mmu` (mouse), `dme` (fruit fly), `sce` (yeast), `eco` (E. coli)
|
|
|
|
### 3. Searching (`kegg_find`)
|
|
|
|
Search KEGG databases by keywords or molecular properties.
|
|
|
|
**When to use**: Finding genes by name/description, searching compounds by formula or mass, discovering entries by keywords.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_find
|
|
|
|
# Keyword search
|
|
results = kegg_find('genes', 'p53')
|
|
shiga_toxin = kegg_find('genes', 'shiga toxin')
|
|
|
|
# Chemical formula search (exact match)
|
|
compounds = kegg_find('compound', 'C7H10N4O2', 'formula')
|
|
|
|
# Molecular weight range search
|
|
drugs = kegg_find('drug', '300-310', 'exact_mass')
|
|
```
|
|
|
|
**Search options**: `formula` (exact match), `exact_mass` (range), `mol_weight` (range)
|
|
|
|
### 4. Retrieving Entries (`kegg_get`)
|
|
|
|
Get complete database entries or specific data formats.
|
|
|
|
**When to use**: Retrieving pathway details, getting gene/protein sequences, downloading pathway maps, accessing compound structures.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_get
|
|
|
|
# Get pathway entry
|
|
pathway = kegg_get('hsa00010') # Glycolysis pathway
|
|
|
|
# Get multiple entries (max 10)
|
|
genes = kegg_get(['hsa:10458', 'hsa:10459'])
|
|
|
|
# Get protein sequence (FASTA)
|
|
sequence = kegg_get('hsa:10458', 'aaseq')
|
|
|
|
# Get nucleotide sequence
|
|
nt_seq = kegg_get('hsa:10458', 'ntseq')
|
|
|
|
# Get compound structure
|
|
mol_file = kegg_get('cpd:C00002', 'mol') # ATP in MOL format
|
|
|
|
# Get pathway as JSON (single entry only)
|
|
pathway_json = kegg_get('hsa05130', 'json')
|
|
|
|
# Get pathway image (single entry only)
|
|
pathway_img = kegg_get('hsa05130', 'image')
|
|
```
|
|
|
|
**Output formats**: `aaseq` (protein FASTA), `ntseq` (nucleotide FASTA), `mol` (MOL format), `kcf` (KCF format), `image` (PNG), `kgml` (XML), `json` (pathway JSON)
|
|
|
|
**Important**: Image, KGML, and JSON formats allow only one entry at a time.
|
|
|
|
### 5. ID Conversion (`kegg_conv`)
|
|
|
|
Convert identifiers between KEGG and external databases.
|
|
|
|
**When to use**: Integrating KEGG data with other databases, mapping gene IDs, converting compound identifiers.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_conv
|
|
|
|
# Convert all human genes to NCBI Gene IDs
|
|
conversions = kegg_conv('ncbi-geneid', 'hsa')
|
|
|
|
# Convert specific gene
|
|
gene_id = kegg_conv('ncbi-geneid', 'hsa:10458')
|
|
|
|
# Convert to UniProt
|
|
uniprot_id = kegg_conv('uniprot', 'hsa:10458')
|
|
|
|
# Convert compounds to PubChem
|
|
pubchem_ids = kegg_conv('pubchem', 'compound')
|
|
|
|
# Reverse conversion (NCBI Gene ID to KEGG)
|
|
kegg_id = kegg_conv('hsa', 'ncbi-geneid')
|
|
```
|
|
|
|
**Supported conversions**: `ncbi-geneid`, `ncbi-proteinid`, `uniprot`, `pubchem`, `chebi`
|
|
|
|
### 6. Cross-Referencing (`kegg_link`)
|
|
|
|
Find related entries within and between KEGG databases.
|
|
|
|
**When to use**: Finding pathways containing genes, getting genes in a pathway, mapping genes to KO groups, finding compounds in pathways.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_link
|
|
|
|
# Find pathways linked to human genes
|
|
pathways = kegg_link('pathway', 'hsa')
|
|
|
|
# Get genes in a specific pathway
|
|
genes = kegg_link('genes', 'hsa00010') # Glycolysis genes
|
|
|
|
# Find pathways containing a specific gene
|
|
gene_pathways = kegg_link('pathway', 'hsa:10458')
|
|
|
|
# Find compounds in a pathway
|
|
compounds = kegg_link('compound', 'hsa00010')
|
|
|
|
# Map genes to KO (orthology) groups
|
|
ko_groups = kegg_link('ko', 'hsa:10458')
|
|
```
|
|
|
|
**Common links**: genes ↔ pathway, pathway ↔ compound, pathway ↔ enzyme, genes ↔ ko (orthology)
|
|
|
|
### 7. Drug-Drug Interactions (`kegg_ddi`)
|
|
|
|
Check for drug-drug interactions.
|
|
|
|
**When to use**: Analyzing drug combinations, checking for contraindications, pharmacological research.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_ddi
|
|
|
|
# Check single drug
|
|
interactions = kegg_ddi('D00001')
|
|
|
|
# Check multiple drugs (max 10)
|
|
interactions = kegg_ddi(['D00001', 'D00002', 'D00003'])
|
|
```
|
|
|
|
## Common Analysis Workflows
|
|
|
|
### Workflow 1: Gene to Pathway Mapping
|
|
|
|
**Use case**: Finding pathways associated with genes of interest (e.g., for pathway enrichment analysis).
|
|
|
|
```python
|
|
from scripts.kegg_api import kegg_find, kegg_link, kegg_get
|
|
|
|
# Step 1: Find gene ID by name
|
|
gene_results = kegg_find('genes', 'p53')
|
|
|
|
# Step 2: Link gene to pathways
|
|
pathways = kegg_link('pathway', 'hsa:7157') # TP53 gene
|
|
|
|
# Step 3: Get detailed pathway information
|
|
for pathway_line in pathways.split('\n'):
|
|
if pathway_line:
|
|
pathway_id = pathway_line.split('\t')[1].replace('path:', '')
|
|
pathway_info = kegg_get(pathway_id)
|
|
# Process pathway information
|
|
```
|
|
|
|
### Workflow 2: Pathway Enrichment Context
|
|
|
|
**Use case**: Getting all genes in organism pathways for enrichment analysis.
|
|
|
|
```python
|
|
from scripts.kegg_api import kegg_list, kegg_link
|
|
|
|
# Step 1: List all human pathways
|
|
pathways = kegg_list('pathway', 'hsa')
|
|
|
|
# Step 2: For each pathway, get associated genes
|
|
for pathway_line in pathways.split('\n'):
|
|
if pathway_line:
|
|
pathway_id = pathway_line.split('\t')[0]
|
|
genes = kegg_link('genes', pathway_id)
|
|
# Process genes for enrichment analysis
|
|
```
|
|
|
|
### Workflow 3: Compound to Pathway Analysis
|
|
|
|
**Use case**: Finding metabolic pathways containing compounds of interest.
|
|
|
|
```python
|
|
from scripts.kegg_api import kegg_find, kegg_link, kegg_get
|
|
|
|
# Step 1: Search for compound
|
|
compound_results = kegg_find('compound', 'glucose')
|
|
|
|
# Step 2: Link compound to reactions
|
|
reactions = kegg_link('reaction', 'cpd:C00031') # Glucose
|
|
|
|
# Step 3: Link reactions to pathways
|
|
pathways = kegg_link('pathway', 'rn:R00299') # Specific reaction
|
|
|
|
# Step 4: Get pathway details
|
|
pathway_info = kegg_get('map00010') # Glycolysis
|
|
```
|
|
|
|
### Workflow 4: Cross-Database Integration
|
|
|
|
**Use case**: Integrating KEGG data with UniProt, NCBI, or PubChem databases.
|
|
|
|
```python
|
|
from scripts.kegg_api import kegg_conv, kegg_get
|
|
|
|
# Step 1: Convert KEGG gene IDs to external database IDs
|
|
uniprot_map = kegg_conv('uniprot', 'hsa')
|
|
ncbi_map = kegg_conv('ncbi-geneid', 'hsa')
|
|
|
|
# Step 2: Parse conversion results
|
|
for line in uniprot_map.split('\n'):
|
|
if line:
|
|
kegg_id, uniprot_id = line.split('\t')
|
|
# Use external IDs for integration
|
|
|
|
# Step 3: Get sequences using KEGG
|
|
sequence = kegg_get('hsa:10458', 'aaseq')
|
|
```
|
|
|
|
### Workflow 5: Organism-Specific Pathway Analysis
|
|
|
|
**Use case**: Comparing pathways across different organisms.
|
|
|
|
```python
|
|
from scripts.kegg_api import kegg_list, kegg_get
|
|
|
|
# Step 1: List pathways for multiple organisms
|
|
human_pathways = kegg_list('pathway', 'hsa')
|
|
mouse_pathways = kegg_list('pathway', 'mmu')
|
|
yeast_pathways = kegg_list('pathway', 'sce')
|
|
|
|
# Step 2: Get reference pathway for comparison
|
|
ref_pathway = kegg_get('map00010') # Reference glycolysis
|
|
|
|
# Step 3: Get organism-specific versions
|
|
hsa_glycolysis = kegg_get('hsa00010')
|
|
mmu_glycolysis = kegg_get('mmu00010')
|
|
```
|
|
|
|
## Pathway Categories
|
|
|
|
KEGG organizes pathways into seven major categories. When interpreting pathway IDs or recommending pathways to users:
|
|
|
|
1. **Metabolism** (e.g., `map00010` - Glycolysis, `map00190` - Oxidative phosphorylation)
|
|
2. **Genetic Information Processing** (e.g., `map03010` - Ribosome, `map03040` - Spliceosome)
|
|
3. **Environmental Information Processing** (e.g., `map04010` - MAPK signaling, `map02010` - ABC transporters)
|
|
4. **Cellular Processes** (e.g., `map04140` - Autophagy, `map04210` - Apoptosis)
|
|
5. **Organismal Systems** (e.g., `map04610` - Complement cascade, `map04910` - Insulin signaling)
|
|
6. **Human Diseases** (e.g., `map05200` - Pathways in cancer, `map05010` - Alzheimer disease)
|
|
7. **Drug Development** (chronological and target-based classifications)
|
|
|
|
Reference `references/kegg_reference.md` for detailed pathway lists and classifications.
|
|
|
|
## Important Identifiers and Formats
|
|
|
|
### Pathway IDs
|
|
- `map#####` - Reference pathway (generic, not organism-specific)
|
|
- `hsa#####` - Human pathway
|
|
- `mmu#####` - Mouse pathway
|
|
|
|
### Gene IDs
|
|
- Format: `organism:gene_number` (e.g., `hsa:10458`)
|
|
|
|
### Compound IDs
|
|
- Format: `cpd:C#####` (e.g., `cpd:C00002` for ATP)
|
|
|
|
### Drug IDs
|
|
- Format: `dr:D#####` (e.g., `dr:D00001`)
|
|
|
|
### Enzyme IDs
|
|
- Format: `ec:EC_number` (e.g., `ec:1.1.1.1`)
|
|
|
|
### KO (KEGG Orthology) IDs
|
|
- Format: `ko:K#####` (e.g., `ko:K00001`)
|
|
|
|
## API Limitations
|
|
|
|
Respect these constraints when using the KEGG API:
|
|
|
|
1. **Entry limits**: Maximum 10 entries per operation (except image/kgml/json: 1 entry only)
|
|
2. **Academic use**: API is for academic use only; commercial use requires licensing
|
|
3. **HTTP status codes**: Check for 200 (success), 400 (bad request), 404 (not found)
|
|
4. **Rate limiting**: No explicit limit, but avoid rapid-fire requests
|
|
|
|
## Detailed Reference
|
|
|
|
For comprehensive API documentation, database specifications, organism codes, and advanced usage, refer to `references/kegg_reference.md`. This includes:
|
|
|
|
- Complete list of KEGG databases
|
|
- Detailed API operation syntax
|
|
- All organism codes
|
|
- HTTP status codes and error handling
|
|
- Integration with Biopython and R/Bioconductor
|
|
- Best practices for API usage
|
|
|
|
## Troubleshooting
|
|
|
|
**404 Not Found**: Entry or database doesn't exist; verify IDs and organism codes
|
|
**400 Bad Request**: Syntax error in API call; check parameter formatting
|
|
**Empty results**: Search term may not match entries; try broader keywords
|
|
**Image/KGML errors**: These formats only work with single entries; remove batch processing
|
|
|
|
## Additional Tools
|
|
|
|
For interactive pathway visualization and annotation:
|
|
- **KEGG Mapper**: https://www.kegg.jp/kegg/mapper/
|
|
- **BlastKOALA**: Automated genome annotation
|
|
- **GhostKOALA**: Metagenome/metatranscriptome annotation
|