mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
558 lines
16 KiB
Markdown
558 lines
16 KiB
Markdown
---
|
|
name: pubchem-database
|
|
description: "Access and analyze chemical compound data from PubChem database using PubChemPy and PUG-REST API. Use this skill when you need to: search compounds by name/CID/SMILES/InChI/formula, retrieve molecular properties (MW/LogP/TPSA/H-bond counts), perform similarity searches with Tanimoto thresholds, conduct substructure searches for pharmacophores, convert between chemical formats (SMILES/InChI/SDF/JSON), generate 2D structure images, access bioactivity data from assays, get compound synonyms and annotations, screen compounds using Lipinski's Rule of Five, batch process multiple compounds, or find drug-like candidates. Handles 110M+ compounds and 270M+ bioactivities with rate limiting (5 req/sec, 400 req/min). Includes error handling for timeouts, not found errors, and missing properties. Supports both synchronous and asynchronous operations for large similarity/substructure searches."
|
|
---
|
|
|
|
# PubChem Database
|
|
|
|
## Overview
|
|
|
|
PubChem is the world's largest freely available chemical database maintained by the National Center for Biotechnology Information (NCBI). It contains over 110 million unique chemical structures and over 270 million bioactivities from more than 770 data sources. This skill provides guidance for programmatically accessing PubChem data using the PUG-REST API and PubChemPy Python library.
|
|
|
|
## Core Capabilities
|
|
|
|
### 1. Chemical Structure Search
|
|
|
|
Search for compounds using multiple identifier types:
|
|
|
|
**By Chemical Name**:
|
|
```python
|
|
import pubchempy as pcp
|
|
compounds = pcp.get_compounds('aspirin', 'name')
|
|
compound = compounds[0]
|
|
```
|
|
|
|
**By CID (Compound ID)**:
|
|
```python
|
|
compound = pcp.Compound.from_cid(2244) # Aspirin
|
|
```
|
|
|
|
**By SMILES**:
|
|
```python
|
|
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
|
|
```
|
|
|
|
**By InChI**:
|
|
```python
|
|
compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]
|
|
```
|
|
|
|
**By Molecular Formula**:
|
|
```python
|
|
compounds = pcp.get_compounds('C9H8O4', 'formula')
|
|
# Returns all compounds matching this formula
|
|
```
|
|
|
|
### 2. Property Retrieval
|
|
|
|
Retrieve molecular properties for compounds using either high-level or low-level approaches:
|
|
|
|
**Using PubChemPy (Recommended)**:
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
# Get compound object with all properties
|
|
compound = pcp.get_compounds('caffeine', 'name')[0]
|
|
|
|
# Access individual properties
|
|
molecular_formula = compound.molecular_formula
|
|
molecular_weight = compound.molecular_weight
|
|
iupac_name = compound.iupac_name
|
|
smiles = compound.canonical_smiles
|
|
inchi = compound.inchi
|
|
xlogp = compound.xlogp # Partition coefficient
|
|
tpsa = compound.tpsa # Topological polar surface area
|
|
```
|
|
|
|
**Get Specific Properties**:
|
|
```python
|
|
# Request only specific properties
|
|
properties = pcp.get_properties(
|
|
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],
|
|
'aspirin',
|
|
'name'
|
|
)
|
|
# Returns list of dictionaries
|
|
```
|
|
|
|
**Batch Property Retrieval**:
|
|
```python
|
|
import pandas as pd
|
|
|
|
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
|
|
all_properties = []
|
|
|
|
for name in compound_names:
|
|
props = pcp.get_properties(
|
|
['MolecularFormula', 'MolecularWeight', 'XLogP'],
|
|
name,
|
|
'name'
|
|
)
|
|
all_properties.extend(props)
|
|
|
|
df = pd.DataFrame(all_properties)
|
|
```
|
|
|
|
**Available Properties**: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see `references/api_reference.md` for complete list).
|
|
|
|
### 3. Similarity Search
|
|
|
|
Find structurally similar compounds using Tanimoto similarity:
|
|
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
# Start with a query compound
|
|
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
|
|
query_smiles = query_compound.canonical_smiles
|
|
|
|
# Perform similarity search
|
|
similar_compounds = pcp.get_compounds(
|
|
query_smiles,
|
|
'smiles',
|
|
searchtype='similarity',
|
|
Threshold=85, # Similarity threshold (0-100)
|
|
MaxRecords=50
|
|
)
|
|
|
|
# Process results
|
|
for compound in similar_compounds[:10]:
|
|
print(f"CID {compound.cid}: {compound.iupac_name}")
|
|
print(f" MW: {compound.molecular_weight}")
|
|
```
|
|
|
|
**Note**: Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically.
|
|
|
|
### 4. Substructure Search
|
|
|
|
Find compounds containing a specific structural motif:
|
|
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
# Search for compounds containing pyridine ring
|
|
pyridine_smiles = 'c1ccncc1'
|
|
|
|
matches = pcp.get_compounds(
|
|
pyridine_smiles,
|
|
'smiles',
|
|
searchtype='substructure',
|
|
MaxRecords=100
|
|
)
|
|
|
|
print(f"Found {len(matches)} compounds containing pyridine")
|
|
```
|
|
|
|
**Common Substructures**:
|
|
- Benzene ring: `c1ccccc1`
|
|
- Pyridine: `c1ccncc1`
|
|
- Phenol: `c1ccc(O)cc1`
|
|
- Carboxylic acid: `C(=O)O`
|
|
|
|
### 5. Format Conversion
|
|
|
|
Convert between different chemical structure formats:
|
|
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
compound = pcp.get_compounds('aspirin', 'name')[0]
|
|
|
|
# Convert to different formats
|
|
smiles = compound.canonical_smiles
|
|
inchi = compound.inchi
|
|
inchikey = compound.inchikey
|
|
cid = compound.cid
|
|
|
|
# Download structure files
|
|
pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)
|
|
pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)
|
|
```
|
|
|
|
### 6. Structure Visualization
|
|
|
|
Generate 2D structure images:
|
|
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
# Download compound structure as PNG
|
|
pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)
|
|
|
|
# Using direct URL (via requests)
|
|
import requests
|
|
|
|
cid = 2244 # Aspirin
|
|
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
|
|
response = requests.get(url)
|
|
|
|
with open('structure.png', 'wb') as f:
|
|
f.write(response.content)
|
|
```
|
|
|
|
### 7. Synonym Retrieval
|
|
|
|
Get all known names and synonyms for a compound:
|
|
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
synonyms_data = pcp.get_synonyms('aspirin', 'name')
|
|
|
|
if synonyms_data:
|
|
cid = synonyms_data[0]['CID']
|
|
synonyms = synonyms_data[0]['Synonym']
|
|
|
|
print(f"CID {cid} has {len(synonyms)} synonyms:")
|
|
for syn in synonyms[:10]: # First 10
|
|
print(f" - {syn}")
|
|
```
|
|
|
|
### 8. Bioactivity Data Access
|
|
|
|
Retrieve biological activity data from assays:
|
|
|
|
```python
|
|
import requests
|
|
import json
|
|
|
|
# Get bioassay summary for a compound
|
|
cid = 2244 # Aspirin
|
|
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
|
|
|
|
response = requests.get(url)
|
|
if response.status_code == 200:
|
|
data = response.json()
|
|
# Process bioassay information
|
|
table = data.get('Table', {})
|
|
rows = table.get('Row', [])
|
|
print(f"Found {len(rows)} bioassay records")
|
|
```
|
|
|
|
**For more complex bioactivity queries**, use the `scripts/bioactivity_query.py` helper script which provides:
|
|
- Bioassay summaries with activity outcome filtering
|
|
- Assay target identification
|
|
- Search for compounds by biological target
|
|
- Active compound lists for specific assays
|
|
|
|
### 9. Comprehensive Compound Annotations
|
|
|
|
Access detailed compound information through PUG-View:
|
|
|
|
```python
|
|
import requests
|
|
|
|
cid = 2244
|
|
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
|
|
|
|
response = requests.get(url)
|
|
if response.status_code == 200:
|
|
annotations = response.json()
|
|
# Contains extensive data including:
|
|
# - Chemical and Physical Properties
|
|
# - Drug and Medication Information
|
|
# - Pharmacology and Biochemistry
|
|
# - Safety and Hazards
|
|
# - Toxicity
|
|
# - Literature references
|
|
# - Patents
|
|
```
|
|
|
|
**Get Specific Section**:
|
|
```python
|
|
# Get only drug information
|
|
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"
|
|
```
|
|
|
|
## Installation Requirements
|
|
|
|
Install PubChemPy for Python-based access:
|
|
|
|
```bash
|
|
pip install pubchempy
|
|
```
|
|
|
|
For direct API access and bioactivity queries:
|
|
|
|
```bash
|
|
pip install requests
|
|
```
|
|
|
|
Optional for data analysis:
|
|
|
|
```bash
|
|
pip install pandas
|
|
```
|
|
|
|
## Helper Scripts
|
|
|
|
This skill includes Python scripts for common PubChem tasks:
|
|
|
|
### scripts/compound_search.py
|
|
|
|
Provides utility functions for searching and retrieving compound information:
|
|
|
|
**Key Functions**:
|
|
- `search_by_name(name, max_results=10)`: Search compounds by name
|
|
- `search_by_smiles(smiles)`: Search by SMILES string
|
|
- `get_compound_by_cid(cid)`: Retrieve compound by CID
|
|
- `get_compound_properties(identifier, namespace, properties)`: Get specific properties
|
|
- `similarity_search(smiles, threshold, max_records)`: Perform similarity search
|
|
- `substructure_search(smiles, max_records)`: Perform substructure search
|
|
- `get_synonyms(identifier, namespace)`: Get all synonyms
|
|
- `batch_search(identifiers, namespace, properties)`: Batch search multiple compounds
|
|
- `download_structure(identifier, namespace, format, filename)`: Download structures
|
|
- `print_compound_info(compound)`: Print formatted compound information
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.compound_search import search_by_name, get_compound_properties
|
|
|
|
# Search for a compound
|
|
compounds = search_by_name('ibuprofen')
|
|
|
|
# Get specific properties
|
|
props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])
|
|
```
|
|
|
|
### scripts/bioactivity_query.py
|
|
|
|
Provides functions for retrieving biological activity data:
|
|
|
|
**Key Functions**:
|
|
- `get_bioassay_summary(cid)`: Get bioassay summary for compound
|
|
- `get_compound_bioactivities(cid, activity_outcome)`: Get filtered bioactivities
|
|
- `get_assay_description(aid)`: Get detailed assay information
|
|
- `get_assay_targets(aid)`: Get biological targets for assay
|
|
- `search_assays_by_target(target_name, max_results)`: Find assays by target
|
|
- `get_active_compounds_in_assay(aid, max_results)`: Get active compounds
|
|
- `get_compound_annotations(cid, section)`: Get PUG-View annotations
|
|
- `summarize_bioactivities(cid)`: Generate bioactivity summary statistics
|
|
- `find_compounds_by_bioactivity(target, threshold, max_compounds)`: Find compounds by target
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities
|
|
|
|
# Get bioactivity summary
|
|
summary = summarize_bioactivities(2244) # Aspirin
|
|
print(f"Total assays: {summary['total_assays']}")
|
|
print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")
|
|
```
|
|
|
|
## API Rate Limits and Best Practices
|
|
|
|
**Rate Limits**:
|
|
- Maximum 5 requests per second
|
|
- Maximum 400 requests per minute
|
|
- Maximum 300 seconds running time per minute
|
|
|
|
**Best Practices**:
|
|
1. **Use CIDs for repeated queries**: CIDs are more efficient than names or structures
|
|
2. **Cache results locally**: Store frequently accessed data
|
|
3. **Batch requests**: Combine multiple queries when possible
|
|
4. **Implement delays**: Add 0.2-0.3 second delays between requests
|
|
5. **Handle errors gracefully**: Check for HTTP errors and missing data
|
|
6. **Use PubChemPy**: Higher-level abstraction handles many edge cases
|
|
7. **Leverage asynchronous pattern**: For large similarity/substructure searches
|
|
8. **Specify MaxRecords**: Limit results to avoid timeouts
|
|
|
|
**Error Handling**:
|
|
```python
|
|
from pubchempy import BadRequestError, NotFoundError, TimeoutError
|
|
|
|
try:
|
|
compound = pcp.get_compounds('query', 'name')[0]
|
|
except NotFoundError:
|
|
print("Compound not found")
|
|
except BadRequestError:
|
|
print("Invalid request format")
|
|
except TimeoutError:
|
|
print("Request timed out - try reducing scope")
|
|
except IndexError:
|
|
print("No results returned")
|
|
```
|
|
|
|
## Common Workflows
|
|
|
|
### Workflow 1: Chemical Identifier Conversion Pipeline
|
|
|
|
Convert between different chemical identifiers:
|
|
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
# Start with any identifier type
|
|
compound = pcp.get_compounds('caffeine', 'name')[0]
|
|
|
|
# Extract all identifier formats
|
|
identifiers = {
|
|
'CID': compound.cid,
|
|
'Name': compound.iupac_name,
|
|
'SMILES': compound.canonical_smiles,
|
|
'InChI': compound.inchi,
|
|
'InChIKey': compound.inchikey,
|
|
'Formula': compound.molecular_formula
|
|
}
|
|
```
|
|
|
|
### Workflow 2: Drug-Like Property Screening
|
|
|
|
Screen compounds using Lipinski's Rule of Five:
|
|
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
def check_drug_likeness(compound_name):
|
|
compound = pcp.get_compounds(compound_name, 'name')[0]
|
|
|
|
# Lipinski's Rule of Five
|
|
rules = {
|
|
'MW <= 500': compound.molecular_weight <= 500,
|
|
'LogP <= 5': compound.xlogp <= 5 if compound.xlogp else None,
|
|
'HBD <= 5': compound.h_bond_donor_count <= 5,
|
|
'HBA <= 10': compound.h_bond_acceptor_count <= 10
|
|
}
|
|
|
|
violations = sum(1 for v in rules.values() if v is False)
|
|
return rules, violations
|
|
|
|
rules, violations = check_drug_likeness('aspirin')
|
|
print(f"Lipinski violations: {violations}")
|
|
```
|
|
|
|
### Workflow 3: Finding Similar Drug Candidates
|
|
|
|
Identify structurally similar compounds to a known drug:
|
|
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
# Start with known drug
|
|
reference_drug = pcp.get_compounds('imatinib', 'name')[0]
|
|
reference_smiles = reference_drug.canonical_smiles
|
|
|
|
# Find similar compounds
|
|
similar = pcp.get_compounds(
|
|
reference_smiles,
|
|
'smiles',
|
|
searchtype='similarity',
|
|
Threshold=85,
|
|
MaxRecords=20
|
|
)
|
|
|
|
# Filter by drug-like properties
|
|
candidates = []
|
|
for comp in similar:
|
|
if comp.molecular_weight and 200 <= comp.molecular_weight <= 600:
|
|
if comp.xlogp and -1 <= comp.xlogp <= 5:
|
|
candidates.append(comp)
|
|
|
|
print(f"Found {len(candidates)} drug-like candidates")
|
|
```
|
|
|
|
### Workflow 4: Batch Compound Property Comparison
|
|
|
|
Compare properties across multiple compounds:
|
|
|
|
```python
|
|
import pubchempy as pcp
|
|
import pandas as pd
|
|
|
|
compound_list = ['aspirin', 'ibuprofen', 'naproxen', 'celecoxib']
|
|
|
|
properties_list = []
|
|
for name in compound_list:
|
|
try:
|
|
compound = pcp.get_compounds(name, 'name')[0]
|
|
properties_list.append({
|
|
'Name': name,
|
|
'CID': compound.cid,
|
|
'Formula': compound.molecular_formula,
|
|
'MW': compound.molecular_weight,
|
|
'LogP': compound.xlogp,
|
|
'TPSA': compound.tpsa,
|
|
'HBD': compound.h_bond_donor_count,
|
|
'HBA': compound.h_bond_acceptor_count
|
|
})
|
|
except Exception as e:
|
|
print(f"Error processing {name}: {e}")
|
|
|
|
df = pd.DataFrame(properties_list)
|
|
print(df.to_string(index=False))
|
|
```
|
|
|
|
### Workflow 5: Substructure-Based Virtual Screening
|
|
|
|
Screen for compounds containing specific pharmacophores:
|
|
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
# Define pharmacophore (e.g., sulfonamide group)
|
|
pharmacophore_smiles = 'S(=O)(=O)N'
|
|
|
|
# Search for compounds containing this substructure
|
|
hits = pcp.get_compounds(
|
|
pharmacophore_smiles,
|
|
'smiles',
|
|
searchtype='substructure',
|
|
MaxRecords=100
|
|
)
|
|
|
|
# Further filter by properties
|
|
filtered_hits = [
|
|
comp for comp in hits
|
|
if comp.molecular_weight and comp.molecular_weight < 500
|
|
]
|
|
|
|
print(f"Found {len(filtered_hits)} compounds with desired substructure")
|
|
```
|
|
|
|
## Reference Documentation
|
|
|
|
For detailed API documentation, including complete property lists, URL patterns, advanced query options, and more examples, consult `references/api_reference.md`. This comprehensive reference includes:
|
|
|
|
- Complete PUG-REST API endpoint documentation
|
|
- Full list of available molecular properties
|
|
- Asynchronous request handling patterns
|
|
- PubChemPy API reference
|
|
- PUG-View API for annotations
|
|
- Common workflows and use cases
|
|
- Links to official PubChem documentation
|
|
|
|
## Troubleshooting
|
|
|
|
**Compound Not Found**:
|
|
- Try alternative names or synonyms
|
|
- Use CID if known
|
|
- Check spelling and chemical name format
|
|
|
|
**Timeout Errors**:
|
|
- Reduce MaxRecords parameter
|
|
- Add delays between requests
|
|
- Use CIDs instead of names for faster queries
|
|
|
|
**Empty Property Values**:
|
|
- Not all properties are available for all compounds
|
|
- Check if property exists before accessing: `if compound.xlogp:`
|
|
- Some properties only available for certain compound types
|
|
|
|
**Rate Limit Exceeded**:
|
|
- Implement delays (0.2-0.3 seconds) between requests
|
|
- Use batch operations where possible
|
|
- Consider caching results locally
|
|
|
|
**Similarity/Substructure Search Hangs**:
|
|
- These are asynchronous operations that may take 15-30 seconds
|
|
- PubChemPy handles polling automatically
|
|
- Reduce MaxRecords if timing out
|
|
|
|
## Additional Resources
|
|
|
|
- PubChem Home: https://pubchem.ncbi.nlm.nih.gov/
|
|
- PUG-REST Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
|
|
- PUG-REST Tutorial: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial
|
|
- PubChemPy Documentation: https://pubchempy.readthedocs.io/
|
|
- PubChemPy GitHub: https://github.com/mcs07/PubChemPy
|