Add Pubmed

This commit is contained in:
Timothy Kassis
2025-10-19 14:28:39 -07:00
parent e99bb7694b
commit bc187a1f3a
9 changed files with 1643 additions and 1663 deletions

View File

@@ -35,7 +35,6 @@
"./scientific-packages/medchem",
"./scientific-packages/molfeat",
"./scientific-packages/polars",
"./scientific-packages/pubchem-database",
"./scientific-packages/pydeseq2",
"./scientific-packages/pymatgen",
"./scientific-packages/pymc",
@@ -60,7 +59,8 @@
"source": "./",
"strict": false,
"skills": [
"./scientific-databases/pubchem-database"
"./scientific-databases/pubchem-database",
"./scientific-databases/pubmed-database"
]
}
]

View File

@@ -0,0 +1,454 @@
---
name: pubmed-database
description: Toolkit for searching and accessing PubMed, the U.S. National Library of Medicine's free database of biomedical literature. Use this skill when working with medical research literature, searching for scientific articles, constructing advanced search queries with MeSH terms and field tags, accessing articles programmatically via E-utilities API, or conducting systematic literature reviews in life sciences, medicine, or biomedical research.
---
# PubMed Database
## Overview
PubMed is the U.S. National Library of Medicine's comprehensive database providing free access to MEDLINE and life sciences literature. This skill provides expertise in searching PubMed effectively, constructing advanced queries, and accessing data programmatically through the E-utilities API.
## When to Use This Skill
Use this skill when:
- Searching for biomedical or life sciences research articles
- Constructing complex search queries with Boolean operators, field tags, or MeSH terms
- Conducting systematic literature reviews or meta-analyses
- Accessing PubMed data programmatically via the E-utilities API
- Finding articles by specific criteria (author, journal, publication date, article type)
- Retrieving citation information, abstracts, or full-text articles
- Working with PMIDs (PubMed IDs) or DOIs
- Creating automated workflows for literature monitoring or data extraction
## Core Capabilities
### 1. Advanced Search Query Construction
Construct sophisticated PubMed queries using Boolean operators, field tags, and specialized syntax.
**Basic Search Strategies**:
- Combine concepts with Boolean operators (AND, OR, NOT)
- Use field tags to limit searches to specific record parts
- Employ phrase searching with double quotes for exact matches
- Apply wildcards for term variations
- Use proximity searching for terms within specified distances
**Example Queries**:
```
# Recent systematic reviews on diabetes treatment
diabetes mellitus[mh] AND treatment[tiab] AND systematic review[pt] AND 2023:2024[dp]
# Clinical trials comparing two drugs
(metformin[nm] OR insulin[nm]) AND diabetes mellitus, type 2[mh] AND randomized controlled trial[pt]
# Author-specific research
smith ja[au] AND cancer[tiab] AND 2023[dp] AND english[la]
```
**When to consult search_syntax.md**:
- Need comprehensive list of available field tags
- Require detailed explanation of search operators
- Constructing complex proximity searches
- Understanding automatic term mapping behavior
- Need specific syntax for date ranges, wildcards, or special characters
Grep pattern for field tags: `\[au\]|\[ti\]|\[ab\]|\[mh\]|\[pt\]|\[dp\]`
### 2. MeSH Terms and Controlled Vocabulary
Use Medical Subject Headings (MeSH) for precise, consistent searching across the biomedical literature.
**MeSH Searching**:
- [mh] tag searches MeSH terms with automatic inclusion of narrower terms
- [majr] tag limits to articles where the topic is the main focus
- Combine MeSH terms with subheadings for specificity (e.g., diabetes mellitus/therapy[mh])
**Common MeSH Subheadings**:
- /diagnosis - Diagnostic methods
- /drug therapy - Pharmaceutical treatment
- /epidemiology - Disease patterns and prevalence
- /etiology - Disease causes
- /prevention & control - Preventive measures
- /therapy - Treatment approaches
**Example**:
```
# Diabetes therapy with specific focus
diabetes mellitus, type 2[mh]/drug therapy AND cardiovascular diseases[mh]/prevention & control
```
### 3. Article Type and Publication Filtering
Filter results by publication type, date, text availability, and other attributes.
**Publication Types** (use [pt] field tag):
- Clinical Trial
- Meta-Analysis
- Randomized Controlled Trial
- Review
- Systematic Review
- Case Reports
- Guideline
**Date Filtering**:
- Single year: `2024[dp]`
- Date range: `2020:2024[dp]`
- Specific date: `2024/03/15[dp]`
**Text Availability**:
- Free full text: Add `AND free full text[sb]` to query
- Has abstract: Add `AND hasabstract[text]` to query
**Example**:
```
# Recent free full-text RCTs on hypertension
hypertension[mh] AND randomized controlled trial[pt] AND 2023:2024[dp] AND free full text[sb]
```
### 4. Programmatic Access via E-utilities API
Access PubMed data programmatically using the NCBI E-utilities REST API for automation and bulk operations.
**Core API Endpoints**:
1. **ESearch** - Search database and retrieve PMIDs
2. **EFetch** - Download full records in various formats
3. **ESummary** - Get document summaries
4. **EPost** - Upload UIDs for batch processing
5. **ELink** - Find related articles and linked data
**Basic Workflow**:
```python
import requests
# Step 1: Search for articles
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
search_url = f"{base_url}esearch.fcgi"
params = {
"db": "pubmed",
"term": "diabetes[tiab] AND 2024[dp]",
"retmax": 100,
"retmode": "json",
"api_key": "YOUR_API_KEY" # Optional but recommended
}
response = requests.get(search_url, params=params)
pmids = response.json()["esearchresult"]["idlist"]
# Step 2: Fetch article details
fetch_url = f"{base_url}efetch.fcgi"
params = {
"db": "pubmed",
"id": ",".join(pmids),
"rettype": "abstract",
"retmode": "text",
"api_key": "YOUR_API_KEY"
}
response = requests.get(fetch_url, params=params)
abstracts = response.text
```
**Rate Limits**:
- Without API key: 3 requests/second
- With API key: 10 requests/second
- Always include User-Agent header
**Best Practices**:
- Use history server (usehistory=y) for large result sets
- Implement batch operations via EPost for multiple UIDs
- Cache results locally to minimize redundant calls
- Respect rate limits to avoid service disruption
**When to consult api_reference.md**:
- Need detailed endpoint documentation
- Require parameter specifications for each E-utility
- Constructing batch operations or history server workflows
- Understanding response formats (XML, JSON, text)
- Troubleshooting API errors or rate limit issues
Grep pattern for API endpoints: `esearch|efetch|esummary|epost|elink|einfo`
### 5. Citation Matching and Article Retrieval
Find articles using partial citation information or specific identifiers.
**By Identifier**:
```
# By PMID
12345678[pmid]
# By DOI
10.1056/NEJMoa123456[doi]
# By PMC ID
PMC123456[pmc]
```
**Citation Matching** (via ECitMatch API):
Use journal name, year, volume, page, and author to find PMIDs:
```
Format: journal|year|volume|page|author|key|
Example: Science|2008|320|5880|1185|key1|
```
**By Author and Metadata**:
```
# First author with year and topic
smith ja[1au] AND 2023[dp] AND cancer[tiab]
# Journal, volume, and page
nature[ta] AND 2024[dp] AND 456[vi] AND 123-130[pg]
```
### 6. Systematic Literature Reviews
Conduct comprehensive literature searches for systematic reviews and meta-analyses.
**PICO Framework** (Population, Intervention, Comparison, Outcome):
Structure clinical research questions systematically:
```
# Example: Diabetes treatment effectiveness
# P: diabetes mellitus, type 2[mh]
# I: metformin[nm]
# C: lifestyle modification[tiab]
# O: glycemic control[tiab]
diabetes mellitus, type 2[mh] AND
(metformin[nm] OR lifestyle modification[tiab]) AND
glycemic control[tiab] AND
randomized controlled trial[pt]
```
**Comprehensive Search Strategy**:
```
# Include multiple synonyms and MeSH terms
(disease name[tiab] OR disease name[mh] OR synonym[tiab]) AND
(treatment[tiab] OR therapy[tiab] OR intervention[tiab]) AND
(systematic review[pt] OR meta-analysis[pt] OR randomized controlled trial[pt]) AND
2020:2024[dp] AND
english[la]
```
**Search Refinement**:
1. Start broad, review results
2. Add specificity with field tags
3. Apply date and publication type filters
4. Use Advanced Search to view query translation
5. Combine search history for complex queries
**When to consult common_queries.md**:
- Need example queries for specific disease types or research areas
- Require templates for different study designs
- Looking for population-specific query patterns (pediatric, geriatric, etc.)
- Constructing methodology-specific searches
- Need quality filters or best practice patterns
Grep pattern for query examples: `diabetes|cancer|cardiovascular|clinical trial|systematic review`
### 7. Search History and Saved Searches
Use PubMed's search history and My NCBI features for efficient research workflows.
**Search History** (via Advanced Search):
- Maintains up to 100 searches
- Expires after 8 hours of inactivity
- Combine previous searches using # references
- Preview result counts before executing
**Example**:
```
#1: diabetes mellitus[mh]
#2: cardiovascular diseases[mh]
#3: #1 AND #2 AND risk factors[tiab]
```
**My NCBI Features**:
- Save searches indefinitely
- Set up email alerts for new matching articles
- Create collections of saved articles
- Organize research by project or topic
**RSS Feeds**:
Create RSS feeds for any search to monitor new publications in your area of interest.
### 8. Related Articles and Citation Discovery
Find related research and explore citation networks.
**Similar Articles Feature**:
Every PubMed article includes pre-calculated related articles based on:
- Title and abstract similarity
- MeSH term overlap
- Weighted algorithmic matching
**ELink for Related Data**:
```
# Find related articles programmatically
elink.fcgi?dbfrom=pubmed&db=pubmed&id=PMID&cmd=neighbor
```
**Citation Links**:
- LinkOut to full text from publishers
- Links to PubMed Central free articles
- Connections to related NCBI databases (GenBank, ClinicalTrials.gov, etc.)
### 9. Export and Citation Management
Export search results in various formats for citation management and further analysis.
**Export Formats**:
- .nbib files for reference managers (Zotero, Mendeley, EndNote)
- AMA, MLA, APA, NLM citation styles
- CSV for data analysis
- XML for programmatic processing
**Clipboard and Collections**:
- Clipboard: Temporary storage for up to 500 items (8-hour expiration)
- Collections: Permanent storage via My NCBI account
**Batch Export via API**:
```python
# Export citations in MEDLINE format
efetch.fcgi?db=pubmed&id=PMID1,PMID2&rettype=medline&retmode=text
```
## Working with Reference Files
This skill includes three comprehensive reference files in the `references/` directory:
### references/api_reference.md
Complete E-utilities API documentation including all nine endpoints, parameters, response formats, and best practices. Consult when:
- Implementing programmatic PubMed access
- Constructing API requests
- Understanding rate limits and authentication
- Working with large datasets via history server
- Troubleshooting API errors
### references/search_syntax.md
Detailed guide to PubMed search syntax including field tags, Boolean operators, wildcards, and special characters. Consult when:
- Constructing complex search queries
- Understanding automatic term mapping
- Using advanced search features (proximity, wildcards)
- Applying filters and limits
- Troubleshooting unexpected search results
### references/common_queries.md
Extensive collection of example queries for various research scenarios, disease types, and methodologies. Consult when:
- Starting a new literature search
- Need templates for specific research areas
- Looking for best practice query patterns
- Conducting systematic reviews
- Searching for specific study designs or populations
**Reference Loading Strategy**:
Load reference files into context as needed based on the specific task. For brief queries or basic searches, the information in this SKILL.md may be sufficient. For complex operations, consult the appropriate reference file.
## Common Workflows
### Workflow 1: Basic Literature Search
1. Identify key concepts and synonyms
2. Construct query with Boolean operators and field tags
3. Review initial results and refine query
4. Apply filters (date, article type, language)
5. Export results for analysis
### Workflow 2: Systematic Review Search
1. Define research question using PICO framework
2. Identify all relevant MeSH terms and synonyms
3. Construct comprehensive search strategy
4. Search multiple databases (include PubMed)
5. Document search strategy and date
6. Export results for screening and review
### Workflow 3: Programmatic Data Extraction
1. Design search query and test in web interface
2. Implement search using ESearch API
3. Use history server for large result sets
4. Retrieve detailed records with EFetch
5. Parse XML/JSON responses
6. Store data locally with caching
7. Implement rate limiting and error handling
### Workflow 4: Citation Discovery
1. Start with known relevant article
2. Use Similar Articles to find related work
3. Check citing articles (when available)
4. Explore MeSH terms from relevant articles
5. Construct new searches based on discoveries
6. Use ELink to find related database entries
### Workflow 5: Ongoing Literature Monitoring
1. Construct comprehensive search query
2. Test and refine query for precision
3. Save search to My NCBI account
4. Set up email alerts for new matches
5. Create RSS feed for feed reader monitoring
6. Review new articles regularly
## Tips and Best Practices
### Search Strategy
- Start broad, then narrow with field tags and filters
- Include synonyms and MeSH terms for comprehensive coverage
- Use quotation marks for exact phrases
- Check Search Details in Advanced Search to verify query translation
- Combine multiple searches using search history
### API Usage
- Obtain API key for higher rate limits (10 req/sec vs 3 req/sec)
- Use history server for result sets > 500 articles
- Implement exponential backoff for rate limit handling
- Cache results locally to minimize redundant requests
- Always include descriptive User-Agent header
### Quality Filtering
- Prefer systematic reviews and meta-analyses for synthesized evidence
- Use publication type filters to find specific study designs
- Filter by date for most recent research
- Apply language filters as appropriate
- Use free full text filter for immediate access
### Citation Management
- Export early and often to avoid losing search results
- Use .nbib format for compatibility with most reference managers
- Create My NCBI account for permanent collections
- Document search strategies for reproducibility
- Use Collections to organize research by project
## Limitations and Considerations
### Database Coverage
- Primarily biomedical and life sciences literature
- Pre-1975 articles often lack abstracts
- Full author names available from 2002 forward
- Non-English abstracts available but may default to English display
### Search Limitations
- Display limited to 10,000 results maximum
- Search history expires after 8 hours of inactivity
- Clipboard holds max 500 items with 8-hour expiration
- Automatic term mapping may produce unexpected results
### API Considerations
- Rate limits apply (3-10 requests/second)
- Large queries may time out (use history server)
- XML parsing required for detailed data extraction
- API key recommended for production use
### Access Limitations
- PubMed provides citations and abstracts (not always full text)
- Full text access depends on publisher, institutional access, or open access status
- LinkOut availability varies by journal and institution
- Some content requires subscription or payment
## Support Resources
- **PubMed Help**: https://pubmed.ncbi.nlm.nih.gov/help/
- **E-utilities Documentation**: https://www.ncbi.nlm.nih.gov/books/NBK25501/
- **NLM Help Desk**: 1-888-FIND-NLM (1-888-346-3656)
- **Technical Support**: vog.hin.mln.ibcn@seitilitue
- **Mailing List**: utilities-announce@ncbi.nlm.nih.gov

View File

@@ -0,0 +1,298 @@
# PubMed E-utilities API Reference
## Overview
The NCBI E-utilities provide programmatic access to PubMed and other Entrez databases through a REST API. The base URL for all E-utilities is:
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
```
## API Key Requirements
As of December 1, 2018, NCBI enforces API key usage for E-utility calls. API keys increase rate limits from 3 requests/second to 10 requests/second. To obtain an API key, register for an NCBI account and generate a key from your account settings.
Include the API key in requests using the `&api_key` parameter:
```
esearch.fcgi?db=pubmed&term=cancer&api_key=YOUR_API_KEY
```
## Rate Limits
- **Without API key**: 3 requests per second
- **With API key**: 10 requests per second
- Always include a User-Agent header in requests
## Core E-utility Tools
### 1. ESearch - Query Databases
**Endpoint**: `esearch.fcgi`
**Purpose**: Search an Entrez database and retrieve a list of UIDs (e.g., PMIDs for PubMed)
**Required Parameters**:
- `db` - Database to search (e.g., pubmed, gene, protein)
- `term` - Search query
**Optional Parameters**:
- `retmax` - Maximum records to return (default: 20, max: 10000)
- `retstart` - Index of first record to return (default: 0)
- `usehistory=y` - Store results on history server for large result sets
- `retmode` - Return format (xml, json)
- `sort` - Sort order (relevance, pub_date, first_author, last_author, journal)
- `field` - Limit search to specific field
- `datetype` - Type of date to use for filtering (pdat for publication date)
- `mindate` - Minimum date (YYYY/MM/DD format)
- `maxdate` - Maximum date (YYYY/MM/DD format)
**Example Request**:
```
esearch.fcgi?db=pubmed&term=breast+cancer&retmax=100&retmode=json&api_key=YOUR_API_KEY
```
**Response Elements**:
- `Count` - Total number of records matching query
- `RetMax` - Number of records returned in this response
- `RetStart` - Index of first returned record
- `IdList` - List of UIDs (PMIDs)
- `WebEnv` - History server environment string (when usehistory=y)
- `QueryKey` - Query key for history server (when usehistory=y)
### 2. EFetch - Download Records
**Endpoint**: `efetch.fcgi`
**Purpose**: Retrieve full records from a database in various formats
**Required Parameters**:
- `db` - Database name
- `id` - Comma-separated list of UIDs, or use WebEnv/query_key from ESearch
**Optional Parameters**:
- `rettype` - Record type (abstract, medline, xml, uilist)
- `retmode` - Return mode (text, xml)
- `retstart` - Starting record index
- `retmax` - Maximum records per request
**Example Request**:
```
efetch.fcgi?db=pubmed&id=123456,234567&rettype=abstract&retmode=text&api_key=YOUR_API_KEY
```
**Common rettype Values for PubMed**:
- `abstract` - Abstract text
- `medline` - Full MEDLINE format
- `xml` - PubMed XML format
- `uilist` - List of UIDs only
### 3. ESummary - Retrieve Document Summaries
**Endpoint**: `esummary.fcgi`
**Purpose**: Get document summaries (DocSum) for a list of UIDs
**Required Parameters**:
- `db` - Database name
- `id` - Comma-separated UIDs or WebEnv/query_key
**Optional Parameters**:
- `retmode` - Return format (xml, json)
- `version` - DocSum version (1.0 or 2.0, default is 1.0)
**Example Request**:
```
esummary.fcgi?db=pubmed&id=123456,234567&retmode=json&version=2.0&api_key=YOUR_API_KEY
```
**DocSum Fields** (vary by database, common PubMed fields):
- Title
- Authors
- Source (journal)
- PubDate
- Volume, Issue, Pages
- DOI
- PmcRefCount (citations in PMC)
### 4. EPost - Upload UIDs
**Endpoint**: `epost.fcgi`
**Purpose**: Upload a list of UIDs to the history server for use in subsequent requests
**Required Parameters**:
- `db` - Database name
- `id` - Comma-separated list of UIDs
**Example Request**:
```
epost.fcgi?db=pubmed&id=123456,234567,345678&api_key=YOUR_API_KEY
```
**Response**:
Returns WebEnv and QueryKey for use in subsequent requests
### 5. ELink - Find Related Data
**Endpoint**: `elink.fcgi`
**Purpose**: Find related records within the same database or in different databases
**Required Parameters**:
- `dbfrom` - Source database
- `db` - Target database (can be same as dbfrom)
- `id` - UID(s) from source database
**Optional Parameters**:
- `cmd` - Link command (neighbor, neighbor_history, prlinks, llinks, etc.)
- `linkname` - Specific link type to retrieve
- `term` - Filter results with search query
- `holding` - Filter by library holdings
**Example Request**:
```
elink.fcgi?dbfrom=pubmed&db=pubmed&id=123456&cmd=neighbor&api_key=YOUR_API_KEY
```
**Common Link Commands**:
- `neighbor` - Return related records
- `neighbor_history` - Post related records to history server
- `prlinks` - Return provider URLs
- `llinks` - Return LinkOut URLs
### 6. EInfo - Database Information
**Endpoint**: `einfo.fcgi`
**Purpose**: Get information about available Entrez databases or specific database fields
**Parameters**:
- `db` - Database name (optional; omit to list all databases)
- `retmode` - Return format (xml, json)
**Example Request**:
```
einfo.fcgi?db=pubmed&retmode=json&api_key=YOUR_API_KEY
```
**Returns**:
- Database description
- Record count
- Last update date
- Available search fields with descriptions
### 7. EGQuery - Global Query
**Endpoint**: `egquery.fcgi`
**Purpose**: Search term counts across all Entrez databases
**Required Parameters**:
- `term` - Search query
**Example Request**:
```
egquery.fcgi?term=cancer&api_key=YOUR_API_KEY
```
### 8. ESpell - Spelling Suggestions
**Endpoint**: `espell.fcgi`
**Purpose**: Get spelling suggestions for queries
**Required Parameters**:
- `db` - Database name
- `term` - Search term with potential misspelling
**Example Request**:
```
espell.fcgi?db=pubmed&term=cancre&api_key=YOUR_API_KEY
```
### 9. ECitMatch - Citation Matching
**Endpoint**: `ecitmatch.cgi`
**Purpose**: Search PubMed citations using journal, year, volume, page, author information
**Request Format**: POST request with citation strings
**Citation String Format**:
```
journal|year|volume|page|author|key|
```
**Example**:
```
Science|2008|320|5880|1185|key1|
Nature|2010|463|7279|318|key2|
```
**Rate Limit**: 3 requests per second with User-Agent header required
## Best Practices
### Use History Server for Large Result Sets
For queries returning more than 500 records, use the history server:
1. **Initial Search with History**:
```
esearch.fcgi?db=pubmed&term=cancer&usehistory=y&retmode=json&api_key=YOUR_API_KEY
```
2. **Retrieve Records in Batches**:
```
efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_12345&retstart=0&retmax=500&rettype=xml&api_key=YOUR_API_KEY
efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_12345&retstart=500&retmax=500&rettype=xml&api_key=YOUR_API_KEY
```
### Batch Operations
Use EPost to upload large lists of UIDs before fetching:
```
# Step 1: Post UIDs
epost.fcgi?db=pubmed&id=123,456,789,...&api_key=YOUR_API_KEY
# Step 2: Fetch using WebEnv/query_key
efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_12345&rettype=xml&api_key=YOUR_API_KEY
```
### Error Handling
Common HTTP status codes:
- `200` - Success
- `400` - Bad request (check parameters)
- `414` - URI too long (use POST or history server)
- `429` - Rate limit exceeded
### Caching
Implement local caching to:
- Reduce redundant API calls
- Stay within rate limits
- Improve response times
- Respect NCBI resources
## Response Formats
### XML (Default)
Most detailed format with full structured data. Each database has its own DTD (Document Type Definition).
### JSON
Available for most utilities with `retmode=json`. Easier to parse in modern applications.
### Text
Plain text format, useful for abstracts and simple data retrieval.
## Support and Resources
- **API Documentation**: https://www.ncbi.nlm.nih.gov/books/NBK25501/
- **Mailing List**: utilities-announce@ncbi.nlm.nih.gov
- **Support**: vog.hin.mln.ibcn@seitilitue
- **NLM Help Desk**: 1-888-FIND-NLM (1-888-346-3656)

View File

@@ -0,0 +1,453 @@
# Common PubMed Query Patterns
This reference provides practical examples of common PubMed search patterns for various research scenarios.
## General Research Queries
### Finding Recent Research on a Topic
```
breast cancer[tiab] AND 2023:2024[dp]
```
### Systematic Reviews on a Topic
```
(diabetes[tiab] OR diabetes mellitus[mh]) AND systematic review[pt]
```
### Meta-Analyses
```
hypertension[tiab] AND meta-analysis[pt] AND 2020:2024[dp]
```
### Clinical Trials
```
alzheimer disease[mh] AND randomized controlled trial[pt]
```
### Finding Guidelines
```
asthma[tiab] AND (guideline[pt] OR practice guideline[pt])
```
## Disease-Specific Queries
### Cancer Research
```
# General cancer screening
cancer screening[tiab] AND systematic review[pt] AND 2020:2024[dp]
# Specific cancer type with treatment
lung cancer[tiab] AND immunotherapy[tiab] AND clinical trial[pt]
# Cancer genetics
breast neoplasms[mh] AND BRCA1[tiab] AND genetic testing[tiab]
```
### Cardiovascular Disease
```
# Heart disease prevention
(heart disease[tiab] OR cardiovascular disease[mh]) AND prevention[tiab] AND 2022:2024[dp]
# Stroke treatment
stroke[mh] AND (thrombectomy[tiab] OR thrombolysis[tiab]) AND randomized controlled trial[pt]
# Hypertension management
hypertension[mh]/drug therapy AND comparative effectiveness[tiab]
```
### Infectious Diseases
```
# COVID-19 research
COVID-19[tiab] AND (vaccine[tiab] OR vaccination[tiab]) AND 2023:2024[dp]
# Antibiotic resistance
(antibiotic resistance[tiab] OR drug resistance, bacterial[mh]) AND systematic review[pt]
# Tuberculosis treatment
tuberculosis[mh]/drug therapy AND (multidrug-resistant[tiab] OR MDR-TB[tiab])
```
### Neurological Disorders
```
# Alzheimer's disease
alzheimer disease[mh] AND (diagnosis[sh] OR biomarkers[tiab]) AND 2020:2024[dp]
# Parkinson's disease treatment
parkinson disease[mh] AND treatment[tiab] AND clinical trial[pt]
# Multiple sclerosis
multiple sclerosis[mh] AND disease modifying[tiab] AND review[pt]
```
### Diabetes
```
# Type 2 diabetes management
diabetes mellitus, type 2[mh] AND (lifestyle[tiab] OR diet[tiab]) AND randomized controlled trial[pt]
# Diabetes complications
diabetes mellitus[mh] AND (complications[sh] OR diabetic neuropathy[mh])
# New diabetes drugs
diabetes mellitus, type 2[mh] AND (GLP-1[tiab] OR SGLT2[tiab]) AND 2022:2024[dp]
```
## Drug and Treatment Research
### Drug Efficacy Studies
```
# Compare two drugs
(drug A[nm] OR drug B[nm]) AND condition[mh] AND comparative effectiveness[tiab]
# Drug side effects
medication name[nm] AND (adverse effects[sh] OR side effects[tiab])
# Drug combination therapy
(aspirin[nm] AND clopidogrel[nm]) AND acute coronary syndrome[mh]
```
### Treatment Comparisons
```
# Surgery vs medication
condition[mh] AND (surgery[tiab] OR surgical[tiab]) AND (medication[tiab] OR drug therapy[sh]) AND comparative study[pt]
# Different surgical approaches
procedure[tiab] AND (laparoscopic[tiab] OR open surgery[tiab]) AND outcomes[tiab]
```
### Alternative Medicine
```
# Herbal supplements
(herbal medicine[mh] OR phytotherapy[mh]) AND condition[tiab] AND clinical trial[pt]
# Acupuncture
acupuncture[mh] AND pain[tiab] AND randomized controlled trial[pt]
```
## Diagnostic Research
### Diagnostic Tests
```
# Sensitivity and specificity
test name[tiab] AND condition[tiab] AND (sensitivity[tiab] AND specificity[tiab])
# Diagnostic imaging
(MRI[tiab] OR magnetic resonance imaging[tiab]) AND brain tumor[tiab] AND diagnosis[sh]
# Lab test evaluation
biomarker name[tiab] AND disease[tiab] AND (diagnostic[tiab] OR screening[tiab])
```
### Screening Programs
```
# Cancer screening
cancer type[tiab] AND screening[tiab] AND (cost effectiveness[tiab] OR benefit[tiab])
# Population screening
condition[tiab] AND mass screening[mh] AND public health[tiab]
```
## Population-Specific Queries
### Pediatric Research
```
# Children with specific condition
condition[tiab] AND (child[mh] OR pediatric[tiab]) AND treatment[tiab]
# Age-specific
disease[tiab] AND (infant[mh] OR child, preschool[mh])
# Pediatric dosing
drug name[nm] AND pediatric[tiab] AND (dosing[tiab] OR dose[tiab])
```
### Geriatric Research
```
# Elderly population
condition[tiab] AND (aged[mh] OR elderly[tiab] OR geriatric[tiab])
# Aging and disease
aging[mh] AND disease[tiab] AND mechanism[tiab]
# Polypharmacy
polypharmacy[tiab] AND elderly[tiab] AND adverse effects[tiab]
```
### Pregnant Women
```
# Pregnancy and medications
drug name[nm] AND (pregnancy[mh] OR pregnant women[tiab]) AND safety[tiab]
# Pregnancy complications
pregnancy complication[tiab] AND management[tiab]
```
### Sex-Specific Research
```
# Female-specific
condition[tiab] AND female[mh] AND hormones[tiab]
# Male-specific
disease[tiab] AND male[mh] AND risk factors[tiab]
# Sex differences
condition[tiab] AND (sex factors[mh] OR gender differences[tiab])
```
## Epidemiology and Public Health
### Prevalence Studies
```
disease[tiab] AND (prevalence[tiab] OR epidemiology[sh]) AND country/region[tiab]
```
### Incidence Studies
```
condition[tiab] AND incidence[tiab] AND population[tiab] AND 2020:2024[dp]
```
### Risk Factors
```
disease[mh] AND (risk factors[mh] OR etiology[sh]) AND cohort study[tiab]
```
### Global Health
```
disease[tiab] AND (developing countries[mh] OR low income[tiab]) AND burden[tiab]
```
### Health Disparities
```
condition[tiab] AND (health disparities[tiab] OR health equity[tiab]) AND minority groups[tiab]
```
## Methodology-Specific Queries
### Research Methodology
#### Cohort Studies
```
condition[tiab] AND cohort study[tiab] AND prospective[tiab]
```
#### Case-Control Studies
```
disease[tiab] AND case-control studies[mh] AND risk factors[tiab]
```
#### Cross-Sectional Studies
```
condition[tiab] AND cross-sectional studies[mh] AND prevalence[tiab]
```
### Statistical Methods
```
# Machine learning in medicine
(machine learning[tiab] OR artificial intelligence[tiab]) AND diagnosis[tiab] AND validation[tiab]
# Bayesian analysis
condition[tiab] AND bayes theorem[mh] AND clinical decision[tiab]
```
### Genetic and Molecular Research
```
# GWAS studies
disease[tiab] AND (genome-wide association study[tiab] OR GWAS[tiab])
# Gene expression
gene name[tiab] AND (gene expression[mh] OR mRNA[tiab]) AND disease[tiab]
# Proteomics
condition[tiab] AND proteomics[mh] AND biomarkers[tiab]
# CRISPR research
CRISPR[tiab] AND (gene editing[tiab] OR genome editing[tiab]) AND 2020:2024[dp]
```
## Author and Institution Queries
### Finding Work by Specific Author
```
# Single author
smith ja[au] AND cancer[tiab] AND 2023:2024[dp]
# First author only
jones m[1au] AND cardiology[tiab]
# Multiple authors from same group
(smith ja[au] OR jones m[au] OR wilson k[au]) AND research topic[tiab]
```
### Institution-Specific Research
```
# University affiliation
harvard[affil] AND cancer research[tiab] AND 2023:2024[dp]
# Hospital research
"mayo clinic"[affil] AND clinical trial[pt]
# Country-specific
japan[affil] AND robotics[tiab] AND surgery[tiab]
```
## Journal-Specific Queries
### High-Impact Journals
```
# Specific journal
nature[ta] AND genetics[tiab] AND 2024[dp]
# Multiple journals
(nature[ta] OR science[ta] OR cell[ta]) AND immunology[tiab]
# Journal with ISSN
0028-4793[issn] AND clinical trial[pt]
```
## Citation and Reference Queries
### Finding Specific Articles
```
# By PMID
12345678[pmid]
# By DOI
10.1056/NEJMoa123456[doi]
# By first author and year
smith ja[1au] AND 2023[dp] AND cancer[tiab]
```
### Finding Cited Work
```
# Related articles
Similar Articles feature from any PubMed result
# By keyword in references
Use "Cited by" links when available
```
## Advanced Combination Queries
### Comprehensive Literature Review
```
(disease name[tiab] OR disease name[mh]) AND
((treatment[tiab] OR therapy[tiab] OR management[tiab]) OR
(diagnosis[tiab] OR screening[tiab]) OR
(epidemiology[tiab] OR prevalence[tiab])) AND
(systematic review[pt] OR meta-analysis[pt] OR review[pt]) AND
2019:2024[dp] AND english[la]
```
### Precision Medicine Query
```
(precision medicine[tiab] OR personalized medicine[tiab] OR pharmacogenomics[mh]) AND
cancer[tiab] AND
(biomarkers[tiab] OR genetic testing[tiab]) AND
clinical application[tiab] AND
2020:2024[dp]
```
### Translational Research
```
(basic science[tiab] OR bench to bedside[tiab] OR translational medical research[mh]) AND
disease[tiab] AND
(clinical trial[pt] OR clinical application[tiab]) AND
2020:2024[dp]
```
## Quality Filters
### High-Quality Evidence
```
condition[tiab] AND
(randomized controlled trial[pt] OR systematic review[pt] OR meta-analysis[pt]) AND
humans[mh] AND
english[la] AND
2020:2024[dp]
```
### Free Full Text Articles
```
topic[tiab] AND free full text[sb] AND 2023:2024[dp]
```
### Articles with Abstracts
```
condition[tiab] AND hasabstract[text] AND review[pt]
```
## Staying Current
### Latest Publications
```
topic[tiab] AND 2024[dp] AND english[la]
```
### Preprints and Early Access
```
topic[tiab] AND (epub ahead of print[tiab] OR publisher[sb])
```
### Setting Up Alerts
```
# Create search and save to My NCBI
# Enable email alerts for new matching articles
topic[tiab] AND (randomized controlled trial[pt] OR systematic review[pt])
```
## COVID-19 Specific Queries
### Vaccine Research
```
(COVID-19[tiab] OR SARS-CoV-2[tiab]) AND
(vaccine[tiab] OR vaccination[tiab]) AND
(efficacy[tiab] OR effectiveness[tiab]) AND
2023:2024[dp]
```
### Long COVID
```
(long covid[tiab] OR post-acute covid[tiab] OR PASC[tiab]) AND
(symptoms[tiab] OR treatment[tiab])
```
### COVID Treatment
```
COVID-19[tiab] AND
(antiviral[tiab] OR monoclonal antibody[tiab] OR treatment[tiab]) AND
randomized controlled trial[pt]
```
## Tips for Constructing Queries
### 1. PICO Framework
Use PICO (Population, Intervention, Comparison, Outcome) to structure clinical queries:
```
P: diabetes mellitus, type 2[mh]
I: metformin[nm]
C: lifestyle modification[tiab]
O: glycemic control[tiab]
Query: diabetes mellitus, type 2[mh] AND (metformin[nm] OR lifestyle modification[tiab]) AND glycemic control[tiab]
```
### 2. Iterative Refinement
Start broad, review results, refine:
```
1. diabetes → too broad
2. diabetes mellitus type 2 → better
3. diabetes mellitus, type 2[mh] AND metformin[nm] → more specific
4. diabetes mellitus, type 2[mh] AND metformin[nm] AND randomized controlled trial[pt] → focused
```
### 3. Use Search History
Combine previous searches in Advanced Search:
```
#1: diabetes mellitus, type 2[mh]
#2: cardiovascular disease[mh]
#3: #1 AND #2 AND risk factors[tiab]
```
### 4. Save Effective Searches
Create My NCBI account to save successful queries for future use and set up automatic alerts.

View File

@@ -0,0 +1,436 @@
# PubMed Search Syntax and Field Tags
## Boolean Operators
PubMed supports standard Boolean operators to combine search terms:
### AND
Retrieves results containing all search terms. PubMed automatically applies AND between separate concepts.
**Example**:
```
diabetes AND hypertension
```
### OR
Retrieves results containing at least one of the search terms. Useful for synonyms or related concepts.
**Example**:
```
heart attack OR myocardial infarction
```
### NOT
Excludes results containing the specified term. Use cautiously as it may eliminate relevant results.
**Example**:
```
cancer NOT lung
```
**Precedence**: Operations are processed left to right. Use parentheses to control evaluation order:
```
(heart attack OR myocardial infarction) AND treatment
```
## Phrase Searching
### Double Quotes
Enclose exact phrases in double quotes to search for terms in specific order:
```
"kidney allograft"
"machine learning"
"systematic review"
```
### Field Tags
Alternative method using field tags:
```
kidney allograft[Title]
```
## Wildcards
Use asterisk (*) to substitute for zero or more characters:
**Rules**:
- Minimum 4 characters before first wildcard
- Matches word variations and plurals
**Examples**:
```
vaccin* → matches vaccine, vaccination, vaccines, vaccinate
pediatr* → matches pediatric, pediatrics, pediatrician
colo*r → matches color, colour
```
**Limitations**:
- Cannot use at beginning of search term
- May retrieve unexpected variations
## Proximity Searching
Search for terms within a specified distance from each other. Only available in Title, Title/Abstract, and Affiliation fields.
**Syntax**: `"search terms"[field:~N]`
- N = maximum number of words between terms
**Examples**:
```
"vitamin C"[Title:~3] → vitamin within 3 words of C in title
"breast cancer screening"[TIAB:~5] → terms within 5 words in title/abstract
```
## Search Field Tags
Field tags limit searches to specific parts of PubMed records. Format: `term[tag]`
### Author Searching
| Tag | Field | Example |
|-----|-------|---------|
| [au] | Author | smith j[au] |
| [1au] | First Author | jones m[1au] |
| [lastau] | Last Author | wilson k[lastau] |
| [fau] | Full Author Name | smith john a[fau] |
**Author Search Notes**:
- Full author names searchable from 2002 forward
- Format: last name + initials (e.g., `smith ja[au]`)
- Can search without field tag, but [au] ensures accuracy
**Corporate Authors**:
Search organizations as authors:
```
world health organization[au]
```
### Title and Abstract
| Tag | Field | Example |
|-----|-------|---------|
| [ti] | Title | diabetes[ti] |
| [ab] | Abstract | treatment[ab] |
| [tiab] | Title/Abstract | cancer screening[tiab] |
| [tw] | Text Word | cardiovascular[tw] |
**Notes**:
- [tw] searches title, abstract, and other text fields
- [tiab] is most commonly used for comprehensive searching
### Journal Information
| Tag | Field | Example |
|-----|-------|---------|
| [ta] | Journal Title Abbreviation | Science[ta] |
| [jour] | Journal | New England Journal of Medicine[jour] |
| [issn] | ISSN | 0028-4793[issn] |
### Date Fields
| Tag | Field | Format | Example |
|-----|-------|--------|---------|
| [dp] | Publication Date | YYYY/MM/DD | 2023[dp] |
| [edat] | Entrez Date | YYYY/MM/DD | 2023/01/15[edat] |
| [crdt] | Create Date | YYYY/MM/DD | 2023[crdt] |
| [mhda] | MeSH Date | YYYY/MM/DD | 2023[mhda] |
**Date Ranges**:
Use colon to specify ranges:
```
2020:2023[dp] → publications from 2020 to 2023
2023/01/01:2023/06/30[dp] → first half of 2023
```
**Relative Dates**:
PubMed filters provide common ranges:
- Last 1 year
- Last 5 years
- Last 10 years
- Custom date range
### MeSH and Subject Headings
| Tag | Field | Example |
|-----|-------|---------|
| [mh] | MeSH Terms | diabetes mellitus[mh] |
| [majr] | MeSH Major Topic | hypertension[majr] |
| [mesh] | MeSH Terms | cancer[mesh] |
| [sh] | MeSH Subheading | therapy[sh] |
**MeSH Searching**:
- Medical Subject Headings provide controlled vocabulary
- [mh] includes narrower terms automatically
- [majr] limits to articles where topic is main focus
- Combine with subheadings: `diabetes mellitus/therapy[mh]`
**Common MeSH Subheadings**:
- /diagnosis
- /drug therapy
- /epidemiology
- /etiology
- /prevention & control
- /therapy
### Publication Types
| Tag | Field | Example |
|-----|-------|---------|
| [pt] | Publication Type | clinical trial[pt] |
| [ptyp] | Publication Type | review[ptyp] |
**Common Publication Types**:
- Clinical Trial
- Meta-Analysis
- Randomized Controlled Trial
- Review
- Systematic Review
- Case Reports
- Letter
- Editorial
- Guideline
**Example**:
```
cancer AND systematic review[pt]
```
### Other Useful Fields
| Tag | Field | Example |
|-----|-------|---------|
| [la] | Language | english[la] |
| [affil] | Affiliation | harvard[affil] |
| [pmid] | PubMed ID | 12345678[pmid] |
| [pmc] | PMC ID | PMC123456[pmc] |
| [doi] | DOI | 10.1234/example[doi] |
| [gr] | Grant Number | R01CA123456[gr] |
| [isbn] | ISBN | 9780123456789[isbn] |
| [pg] | Pagination | 123-145[pg] |
| [vi] | Volume | 45[vi] |
| [ip] | Issue | 3[ip] |
### Supplemental Concepts
| Tag | Field | Example |
|-----|-------|---------|
| [nm] | Substance Name | aspirin[nm] |
| [ps] | Personal Name | darwin charles[ps] |
## Automatic Term Mapping (ATM)
When searching without field tags, PubMed automatically:
1. **Searches MeSH translation table** for matching MeSH terms
2. **Searches journal translation table** for journal names
3. **Searches author index** for author names
4. **Searches full text** for remaining terms
**Bypass ATM**:
- Use double quotes: `"breast cancer"`
- Use field tags: `breast cancer[tiab]`
**View Translation**:
Use Advanced Search to see how PubMed translated your query in the Search Details box.
## Filters and Limits
### Article Types
- Clinical Trial
- Meta-Analysis
- Randomized Controlled Trial
- Review
- Systematic Review
### Text Availability
- Free full text
- Full text
- Abstract
### Publication Date
- Last 1 year
- Last 5 years
- Last 10 years
- Custom date range
### Species
- Humans
- Animals (specific species available)
### Sex
- Female
- Male
### Age Groups
- Child (0-18 years)
- Infant (birth-23 months)
- Child, Preschool (2-5 years)
- Child (6-12 years)
- Adolescent (13-18 years)
- Adult (19+ years)
- Aged (65+ years)
- 80 and over
### Languages
- English
- Spanish
- French
- German
- Chinese
- And many others
### Other Filters
- Journal categories
- Subject area
- Article attributes (e.g., has abstract, free PMC article)
## Advanced Search Strategies
### Clinical Queries
PubMed provides specialized filters for clinical research:
**Study Categories**:
- Therapy (narrow/broad)
- Diagnosis (narrow/broad)
- Etiology (narrow/broad)
- Prognosis (narrow/broad)
- Clinical prediction guides
**Medical Genetics**:
- Diagnosis
- Differential diagnosis
- Clinical description
- Management
- Genetic counseling
### Hedges and Filters
Pre-built search strategies for specific purposes:
- Systematic review filters
- Quality filters for study types
- Geographic filters
### Combining Searches
Use Advanced Search to combine previous queries:
```
#1 AND #2
#3 OR #4
#5 NOT #6
```
### Search History
- Saves up to 100 searches
- Expires after 8 hours of inactivity
- Access via Advanced Search page
- Combine using # references
## Best Practices
### 1. Start Broad, Then Narrow
Begin with general terms and add specificity:
```
diabetes → too broad
diabetes mellitus type 2 → better
diabetes mellitus type 2[mh] AND treatment[tiab] → more specific
```
### 2. Use Synonyms with OR
Include alternative terms:
```
heart attack OR myocardial infarction OR MI
```
### 3. Combine Concepts with AND
Link different aspects of your research question:
```
(heart attack OR myocardial infarction) AND (aspirin OR acetylsalicylic acid) AND prevention
```
### 4. Leverage MeSH Terms
Use MeSH for consistent indexing:
```
diabetes mellitus[mh] AND hypertension[mh]
```
### 5. Use Filters Strategically
Apply filters to refine results:
- Publication date for recent research
- Article type for specific study designs
- Free full text for accessible articles
### 6. Review Search Details
Check how PubMed interpreted your search in Advanced Search to ensure accuracy.
### 7. Save Effective Searches
Create My NCBI account to:
- Save searches
- Set up email alerts
- Create collections
## Common Search Patterns
### Systematic Review Search
```
(breast cancer[tiab] OR breast neoplasm[mh]) AND (screening[tiab] OR early detection[tiab]) AND systematic review[pt]
```
### Clinical Trial Search
```
diabetes mellitus type 2[mh] AND metformin[nm] AND randomized controlled trial[pt] AND 2020:2024[dp]
```
### Recent Research by Author
```
smith ja[au] AND cancer[tiab] AND 2023:2024[dp] AND english[la]
```
### Drug Treatment Studies
```
hypertension[mh] AND (amlodipine[nm] OR losartan[nm]) AND drug therapy[sh] AND humans[mh]
```
### Geographic-Specific Research
```
malaria[tiab] AND (africa[affil] OR african[tiab]) AND 2020:2024[dp]
```
## Special Characters
| Character | Purpose | Example |
|-----------|---------|---------|
| * | Wildcard | colo*r |
| " " | Phrase search | "breast cancer" |
| ( ) | Group terms | (A OR B) AND C |
| : | Range | 2020:2023[dp] |
| - | Hyphenated terms | COVID-19 |
| / | MeSH subheading | diabetes/therapy[mh] |
## Troubleshooting
### Too Many Results
- Add more specific terms
- Use field tags to limit search scope
- Apply date restrictions
- Use filters for article type
- Add additional concepts with AND
### Too Few Results
- Remove restrictive terms
- Use OR to add synonyms
- Check spelling and terminology
- Remove field tags for broader search
- Expand date range
- Remove filters
### No Results
- Check spelling using ESpell
- Try alternative terminology
- Remove field tags
- Verify correct database (PubMed vs. PMC)
- Broaden search terms
### Unexpected Results
- Review Search Details to see query translation
- Use field tags to prevent automatic term mapping
- Check for common synonyms that may be included
- Refine with additional limiting terms

View File

@@ -1,557 +0,0 @@
---
name: pubchem-database
description: Access chemical compound data from PubChem, the world's largest free chemical database. This skill should be used when retrieving compound properties, searching for chemicals by name/SMILES/InChI, performing similarity or substructure searches, accessing bioactivity data, converting between chemical formats, or generating chemical structure images. Works with over 110 million compounds and 270 million bioactivities through PUG-REST API and PubChemPy library.
---
# PubChem Database
## Overview
PubChem is the world's largest freely available chemical database maintained by the National Center for Biotechnology Information (NCBI). It contains over 110 million unique chemical structures and over 270 million bioactivities from more than 770 data sources. This skill provides guidance for programmatically accessing PubChem data using the PUG-REST API and PubChemPy Python library.
## Core Capabilities
### 1. Chemical Structure Search
Search for compounds using multiple identifier types:
**By Chemical Name**:
```python
import pubchempy as pcp
compounds = pcp.get_compounds('aspirin', 'name')
compound = compounds[0]
```
**By CID (Compound ID)**:
```python
compound = pcp.Compound.from_cid(2244) # Aspirin
```
**By SMILES**:
```python
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
```
**By InChI**:
```python
compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]
```
**By Molecular Formula**:
```python
compounds = pcp.get_compounds('C9H8O4', 'formula')
# Returns all compounds matching this formula
```
### 2. Property Retrieval
Retrieve molecular properties for compounds using either high-level or low-level approaches:
**Using PubChemPy (Recommended)**:
```python
import pubchempy as pcp
# Get compound object with all properties
compound = pcp.get_compounds('caffeine', 'name')[0]
# Access individual properties
molecular_formula = compound.molecular_formula
molecular_weight = compound.molecular_weight
iupac_name = compound.iupac_name
smiles = compound.canonical_smiles
inchi = compound.inchi
xlogp = compound.xlogp # Partition coefficient
tpsa = compound.tpsa # Topological polar surface area
```
**Get Specific Properties**:
```python
# Request only specific properties
properties = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],
'aspirin',
'name'
)
# Returns list of dictionaries
```
**Batch Property Retrieval**:
```python
import pandas as pd
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
all_properties = []
for name in compound_names:
props = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'XLogP'],
name,
'name'
)
all_properties.extend(props)
df = pd.DataFrame(all_properties)
```
**Available Properties**: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see `references/api_reference.md` for complete list).
### 3. Similarity Search
Find structurally similar compounds using Tanimoto similarity:
```python
import pubchempy as pcp
# Start with a query compound
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
query_smiles = query_compound.canonical_smiles
# Perform similarity search
similar_compounds = pcp.get_compounds(
query_smiles,
'smiles',
searchtype='similarity',
Threshold=85, # Similarity threshold (0-100)
MaxRecords=50
)
# Process results
for compound in similar_compounds[:10]:
print(f"CID {compound.cid}: {compound.iupac_name}")
print(f" MW: {compound.molecular_weight}")
```
**Note**: Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically.
### 4. Substructure Search
Find compounds containing a specific structural motif:
```python
import pubchempy as pcp
# Search for compounds containing pyridine ring
pyridine_smiles = 'c1ccncc1'
matches = pcp.get_compounds(
pyridine_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
print(f"Found {len(matches)} compounds containing pyridine")
```
**Common Substructures**:
- Benzene ring: `c1ccccc1`
- Pyridine: `c1ccncc1`
- Phenol: `c1ccc(O)cc1`
- Carboxylic acid: `C(=O)O`
### 5. Format Conversion
Convert between different chemical structure formats:
```python
import pubchempy as pcp
compound = pcp.get_compounds('aspirin', 'name')[0]
# Convert to different formats
smiles = compound.canonical_smiles
inchi = compound.inchi
inchikey = compound.inchikey
cid = compound.cid
# Download structure files
pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)
pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)
```
### 6. Structure Visualization
Generate 2D structure images:
```python
import pubchempy as pcp
# Download compound structure as PNG
pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)
# Using direct URL (via requests)
import requests
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
response = requests.get(url)
with open('structure.png', 'wb') as f:
f.write(response.content)
```
### 7. Synonym Retrieval
Get all known names and synonyms for a compound:
```python
import pubchempy as pcp
synonyms_data = pcp.get_synonyms('aspirin', 'name')
if synonyms_data:
cid = synonyms_data[0]['CID']
synonyms = synonyms_data[0]['Synonym']
print(f"CID {cid} has {len(synonyms)} synonyms:")
for syn in synonyms[:10]: # First 10
print(f" - {syn}")
```
### 8. Bioactivity Data Access
Retrieve biological activity data from assays:
```python
import requests
import json
# Get bioassay summary for a compound
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
# Process bioassay information
table = data.get('Table', {})
rows = table.get('Row', [])
print(f"Found {len(rows)} bioassay records")
```
**For more complex bioactivity queries**, use the `scripts/bioactivity_query.py` helper script which provides:
- Bioassay summaries with activity outcome filtering
- Assay target identification
- Search for compounds by biological target
- Active compound lists for specific assays
### 9. Comprehensive Compound Annotations
Access detailed compound information through PUG-View:
```python
import requests
cid = 2244
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
response = requests.get(url)
if response.status_code == 200:
annotations = response.json()
# Contains extensive data including:
# - Chemical and Physical Properties
# - Drug and Medication Information
# - Pharmacology and Biochemistry
# - Safety and Hazards
# - Toxicity
# - Literature references
# - Patents
```
**Get Specific Section**:
```python
# Get only drug information
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"
```
## Installation Requirements
Install PubChemPy for Python-based access:
```bash
pip install pubchempy
```
For direct API access and bioactivity queries:
```bash
pip install requests
```
Optional for data analysis:
```bash
pip install pandas
```
## Helper Scripts
This skill includes Python scripts for common PubChem tasks:
### scripts/compound_search.py
Provides utility functions for searching and retrieving compound information:
**Key Functions**:
- `search_by_name(name, max_results=10)`: Search compounds by name
- `search_by_smiles(smiles)`: Search by SMILES string
- `get_compound_by_cid(cid)`: Retrieve compound by CID
- `get_compound_properties(identifier, namespace, properties)`: Get specific properties
- `similarity_search(smiles, threshold, max_records)`: Perform similarity search
- `substructure_search(smiles, max_records)`: Perform substructure search
- `get_synonyms(identifier, namespace)`: Get all synonyms
- `batch_search(identifiers, namespace, properties)`: Batch search multiple compounds
- `download_structure(identifier, namespace, format, filename)`: Download structures
- `print_compound_info(compound)`: Print formatted compound information
**Usage**:
```python
from scripts.compound_search import search_by_name, get_compound_properties
# Search for a compound
compounds = search_by_name('ibuprofen')
# Get specific properties
props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])
```
### scripts/bioactivity_query.py
Provides functions for retrieving biological activity data:
**Key Functions**:
- `get_bioassay_summary(cid)`: Get bioassay summary for compound
- `get_compound_bioactivities(cid, activity_outcome)`: Get filtered bioactivities
- `get_assay_description(aid)`: Get detailed assay information
- `get_assay_targets(aid)`: Get biological targets for assay
- `search_assays_by_target(target_name, max_results)`: Find assays by target
- `get_active_compounds_in_assay(aid, max_results)`: Get active compounds
- `get_compound_annotations(cid, section)`: Get PUG-View annotations
- `summarize_bioactivities(cid)`: Generate bioactivity summary statistics
- `find_compounds_by_bioactivity(target, threshold, max_compounds)`: Find compounds by target
**Usage**:
```python
from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities
# Get bioactivity summary
summary = summarize_bioactivities(2244) # Aspirin
print(f"Total assays: {summary['total_assays']}")
print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")
```
## API Rate Limits and Best Practices
**Rate Limits**:
- Maximum 5 requests per second
- Maximum 400 requests per minute
- Maximum 300 seconds running time per minute
**Best Practices**:
1. **Use CIDs for repeated queries**: CIDs are more efficient than names or structures
2. **Cache results locally**: Store frequently accessed data
3. **Batch requests**: Combine multiple queries when possible
4. **Implement delays**: Add 0.2-0.3 second delays between requests
5. **Handle errors gracefully**: Check for HTTP errors and missing data
6. **Use PubChemPy**: Higher-level abstraction handles many edge cases
7. **Leverage asynchronous pattern**: For large similarity/substructure searches
8. **Specify MaxRecords**: Limit results to avoid timeouts
**Error Handling**:
```python
from pubchempy import BadRequestError, NotFoundError, TimeoutError
try:
compound = pcp.get_compounds('query', 'name')[0]
except NotFoundError:
print("Compound not found")
except BadRequestError:
print("Invalid request format")
except TimeoutError:
print("Request timed out - try reducing scope")
except IndexError:
print("No results returned")
```
## Common Workflows
### Workflow 1: Chemical Identifier Conversion Pipeline
Convert between different chemical identifiers:
```python
import pubchempy as pcp
# Start with any identifier type
compound = pcp.get_compounds('caffeine', 'name')[0]
# Extract all identifier formats
identifiers = {
'CID': compound.cid,
'Name': compound.iupac_name,
'SMILES': compound.canonical_smiles,
'InChI': compound.inchi,
'InChIKey': compound.inchikey,
'Formula': compound.molecular_formula
}
```
### Workflow 2: Drug-Like Property Screening
Screen compounds using Lipinski's Rule of Five:
```python
import pubchempy as pcp
def check_drug_likeness(compound_name):
compound = pcp.get_compounds(compound_name, 'name')[0]
# Lipinski's Rule of Five
rules = {
'MW <= 500': compound.molecular_weight <= 500,
'LogP <= 5': compound.xlogp <= 5 if compound.xlogp else None,
'HBD <= 5': compound.h_bond_donor_count <= 5,
'HBA <= 10': compound.h_bond_acceptor_count <= 10
}
violations = sum(1 for v in rules.values() if v is False)
return rules, violations
rules, violations = check_drug_likeness('aspirin')
print(f"Lipinski violations: {violations}")
```
### Workflow 3: Finding Similar Drug Candidates
Identify structurally similar compounds to a known drug:
```python
import pubchempy as pcp
# Start with known drug
reference_drug = pcp.get_compounds('imatinib', 'name')[0]
reference_smiles = reference_drug.canonical_smiles
# Find similar compounds
similar = pcp.get_compounds(
reference_smiles,
'smiles',
searchtype='similarity',
Threshold=85,
MaxRecords=20
)
# Filter by drug-like properties
candidates = []
for comp in similar:
if comp.molecular_weight and 200 <= comp.molecular_weight <= 600:
if comp.xlogp and -1 <= comp.xlogp <= 5:
candidates.append(comp)
print(f"Found {len(candidates)} drug-like candidates")
```
### Workflow 4: Batch Compound Property Comparison
Compare properties across multiple compounds:
```python
import pubchempy as pcp
import pandas as pd
compound_list = ['aspirin', 'ibuprofen', 'naproxen', 'celecoxib']
properties_list = []
for name in compound_list:
try:
compound = pcp.get_compounds(name, 'name')[0]
properties_list.append({
'Name': name,
'CID': compound.cid,
'Formula': compound.molecular_formula,
'MW': compound.molecular_weight,
'LogP': compound.xlogp,
'TPSA': compound.tpsa,
'HBD': compound.h_bond_donor_count,
'HBA': compound.h_bond_acceptor_count
})
except Exception as e:
print(f"Error processing {name}: {e}")
df = pd.DataFrame(properties_list)
print(df.to_string(index=False))
```
### Workflow 5: Substructure-Based Virtual Screening
Screen for compounds containing specific pharmacophores:
```python
import pubchempy as pcp
# Define pharmacophore (e.g., sulfonamide group)
pharmacophore_smiles = 'S(=O)(=O)N'
# Search for compounds containing this substructure
hits = pcp.get_compounds(
pharmacophore_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
# Further filter by properties
filtered_hits = [
comp for comp in hits
if comp.molecular_weight and comp.molecular_weight < 500
]
print(f"Found {len(filtered_hits)} compounds with desired substructure")
```
## Reference Documentation
For detailed API documentation, including complete property lists, URL patterns, advanced query options, and more examples, consult `references/api_reference.md`. This comprehensive reference includes:
- Complete PUG-REST API endpoint documentation
- Full list of available molecular properties
- Asynchronous request handling patterns
- PubChemPy API reference
- PUG-View API for annotations
- Common workflows and use cases
- Links to official PubChem documentation
## Troubleshooting
**Compound Not Found**:
- Try alternative names or synonyms
- Use CID if known
- Check spelling and chemical name format
**Timeout Errors**:
- Reduce MaxRecords parameter
- Add delays between requests
- Use CIDs instead of names for faster queries
**Empty Property Values**:
- Not all properties are available for all compounds
- Check if property exists before accessing: `if compound.xlogp:`
- Some properties only available for certain compound types
**Rate Limit Exceeded**:
- Implement delays (0.2-0.3 seconds) between requests
- Use batch operations where possible
- Consider caching results locally
**Similarity/Substructure Search Hangs**:
- These are asynchronous operations that may take 15-30 seconds
- PubChemPy handles polling automatically
- Reduce MaxRecords if timing out
## Additional Resources
- PubChem Home: https://pubchem.ncbi.nlm.nih.gov/
- PUG-REST Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
- PUG-REST Tutorial: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial
- PubChemPy Documentation: https://pubchempy.readthedocs.io/
- PubChemPy GitHub: https://github.com/mcs07/PubChemPy

View File

@@ -1,440 +0,0 @@
# PubChem API Reference
## Overview
PubChem is the world's largest freely available chemical database maintained by the National Center for Biotechnology Information (NCBI). It contains over 110 million unique chemical structures and over 270 million bioactivities from more than 770 data sources.
## Database Structure
PubChem consists of three primary subdatabases:
1. **Compound Database**: Unique validated chemical structures with computed properties
2. **Substance Database**: Deposited chemical substance records from data sources
3. **BioAssay Database**: Biological activity test results for chemical compounds
## PubChem PUG-REST API
### Base URL Structure
```
https://pubchem.ncbi.nlm.nih.gov/rest/pug/<input>/<operation>/<output>
```
Components:
- `<input>`: compound/cid, substance/sid, assay/aid, or search specifications
- `<operation>`: Optional operations like property, synonyms, classification, etc.
- `<output>`: Format such as JSON, XML, CSV, PNG, SDF, etc.
### Common Request Patterns
#### 1. Retrieve by Identifier
Get compound by CID (Compound ID):
```
GET /rest/pug/compound/cid/{cid}/property/{properties}/JSON
```
Get compound by name:
```
GET /rest/pug/compound/name/{name}/property/{properties}/JSON
```
Get compound by SMILES:
```
GET /rest/pug/compound/smiles/{smiles}/property/{properties}/JSON
```
Get compound by InChI:
```
GET /rest/pug/compound/inchi/{inchi}/property/{properties}/JSON
```
#### 2. Available Properties
Common molecular properties that can be retrieved:
- `MolecularFormula`
- `MolecularWeight`
- `CanonicalSMILES`
- `IsomericSMILES`
- `InChI`
- `InChIKey`
- `IUPACName`
- `XLogP`
- `ExactMass`
- `MonoisotopicMass`
- `TPSA` (Topological Polar Surface Area)
- `Complexity`
- `Charge`
- `HBondDonorCount`
- `HBondAcceptorCount`
- `RotatableBondCount`
- `HeavyAtomCount`
- `IsotopeAtomCount`
- `AtomStereoCount`
- `BondStereoCount`
- `CovalentUnitCount`
- `Volume3D`
- `XStericQuadrupole3D`
- `YStericQuadrupole3D`
- `ZStericQuadrupole3D`
- `FeatureCount3D`
To retrieve multiple properties, separate them with commas:
```
/property/MolecularFormula,MolecularWeight,CanonicalSMILES/JSON
```
#### 3. Structure Search Operations
**Similarity Search**:
```
POST /rest/pug/compound/similarity/smiles/{smiles}/JSON
Parameters: Threshold (default 90%)
```
**Substructure Search**:
```
POST /rest/pug/compound/substructure/smiles/{smiles}/cids/JSON
```
**Superstructure Search**:
```
POST /rest/pug/compound/superstructure/smiles/{smiles}/cids/JSON
```
#### 4. Image Generation
Get 2D structure image:
```
GET /rest/pug/compound/cid/{cid}/PNG
Optional parameters: image_size=small|large
```
#### 5. Format Conversion
Get compound as SDF (Structure-Data File):
```
GET /rest/pug/compound/cid/{cid}/SDF
```
Get compound as MOL:
```
GET /rest/pug/compound/cid/{cid}/record/SDF
```
#### 6. Synonym Retrieval
Get all synonyms for a compound:
```
GET /rest/pug/compound/cid/{cid}/synonyms/JSON
```
#### 7. Bioassay Data
Get bioassay data for a compound:
```
GET /rest/pug/compound/cid/{cid}/assaysummary/JSON
```
Get specific assay information:
```
GET /rest/pug/assay/aid/{aid}/description/JSON
```
### Asynchronous Requests
For large queries (similarity/substructure searches), PUG-REST uses an asynchronous pattern:
1. Submit the query (returns ListKey)
2. Check status using the ListKey
3. Retrieve results when ready
Example workflow:
```python
# Step 1: Submit similarity search
response = requests.post(
"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/similarity/smiles/{smiles}/cids/JSON",
data={"Threshold": 90}
)
listkey = response.json()["Waiting"]["ListKey"]
# Step 2: Check status
status_url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/{listkey}/cids/JSON"
# Step 3: Poll until ready (with timeout)
# Step 4: Retrieve results from the same URL
```
### Usage Limits
**Rate Limits**:
- Maximum 5 requests per second
- Maximum 400 requests per minute
- Maximum 300 seconds running time per minute
**Best Practices**:
- Use batch requests when possible
- Implement exponential backoff for retries
- Cache results when appropriate
- Use asynchronous pattern for large queries
## PubChemPy Python Library
PubChemPy is a Python wrapper that simplifies PUG-REST API access.
### Installation
```bash
pip install pubchempy
```
### Key Classes
#### Compound Class
Main class for representing chemical compounds:
```python
import pubchempy as pcp
# Get by CID
compound = pcp.Compound.from_cid(2244)
# Access properties
compound.molecular_formula # 'C9H8O4'
compound.molecular_weight # 180.16
compound.iupac_name # '2-acetyloxybenzoic acid'
compound.canonical_smiles # 'CC(=O)OC1=CC=CC=C1C(=O)O'
compound.isomeric_smiles # Same as canonical for non-stereoisomers
compound.inchi # InChI string
compound.inchikey # InChI Key
compound.xlogp # Partition coefficient
compound.tpsa # Topological polar surface area
```
#### Search Methods
**By Name**:
```python
compounds = pcp.get_compounds('aspirin', 'name')
# Returns list of Compound objects
```
**By SMILES**:
```python
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
```
**By InChI**:
```python
compound = pcp.get_compounds('InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)', 'inchi')[0]
```
**By Formula**:
```python
compounds = pcp.get_compounds('C9H8O4', 'formula')
# Returns all compounds with this formula
```
**Similarity Search**:
```python
results = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles',
searchtype='similarity',
Threshold=90)
```
**Substructure Search**:
```python
results = pcp.get_compounds('c1ccccc1', 'smiles',
searchtype='substructure')
# Returns all compounds containing benzene ring
```
#### Property Retrieval
Get specific properties for multiple compounds:
```python
properties = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES'],
'aspirin',
'name'
)
# Returns list of dictionaries
```
Get properties as pandas DataFrame:
```python
import pandas as pd
df = pd.DataFrame(properties)
```
#### Synonyms
Get all synonyms for a compound:
```python
synonyms = pcp.get_synonyms('aspirin', 'name')
# Returns list of dictionaries with CID and synonym lists
```
#### Download Formats
Download compound in various formats:
```python
# Get as SDF
sdf_data = pcp.download('SDF', 'aspirin', 'name', overwrite=True)
# Get as JSON
json_data = pcp.download('JSON', '2244', 'cid')
# Get as PNG image
pcp.download('PNG', '2244', 'cid', 'aspirin.png', overwrite=True)
```
### Error Handling
```python
from pubchempy import BadRequestError, NotFoundError, TimeoutError
try:
compound = pcp.get_compounds('nonexistent', 'name')
except NotFoundError:
print("Compound not found")
except BadRequestError:
print("Invalid request")
except TimeoutError:
print("Request timed out")
```
## PUG-View API
PUG-View provides access to full textual annotations and specialized reports.
### Key Endpoints
Get compound annotations:
```
GET /rest/pug_view/data/compound/{cid}/JSON
```
Get specific annotation sections:
```
GET /rest/pug_view/data/compound/{cid}/JSON?heading={section_name}
```
Available sections include:
- Chemical and Physical Properties
- Drug and Medication Information
- Pharmacology and Biochemistry
- Safety and Hazards
- Toxicity
- Literature
- Patents
- Biomolecular Interactions and Pathways
## Common Workflows
### 1. Chemical Identifier Conversion
Convert from name to SMILES to InChI:
```python
import pubchempy as pcp
compound = pcp.get_compounds('caffeine', 'name')[0]
smiles = compound.canonical_smiles
inchi = compound.inchi
inchikey = compound.inchikey
cid = compound.cid
```
### 2. Batch Property Retrieval
Get properties for multiple compounds:
```python
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
properties = []
for name in compound_names:
props = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'XLogP'],
name,
'name'
)
properties.extend(props)
import pandas as pd
df = pd.DataFrame(properties)
```
### 3. Finding Similar Compounds
Find structurally similar compounds to a query:
```python
# Start with a known compound
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
query_smiles = query_compound.canonical_smiles
# Perform similarity search
similar = pcp.get_compounds(
query_smiles,
'smiles',
searchtype='similarity',
Threshold=85
)
# Get properties for similar compounds
for compound in similar[:10]: # First 10 results
print(f"{compound.cid}: {compound.iupac_name}, MW: {compound.molecular_weight}")
```
### 4. Substructure Screening
Find all compounds containing a specific substructure:
```python
# Search for compounds containing pyridine ring
pyridine_smiles = 'c1ccncc1'
matches = pcp.get_compounds(
pyridine_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
print(f"Found {len(matches)} compounds containing pyridine")
```
### 5. Bioactivity Data Retrieval
```python
import requests
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)
if response.status_code == 200:
bioassay_data = response.json()
# Process bioassay information
```
## Tips and Best Practices
1. **Use CIDs for repeated queries**: CIDs are more efficient than names or structures
2. **Cache results**: Store frequently accessed data locally
3. **Batch requests**: Combine multiple queries when possible
4. **Handle rate limits**: Implement delays between requests
5. **Use appropriate search types**: Similarity for related compounds, substructure for motif finding
6. **Leverage PubChemPy**: Higher-level abstraction simplifies common tasks
7. **Handle missing data**: Not all properties are available for all compounds
8. **Use asynchronous pattern**: For large similarity/substructure searches
9. **Specify output format**: Choose JSON for programmatic access, SDF for cheminformatics tools
10. **Read documentation**: Full PUG-REST documentation available at https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
## Additional Resources
- PubChem Home: https://pubchem.ncbi.nlm.nih.gov/
- PUG-REST Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
- PUG-REST Tutorial: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial
- PubChemPy Documentation: https://pubchempy.readthedocs.io/
- PubChemPy GitHub: https://github.com/mcs07/PubChemPy
- IUPAC Tutorial: https://iupac.github.io/WFChemCookbook/datasources/pubchem_pugrest.html

View File

@@ -1,367 +0,0 @@
#!/usr/bin/env python3
"""
PubChem Bioactivity Data Retrieval
This script provides functions for retrieving biological activity data
from PubChem for compounds and assays.
"""
import sys
import json
import time
from typing import Dict, List, Optional
try:
import requests
except ImportError:
print("Error: requests is not installed. Install it with: pip install requests")
sys.exit(1)
BASE_URL = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
PUG_VIEW_URL = "https://pubchem.ncbi.nlm.nih.gov/rest/pug_view"
# Rate limiting: 5 requests per second maximum
REQUEST_DELAY = 0.21 # seconds between requests
def rate_limited_request(url: str, method: str = 'GET', **kwargs) -> Optional[requests.Response]:
"""
Make a rate-limited request to PubChem API.
Args:
url: Request URL
method: HTTP method ('GET' or 'POST')
**kwargs: Additional arguments for requests
Returns:
Response object or None on error
"""
time.sleep(REQUEST_DELAY)
try:
if method.upper() == 'GET':
response = requests.get(url, **kwargs)
else:
response = requests.post(url, **kwargs)
response.raise_for_status()
return response
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
return None
def get_bioassay_summary(cid: int) -> Optional[Dict]:
"""
Get bioassay summary for a compound.
Args:
cid: PubChem Compound ID
Returns:
Dictionary containing bioassay summary data
"""
url = f"{BASE_URL}/compound/cid/{cid}/assaysummary/JSON"
response = rate_limited_request(url)
if response and response.status_code == 200:
return response.json()
return None
def get_compound_bioactivities(
cid: int,
activity_outcome: Optional[str] = None
) -> List[Dict]:
"""
Get bioactivity data for a compound.
Args:
cid: PubChem Compound ID
activity_outcome: Filter by activity ('active', 'inactive', 'inconclusive')
Returns:
List of bioactivity records
"""
data = get_bioassay_summary(cid)
if not data:
return []
activities = []
table = data.get('Table', {})
for row in table.get('Row', []):
activity = {}
for i, cell in enumerate(row.get('Cell', [])):
column_name = table['Columns']['Column'][i]
activity[column_name] = cell
if activity_outcome:
if activity.get('Activity Outcome', '').lower() == activity_outcome.lower():
activities.append(activity)
else:
activities.append(activity)
return activities
def get_assay_description(aid: int) -> Optional[Dict]:
"""
Get detailed description for a specific assay.
Args:
aid: PubChem Assay ID (AID)
Returns:
Dictionary containing assay description
"""
url = f"{BASE_URL}/assay/aid/{aid}/description/JSON"
response = rate_limited_request(url)
if response and response.status_code == 200:
return response.json()
return None
def get_assay_targets(aid: int) -> List[str]:
"""
Get biological targets for an assay.
Args:
aid: PubChem Assay ID
Returns:
List of target names
"""
description = get_assay_description(aid)
if not description:
return []
targets = []
assay_data = description.get('PC_AssayContainer', [{}])[0]
assay = assay_data.get('assay', {})
# Extract target information
descr = assay.get('descr', {})
for target in descr.get('target', []):
mol_id = target.get('mol_id', '')
name = target.get('name', '')
if name:
targets.append(name)
elif mol_id:
targets.append(f"GI:{mol_id}")
return targets
def search_assays_by_target(
target_name: str,
max_results: int = 100
) -> List[int]:
"""
Search for assays targeting a specific protein or gene.
Args:
target_name: Name of the target (e.g., 'EGFR', 'p53')
max_results: Maximum number of results
Returns:
List of Assay IDs (AIDs)
"""
# Use PubChem's text search for assays
url = f"{BASE_URL}/assay/target/{target_name}/aids/JSON"
response = rate_limited_request(url)
if response and response.status_code == 200:
data = response.json()
aids = data.get('IdentifierList', {}).get('AID', [])
return aids[:max_results]
return []
def get_active_compounds_in_assay(aid: int, max_results: int = 1000) -> List[int]:
"""
Get list of active compounds in an assay.
Args:
aid: PubChem Assay ID
max_results: Maximum number of results
Returns:
List of Compound IDs (CIDs) that showed activity
"""
url = f"{BASE_URL}/assay/aid/{aid}/cids/JSON?cids_type=active"
response = rate_limited_request(url)
if response and response.status_code == 200:
data = response.json()
cids = data.get('IdentifierList', {}).get('CID', [])
return cids[:max_results]
return []
def get_compound_annotations(cid: int, section: Optional[str] = None) -> Optional[Dict]:
"""
Get comprehensive compound annotations from PUG-View.
Args:
cid: PubChem Compound ID
section: Specific section to retrieve (e.g., 'Pharmacology and Biochemistry')
Returns:
Dictionary containing annotation data
"""
url = f"{PUG_VIEW_URL}/data/compound/{cid}/JSON"
if section:
url += f"?heading={section}"
response = rate_limited_request(url)
if response and response.status_code == 200:
return response.json()
return None
def get_drug_information(cid: int) -> Optional[Dict]:
"""
Get drug and medication information for a compound.
Args:
cid: PubChem Compound ID
Returns:
Dictionary containing drug information
"""
return get_compound_annotations(cid, section="Drug and Medication Information")
def get_safety_hazards(cid: int) -> Optional[Dict]:
"""
Get safety and hazard information for a compound.
Args:
cid: PubChem Compound ID
Returns:
Dictionary containing safety information
"""
return get_compound_annotations(cid, section="Safety and Hazards")
def summarize_bioactivities(cid: int) -> Dict:
"""
Generate a summary of bioactivity data for a compound.
Args:
cid: PubChem Compound ID
Returns:
Dictionary with bioactivity summary statistics
"""
activities = get_compound_bioactivities(cid)
summary = {
'total_assays': len(activities),
'active': 0,
'inactive': 0,
'inconclusive': 0,
'unspecified': 0,
'assay_types': {}
}
for activity in activities:
outcome = activity.get('Activity Outcome', '').lower()
if 'active' in outcome:
summary['active'] += 1
elif 'inactive' in outcome:
summary['inactive'] += 1
elif 'inconclusive' in outcome:
summary['inconclusive'] += 1
else:
summary['unspecified'] += 1
return summary
def find_compounds_by_bioactivity(
target: str,
threshold: Optional[float] = None,
max_compounds: int = 100
) -> List[Dict]:
"""
Find compounds with bioactivity against a specific target.
Args:
target: Target name (e.g., 'EGFR')
threshold: Activity threshold (if applicable)
max_compounds: Maximum number of compounds to return
Returns:
List of dictionaries with compound information and activity data
"""
# Step 1: Find assays for the target
assay_ids = search_assays_by_target(target, max_results=10)
if not assay_ids:
print(f"No assays found for target: {target}")
return []
# Step 2: Get active compounds from these assays
compound_set = set()
compound_data = []
for aid in assay_ids[:5]: # Limit to first 5 assays
active_cids = get_active_compounds_in_assay(aid, max_results=max_compounds)
for cid in active_cids:
if cid not in compound_set and len(compound_data) < max_compounds:
compound_set.add(cid)
compound_data.append({
'cid': cid,
'aid': aid,
'target': target
})
if len(compound_data) >= max_compounds:
break
return compound_data
def main():
"""Example usage of bioactivity query functions."""
# Example 1: Get bioassay summary for aspirin (CID 2244)
print("Example 1: Getting bioassay summary for aspirin (CID 2244)...")
summary = summarize_bioactivities(2244)
print(json.dumps(summary, indent=2))
# Example 2: Get active bioactivities for a compound
print("\nExample 2: Getting active bioactivities for aspirin...")
activities = get_compound_bioactivities(2244, activity_outcome='active')
print(f"Found {len(activities)} active bioactivities")
if activities:
print(f"First activity: {activities[0].get('Assay Name', 'N/A')}")
# Example 3: Get assay information
print("\nExample 3: Getting assay description...")
if activities:
aid = activities[0].get('AID', 0)
targets = get_assay_targets(aid)
print(f"Assay {aid} targets: {', '.join(targets) if targets else 'N/A'}")
# Example 4: Search for compounds targeting EGFR
print("\nExample 4: Searching for EGFR inhibitors...")
egfr_compounds = find_compounds_by_bioactivity('EGFR', max_compounds=5)
print(f"Found {len(egfr_compounds)} compounds with EGFR activity")
for comp in egfr_compounds[:5]:
print(f" CID {comp['cid']} (from AID {comp['aid']})")
if __name__ == '__main__':
main()

View File

@@ -1,297 +0,0 @@
#!/usr/bin/env python3
"""
PubChem Compound Search Utility
This script provides functions for searching and retrieving compound information
from PubChem using the PubChemPy library.
"""
import sys
import json
from typing import List, Dict, Optional, Union
try:
import pubchempy as pcp
except ImportError:
print("Error: pubchempy is not installed. Install it with: pip install pubchempy")
sys.exit(1)
def search_by_name(name: str, max_results: int = 10) -> List[pcp.Compound]:
"""
Search for compounds by name.
Args:
name: Chemical name to search for
max_results: Maximum number of results to return
Returns:
List of Compound objects
"""
try:
compounds = pcp.get_compounds(name, 'name')
return compounds[:max_results]
except Exception as e:
print(f"Error searching for '{name}': {e}")
return []
def search_by_smiles(smiles: str) -> Optional[pcp.Compound]:
"""
Search for a compound by SMILES string.
Args:
smiles: SMILES string
Returns:
Compound object or None if not found
"""
try:
compounds = pcp.get_compounds(smiles, 'smiles')
return compounds[0] if compounds else None
except Exception as e:
print(f"Error searching for SMILES '{smiles}': {e}")
return None
def get_compound_by_cid(cid: int) -> Optional[pcp.Compound]:
"""
Retrieve a compound by its CID (Compound ID).
Args:
cid: PubChem Compound ID
Returns:
Compound object or None if not found
"""
try:
return pcp.Compound.from_cid(cid)
except Exception as e:
print(f"Error retrieving CID {cid}: {e}")
return None
def get_compound_properties(
identifier: Union[str, int],
namespace: str = 'name',
properties: Optional[List[str]] = None
) -> Dict:
"""
Get specific properties for a compound.
Args:
identifier: Compound identifier (name, SMILES, CID, etc.)
namespace: Type of identifier ('name', 'smiles', 'cid', 'inchi', etc.)
properties: List of properties to retrieve. If None, returns common properties.
Returns:
Dictionary of properties
"""
if properties is None:
properties = [
'MolecularFormula',
'MolecularWeight',
'CanonicalSMILES',
'IUPACName',
'XLogP',
'TPSA',
'HBondDonorCount',
'HBondAcceptorCount'
]
try:
result = pcp.get_properties(properties, identifier, namespace)
return result[0] if result else {}
except Exception as e:
print(f"Error getting properties for '{identifier}': {e}")
return {}
def similarity_search(
smiles: str,
threshold: int = 90,
max_records: int = 10
) -> List[pcp.Compound]:
"""
Perform similarity search for compounds similar to the query structure.
Args:
smiles: Query SMILES string
threshold: Similarity threshold (0-100)
max_records: Maximum number of results
Returns:
List of similar Compound objects
"""
try:
compounds = pcp.get_compounds(
smiles,
'smiles',
searchtype='similarity',
Threshold=threshold,
MaxRecords=max_records
)
return compounds
except Exception as e:
print(f"Error in similarity search: {e}")
return []
def substructure_search(
smiles: str,
max_records: int = 100
) -> List[pcp.Compound]:
"""
Perform substructure search for compounds containing the query structure.
Args:
smiles: Query SMILES string (substructure)
max_records: Maximum number of results
Returns:
List of Compound objects containing the substructure
"""
try:
compounds = pcp.get_compounds(
smiles,
'smiles',
searchtype='substructure',
MaxRecords=max_records
)
return compounds
except Exception as e:
print(f"Error in substructure search: {e}")
return []
def get_synonyms(identifier: Union[str, int], namespace: str = 'name') -> List[str]:
"""
Get all synonyms for a compound.
Args:
identifier: Compound identifier
namespace: Type of identifier
Returns:
List of synonym strings
"""
try:
results = pcp.get_synonyms(identifier, namespace)
if results:
return results[0].get('Synonym', [])
return []
except Exception as e:
print(f"Error getting synonyms: {e}")
return []
def batch_search(
identifiers: List[str],
namespace: str = 'name',
properties: Optional[List[str]] = None
) -> List[Dict]:
"""
Batch search for multiple compounds.
Args:
identifiers: List of compound identifiers
namespace: Type of identifiers
properties: List of properties to retrieve
Returns:
List of dictionaries containing properties for each compound
"""
results = []
for identifier in identifiers:
props = get_compound_properties(identifier, namespace, properties)
if props:
props['query'] = identifier
results.append(props)
return results
def download_structure(
identifier: Union[str, int],
namespace: str = 'name',
format: str = 'SDF',
filename: Optional[str] = None
) -> Optional[str]:
"""
Download compound structure in specified format.
Args:
identifier: Compound identifier
namespace: Type of identifier
format: Output format ('SDF', 'JSON', 'PNG', etc.)
filename: Output filename (if None, returns data as string)
Returns:
Data string if filename is None, else None
"""
try:
if filename:
pcp.download(format, identifier, namespace, filename, overwrite=True)
return None
else:
return pcp.download(format, identifier, namespace)
except Exception as e:
print(f"Error downloading structure: {e}")
return None
def print_compound_info(compound: pcp.Compound) -> None:
"""
Print formatted compound information.
Args:
compound: PubChemPy Compound object
"""
print(f"\n{'='*60}")
print(f"Compound CID: {compound.cid}")
print(f"{'='*60}")
print(f"IUPAC Name: {compound.iupac_name or 'N/A'}")
print(f"Molecular Formula: {compound.molecular_formula or 'N/A'}")
print(f"Molecular Weight: {compound.molecular_weight or 'N/A'} g/mol")
print(f"Canonical SMILES: {compound.canonical_smiles or 'N/A'}")
print(f"InChI: {compound.inchi or 'N/A'}")
print(f"InChI Key: {compound.inchikey or 'N/A'}")
print(f"XLogP: {compound.xlogp or 'N/A'}")
print(f"TPSA: {compound.tpsa or 'N/A'} Ų")
print(f"H-Bond Donors: {compound.h_bond_donor_count or 'N/A'}")
print(f"H-Bond Acceptors: {compound.h_bond_acceptor_count or 'N/A'}")
print(f"{'='*60}\n")
def main():
"""Example usage of PubChem search functions."""
# Example 1: Search by name
print("Example 1: Searching for 'aspirin'...")
compounds = search_by_name('aspirin', max_results=1)
if compounds:
print_compound_info(compounds[0])
# Example 2: Get properties
print("\nExample 2: Getting properties for caffeine...")
props = get_compound_properties('caffeine', 'name')
print(json.dumps(props, indent=2))
# Example 3: Similarity search
print("\nExample 3: Finding compounds similar to benzene...")
benzene_smiles = 'c1ccccc1'
similar = similarity_search(benzene_smiles, threshold=95, max_records=5)
print(f"Found {len(similar)} similar compounds:")
for comp in similar:
print(f" CID {comp.cid}: {comp.iupac_name or 'N/A'}")
# Example 4: Batch search
print("\nExample 4: Batch search for multiple compounds...")
names = ['aspirin', 'ibuprofen', 'paracetamol']
results = batch_search(names, properties=['MolecularFormula', 'MolecularWeight'])
for result in results:
print(f" {result.get('query')}: {result.get('MolecularFormula')} "
f"({result.get('MolecularWeight')} g/mol)")
if __name__ == '__main__':
main()