Files
claude-scientific-skills/scientific-packages/biopython/references/database_tools.md
2025-10-19 14:12:02 -07:00

307 lines
8.0 KiB
Markdown

# BioPython Database Access and Search Tools
This document covers BioPython's capabilities for accessing biological databases and performing sequence searches.
## NCBI Database Access
### Bio.Entrez - NCBI E-utilities Interface
Provides programmatic access to NCBI databases including PubMed, GenBank, Protein, Nucleotide, and more.
**Important:** Always set your email before using Entrez:
```python
from Bio import Entrez
Entrez.email = "your.email@example.com"
```
#### Core Query Functions
**esearch** - Search databases and retrieve IDs:
```python
handle = Entrez.esearch(db="nucleotide", term="Homo sapiens[Organism] AND COX1")
record = Entrez.read(handle)
id_list = record["IdList"]
```
Parameters:
- `db`: Database to search (nucleotide, protein, pubmed, etc.)
- `term`: Search query
- `retmax`: Maximum number of IDs to return
- `sort`: Sort order (relevance, pub_date, etc.)
- `usehistory`: Store results on server (useful for large queries)
**efetch** - Retrieve full records:
```python
handle = Entrez.efetch(db="nucleotide", id="123456", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
```
Parameters:
- `db`: Database name
- `id`: Single ID or comma-separated list
- `rettype`: Return type (gb, fasta, gp, xml, etc.)
- `retmode`: Return mode (text, xml, asn.1)
- Automatically uses POST for >200 IDs
**elink** - Find related records across databases:
```python
handle = Entrez.elink(dbfrom="protein", db="gene", id="15718680")
result = Entrez.read(handle)
```
Parameters:
- `dbfrom`: Source database
- `db`: Target database
- `id`: ID(s) to link from
- Returns LinkOut providers and relevancy scores
**esummary** - Get document summaries:
```python
handle = Entrez.esummary(db="protein", id="15718680")
summary = Entrez.read(handle)
print(summary[0]['Title'])
```
Returns quick overviews without full records.
**einfo** - Get database statistics:
```python
handle = Entrez.einfo(db="nucleotide")
info = Entrez.read(handle)
```
Provides field indices, term counts, update dates, and available links.
**epost** - Upload ID lists to server:
```python
handle = Entrez.epost("nucleotide", id="123456,789012")
result = Entrez.read(handle)
webenv = result["WebEnv"]
query_key = result["QueryKey"]
```
Useful for large queries split across multiple requests.
**espell** - Get spelling suggestions:
```python
handle = Entrez.espell(term="brest cancer")
result = Entrez.read(handle)
print(result["CorrectedQuery"]) # "breast cancer"
```
**ecitmatch** - Convert citations to PubMed IDs:
```python
citation = "proc natl acad sci u s a|1991|88|3248|mann bj|"
handle = Entrez.ecitmatch(db="pubmed", bdata=citation)
```
#### Data Processing Functions
**Entrez.read()** - Parse XML to Python dictionary:
```python
handle = Entrez.esearch(db="protein", term="insulin")
record = Entrez.read(handle)
```
**Entrez.parse()** - Generator for large XML results:
```python
handle = Entrez.efetch(db="protein", id=id_list, rettype="gp", retmode="xml")
for record in Entrez.parse(handle):
process(record)
```
#### Common Workflows
**Download sequences by accession:**
```python
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="fasta", retmode="text")
record = SeqIO.read(handle, "fasta")
```
**Search and download multiple sequences:**
```python
# Search
search_handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax="100")
search_results = Entrez.read(search_handle)
# Download
fetch_handle = Entrez.efetch(db="nucleotide", id=search_results["IdList"], rettype="gb", retmode="text")
for record in SeqIO.parse(fetch_handle, "genbank"):
print(record.id)
```
**Use WebEnv for large queries:**
```python
# Post IDs
post_handle = Entrez.epost(db="nucleotide", id=",".join(large_id_list))
post_result = Entrez.read(post_handle)
# Fetch in batches
batch_size = 500
for start in range(0, count, batch_size):
fetch_handle = Entrez.efetch(
db="nucleotide",
rettype="fasta",
retmode="text",
retstart=start,
retmax=batch_size,
webenv=post_result["WebEnv"],
query_key=post_result["QueryKey"]
)
# Process batch
```
### Bio.GenBank - GenBank Format Parsing
Low-level GenBank file parser (SeqIO is usually preferred).
### Bio.SwissProt - Swiss-Prot/UniProt Parsing
Parse Swiss-Prot and UniProtKB flat file format:
```python
from Bio import SwissProt
with open("uniprot.dat") as handle:
for record in SwissProt.parse(handle):
print(record.entry_name, record.organism)
```
## Sequence Similarity Searches
### Bio.Blast - BLAST Interface
Tools for running BLAST searches and parsing results.
#### Running BLAST
**NCBI QBLAST (online):**
```python
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
```
Parameters:
- Program: blastn, blastp, blastx, tblastn, tblastx
- Database: nt, nr, refseq_rna, pdb, etc.
- Sequence: string or Seq object
- Additional parameters: `expect`, `word_size`, `hitlist_size`, `format_type`
**Local BLAST:**
Run standalone BLAST from command line, then parse results.
#### Parsing BLAST Results
**XML format (recommended):**
```python
from Bio.Blast import NCBIXML
result_handle = open("blast_results.xml")
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001:
print(f"Hit: {alignment.title}")
print(f"Length: {alignment.length}")
print(f"E-value: {hsp.expect}")
print(f"Identities: {hsp.identities}/{hsp.align_length}")
```
**Functions:**
- `NCBIXML.read()`: Single query
- `NCBIXML.parse()`: Multiple queries (generator)
**Key Record Attributes:**
- `alignments`: List of matching sequences
- `query`: Query sequence ID
- `query_length`: Length of query
**Alignment Attributes:**
- `title`: Description of hit
- `length`: Length of hit sequence
- `hsps`: High-scoring segment pairs
**HSP Attributes:**
- `expect`: E-value
- `score`: Bit score
- `identities`: Number of identical residues
- `positives`: Number of positive scoring matches
- `gaps`: Number of gaps
- `align_length`: Length of alignment
- `query`: Aligned query sequence
- `match`: Match indicators
- `sbjct`: Aligned subject sequence
- `query_start`, `query_end`: Query coordinates
- `sbjct_start`, `sbjct_end`: Subject coordinates
#### Common BLAST Workflows
**Find homologs:**
```python
result = NCBIWWW.qblast("blastp", "nr", protein_sequence, expect=1e-10)
with open("results.xml", "w") as out:
out.write(result.read())
```
**Filter results by criteria:**
```python
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 1e-5 and hsp.identities/hsp.align_length > 0.5:
# Process high-quality hits
pass
```
### Bio.SearchIO - Unified Search Results Parser
Modern interface for parsing various search tool outputs (BLAST, HMMER, BLAT, etc.).
**Key Functions:**
- `read()`: Parse single query
- `parse()`: Parse multiple queries (generator)
- `write()`: Write results to file
- `convert()`: Convert between formats
**Supported Tools:**
- BLAST (XML, tabular, plain text)
- HMMER (hmmscan, hmmsearch, phmmer)
- BLAT
- FASTA
- InterProScan
- Exonerate
**Example:**
```python
from Bio import SearchIO
results = SearchIO.parse("blast_output.xml", "blast-xml")
for result in results:
for hit in result:
if hit.hsps[0].evalue < 0.001:
print(hit.id, hit.hsps[0].evalue)
```
## Local Database Management
### BioSQL - SQL Database Interface
Store and manage biological sequences in SQL databases (PostgreSQL, MySQL, SQLite).
**Features:**
- Store SeqRecord objects with annotations
- Efficient querying and retrieval
- Cross-reference sequences
- Track relationships between sequences
**Example:**
```python
from BioSQL import BioSeqDatabase
server = BioSeqDatabase.open_database(driver="MySQLdb", user="user", passwd="pass", host="localhost", db="bioseqdb")
db = server["my_db"]
# Store sequences
db.load(SeqIO.parse("sequences.gb", "genbank"))
# Query
seq = db.lookup(accession="NC_005816")
```