mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
307 lines
8.0 KiB
Markdown
307 lines
8.0 KiB
Markdown
# BioPython Database Access and Search Tools
|
|
|
|
This document covers BioPython's capabilities for accessing biological databases and performing sequence searches.
|
|
|
|
## NCBI Database Access
|
|
|
|
### Bio.Entrez - NCBI E-utilities Interface
|
|
|
|
Provides programmatic access to NCBI databases including PubMed, GenBank, Protein, Nucleotide, and more.
|
|
|
|
**Important:** Always set your email before using Entrez:
|
|
```python
|
|
from Bio import Entrez
|
|
Entrez.email = "your.email@example.com"
|
|
```
|
|
|
|
#### Core Query Functions
|
|
|
|
**esearch** - Search databases and retrieve IDs:
|
|
```python
|
|
handle = Entrez.esearch(db="nucleotide", term="Homo sapiens[Organism] AND COX1")
|
|
record = Entrez.read(handle)
|
|
id_list = record["IdList"]
|
|
```
|
|
|
|
Parameters:
|
|
- `db`: Database to search (nucleotide, protein, pubmed, etc.)
|
|
- `term`: Search query
|
|
- `retmax`: Maximum number of IDs to return
|
|
- `sort`: Sort order (relevance, pub_date, etc.)
|
|
- `usehistory`: Store results on server (useful for large queries)
|
|
|
|
**efetch** - Retrieve full records:
|
|
```python
|
|
handle = Entrez.efetch(db="nucleotide", id="123456", rettype="gb", retmode="text")
|
|
record = SeqIO.read(handle, "genbank")
|
|
```
|
|
|
|
Parameters:
|
|
- `db`: Database name
|
|
- `id`: Single ID or comma-separated list
|
|
- `rettype`: Return type (gb, fasta, gp, xml, etc.)
|
|
- `retmode`: Return mode (text, xml, asn.1)
|
|
- Automatically uses POST for >200 IDs
|
|
|
|
**elink** - Find related records across databases:
|
|
```python
|
|
handle = Entrez.elink(dbfrom="protein", db="gene", id="15718680")
|
|
result = Entrez.read(handle)
|
|
```
|
|
|
|
Parameters:
|
|
- `dbfrom`: Source database
|
|
- `db`: Target database
|
|
- `id`: ID(s) to link from
|
|
- Returns LinkOut providers and relevancy scores
|
|
|
|
**esummary** - Get document summaries:
|
|
```python
|
|
handle = Entrez.esummary(db="protein", id="15718680")
|
|
summary = Entrez.read(handle)
|
|
print(summary[0]['Title'])
|
|
```
|
|
|
|
Returns quick overviews without full records.
|
|
|
|
**einfo** - Get database statistics:
|
|
```python
|
|
handle = Entrez.einfo(db="nucleotide")
|
|
info = Entrez.read(handle)
|
|
```
|
|
|
|
Provides field indices, term counts, update dates, and available links.
|
|
|
|
**epost** - Upload ID lists to server:
|
|
```python
|
|
handle = Entrez.epost("nucleotide", id="123456,789012")
|
|
result = Entrez.read(handle)
|
|
webenv = result["WebEnv"]
|
|
query_key = result["QueryKey"]
|
|
```
|
|
|
|
Useful for large queries split across multiple requests.
|
|
|
|
**espell** - Get spelling suggestions:
|
|
```python
|
|
handle = Entrez.espell(term="brest cancer")
|
|
result = Entrez.read(handle)
|
|
print(result["CorrectedQuery"]) # "breast cancer"
|
|
```
|
|
|
|
**ecitmatch** - Convert citations to PubMed IDs:
|
|
```python
|
|
citation = "proc natl acad sci u s a|1991|88|3248|mann bj|"
|
|
handle = Entrez.ecitmatch(db="pubmed", bdata=citation)
|
|
```
|
|
|
|
#### Data Processing Functions
|
|
|
|
**Entrez.read()** - Parse XML to Python dictionary:
|
|
```python
|
|
handle = Entrez.esearch(db="protein", term="insulin")
|
|
record = Entrez.read(handle)
|
|
```
|
|
|
|
**Entrez.parse()** - Generator for large XML results:
|
|
```python
|
|
handle = Entrez.efetch(db="protein", id=id_list, rettype="gp", retmode="xml")
|
|
for record in Entrez.parse(handle):
|
|
process(record)
|
|
```
|
|
|
|
#### Common Workflows
|
|
|
|
**Download sequences by accession:**
|
|
```python
|
|
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="fasta", retmode="text")
|
|
record = SeqIO.read(handle, "fasta")
|
|
```
|
|
|
|
**Search and download multiple sequences:**
|
|
```python
|
|
# Search
|
|
search_handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax="100")
|
|
search_results = Entrez.read(search_handle)
|
|
|
|
# Download
|
|
fetch_handle = Entrez.efetch(db="nucleotide", id=search_results["IdList"], rettype="gb", retmode="text")
|
|
for record in SeqIO.parse(fetch_handle, "genbank"):
|
|
print(record.id)
|
|
```
|
|
|
|
**Use WebEnv for large queries:**
|
|
```python
|
|
# Post IDs
|
|
post_handle = Entrez.epost(db="nucleotide", id=",".join(large_id_list))
|
|
post_result = Entrez.read(post_handle)
|
|
|
|
# Fetch in batches
|
|
batch_size = 500
|
|
for start in range(0, count, batch_size):
|
|
fetch_handle = Entrez.efetch(
|
|
db="nucleotide",
|
|
rettype="fasta",
|
|
retmode="text",
|
|
retstart=start,
|
|
retmax=batch_size,
|
|
webenv=post_result["WebEnv"],
|
|
query_key=post_result["QueryKey"]
|
|
)
|
|
# Process batch
|
|
```
|
|
|
|
### Bio.GenBank - GenBank Format Parsing
|
|
|
|
Low-level GenBank file parser (SeqIO is usually preferred).
|
|
|
|
### Bio.SwissProt - Swiss-Prot/UniProt Parsing
|
|
|
|
Parse Swiss-Prot and UniProtKB flat file format:
|
|
```python
|
|
from Bio import SwissProt
|
|
with open("uniprot.dat") as handle:
|
|
for record in SwissProt.parse(handle):
|
|
print(record.entry_name, record.organism)
|
|
```
|
|
|
|
## Sequence Similarity Searches
|
|
|
|
### Bio.Blast - BLAST Interface
|
|
|
|
Tools for running BLAST searches and parsing results.
|
|
|
|
#### Running BLAST
|
|
|
|
**NCBI QBLAST (online):**
|
|
```python
|
|
from Bio.Blast import NCBIWWW
|
|
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
|
|
```
|
|
|
|
Parameters:
|
|
- Program: blastn, blastp, blastx, tblastn, tblastx
|
|
- Database: nt, nr, refseq_rna, pdb, etc.
|
|
- Sequence: string or Seq object
|
|
- Additional parameters: `expect`, `word_size`, `hitlist_size`, `format_type`
|
|
|
|
**Local BLAST:**
|
|
Run standalone BLAST from command line, then parse results.
|
|
|
|
#### Parsing BLAST Results
|
|
|
|
**XML format (recommended):**
|
|
```python
|
|
from Bio.Blast import NCBIXML
|
|
|
|
result_handle = open("blast_results.xml")
|
|
blast_records = NCBIXML.parse(result_handle)
|
|
|
|
for blast_record in blast_records:
|
|
for alignment in blast_record.alignments:
|
|
for hsp in alignment.hsps:
|
|
if hsp.expect < 0.001:
|
|
print(f"Hit: {alignment.title}")
|
|
print(f"Length: {alignment.length}")
|
|
print(f"E-value: {hsp.expect}")
|
|
print(f"Identities: {hsp.identities}/{hsp.align_length}")
|
|
```
|
|
|
|
**Functions:**
|
|
- `NCBIXML.read()`: Single query
|
|
- `NCBIXML.parse()`: Multiple queries (generator)
|
|
|
|
**Key Record Attributes:**
|
|
- `alignments`: List of matching sequences
|
|
- `query`: Query sequence ID
|
|
- `query_length`: Length of query
|
|
|
|
**Alignment Attributes:**
|
|
- `title`: Description of hit
|
|
- `length`: Length of hit sequence
|
|
- `hsps`: High-scoring segment pairs
|
|
|
|
**HSP Attributes:**
|
|
- `expect`: E-value
|
|
- `score`: Bit score
|
|
- `identities`: Number of identical residues
|
|
- `positives`: Number of positive scoring matches
|
|
- `gaps`: Number of gaps
|
|
- `align_length`: Length of alignment
|
|
- `query`: Aligned query sequence
|
|
- `match`: Match indicators
|
|
- `sbjct`: Aligned subject sequence
|
|
- `query_start`, `query_end`: Query coordinates
|
|
- `sbjct_start`, `sbjct_end`: Subject coordinates
|
|
|
|
#### Common BLAST Workflows
|
|
|
|
**Find homologs:**
|
|
```python
|
|
result = NCBIWWW.qblast("blastp", "nr", protein_sequence, expect=1e-10)
|
|
with open("results.xml", "w") as out:
|
|
out.write(result.read())
|
|
```
|
|
|
|
**Filter results by criteria:**
|
|
```python
|
|
for alignment in blast_record.alignments:
|
|
for hsp in alignment.hsps:
|
|
if hsp.expect < 1e-5 and hsp.identities/hsp.align_length > 0.5:
|
|
# Process high-quality hits
|
|
pass
|
|
```
|
|
|
|
### Bio.SearchIO - Unified Search Results Parser
|
|
|
|
Modern interface for parsing various search tool outputs (BLAST, HMMER, BLAT, etc.).
|
|
|
|
**Key Functions:**
|
|
- `read()`: Parse single query
|
|
- `parse()`: Parse multiple queries (generator)
|
|
- `write()`: Write results to file
|
|
- `convert()`: Convert between formats
|
|
|
|
**Supported Tools:**
|
|
- BLAST (XML, tabular, plain text)
|
|
- HMMER (hmmscan, hmmsearch, phmmer)
|
|
- BLAT
|
|
- FASTA
|
|
- InterProScan
|
|
- Exonerate
|
|
|
|
**Example:**
|
|
```python
|
|
from Bio import SearchIO
|
|
results = SearchIO.parse("blast_output.xml", "blast-xml")
|
|
for result in results:
|
|
for hit in result:
|
|
if hit.hsps[0].evalue < 0.001:
|
|
print(hit.id, hit.hsps[0].evalue)
|
|
```
|
|
|
|
## Local Database Management
|
|
|
|
### BioSQL - SQL Database Interface
|
|
|
|
Store and manage biological sequences in SQL databases (PostgreSQL, MySQL, SQLite).
|
|
|
|
**Features:**
|
|
- Store SeqRecord objects with annotations
|
|
- Efficient querying and retrieval
|
|
- Cross-reference sequences
|
|
- Track relationships between sequences
|
|
|
|
**Example:**
|
|
```python
|
|
from BioSQL import BioSeqDatabase
|
|
server = BioSeqDatabase.open_database(driver="MySQLdb", user="user", passwd="pass", host="localhost", db="bioseqdb")
|
|
db = server["my_db"]
|
|
|
|
# Store sequences
|
|
db.load(SeqIO.parse("sequences.gb", "genbank"))
|
|
|
|
# Query
|
|
seq = db.lookup(accession="NC_005816")
|
|
```
|