8.0 KiB
BioPython Database Access and Search Tools
This document covers BioPython's capabilities for accessing biological databases and performing sequence searches.
NCBI Database Access
Bio.Entrez - NCBI E-utilities Interface
Provides programmatic access to NCBI databases including PubMed, GenBank, Protein, Nucleotide, and more.
Important: Always set your email before using Entrez:
from Bio import Entrez
Entrez.email = "your.email@example.com"
Core Query Functions
esearch - Search databases and retrieve IDs:
handle = Entrez.esearch(db="nucleotide", term="Homo sapiens[Organism] AND COX1")
record = Entrez.read(handle)
id_list = record["IdList"]
Parameters:
db: Database to search (nucleotide, protein, pubmed, etc.)term: Search queryretmax: Maximum number of IDs to returnsort: Sort order (relevance, pub_date, etc.)usehistory: Store results on server (useful for large queries)
efetch - Retrieve full records:
handle = Entrez.efetch(db="nucleotide", id="123456", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
Parameters:
db: Database nameid: Single ID or comma-separated listrettype: Return type (gb, fasta, gp, xml, etc.)retmode: Return mode (text, xml, asn.1)- Automatically uses POST for >200 IDs
elink - Find related records across databases:
handle = Entrez.elink(dbfrom="protein", db="gene", id="15718680")
result = Entrez.read(handle)
Parameters:
dbfrom: Source databasedb: Target databaseid: ID(s) to link from- Returns LinkOut providers and relevancy scores
esummary - Get document summaries:
handle = Entrez.esummary(db="protein", id="15718680")
summary = Entrez.read(handle)
print(summary[0]['Title'])
Returns quick overviews without full records.
einfo - Get database statistics:
handle = Entrez.einfo(db="nucleotide")
info = Entrez.read(handle)
Provides field indices, term counts, update dates, and available links.
epost - Upload ID lists to server:
handle = Entrez.epost("nucleotide", id="123456,789012")
result = Entrez.read(handle)
webenv = result["WebEnv"]
query_key = result["QueryKey"]
Useful for large queries split across multiple requests.
espell - Get spelling suggestions:
handle = Entrez.espell(term="brest cancer")
result = Entrez.read(handle)
print(result["CorrectedQuery"]) # "breast cancer"
ecitmatch - Convert citations to PubMed IDs:
citation = "proc natl acad sci u s a|1991|88|3248|mann bj|"
handle = Entrez.ecitmatch(db="pubmed", bdata=citation)
Data Processing Functions
Entrez.read() - Parse XML to Python dictionary:
handle = Entrez.esearch(db="protein", term="insulin")
record = Entrez.read(handle)
Entrez.parse() - Generator for large XML results:
handle = Entrez.efetch(db="protein", id=id_list, rettype="gp", retmode="xml")
for record in Entrez.parse(handle):
process(record)
Common Workflows
Download sequences by accession:
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="fasta", retmode="text")
record = SeqIO.read(handle, "fasta")
Search and download multiple sequences:
# Search
search_handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax="100")
search_results = Entrez.read(search_handle)
# Download
fetch_handle = Entrez.efetch(db="nucleotide", id=search_results["IdList"], rettype="gb", retmode="text")
for record in SeqIO.parse(fetch_handle, "genbank"):
print(record.id)
Use WebEnv for large queries:
# Post IDs
post_handle = Entrez.epost(db="nucleotide", id=",".join(large_id_list))
post_result = Entrez.read(post_handle)
# Fetch in batches
batch_size = 500
for start in range(0, count, batch_size):
fetch_handle = Entrez.efetch(
db="nucleotide",
rettype="fasta",
retmode="text",
retstart=start,
retmax=batch_size,
webenv=post_result["WebEnv"],
query_key=post_result["QueryKey"]
)
# Process batch
Bio.GenBank - GenBank Format Parsing
Low-level GenBank file parser (SeqIO is usually preferred).
Bio.SwissProt - Swiss-Prot/UniProt Parsing
Parse Swiss-Prot and UniProtKB flat file format:
from Bio import SwissProt
with open("uniprot.dat") as handle:
for record in SwissProt.parse(handle):
print(record.entry_name, record.organism)
Sequence Similarity Searches
Bio.Blast - BLAST Interface
Tools for running BLAST searches and parsing results.
Running BLAST
NCBI QBLAST (online):
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
Parameters:
- Program: blastn, blastp, blastx, tblastn, tblastx
- Database: nt, nr, refseq_rna, pdb, etc.
- Sequence: string or Seq object
- Additional parameters:
expect,word_size,hitlist_size,format_type
Local BLAST: Run standalone BLAST from command line, then parse results.
Parsing BLAST Results
XML format (recommended):
from Bio.Blast import NCBIXML
result_handle = open("blast_results.xml")
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001:
print(f"Hit: {alignment.title}")
print(f"Length: {alignment.length}")
print(f"E-value: {hsp.expect}")
print(f"Identities: {hsp.identities}/{hsp.align_length}")
Functions:
NCBIXML.read(): Single queryNCBIXML.parse(): Multiple queries (generator)
Key Record Attributes:
alignments: List of matching sequencesquery: Query sequence IDquery_length: Length of query
Alignment Attributes:
title: Description of hitlength: Length of hit sequencehsps: High-scoring segment pairs
HSP Attributes:
expect: E-valuescore: Bit scoreidentities: Number of identical residuespositives: Number of positive scoring matchesgaps: Number of gapsalign_length: Length of alignmentquery: Aligned query sequencematch: Match indicatorssbjct: Aligned subject sequencequery_start,query_end: Query coordinatessbjct_start,sbjct_end: Subject coordinates
Common BLAST Workflows
Find homologs:
result = NCBIWWW.qblast("blastp", "nr", protein_sequence, expect=1e-10)
with open("results.xml", "w") as out:
out.write(result.read())
Filter results by criteria:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 1e-5 and hsp.identities/hsp.align_length > 0.5:
# Process high-quality hits
pass
Bio.SearchIO - Unified Search Results Parser
Modern interface for parsing various search tool outputs (BLAST, HMMER, BLAT, etc.).
Key Functions:
read(): Parse single queryparse(): Parse multiple queries (generator)write(): Write results to fileconvert(): Convert between formats
Supported Tools:
- BLAST (XML, tabular, plain text)
- HMMER (hmmscan, hmmsearch, phmmer)
- BLAT
- FASTA
- InterProScan
- Exonerate
Example:
from Bio import SearchIO
results = SearchIO.parse("blast_output.xml", "blast-xml")
for result in results:
for hit in result:
if hit.hsps[0].evalue < 0.001:
print(hit.id, hit.hsps[0].evalue)
Local Database Management
BioSQL - SQL Database Interface
Store and manage biological sequences in SQL databases (PostgreSQL, MySQL, SQLite).
Features:
- Store SeqRecord objects with annotations
- Efficient querying and retrieval
- Cross-reference sequences
- Track relationships between sequences
Example:
from BioSQL import BioSeqDatabase
server = BioSeqDatabase.open_database(driver="MySQLdb", user="user", passwd="pass", host="localhost", db="bioseqdb")
db = server["my_db"]
# Store sequences
db.load(SeqIO.parse("sequences.gb", "genbank"))
# Query
seq = db.lookup(accession="NC_005816")