# BioPython Database Access and Search Tools This document covers BioPython's capabilities for accessing biological databases and performing sequence searches. ## NCBI Database Access ### Bio.Entrez - NCBI E-utilities Interface Provides programmatic access to NCBI databases including PubMed, GenBank, Protein, Nucleotide, and more. **Important:** Always set your email before using Entrez: ```python from Bio import Entrez Entrez.email = "your.email@example.com" ``` #### Core Query Functions **esearch** - Search databases and retrieve IDs: ```python handle = Entrez.esearch(db="nucleotide", term="Homo sapiens[Organism] AND COX1") record = Entrez.read(handle) id_list = record["IdList"] ``` Parameters: - `db`: Database to search (nucleotide, protein, pubmed, etc.) - `term`: Search query - `retmax`: Maximum number of IDs to return - `sort`: Sort order (relevance, pub_date, etc.) - `usehistory`: Store results on server (useful for large queries) **efetch** - Retrieve full records: ```python handle = Entrez.efetch(db="nucleotide", id="123456", rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") ``` Parameters: - `db`: Database name - `id`: Single ID or comma-separated list - `rettype`: Return type (gb, fasta, gp, xml, etc.) - `retmode`: Return mode (text, xml, asn.1) - Automatically uses POST for >200 IDs **elink** - Find related records across databases: ```python handle = Entrez.elink(dbfrom="protein", db="gene", id="15718680") result = Entrez.read(handle) ``` Parameters: - `dbfrom`: Source database - `db`: Target database - `id`: ID(s) to link from - Returns LinkOut providers and relevancy scores **esummary** - Get document summaries: ```python handle = Entrez.esummary(db="protein", id="15718680") summary = Entrez.read(handle) print(summary[0]['Title']) ``` Returns quick overviews without full records. **einfo** - Get database statistics: ```python handle = Entrez.einfo(db="nucleotide") info = Entrez.read(handle) ``` Provides field indices, term counts, update dates, and available links. **epost** - Upload ID lists to server: ```python handle = Entrez.epost("nucleotide", id="123456,789012") result = Entrez.read(handle) webenv = result["WebEnv"] query_key = result["QueryKey"] ``` Useful for large queries split across multiple requests. **espell** - Get spelling suggestions: ```python handle = Entrez.espell(term="brest cancer") result = Entrez.read(handle) print(result["CorrectedQuery"]) # "breast cancer" ``` **ecitmatch** - Convert citations to PubMed IDs: ```python citation = "proc natl acad sci u s a|1991|88|3248|mann bj|" handle = Entrez.ecitmatch(db="pubmed", bdata=citation) ``` #### Data Processing Functions **Entrez.read()** - Parse XML to Python dictionary: ```python handle = Entrez.esearch(db="protein", term="insulin") record = Entrez.read(handle) ``` **Entrez.parse()** - Generator for large XML results: ```python handle = Entrez.efetch(db="protein", id=id_list, rettype="gp", retmode="xml") for record in Entrez.parse(handle): process(record) ``` #### Common Workflows **Download sequences by accession:** ```python handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="fasta", retmode="text") record = SeqIO.read(handle, "fasta") ``` **Search and download multiple sequences:** ```python # Search search_handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax="100") search_results = Entrez.read(search_handle) # Download fetch_handle = Entrez.efetch(db="nucleotide", id=search_results["IdList"], rettype="gb", retmode="text") for record in SeqIO.parse(fetch_handle, "genbank"): print(record.id) ``` **Use WebEnv for large queries:** ```python # Post IDs post_handle = Entrez.epost(db="nucleotide", id=",".join(large_id_list)) post_result = Entrez.read(post_handle) # Fetch in batches batch_size = 500 for start in range(0, count, batch_size): fetch_handle = Entrez.efetch( db="nucleotide", rettype="fasta", retmode="text", retstart=start, retmax=batch_size, webenv=post_result["WebEnv"], query_key=post_result["QueryKey"] ) # Process batch ``` ### Bio.GenBank - GenBank Format Parsing Low-level GenBank file parser (SeqIO is usually preferred). ### Bio.SwissProt - Swiss-Prot/UniProt Parsing Parse Swiss-Prot and UniProtKB flat file format: ```python from Bio import SwissProt with open("uniprot.dat") as handle: for record in SwissProt.parse(handle): print(record.entry_name, record.organism) ``` ## Sequence Similarity Searches ### Bio.Blast - BLAST Interface Tools for running BLAST searches and parsing results. #### Running BLAST **NCBI QBLAST (online):** ```python from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastn", "nt", sequence) ``` Parameters: - Program: blastn, blastp, blastx, tblastn, tblastx - Database: nt, nr, refseq_rna, pdb, etc. - Sequence: string or Seq object - Additional parameters: `expect`, `word_size`, `hitlist_size`, `format_type` **Local BLAST:** Run standalone BLAST from command line, then parse results. #### Parsing BLAST Results **XML format (recommended):** ```python from Bio.Blast import NCBIXML result_handle = open("blast_results.xml") blast_records = NCBIXML.parse(result_handle) for blast_record in blast_records: for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < 0.001: print(f"Hit: {alignment.title}") print(f"Length: {alignment.length}") print(f"E-value: {hsp.expect}") print(f"Identities: {hsp.identities}/{hsp.align_length}") ``` **Functions:** - `NCBIXML.read()`: Single query - `NCBIXML.parse()`: Multiple queries (generator) **Key Record Attributes:** - `alignments`: List of matching sequences - `query`: Query sequence ID - `query_length`: Length of query **Alignment Attributes:** - `title`: Description of hit - `length`: Length of hit sequence - `hsps`: High-scoring segment pairs **HSP Attributes:** - `expect`: E-value - `score`: Bit score - `identities`: Number of identical residues - `positives`: Number of positive scoring matches - `gaps`: Number of gaps - `align_length`: Length of alignment - `query`: Aligned query sequence - `match`: Match indicators - `sbjct`: Aligned subject sequence - `query_start`, `query_end`: Query coordinates - `sbjct_start`, `sbjct_end`: Subject coordinates #### Common BLAST Workflows **Find homologs:** ```python result = NCBIWWW.qblast("blastp", "nr", protein_sequence, expect=1e-10) with open("results.xml", "w") as out: out.write(result.read()) ``` **Filter results by criteria:** ```python for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < 1e-5 and hsp.identities/hsp.align_length > 0.5: # Process high-quality hits pass ``` ### Bio.SearchIO - Unified Search Results Parser Modern interface for parsing various search tool outputs (BLAST, HMMER, BLAT, etc.). **Key Functions:** - `read()`: Parse single query - `parse()`: Parse multiple queries (generator) - `write()`: Write results to file - `convert()`: Convert between formats **Supported Tools:** - BLAST (XML, tabular, plain text) - HMMER (hmmscan, hmmsearch, phmmer) - BLAT - FASTA - InterProScan - Exonerate **Example:** ```python from Bio import SearchIO results = SearchIO.parse("blast_output.xml", "blast-xml") for result in results: for hit in result: if hit.hsps[0].evalue < 0.001: print(hit.id, hit.hsps[0].evalue) ``` ## Local Database Management ### BioSQL - SQL Database Interface Store and manage biological sequences in SQL databases (PostgreSQL, MySQL, SQLite). **Features:** - Store SeqRecord objects with annotations - Efficient querying and retrieval - Cross-reference sequences - Track relationships between sequences **Example:** ```python from BioSQL import BioSeqDatabase server = BioSeqDatabase.open_database(driver="MySQLdb", user="user", passwd="pass", host="localhost", db="bioseqdb") db = server["my_db"] # Store sequences db.load(SeqIO.parse("sequences.gb", "genbank")) # Query seq = db.lookup(accession="NC_005816") ```