5.9 KiB
BioPython Core Modules Reference
This document provides detailed information about BioPython's core modules and their capabilities.
Sequence Handling
Bio.Seq - Sequence Objects
Seq objects are BioPython's fundamental data structure for biological sequences, providing biological methods on top of string-like behavior.
Creation:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
Key Operations:
- String methods:
find(),count(),count_overlap()(for overlapping patterns) - Complement/Reverse complement: Returns complementary sequences
- Transcription: DNA → RNA (T → U)
- Back transcription: RNA → DNA
- Translation: DNA/RNA → protein with customizable genetic codes and stop codon handling
Use Cases:
- DNA/RNA sequence manipulation
- Converting between nucleic acid types
- Protein translation from coding sequences
- Sequence searching and pattern counting
Bio.SeqRecord - Sequence Metadata
SeqRecord wraps Seq objects with metadata like ID, description, and features.
Attributes:
seq: The sequence itself (Seq object)id: Unique identifiername: Short namedescription: Longer descriptionfeatures: List of SeqFeature objectsannotations: Dictionary of additional informationletter_annotations: Per-letter annotations (e.g., quality scores)
Bio.SeqFeature - Sequence Annotations
Manages sequence annotations and features such as genes, promoters, and coding regions.
Common Features:
- Gene locations
- CDS (coding sequences)
- Promoters and regulatory elements
- Exons and introns
- Protein domains
File Input/Output
Bio.SeqIO - Sequence File I/O
Unified interface for reading and writing sequence files in multiple formats.
Supported Formats:
- FASTA/FASTQ: Standard sequence formats
- GenBank/EMBL: Feature-rich annotation formats
- Clustal/Stockholm/PHYLIP: Alignment formats
- ABI/SFF: Trace and flowgram data
- Swiss-Prot/PIR: Protein databases
- PDB: Protein structure files
Key Functions:
SeqIO.parse() - Iterator for reading multiple records:
from Bio import SeqIO
for record in SeqIO.parse("file.fasta", "fasta"):
print(record.id, len(record.seq))
SeqIO.read() - Read single record:
record = SeqIO.read("file.fasta", "fasta")
SeqIO.write() - Write sequences:
SeqIO.write(sequences, "output.fasta", "fasta")
SeqIO.convert() - Direct format conversion:
count = SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
SeqIO.index() - Memory-efficient random access for large files:
record_dict = SeqIO.index("large_file.fasta", "fasta")
sequence = record_dict["seq_id"]
SeqIO.to_dict() - Load all records into dictionary (memory-based):
record_dict = SeqIO.to_dict(SeqIO.parse("file.fasta", "fasta"))
Common Patterns:
- Format conversion between FASTA, GenBank, FASTQ
- Filtering sequences by length, ID, or content
- Extracting subsequences
- Batch processing large files with iterators
Bio.AlignIO - Multiple Sequence Alignment I/O
Handles multiple sequence alignment files.
Key Functions:
write(): Save alignmentsparse(): Read multiple alignmentsread(): Read single alignmentconvert(): Convert between formats
Supported Formats:
- Clustal
- PHYLIP (sequential and interleaved)
- Stockholm
- NEXUS
- FASTA (aligned)
- MAF (Multiple Alignment Format)
Sequence Alignment
Bio.Align - Alignment Tools
PairwiseAligner - High-performance pairwise alignment:
from Bio import Align
aligner = Align.PairwiseAligner()
aligner.mode = 'global' # or 'local'
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.gap_score = -2.5
alignments = aligner.align(seq1, seq2)
CodonAligner - Codon-aware alignment
MultipleSeqAlignment - Container for MSA with column access
Bio.pairwise2 (Legacy)
Legacy pairwise alignment module with functions like align.globalxx(), align.localxx().
Sequence Analysis Utilities
Bio.SeqUtils - Sequence Analysis
Collection of utility functions:
CheckSum - Calculate sequence checksums (CRC32, CRC64, GCG)
MeltingTemp - DNA melting temperature calculations:
- Nearest-neighbor method
- Wallace rule
- GC content method
IsoelectricPoint - Protein pI calculation
ProtParam - Protein analysis:
- Molecular weight
- Aromaticity
- Instability index
- Secondary structure fractions
GC/GC_skew - Calculate GC content and GC skew for sequence windows
Bio.Data.CodonTable - Genetic Codes
Access to NCBI genetic code tables:
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_id[1]
print(standard_table.forward_table) # codon to amino acid
print(standard_table.back_table) # amino acid to codons
print(standard_table.start_codons)
print(standard_table.stop_codons)
Available codes:
- Standard code (1)
- Vertebrate mitochondrial (2)
- Yeast mitochondrial (3)
- And many more organism-specific codes
Sequence Motifs and Patterns
Bio.motifs - Sequence Motif Analysis
Tools for working with sequence motifs:
Position Weight Matrices (PWM):
- Create PWM from aligned sequences
- Calculate information content
- Search sequences for motif matches
- Generate consensus sequences
Position Specific Scoring Matrices (PSSM):
- Convert PWM to PSSM
- Score sequences against motifs
- Determine significance thresholds
Supported Formats:
- JASPAR
- TRANSFAC
- MEME
- AlignAce
Bio.Restriction - Restriction Enzymes
Comprehensive restriction enzyme database and analysis:
Capabilities:
- Search for restriction sites
- Predict digestion products
- Analyze restriction maps
- Access enzyme properties (recognition site, cut positions, isoschizomers)
Example usage:
from Bio import Restriction
from Bio.Seq import Seq
seq = Seq("GAATTC...")
enzyme = Restriction.EcoRI
results = enzyme.search(seq)