# BioPython Core Modules Reference This document provides detailed information about BioPython's core modules and their capabilities. ## Sequence Handling ### Bio.Seq - Sequence Objects Seq objects are BioPython's fundamental data structure for biological sequences, providing biological methods on top of string-like behavior. **Creation:** ```python from Bio.Seq import Seq my_seq = Seq("AGTACACTGGT") ``` **Key Operations:** - String methods: `find()`, `count()`, `count_overlap()` (for overlapping patterns) - Complement/Reverse complement: Returns complementary sequences - Transcription: DNA → RNA (T → U) - Back transcription: RNA → DNA - Translation: DNA/RNA → protein with customizable genetic codes and stop codon handling **Use Cases:** - DNA/RNA sequence manipulation - Converting between nucleic acid types - Protein translation from coding sequences - Sequence searching and pattern counting ### Bio.SeqRecord - Sequence Metadata SeqRecord wraps Seq objects with metadata like ID, description, and features. **Attributes:** - `seq`: The sequence itself (Seq object) - `id`: Unique identifier - `name`: Short name - `description`: Longer description - `features`: List of SeqFeature objects - `annotations`: Dictionary of additional information - `letter_annotations`: Per-letter annotations (e.g., quality scores) ### Bio.SeqFeature - Sequence Annotations Manages sequence annotations and features such as genes, promoters, and coding regions. **Common Features:** - Gene locations - CDS (coding sequences) - Promoters and regulatory elements - Exons and introns - Protein domains ## File Input/Output ### Bio.SeqIO - Sequence File I/O Unified interface for reading and writing sequence files in multiple formats. **Supported Formats:** - FASTA/FASTQ: Standard sequence formats - GenBank/EMBL: Feature-rich annotation formats - Clustal/Stockholm/PHYLIP: Alignment formats - ABI/SFF: Trace and flowgram data - Swiss-Prot/PIR: Protein databases - PDB: Protein structure files **Key Functions:** **SeqIO.parse()** - Iterator for reading multiple records: ```python from Bio import SeqIO for record in SeqIO.parse("file.fasta", "fasta"): print(record.id, len(record.seq)) ``` **SeqIO.read()** - Read single record: ```python record = SeqIO.read("file.fasta", "fasta") ``` **SeqIO.write()** - Write sequences: ```python SeqIO.write(sequences, "output.fasta", "fasta") ``` **SeqIO.convert()** - Direct format conversion: ```python count = SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta") ``` **SeqIO.index()** - Memory-efficient random access for large files: ```python record_dict = SeqIO.index("large_file.fasta", "fasta") sequence = record_dict["seq_id"] ``` **SeqIO.to_dict()** - Load all records into dictionary (memory-based): ```python record_dict = SeqIO.to_dict(SeqIO.parse("file.fasta", "fasta")) ``` **Common Patterns:** - Format conversion between FASTA, GenBank, FASTQ - Filtering sequences by length, ID, or content - Extracting subsequences - Batch processing large files with iterators ### Bio.AlignIO - Multiple Sequence Alignment I/O Handles multiple sequence alignment files. **Key Functions:** - `write()`: Save alignments - `parse()`: Read multiple alignments - `read()`: Read single alignment - `convert()`: Convert between formats **Supported Formats:** - Clustal - PHYLIP (sequential and interleaved) - Stockholm - NEXUS - FASTA (aligned) - MAF (Multiple Alignment Format) ## Sequence Alignment ### Bio.Align - Alignment Tools **PairwiseAligner** - High-performance pairwise alignment: ```python from Bio import Align aligner = Align.PairwiseAligner() aligner.mode = 'global' # or 'local' aligner.match_score = 2 aligner.mismatch_score = -1 aligner.gap_score = -2.5 alignments = aligner.align(seq1, seq2) ``` **CodonAligner** - Codon-aware alignment **MultipleSeqAlignment** - Container for MSA with column access ### Bio.pairwise2 (Legacy) Legacy pairwise alignment module with functions like `align.globalxx()`, `align.localxx()`. ## Sequence Analysis Utilities ### Bio.SeqUtils - Sequence Analysis Collection of utility functions: **CheckSum** - Calculate sequence checksums (CRC32, CRC64, GCG) **MeltingTemp** - DNA melting temperature calculations: - Nearest-neighbor method - Wallace rule - GC content method **IsoelectricPoint** - Protein pI calculation **ProtParam** - Protein analysis: - Molecular weight - Aromaticity - Instability index - Secondary structure fractions **GC/GC_skew** - Calculate GC content and GC skew for sequence windows ### Bio.Data.CodonTable - Genetic Codes Access to NCBI genetic code tables: ```python from Bio.Data import CodonTable standard_table = CodonTable.unambiguous_dna_by_id[1] print(standard_table.forward_table) # codon to amino acid print(standard_table.back_table) # amino acid to codons print(standard_table.start_codons) print(standard_table.stop_codons) ``` **Available codes:** - Standard code (1) - Vertebrate mitochondrial (2) - Yeast mitochondrial (3) - And many more organism-specific codes ## Sequence Motifs and Patterns ### Bio.motifs - Sequence Motif Analysis Tools for working with sequence motifs: **Position Weight Matrices (PWM):** - Create PWM from aligned sequences - Calculate information content - Search sequences for motif matches - Generate consensus sequences **Position Specific Scoring Matrices (PSSM):** - Convert PWM to PSSM - Score sequences against motifs - Determine significance thresholds **Supported Formats:** - JASPAR - TRANSFAC - MEME - AlignAce ### Bio.Restriction - Restriction Enzymes Comprehensive restriction enzyme database and analysis: **Capabilities:** - Search for restriction sites - Predict digestion products - Analyze restriction maps - Access enzyme properties (recognition site, cut positions, isoschizomers) **Example usage:** ```python from Bio import Restriction from Bio.Seq import Seq seq = Seq("GAATTC...") enzyme = Restriction.EcoRI results = enzyme.search(seq) ```