mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Add more scientific skills
This commit is contained in:
450
scientific-packages/biopython/SKILL.md
Normal file
450
scientific-packages/biopython/SKILL.md
Normal file
@@ -0,0 +1,450 @@
|
||||
---
|
||||
name: biopython
|
||||
description: Comprehensive toolkit for computational molecular biology using BioPython. Use this skill when working with biological sequences (DNA, RNA, protein), parsing sequence files (FASTA, GenBank, FASTQ), accessing NCBI databases (Entrez, BLAST), performing sequence alignments, building phylogenetic trees, analyzing protein structures (PDB), or any bioinformatics task requiring BioPython modules.
|
||||
---
|
||||
|
||||
# BioPython
|
||||
|
||||
## Overview
|
||||
|
||||
BioPython is a comprehensive Python library for computational molecular biology and bioinformatics. This skill provides guidance on using BioPython's extensive modules for sequence manipulation, file I/O, database access, sequence similarity searches, alignments, phylogenetics, structural biology, and population genetics.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Working with biological sequences (DNA, RNA, protein)
|
||||
- Reading or writing sequence files (FASTA, GenBank, FASTQ, etc.)
|
||||
- Accessing NCBI databases (GenBank, PubMed, Protein, Nucleotide)
|
||||
- Running or parsing BLAST searches
|
||||
- Performing sequence alignments (pairwise or multiple)
|
||||
- Building or analyzing phylogenetic trees
|
||||
- Analyzing protein structures (PDB files)
|
||||
- Calculating sequence properties (GC content, melting temp, molecular weight)
|
||||
- Converting between sequence file formats
|
||||
- Performing population genetics analysis
|
||||
- Any bioinformatics task requiring BioPython
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Sequence Manipulation
|
||||
|
||||
Create and manipulate biological sequences using `Bio.Seq`:
|
||||
|
||||
```python
|
||||
from Bio.Seq import Seq
|
||||
|
||||
dna_seq = Seq("ATGGTGCATCTGACT")
|
||||
rna_seq = dna_seq.transcribe() # DNA → RNA
|
||||
protein = dna_seq.translate() # DNA → Protein
|
||||
rev_comp = dna_seq.reverse_complement() # Reverse complement
|
||||
```
|
||||
|
||||
**Common operations:**
|
||||
- Transcription and back-transcription
|
||||
- Translation with custom genetic codes
|
||||
- Complement and reverse complement
|
||||
- Sequence slicing and concatenation
|
||||
- Pattern searching and counting
|
||||
|
||||
**Reference:** See `references/core_modules.md` (section: Bio.Seq) for detailed operations and examples.
|
||||
|
||||
### 2. File Input/Output
|
||||
|
||||
Read and write sequence files in multiple formats using `Bio.SeqIO`:
|
||||
|
||||
```python
|
||||
from Bio import SeqIO
|
||||
|
||||
# Read sequences
|
||||
for record in SeqIO.parse("sequences.fasta", "fasta"):
|
||||
print(record.id, len(record.seq))
|
||||
|
||||
# Write sequences
|
||||
SeqIO.write(records, "output.gb", "genbank")
|
||||
|
||||
# Convert formats
|
||||
SeqIO.convert("input.fasta", "fasta", "output.gb", "genbank")
|
||||
```
|
||||
|
||||
**Supported formats:** FASTA, FASTQ, GenBank, EMBL, Swiss-Prot, PDB, Clustal, PHYLIP, NEXUS, Stockholm, and many more.
|
||||
|
||||
**Common workflows:**
|
||||
- Format conversion (FASTA ↔ GenBank ↔ FASTQ)
|
||||
- Filtering sequences by length, ID, or content
|
||||
- Batch processing large files with iterators
|
||||
- Random access with `SeqIO.index()` for large files
|
||||
|
||||
**Script:** Use `scripts/file_io.py` for file I/O examples and patterns.
|
||||
|
||||
**Reference:** See `references/core_modules.md` (section: Bio.SeqIO) for comprehensive format details and workflows.
|
||||
|
||||
### 3. NCBI Database Access
|
||||
|
||||
Access NCBI databases (GenBank, PubMed, Protein, etc.) using `Bio.Entrez`:
|
||||
|
||||
```python
|
||||
from Bio import Entrez
|
||||
|
||||
Entrez.email = "your.email@example.com" # Required!
|
||||
|
||||
# Search database
|
||||
handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax=100)
|
||||
record = Entrez.read(handle)
|
||||
id_list = record["IdList"]
|
||||
|
||||
# Fetch sequences
|
||||
handle = Entrez.efetch(db="nucleotide", id=id_list, rettype="fasta", retmode="text")
|
||||
records = SeqIO.parse(handle, "fasta")
|
||||
```
|
||||
|
||||
**Key Entrez functions:**
|
||||
- `esearch()`: Search databases, retrieve IDs
|
||||
- `efetch()`: Download full records
|
||||
- `esummary()`: Get document summaries
|
||||
- `elink()`: Find related records across databases
|
||||
- `einfo()`: Get database information
|
||||
- `epost()`: Upload ID lists for large queries
|
||||
|
||||
**Important:** Always set `Entrez.email` before using Entrez functions.
|
||||
|
||||
**Script:** Use `scripts/ncbi_entrez.py` for complete Entrez workflows including batch downloads and WebEnv usage.
|
||||
|
||||
**Reference:** See `references/database_tools.md` (section: Bio.Entrez) for detailed function documentation and parameters.
|
||||
|
||||
### 4. BLAST Searches
|
||||
|
||||
Run BLAST searches and parse results using `Bio.Blast`:
|
||||
|
||||
```python
|
||||
from Bio.Blast import NCBIWWW, NCBIXML
|
||||
|
||||
# Run BLAST online
|
||||
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
|
||||
|
||||
# Save results
|
||||
with open("blast_results.xml", "w") as out:
|
||||
out.write(result_handle.read())
|
||||
|
||||
# Parse results
|
||||
with open("blast_results.xml") as result_handle:
|
||||
blast_record = NCBIXML.read(result_handle)
|
||||
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
if hsp.expect < 0.001:
|
||||
print(f"Hit: {alignment.title}")
|
||||
print(f"E-value: {hsp.expect}")
|
||||
print(f"Identity: {hsp.identities}/{hsp.align_length}")
|
||||
```
|
||||
|
||||
**BLAST programs:** blastn, blastp, blastx, tblastn, tblastx
|
||||
|
||||
**Key result attributes:**
|
||||
- `alignment.title`: Hit description
|
||||
- `hsp.expect`: E-value
|
||||
- `hsp.identities`: Number of identical residues
|
||||
- `hsp.query`, `hsp.match`, `hsp.sbjct`: Aligned sequences
|
||||
|
||||
**Script:** Use `scripts/blast_search.py` for complete BLAST workflows including result filtering and extraction.
|
||||
|
||||
**Reference:** See `references/database_tools.md` (section: Bio.Blast) for detailed parsing and filtering strategies.
|
||||
|
||||
### 5. Sequence Alignment
|
||||
|
||||
Perform pairwise and multiple sequence alignments using `Bio.Align`:
|
||||
|
||||
**Pairwise alignment:**
|
||||
```python
|
||||
from Bio import Align
|
||||
|
||||
aligner = Align.PairwiseAligner()
|
||||
aligner.mode = 'global' # or 'local'
|
||||
aligner.match_score = 2
|
||||
aligner.mismatch_score = -1
|
||||
aligner.gap_score = -2
|
||||
|
||||
alignments = aligner.align(seq1, seq2)
|
||||
print(alignments[0])
|
||||
print(f"Score: {alignments.score}")
|
||||
```
|
||||
|
||||
**Multiple sequence alignment I/O:**
|
||||
```python
|
||||
from Bio import AlignIO
|
||||
|
||||
# Read alignment
|
||||
alignment = AlignIO.read("alignment.clustal", "clustal")
|
||||
|
||||
# Write alignment
|
||||
AlignIO.write(alignment, "output.phylip", "phylip")
|
||||
|
||||
# Convert formats
|
||||
AlignIO.convert("input.clustal", "clustal", "output.fasta", "fasta")
|
||||
```
|
||||
|
||||
**Supported formats:** Clustal, PHYLIP, Stockholm, NEXUS, FASTA, MAF
|
||||
|
||||
**Script:** Use `scripts/alignment_phylogeny.py` for alignment examples and workflows.
|
||||
|
||||
**Reference:** See `references/core_modules.md` (sections: Bio.Align, Bio.AlignIO) for detailed alignment capabilities.
|
||||
|
||||
### 6. Phylogenetic Analysis
|
||||
|
||||
Build and analyze phylogenetic trees using `Bio.Phylo`:
|
||||
|
||||
```python
|
||||
from Bio import Phylo
|
||||
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
|
||||
|
||||
# Read alignment
|
||||
alignment = AlignIO.read("sequences.fasta", "fasta")
|
||||
|
||||
# Calculate distance matrix
|
||||
calculator = DistanceCalculator('identity')
|
||||
dm = calculator.get_distance(alignment)
|
||||
|
||||
# Build tree (UPGMA or Neighbor-Joining)
|
||||
constructor = DistanceTreeConstructor(calculator)
|
||||
tree = constructor.upgma(dm) # or constructor.nj(dm)
|
||||
|
||||
# Visualize tree
|
||||
Phylo.draw_ascii(tree)
|
||||
Phylo.draw(tree) # matplotlib visualization
|
||||
|
||||
# Save tree
|
||||
Phylo.write(tree, "tree.nwk", "newick")
|
||||
```
|
||||
|
||||
**Tree manipulation:**
|
||||
- `tree.ladderize()`: Sort branches
|
||||
- `tree.root_at_midpoint()`: Root at midpoint
|
||||
- `tree.prune()`: Remove taxa
|
||||
- `tree.collapse_all()`: Collapse short branches
|
||||
- `tree.distance()`: Calculate distances between clades
|
||||
|
||||
**Supported formats:** Newick, NEXUS, PhyloXML, NeXML
|
||||
|
||||
**Script:** Use `scripts/alignment_phylogeny.py` for tree construction and manipulation examples.
|
||||
|
||||
**Reference:** See `references/specialized_modules.md` (section: Bio.Phylo) for comprehensive tree analysis capabilities.
|
||||
|
||||
### 7. Structural Bioinformatics
|
||||
|
||||
Analyze protein structures using `Bio.PDB`:
|
||||
|
||||
```python
|
||||
from Bio.PDB import PDBParser, PDBList
|
||||
|
||||
# Download structure
|
||||
pdbl = PDBList()
|
||||
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir=".")
|
||||
|
||||
# Parse structure
|
||||
parser = PDBParser()
|
||||
structure = parser.get_structure("protein", "1abc.pdb")
|
||||
|
||||
# Navigate hierarchy: Structure → Model → Chain → Residue → Atom
|
||||
for model in structure:
|
||||
for chain in model:
|
||||
for residue in chain:
|
||||
for atom in residue:
|
||||
print(atom.name, atom.coord)
|
||||
|
||||
# Secondary structure with DSSP
|
||||
from Bio.PDB import DSSP
|
||||
dssp = DSSP(model, "structure.pdb")
|
||||
|
||||
# Structural alignment
|
||||
from Bio.PDB import Superimposer
|
||||
sup = Superimposer()
|
||||
sup.set_atoms(ref_atoms, alt_atoms)
|
||||
print(f"RMSD: {sup.rms}")
|
||||
```
|
||||
|
||||
**Key capabilities:**
|
||||
- Parse PDB, mmCIF, MMTF formats
|
||||
- Secondary structure analysis (DSSP)
|
||||
- Solvent accessibility calculations
|
||||
- Structural superimposition
|
||||
- Distance and angle calculations
|
||||
- Structure quality validation
|
||||
|
||||
**Reference:** See `references/specialized_modules.md` (section: Bio.PDB) for complete structural analysis capabilities.
|
||||
|
||||
### 8. Sequence Analysis Utilities
|
||||
|
||||
Calculate sequence properties using `Bio.SeqUtils`:
|
||||
|
||||
```python
|
||||
from Bio.SeqUtils import gc_fraction, MeltingTemp as mt
|
||||
from Bio.SeqUtils.ProtParam import ProteinAnalysis
|
||||
|
||||
# DNA analysis
|
||||
gc = gc_fraction(dna_seq) * 100
|
||||
tm = mt.Tm_NN(dna_seq) # Melting temperature
|
||||
|
||||
# Protein analysis
|
||||
protein_analysis = ProteinAnalysis(str(protein_seq))
|
||||
mw = protein_analysis.molecular_weight()
|
||||
pi = protein_analysis.isoelectric_point()
|
||||
aromaticity = protein_analysis.aromaticity()
|
||||
instability = protein_analysis.instability_index()
|
||||
```
|
||||
|
||||
**Available analyses:**
|
||||
- GC content and GC skew
|
||||
- Melting temperature (multiple methods)
|
||||
- Molecular weight
|
||||
- Isoelectric point
|
||||
- Aromaticity
|
||||
- Instability index
|
||||
- Secondary structure prediction
|
||||
- Sequence checksums
|
||||
|
||||
**Script:** Use `scripts/sequence_operations.py` for sequence analysis examples.
|
||||
|
||||
**Reference:** See `references/core_modules.md` (section: Bio.SeqUtils) for all available utilities.
|
||||
|
||||
### 9. Specialized Modules
|
||||
|
||||
**Restriction enzymes:**
|
||||
```python
|
||||
from Bio import Restriction
|
||||
enzyme = Restriction.EcoRI
|
||||
sites = enzyme.search(seq)
|
||||
```
|
||||
|
||||
**Motif analysis:**
|
||||
```python
|
||||
from Bio import motifs
|
||||
m = motifs.create([seq1, seq2, seq3])
|
||||
pwm = m.counts.normalize(pseudocounts=0.5)
|
||||
```
|
||||
|
||||
**Population genetics:**
|
||||
Use `Bio.PopGen` for allele frequencies, Hardy-Weinberg equilibrium, FST calculations.
|
||||
|
||||
**Clustering:**
|
||||
Use `Bio.Cluster` for hierarchical clustering, k-means, PCA on biological data.
|
||||
|
||||
**Reference:** See `references/core_modules.md` and `references/specialized_modules.md` for specialized module documentation.
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Workflow 1: Download and Analyze NCBI Sequences
|
||||
|
||||
1. Search NCBI database with `Entrez.esearch()`
|
||||
2. Fetch sequences with `Entrez.efetch()`
|
||||
3. Parse with `SeqIO.parse()`
|
||||
4. Analyze sequences (GC content, translation, etc.)
|
||||
5. Save results to file
|
||||
|
||||
**Script:** Use `scripts/ncbi_entrez.py` for complete implementation.
|
||||
|
||||
### Workflow 2: Sequence Similarity Search
|
||||
|
||||
1. Run BLAST with `NCBIWWW.qblast()` or parse existing results
|
||||
2. Parse XML results with `NCBIXML.read()`
|
||||
3. Filter hits by E-value, identity, coverage
|
||||
4. Extract and save significant hits
|
||||
5. Perform downstream analysis
|
||||
|
||||
**Script:** Use `scripts/blast_search.py` for complete implementation.
|
||||
|
||||
### Workflow 3: Phylogenetic Tree Construction
|
||||
|
||||
1. Read multiple sequence alignment with `AlignIO.read()`
|
||||
2. Calculate distance matrix with `DistanceCalculator`
|
||||
3. Build tree with `DistanceTreeConstructor` (UPGMA or NJ)
|
||||
4. Manipulate tree (ladderize, root, prune)
|
||||
5. Visualize with `Phylo.draw()` or `Phylo.draw_ascii()`
|
||||
6. Save tree with `Phylo.write()`
|
||||
|
||||
**Script:** Use `scripts/alignment_phylogeny.py` for complete implementation.
|
||||
|
||||
### Workflow 4: Format Conversion Pipeline
|
||||
|
||||
1. Read sequences in original format with `SeqIO.parse()`
|
||||
2. Filter or modify sequences as needed
|
||||
3. Write to new format with `SeqIO.write()`
|
||||
4. Or use `SeqIO.convert()` for direct conversion
|
||||
|
||||
**Script:** Use `scripts/file_io.py` for format conversion examples.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Email Configuration
|
||||
Always set `Entrez.email` before using NCBI services:
|
||||
```python
|
||||
Entrez.email = "your.email@example.com"
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
Be polite to NCBI servers:
|
||||
- Use `time.sleep()` between requests
|
||||
- Use WebEnv for large queries
|
||||
- Batch downloads in reasonable chunks (100-500 sequences)
|
||||
|
||||
### Memory Management
|
||||
For large files:
|
||||
- Use iterators (`SeqIO.parse()`) instead of lists
|
||||
- Use `SeqIO.index()` for random access without loading entire file
|
||||
- Process in batches when possible
|
||||
|
||||
### Error Handling
|
||||
Always handle potential errors:
|
||||
```python
|
||||
try:
|
||||
record = SeqIO.read(handle, format)
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
```
|
||||
|
||||
### File Format Selection
|
||||
Choose appropriate formats:
|
||||
- FASTA: Simple sequences, no annotations
|
||||
- GenBank: Rich annotations, features, references
|
||||
- FASTQ: Sequences with quality scores
|
||||
- PDB: 3D structural data
|
||||
|
||||
## Resources
|
||||
|
||||
### scripts/
|
||||
Executable Python scripts demonstrating common BioPython workflows:
|
||||
|
||||
- `sequence_operations.py`: Basic sequence manipulation (transcription, translation, complement, GC content, melting temp)
|
||||
- `file_io.py`: Reading, writing, and converting sequence files; filtering; indexing large files
|
||||
- `ncbi_entrez.py`: Searching and downloading from NCBI databases; batch processing with WebEnv
|
||||
- `blast_search.py`: Running BLAST searches online; parsing and filtering results
|
||||
- `alignment_phylogeny.py`: Pairwise and multiple sequence alignment; phylogenetic tree construction and manipulation
|
||||
|
||||
Run any script with `python3 scripts/<script_name>.py` to see examples.
|
||||
|
||||
### references/
|
||||
Comprehensive reference documentation for BioPython modules:
|
||||
|
||||
- `core_modules.md`: Core sequence handling (Seq, SeqRecord, SeqIO, AlignIO, Align, SeqUtils, CodonTable, motifs, Restriction)
|
||||
- `database_tools.md`: Database access and searches (Entrez, BLAST, SearchIO, BioSQL)
|
||||
- `specialized_modules.md`: Advanced analyses (PDB, Phylo, PAML, PopGen, Cluster, Graphics)
|
||||
|
||||
Reference these files when:
|
||||
- Learning about specific module capabilities
|
||||
- Looking up function parameters and options
|
||||
- Understanding supported file formats
|
||||
- Finding example code patterns
|
||||
|
||||
Use `grep` to search references for specific topics:
|
||||
```bash
|
||||
grep -n "secondary structure" references/specialized_modules.md
|
||||
grep -n "efetch" references/database_tools.md
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
**Official Documentation:** https://biopython.org/docs/latest/
|
||||
|
||||
**Tutorial:** https://biopython.org/docs/latest/Tutorial/index.html
|
||||
|
||||
**API Reference:** https://biopython.org/docs/latest/api/index.html
|
||||
|
||||
**Cookbook:** https://biopython.org/wiki/Category:Cookbook
|
||||
232
scientific-packages/biopython/references/core_modules.md
Normal file
232
scientific-packages/biopython/references/core_modules.md
Normal file
@@ -0,0 +1,232 @@
|
||||
# BioPython Core Modules Reference
|
||||
|
||||
This document provides detailed information about BioPython's core modules and their capabilities.
|
||||
|
||||
## Sequence Handling
|
||||
|
||||
### Bio.Seq - Sequence Objects
|
||||
|
||||
Seq objects are BioPython's fundamental data structure for biological sequences, providing biological methods on top of string-like behavior.
|
||||
|
||||
**Creation:**
|
||||
```python
|
||||
from Bio.Seq import Seq
|
||||
my_seq = Seq("AGTACACTGGT")
|
||||
```
|
||||
|
||||
**Key Operations:**
|
||||
- String methods: `find()`, `count()`, `count_overlap()` (for overlapping patterns)
|
||||
- Complement/Reverse complement: Returns complementary sequences
|
||||
- Transcription: DNA → RNA (T → U)
|
||||
- Back transcription: RNA → DNA
|
||||
- Translation: DNA/RNA → protein with customizable genetic codes and stop codon handling
|
||||
|
||||
**Use Cases:**
|
||||
- DNA/RNA sequence manipulation
|
||||
- Converting between nucleic acid types
|
||||
- Protein translation from coding sequences
|
||||
- Sequence searching and pattern counting
|
||||
|
||||
### Bio.SeqRecord - Sequence Metadata
|
||||
|
||||
SeqRecord wraps Seq objects with metadata like ID, description, and features.
|
||||
|
||||
**Attributes:**
|
||||
- `seq`: The sequence itself (Seq object)
|
||||
- `id`: Unique identifier
|
||||
- `name`: Short name
|
||||
- `description`: Longer description
|
||||
- `features`: List of SeqFeature objects
|
||||
- `annotations`: Dictionary of additional information
|
||||
- `letter_annotations`: Per-letter annotations (e.g., quality scores)
|
||||
|
||||
### Bio.SeqFeature - Sequence Annotations
|
||||
|
||||
Manages sequence annotations and features such as genes, promoters, and coding regions.
|
||||
|
||||
**Common Features:**
|
||||
- Gene locations
|
||||
- CDS (coding sequences)
|
||||
- Promoters and regulatory elements
|
||||
- Exons and introns
|
||||
- Protein domains
|
||||
|
||||
## File Input/Output
|
||||
|
||||
### Bio.SeqIO - Sequence File I/O
|
||||
|
||||
Unified interface for reading and writing sequence files in multiple formats.
|
||||
|
||||
**Supported Formats:**
|
||||
- FASTA/FASTQ: Standard sequence formats
|
||||
- GenBank/EMBL: Feature-rich annotation formats
|
||||
- Clustal/Stockholm/PHYLIP: Alignment formats
|
||||
- ABI/SFF: Trace and flowgram data
|
||||
- Swiss-Prot/PIR: Protein databases
|
||||
- PDB: Protein structure files
|
||||
|
||||
**Key Functions:**
|
||||
|
||||
**SeqIO.parse()** - Iterator for reading multiple records:
|
||||
```python
|
||||
from Bio import SeqIO
|
||||
for record in SeqIO.parse("file.fasta", "fasta"):
|
||||
print(record.id, len(record.seq))
|
||||
```
|
||||
|
||||
**SeqIO.read()** - Read single record:
|
||||
```python
|
||||
record = SeqIO.read("file.fasta", "fasta")
|
||||
```
|
||||
|
||||
**SeqIO.write()** - Write sequences:
|
||||
```python
|
||||
SeqIO.write(sequences, "output.fasta", "fasta")
|
||||
```
|
||||
|
||||
**SeqIO.convert()** - Direct format conversion:
|
||||
```python
|
||||
count = SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
|
||||
```
|
||||
|
||||
**SeqIO.index()** - Memory-efficient random access for large files:
|
||||
```python
|
||||
record_dict = SeqIO.index("large_file.fasta", "fasta")
|
||||
sequence = record_dict["seq_id"]
|
||||
```
|
||||
|
||||
**SeqIO.to_dict()** - Load all records into dictionary (memory-based):
|
||||
```python
|
||||
record_dict = SeqIO.to_dict(SeqIO.parse("file.fasta", "fasta"))
|
||||
```
|
||||
|
||||
**Common Patterns:**
|
||||
- Format conversion between FASTA, GenBank, FASTQ
|
||||
- Filtering sequences by length, ID, or content
|
||||
- Extracting subsequences
|
||||
- Batch processing large files with iterators
|
||||
|
||||
### Bio.AlignIO - Multiple Sequence Alignment I/O
|
||||
|
||||
Handles multiple sequence alignment files.
|
||||
|
||||
**Key Functions:**
|
||||
- `write()`: Save alignments
|
||||
- `parse()`: Read multiple alignments
|
||||
- `read()`: Read single alignment
|
||||
- `convert()`: Convert between formats
|
||||
|
||||
**Supported Formats:**
|
||||
- Clustal
|
||||
- PHYLIP (sequential and interleaved)
|
||||
- Stockholm
|
||||
- NEXUS
|
||||
- FASTA (aligned)
|
||||
- MAF (Multiple Alignment Format)
|
||||
|
||||
## Sequence Alignment
|
||||
|
||||
### Bio.Align - Alignment Tools
|
||||
|
||||
**PairwiseAligner** - High-performance pairwise alignment:
|
||||
```python
|
||||
from Bio import Align
|
||||
aligner = Align.PairwiseAligner()
|
||||
aligner.mode = 'global' # or 'local'
|
||||
aligner.match_score = 2
|
||||
aligner.mismatch_score = -1
|
||||
aligner.gap_score = -2.5
|
||||
alignments = aligner.align(seq1, seq2)
|
||||
```
|
||||
|
||||
**CodonAligner** - Codon-aware alignment
|
||||
|
||||
**MultipleSeqAlignment** - Container for MSA with column access
|
||||
|
||||
### Bio.pairwise2 (Legacy)
|
||||
|
||||
Legacy pairwise alignment module with functions like `align.globalxx()`, `align.localxx()`.
|
||||
|
||||
## Sequence Analysis Utilities
|
||||
|
||||
### Bio.SeqUtils - Sequence Analysis
|
||||
|
||||
Collection of utility functions:
|
||||
|
||||
**CheckSum** - Calculate sequence checksums (CRC32, CRC64, GCG)
|
||||
|
||||
**MeltingTemp** - DNA melting temperature calculations:
|
||||
- Nearest-neighbor method
|
||||
- Wallace rule
|
||||
- GC content method
|
||||
|
||||
**IsoelectricPoint** - Protein pI calculation
|
||||
|
||||
**ProtParam** - Protein analysis:
|
||||
- Molecular weight
|
||||
- Aromaticity
|
||||
- Instability index
|
||||
- Secondary structure fractions
|
||||
|
||||
**GC/GC_skew** - Calculate GC content and GC skew for sequence windows
|
||||
|
||||
### Bio.Data.CodonTable - Genetic Codes
|
||||
|
||||
Access to NCBI genetic code tables:
|
||||
```python
|
||||
from Bio.Data import CodonTable
|
||||
standard_table = CodonTable.unambiguous_dna_by_id[1]
|
||||
print(standard_table.forward_table) # codon to amino acid
|
||||
print(standard_table.back_table) # amino acid to codons
|
||||
print(standard_table.start_codons)
|
||||
print(standard_table.stop_codons)
|
||||
```
|
||||
|
||||
**Available codes:**
|
||||
- Standard code (1)
|
||||
- Vertebrate mitochondrial (2)
|
||||
- Yeast mitochondrial (3)
|
||||
- And many more organism-specific codes
|
||||
|
||||
## Sequence Motifs and Patterns
|
||||
|
||||
### Bio.motifs - Sequence Motif Analysis
|
||||
|
||||
Tools for working with sequence motifs:
|
||||
|
||||
**Position Weight Matrices (PWM):**
|
||||
- Create PWM from aligned sequences
|
||||
- Calculate information content
|
||||
- Search sequences for motif matches
|
||||
- Generate consensus sequences
|
||||
|
||||
**Position Specific Scoring Matrices (PSSM):**
|
||||
- Convert PWM to PSSM
|
||||
- Score sequences against motifs
|
||||
- Determine significance thresholds
|
||||
|
||||
**Supported Formats:**
|
||||
- JASPAR
|
||||
- TRANSFAC
|
||||
- MEME
|
||||
- AlignAce
|
||||
|
||||
### Bio.Restriction - Restriction Enzymes
|
||||
|
||||
Comprehensive restriction enzyme database and analysis:
|
||||
|
||||
**Capabilities:**
|
||||
- Search for restriction sites
|
||||
- Predict digestion products
|
||||
- Analyze restriction maps
|
||||
- Access enzyme properties (recognition site, cut positions, isoschizomers)
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
from Bio import Restriction
|
||||
from Bio.Seq import Seq
|
||||
|
||||
seq = Seq("GAATTC...")
|
||||
enzyme = Restriction.EcoRI
|
||||
results = enzyme.search(seq)
|
||||
```
|
||||
306
scientific-packages/biopython/references/database_tools.md
Normal file
306
scientific-packages/biopython/references/database_tools.md
Normal file
@@ -0,0 +1,306 @@
|
||||
# BioPython Database Access and Search Tools
|
||||
|
||||
This document covers BioPython's capabilities for accessing biological databases and performing sequence searches.
|
||||
|
||||
## NCBI Database Access
|
||||
|
||||
### Bio.Entrez - NCBI E-utilities Interface
|
||||
|
||||
Provides programmatic access to NCBI databases including PubMed, GenBank, Protein, Nucleotide, and more.
|
||||
|
||||
**Important:** Always set your email before using Entrez:
|
||||
```python
|
||||
from Bio import Entrez
|
||||
Entrez.email = "your.email@example.com"
|
||||
```
|
||||
|
||||
#### Core Query Functions
|
||||
|
||||
**esearch** - Search databases and retrieve IDs:
|
||||
```python
|
||||
handle = Entrez.esearch(db="nucleotide", term="Homo sapiens[Organism] AND COX1")
|
||||
record = Entrez.read(handle)
|
||||
id_list = record["IdList"]
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- `db`: Database to search (nucleotide, protein, pubmed, etc.)
|
||||
- `term`: Search query
|
||||
- `retmax`: Maximum number of IDs to return
|
||||
- `sort`: Sort order (relevance, pub_date, etc.)
|
||||
- `usehistory`: Store results on server (useful for large queries)
|
||||
|
||||
**efetch** - Retrieve full records:
|
||||
```python
|
||||
handle = Entrez.efetch(db="nucleotide", id="123456", rettype="gb", retmode="text")
|
||||
record = SeqIO.read(handle, "genbank")
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- `db`: Database name
|
||||
- `id`: Single ID or comma-separated list
|
||||
- `rettype`: Return type (gb, fasta, gp, xml, etc.)
|
||||
- `retmode`: Return mode (text, xml, asn.1)
|
||||
- Automatically uses POST for >200 IDs
|
||||
|
||||
**elink** - Find related records across databases:
|
||||
```python
|
||||
handle = Entrez.elink(dbfrom="protein", db="gene", id="15718680")
|
||||
result = Entrez.read(handle)
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- `dbfrom`: Source database
|
||||
- `db`: Target database
|
||||
- `id`: ID(s) to link from
|
||||
- Returns LinkOut providers and relevancy scores
|
||||
|
||||
**esummary** - Get document summaries:
|
||||
```python
|
||||
handle = Entrez.esummary(db="protein", id="15718680")
|
||||
summary = Entrez.read(handle)
|
||||
print(summary[0]['Title'])
|
||||
```
|
||||
|
||||
Returns quick overviews without full records.
|
||||
|
||||
**einfo** - Get database statistics:
|
||||
```python
|
||||
handle = Entrez.einfo(db="nucleotide")
|
||||
info = Entrez.read(handle)
|
||||
```
|
||||
|
||||
Provides field indices, term counts, update dates, and available links.
|
||||
|
||||
**epost** - Upload ID lists to server:
|
||||
```python
|
||||
handle = Entrez.epost("nucleotide", id="123456,789012")
|
||||
result = Entrez.read(handle)
|
||||
webenv = result["WebEnv"]
|
||||
query_key = result["QueryKey"]
|
||||
```
|
||||
|
||||
Useful for large queries split across multiple requests.
|
||||
|
||||
**espell** - Get spelling suggestions:
|
||||
```python
|
||||
handle = Entrez.espell(term="brest cancer")
|
||||
result = Entrez.read(handle)
|
||||
print(result["CorrectedQuery"]) # "breast cancer"
|
||||
```
|
||||
|
||||
**ecitmatch** - Convert citations to PubMed IDs:
|
||||
```python
|
||||
citation = "proc natl acad sci u s a|1991|88|3248|mann bj|"
|
||||
handle = Entrez.ecitmatch(db="pubmed", bdata=citation)
|
||||
```
|
||||
|
||||
#### Data Processing Functions
|
||||
|
||||
**Entrez.read()** - Parse XML to Python dictionary:
|
||||
```python
|
||||
handle = Entrez.esearch(db="protein", term="insulin")
|
||||
record = Entrez.read(handle)
|
||||
```
|
||||
|
||||
**Entrez.parse()** - Generator for large XML results:
|
||||
```python
|
||||
handle = Entrez.efetch(db="protein", id=id_list, rettype="gp", retmode="xml")
|
||||
for record in Entrez.parse(handle):
|
||||
process(record)
|
||||
```
|
||||
|
||||
#### Common Workflows
|
||||
|
||||
**Download sequences by accession:**
|
||||
```python
|
||||
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="fasta", retmode="text")
|
||||
record = SeqIO.read(handle, "fasta")
|
||||
```
|
||||
|
||||
**Search and download multiple sequences:**
|
||||
```python
|
||||
# Search
|
||||
search_handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax="100")
|
||||
search_results = Entrez.read(search_handle)
|
||||
|
||||
# Download
|
||||
fetch_handle = Entrez.efetch(db="nucleotide", id=search_results["IdList"], rettype="gb", retmode="text")
|
||||
for record in SeqIO.parse(fetch_handle, "genbank"):
|
||||
print(record.id)
|
||||
```
|
||||
|
||||
**Use WebEnv for large queries:**
|
||||
```python
|
||||
# Post IDs
|
||||
post_handle = Entrez.epost(db="nucleotide", id=",".join(large_id_list))
|
||||
post_result = Entrez.read(post_handle)
|
||||
|
||||
# Fetch in batches
|
||||
batch_size = 500
|
||||
for start in range(0, count, batch_size):
|
||||
fetch_handle = Entrez.efetch(
|
||||
db="nucleotide",
|
||||
rettype="fasta",
|
||||
retmode="text",
|
||||
retstart=start,
|
||||
retmax=batch_size,
|
||||
webenv=post_result["WebEnv"],
|
||||
query_key=post_result["QueryKey"]
|
||||
)
|
||||
# Process batch
|
||||
```
|
||||
|
||||
### Bio.GenBank - GenBank Format Parsing
|
||||
|
||||
Low-level GenBank file parser (SeqIO is usually preferred).
|
||||
|
||||
### Bio.SwissProt - Swiss-Prot/UniProt Parsing
|
||||
|
||||
Parse Swiss-Prot and UniProtKB flat file format:
|
||||
```python
|
||||
from Bio import SwissProt
|
||||
with open("uniprot.dat") as handle:
|
||||
for record in SwissProt.parse(handle):
|
||||
print(record.entry_name, record.organism)
|
||||
```
|
||||
|
||||
## Sequence Similarity Searches
|
||||
|
||||
### Bio.Blast - BLAST Interface
|
||||
|
||||
Tools for running BLAST searches and parsing results.
|
||||
|
||||
#### Running BLAST
|
||||
|
||||
**NCBI QBLAST (online):**
|
||||
```python
|
||||
from Bio.Blast import NCBIWWW
|
||||
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- Program: blastn, blastp, blastx, tblastn, tblastx
|
||||
- Database: nt, nr, refseq_rna, pdb, etc.
|
||||
- Sequence: string or Seq object
|
||||
- Additional parameters: `expect`, `word_size`, `hitlist_size`, `format_type`
|
||||
|
||||
**Local BLAST:**
|
||||
Run standalone BLAST from command line, then parse results.
|
||||
|
||||
#### Parsing BLAST Results
|
||||
|
||||
**XML format (recommended):**
|
||||
```python
|
||||
from Bio.Blast import NCBIXML
|
||||
|
||||
result_handle = open("blast_results.xml")
|
||||
blast_records = NCBIXML.parse(result_handle)
|
||||
|
||||
for blast_record in blast_records:
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
if hsp.expect < 0.001:
|
||||
print(f"Hit: {alignment.title}")
|
||||
print(f"Length: {alignment.length}")
|
||||
print(f"E-value: {hsp.expect}")
|
||||
print(f"Identities: {hsp.identities}/{hsp.align_length}")
|
||||
```
|
||||
|
||||
**Functions:**
|
||||
- `NCBIXML.read()`: Single query
|
||||
- `NCBIXML.parse()`: Multiple queries (generator)
|
||||
|
||||
**Key Record Attributes:**
|
||||
- `alignments`: List of matching sequences
|
||||
- `query`: Query sequence ID
|
||||
- `query_length`: Length of query
|
||||
|
||||
**Alignment Attributes:**
|
||||
- `title`: Description of hit
|
||||
- `length`: Length of hit sequence
|
||||
- `hsps`: High-scoring segment pairs
|
||||
|
||||
**HSP Attributes:**
|
||||
- `expect`: E-value
|
||||
- `score`: Bit score
|
||||
- `identities`: Number of identical residues
|
||||
- `positives`: Number of positive scoring matches
|
||||
- `gaps`: Number of gaps
|
||||
- `align_length`: Length of alignment
|
||||
- `query`: Aligned query sequence
|
||||
- `match`: Match indicators
|
||||
- `sbjct`: Aligned subject sequence
|
||||
- `query_start`, `query_end`: Query coordinates
|
||||
- `sbjct_start`, `sbjct_end`: Subject coordinates
|
||||
|
||||
#### Common BLAST Workflows
|
||||
|
||||
**Find homologs:**
|
||||
```python
|
||||
result = NCBIWWW.qblast("blastp", "nr", protein_sequence, expect=1e-10)
|
||||
with open("results.xml", "w") as out:
|
||||
out.write(result.read())
|
||||
```
|
||||
|
||||
**Filter results by criteria:**
|
||||
```python
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
if hsp.expect < 1e-5 and hsp.identities/hsp.align_length > 0.5:
|
||||
# Process high-quality hits
|
||||
pass
|
||||
```
|
||||
|
||||
### Bio.SearchIO - Unified Search Results Parser
|
||||
|
||||
Modern interface for parsing various search tool outputs (BLAST, HMMER, BLAT, etc.).
|
||||
|
||||
**Key Functions:**
|
||||
- `read()`: Parse single query
|
||||
- `parse()`: Parse multiple queries (generator)
|
||||
- `write()`: Write results to file
|
||||
- `convert()`: Convert between formats
|
||||
|
||||
**Supported Tools:**
|
||||
- BLAST (XML, tabular, plain text)
|
||||
- HMMER (hmmscan, hmmsearch, phmmer)
|
||||
- BLAT
|
||||
- FASTA
|
||||
- InterProScan
|
||||
- Exonerate
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from Bio import SearchIO
|
||||
results = SearchIO.parse("blast_output.xml", "blast-xml")
|
||||
for result in results:
|
||||
for hit in result:
|
||||
if hit.hsps[0].evalue < 0.001:
|
||||
print(hit.id, hit.hsps[0].evalue)
|
||||
```
|
||||
|
||||
## Local Database Management
|
||||
|
||||
### BioSQL - SQL Database Interface
|
||||
|
||||
Store and manage biological sequences in SQL databases (PostgreSQL, MySQL, SQLite).
|
||||
|
||||
**Features:**
|
||||
- Store SeqRecord objects with annotations
|
||||
- Efficient querying and retrieval
|
||||
- Cross-reference sequences
|
||||
- Track relationships between sequences
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from BioSQL import BioSeqDatabase
|
||||
server = BioSeqDatabase.open_database(driver="MySQLdb", user="user", passwd="pass", host="localhost", db="bioseqdb")
|
||||
db = server["my_db"]
|
||||
|
||||
# Store sequences
|
||||
db.load(SeqIO.parse("sequences.gb", "genbank"))
|
||||
|
||||
# Query
|
||||
seq = db.lookup(accession="NC_005816")
|
||||
```
|
||||
612
scientific-packages/biopython/references/specialized_modules.md
Normal file
612
scientific-packages/biopython/references/specialized_modules.md
Normal file
@@ -0,0 +1,612 @@
|
||||
# BioPython Specialized Analysis Modules
|
||||
|
||||
This document covers BioPython's specialized modules for structural biology, phylogenetics, population genetics, and other advanced analyses.
|
||||
|
||||
## Structural Bioinformatics
|
||||
|
||||
### Bio.PDB - Protein Structure Analysis
|
||||
|
||||
Comprehensive tools for handling macromolecular crystal structures.
|
||||
|
||||
#### Structure Hierarchy
|
||||
|
||||
PDB structures are organized hierarchically:
|
||||
- **Structure** → Models → Chains → Residues → Atoms
|
||||
|
||||
```python
|
||||
from Bio.PDB import PDBParser
|
||||
|
||||
parser = PDBParser()
|
||||
structure = parser.get_structure("protein", "1abc.pdb")
|
||||
|
||||
# Navigate hierarchy
|
||||
for model in structure:
|
||||
for chain in model:
|
||||
for residue in chain:
|
||||
for atom in residue:
|
||||
print(atom.coord) # xyz coordinates
|
||||
```
|
||||
|
||||
#### Parsing Structure Files
|
||||
|
||||
**PDB format:**
|
||||
```python
|
||||
from Bio.PDB import PDBParser
|
||||
parser = PDBParser(QUIET=True)
|
||||
structure = parser.get_structure("myprotein", "structure.pdb")
|
||||
```
|
||||
|
||||
**mmCIF format:**
|
||||
```python
|
||||
from Bio.PDB import MMCIFParser
|
||||
parser = MMCIFParser(QUIET=True)
|
||||
structure = parser.get_structure("myprotein", "structure.cif")
|
||||
```
|
||||
|
||||
**Fast mmCIF parser:**
|
||||
```python
|
||||
from Bio.PDB import FastMMCIFParser
|
||||
parser = FastMMCIFParser(QUIET=True)
|
||||
structure = parser.get_structure("myprotein", "structure.cif")
|
||||
```
|
||||
|
||||
**MMTF format:**
|
||||
```python
|
||||
from Bio.PDB import MMTFParser
|
||||
parser = MMTFParser()
|
||||
structure = parser.get_structure("structure.mmtf")
|
||||
```
|
||||
|
||||
**Binary CIF:**
|
||||
```python
|
||||
from Bio.PDB.binary_cif import BinaryCIFParser
|
||||
parser = BinaryCIFParser()
|
||||
structure = parser.get_structure("structure.bcif")
|
||||
```
|
||||
|
||||
#### Downloading Structures
|
||||
|
||||
```python
|
||||
from Bio.PDB import PDBList
|
||||
pdbl = PDBList()
|
||||
|
||||
# Download specific structure
|
||||
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir="structures/")
|
||||
|
||||
# Download entire PDB (obsolete entries)
|
||||
pdbl.download_obsolete_entries(pdir="obsolete/")
|
||||
|
||||
# Update local PDB mirror
|
||||
pdbl.update_pdb()
|
||||
```
|
||||
|
||||
#### Structure Selection and Filtering
|
||||
|
||||
```python
|
||||
# Select specific chains
|
||||
chain_A = structure[0]['A']
|
||||
|
||||
# Select specific residues
|
||||
residue_10 = chain_A[10]
|
||||
|
||||
# Select specific atoms
|
||||
ca_atom = residue_10['CA']
|
||||
|
||||
# Iterate over specific atom types
|
||||
for atom in structure.get_atoms():
|
||||
if atom.name == 'CA': # Alpha carbons only
|
||||
print(atom.coord)
|
||||
```
|
||||
|
||||
**Structure selectors:**
|
||||
```python
|
||||
from Bio.PDB.Polypeptide import is_aa
|
||||
|
||||
# Filter by residue type
|
||||
for residue in structure.get_residues():
|
||||
if is_aa(residue):
|
||||
print(f"Amino acid: {residue.resname}")
|
||||
```
|
||||
|
||||
#### Secondary Structure Analysis
|
||||
|
||||
**DSSP integration:**
|
||||
```python
|
||||
from Bio.PDB import DSSP
|
||||
|
||||
# Requires DSSP program installed
|
||||
model = structure[0]
|
||||
dssp = DSSP(model, "structure.pdb")
|
||||
|
||||
# Access secondary structure
|
||||
for key in dssp:
|
||||
secondary_structure = dssp[key][2]
|
||||
accessibility = dssp[key][3]
|
||||
print(f"Residue {key}: {secondary_structure}, accessible: {accessibility}")
|
||||
```
|
||||
|
||||
DSSP codes:
|
||||
- H: Alpha helix
|
||||
- B: Beta bridge
|
||||
- E: Extended strand (beta sheet)
|
||||
- G: 3-10 helix
|
||||
- I: Pi helix
|
||||
- T: Turn
|
||||
- S: Bend
|
||||
- -: Coil
|
||||
|
||||
#### Solvent Accessibility
|
||||
|
||||
**Shrake-Rupley algorithm:**
|
||||
```python
|
||||
from Bio.PDB import ShrakeRupley
|
||||
|
||||
sr = ShrakeRupley()
|
||||
sr.compute(structure, level="R") # R=residue, A=atom, C=chain, M=model, S=structure
|
||||
|
||||
for residue in structure.get_residues():
|
||||
print(f"{residue.resname} {residue.id[1]}: {residue.sasa} Ų")
|
||||
```
|
||||
|
||||
**NACCESS wrapper:**
|
||||
```python
|
||||
from Bio.PDB import NACCESS
|
||||
|
||||
# Requires NACCESS program
|
||||
naccess = NACCESS("structure.pdb")
|
||||
for residue_id, data in naccess.items():
|
||||
print(f"Residue {residue_id}: {data['all_atoms_abs']} Ų")
|
||||
```
|
||||
|
||||
**Half-sphere exposure:**
|
||||
```python
|
||||
from Bio.PDB import HSExposure
|
||||
|
||||
# Requires DSSP
|
||||
model = structure[0]
|
||||
hse = HSExposure()
|
||||
hse.calc_hs_exposure(model, "structure.pdb")
|
||||
|
||||
for chain in model:
|
||||
for residue in chain:
|
||||
if residue.has_id('EXP_HSE_A_U'):
|
||||
hse_up = residue.xtra['EXP_HSE_A_U']
|
||||
hse_down = residue.xtra['EXP_HSE_A_D']
|
||||
```
|
||||
|
||||
#### Structural Alignment and Superimposition
|
||||
|
||||
**Standard superimposition:**
|
||||
```python
|
||||
from Bio.PDB import Superimposer
|
||||
|
||||
sup = Superimposer()
|
||||
sup.set_atoms(ref_atoms, alt_atoms) # Lists of atoms to align
|
||||
sup.apply(structure2.get_atoms()) # Apply transformation
|
||||
|
||||
print(f"RMSD: {sup.rms}")
|
||||
print(f"Rotation matrix: {sup.rotran[0]}")
|
||||
print(f"Translation vector: {sup.rotran[1]}")
|
||||
```
|
||||
|
||||
**QCP (Quaternion Characteristic Polynomial) method:**
|
||||
```python
|
||||
from Bio.PDB import QCPSuperimposer
|
||||
|
||||
qcp = QCPSuperimposer()
|
||||
qcp.set(ref_coords, alt_coords)
|
||||
qcp.run()
|
||||
print(f"RMSD: {qcp.get_rms()}")
|
||||
```
|
||||
|
||||
#### Geometric Calculations
|
||||
|
||||
**Distances and angles:**
|
||||
```python
|
||||
# Distance between atoms
|
||||
from Bio.PDB import Vector
|
||||
dist = atom1 - atom2 # Returns distance
|
||||
|
||||
# Angle between three atoms
|
||||
from Bio.PDB import calc_angle
|
||||
angle = calc_angle(atom1.coord, atom2.coord, atom3.coord)
|
||||
|
||||
# Dihedral angle
|
||||
from Bio.PDB import calc_dihedral
|
||||
dihedral = calc_dihedral(atom1.coord, atom2.coord, atom3.coord, atom4.coord)
|
||||
```
|
||||
|
||||
**Vector operations:**
|
||||
```python
|
||||
from Bio.PDB.Vector import Vector
|
||||
|
||||
v1 = Vector(atom1.coord)
|
||||
v2 = Vector(atom2.coord)
|
||||
|
||||
# Vector operations
|
||||
v3 = v1 + v2
|
||||
v4 = v1 - v2
|
||||
dot_product = v1 * v2
|
||||
cross_product = v1 ** v2
|
||||
magnitude = v1.norm()
|
||||
normalized = v1.normalized()
|
||||
```
|
||||
|
||||
#### Internal Coordinates
|
||||
|
||||
Advanced residue geometry representation:
|
||||
```python
|
||||
from Bio.PDB import internal_coords
|
||||
|
||||
# Enable internal coordinates
|
||||
structure.atom_to_internal_coordinates()
|
||||
|
||||
# Access phi, psi angles
|
||||
for residue in structure.get_residues():
|
||||
if residue.internal_coord:
|
||||
print(f"Phi: {residue.internal_coord.get_angle('phi')}")
|
||||
print(f"Psi: {residue.internal_coord.get_angle('psi')}")
|
||||
```
|
||||
|
||||
#### Writing Structures
|
||||
|
||||
```python
|
||||
from Bio.PDB import PDBIO
|
||||
|
||||
io = PDBIO()
|
||||
io.set_structure(structure)
|
||||
io.save("output.pdb")
|
||||
|
||||
# Save specific selection
|
||||
io.save("chain_A.pdb", select=ChainSelector("A"))
|
||||
```
|
||||
|
||||
### Bio.SCOP - SCOP Database
|
||||
|
||||
Access to Structural Classification of Proteins database.
|
||||
|
||||
### Bio.KEGG - Pathway Analysis
|
||||
|
||||
Interface to KEGG (Kyoto Encyclopedia of Genes and Genomes) databases:
|
||||
|
||||
**Capabilities:**
|
||||
- Access pathway maps
|
||||
- Retrieve enzyme data
|
||||
- Get compound information
|
||||
- Query orthology relationships
|
||||
|
||||
## Phylogenetics
|
||||
|
||||
### Bio.Phylo - Phylogenetic Tree Analysis
|
||||
|
||||
Comprehensive phylogenetic tree manipulation and analysis.
|
||||
|
||||
#### Reading and Writing Trees
|
||||
|
||||
**Supported formats:**
|
||||
- Newick: Simple, widely-used format
|
||||
- NEXUS: Rich metadata format
|
||||
- PhyloXML: XML-based with extensive annotations
|
||||
- NeXML: Modern XML standard
|
||||
|
||||
```python
|
||||
from Bio import Phylo
|
||||
|
||||
# Read tree
|
||||
tree = Phylo.read("tree.nwk", "newick")
|
||||
|
||||
# Read multiple trees
|
||||
trees = list(Phylo.parse("trees.nex", "nexus"))
|
||||
|
||||
# Write tree
|
||||
Phylo.write(tree, "output.nwk", "newick")
|
||||
```
|
||||
|
||||
#### Tree Visualization
|
||||
|
||||
**ASCII visualization:**
|
||||
```python
|
||||
Phylo.draw_ascii(tree)
|
||||
```
|
||||
|
||||
**Matplotlib plotting:**
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
Phylo.draw(tree)
|
||||
plt.show()
|
||||
|
||||
# With customization
|
||||
fig, ax = plt.subplots(figsize=(10, 8))
|
||||
Phylo.draw(tree, axes=ax, do_show=False)
|
||||
ax.set_title("My Phylogenetic Tree")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
#### Tree Navigation and Manipulation
|
||||
|
||||
**Find clades:**
|
||||
```python
|
||||
# Get all terminal nodes (leaves)
|
||||
terminals = tree.get_terminals()
|
||||
|
||||
# Get all nonterminal nodes
|
||||
nonterminals = tree.get_nonterminals()
|
||||
|
||||
# Find specific clade
|
||||
target = tree.find_any(name="Species_A")
|
||||
|
||||
# Find all matching clades
|
||||
matches = tree.find_clades(terminal=True)
|
||||
```
|
||||
|
||||
**Tree properties:**
|
||||
```python
|
||||
# Count terminals
|
||||
num_species = tree.count_terminals()
|
||||
|
||||
# Get total branch length
|
||||
total_length = tree.total_branch_length()
|
||||
|
||||
# Check if tree is bifurcating
|
||||
is_bifurcating = tree.is_bifurcating()
|
||||
|
||||
# Get maximum distance from root
|
||||
max_dist = tree.distance(tree.root)
|
||||
```
|
||||
|
||||
**Tree modification:**
|
||||
```python
|
||||
# Prune tree to specific taxa
|
||||
keep_taxa = ["Species_A", "Species_B", "Species_C"]
|
||||
tree.prune(keep_taxa)
|
||||
|
||||
# Collapse short branches
|
||||
tree.collapse_all(lambda c: c.branch_length < 0.01)
|
||||
|
||||
# Ladderize (sort branches)
|
||||
tree.ladderize()
|
||||
|
||||
# Root tree at midpoint
|
||||
tree.root_at_midpoint()
|
||||
|
||||
# Root at specific clade
|
||||
outgroup = tree.find_any(name="Outgroup_species")
|
||||
tree.root_with_outgroup(outgroup)
|
||||
```
|
||||
|
||||
**Calculate distances:**
|
||||
```python
|
||||
# Distance between two clades
|
||||
dist = tree.distance(clade1, clade2)
|
||||
|
||||
# Distance from root
|
||||
root_dist = tree.distance(tree.root, terminal_clade)
|
||||
```
|
||||
|
||||
#### Tree Construction
|
||||
|
||||
**Distance-based methods:**
|
||||
```python
|
||||
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator
|
||||
from Bio import AlignIO
|
||||
|
||||
# Load alignment
|
||||
aln = AlignIO.read("alignment.fasta", "fasta")
|
||||
|
||||
# Calculate distance matrix
|
||||
calculator = DistanceCalculator('identity')
|
||||
dm = calculator.get_distance(aln)
|
||||
|
||||
# Construct tree using UPGMA
|
||||
constructor = DistanceTreeConstructor()
|
||||
tree_upgma = constructor.upgma(dm)
|
||||
|
||||
# Or using Neighbor-Joining
|
||||
tree_nj = constructor.nj(dm)
|
||||
```
|
||||
|
||||
**Parsimony method:**
|
||||
```python
|
||||
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher
|
||||
|
||||
scorer = ParsimonyScorer()
|
||||
searcher = NNITreeSearcher(scorer)
|
||||
tree = searcher.search(starting_tree, alignment)
|
||||
```
|
||||
|
||||
**Distance calculators:**
|
||||
- 'identity': Simple identity scoring
|
||||
- 'blastn': BLAST nucleotide scoring
|
||||
- 'blastp': BLAST protein scoring
|
||||
- 'dnafull': EMBOSS DNA scoring matrix
|
||||
- 'blosum62': BLOSUM62 protein matrix
|
||||
- 'pam250': PAM250 protein matrix
|
||||
|
||||
#### Consensus Trees
|
||||
|
||||
```python
|
||||
from Bio.Phylo.Consensus import majority_consensus, strict_consensus
|
||||
|
||||
# Strict consensus
|
||||
consensus_strict = strict_consensus(trees)
|
||||
|
||||
# Majority rule consensus
|
||||
consensus_majority = majority_consensus(trees, cutoff=0.5)
|
||||
|
||||
# Bootstrap consensus
|
||||
from Bio.Phylo.Consensus import bootstrap_consensus
|
||||
bootstrap_tree = bootstrap_consensus(trees, cutoff=0.7)
|
||||
```
|
||||
|
||||
#### External Tool Wrappers
|
||||
|
||||
**PhyML:**
|
||||
```python
|
||||
from Bio.Phylo.Applications import PhymlCommandline
|
||||
|
||||
cmd = PhymlCommandline(input="alignment.phy", datatype="nt", model="HKY85", alpha="e", bootstrap=100)
|
||||
stdout, stderr = cmd()
|
||||
tree = Phylo.read("alignment.phy_phyml_tree.txt", "newick")
|
||||
```
|
||||
|
||||
**RAxML:**
|
||||
```python
|
||||
from Bio.Phylo.Applications import RaxmlCommandline
|
||||
|
||||
cmd = RaxmlCommandline(
|
||||
sequences="alignment.phy",
|
||||
model="GTRGAMMA",
|
||||
name="mytree",
|
||||
parsimony_seed=12345
|
||||
)
|
||||
stdout, stderr = cmd()
|
||||
```
|
||||
|
||||
**FastTree:**
|
||||
```python
|
||||
from Bio.Phylo.Applications import FastTreeCommandline
|
||||
|
||||
cmd = FastTreeCommandline(input="alignment.fasta", out="tree.nwk", gtr=True, gamma=True)
|
||||
stdout, stderr = cmd()
|
||||
```
|
||||
|
||||
### Bio.Phylo.PAML - Evolutionary Analysis
|
||||
|
||||
Interface to PAML (Phylogenetic Analysis by Maximum Likelihood):
|
||||
|
||||
**CODEML - Codon-based analysis:**
|
||||
```python
|
||||
from Bio.Phylo.PAML import codeml
|
||||
|
||||
cml = codeml.Codeml()
|
||||
cml.alignment = "alignment.phy"
|
||||
cml.tree = "tree.nwk"
|
||||
cml.out_file = "results.out"
|
||||
cml.working_dir = "./paml_wd"
|
||||
|
||||
# Set parameters
|
||||
cml.set_options(
|
||||
seqtype=1, # Codon sequences
|
||||
model=0, # One omega ratio
|
||||
NSsites=[0, 1, 2], # Test different models
|
||||
CodonFreq=2, # F3x4 codon frequencies
|
||||
)
|
||||
|
||||
results = cml.run()
|
||||
```
|
||||
|
||||
**BaseML - Nucleotide-based analysis:**
|
||||
```python
|
||||
from Bio.Phylo.PAML import baseml
|
||||
|
||||
bml = baseml.Baseml()
|
||||
bml.alignment = "alignment.phy"
|
||||
bml.tree = "tree.nwk"
|
||||
results = bml.run()
|
||||
```
|
||||
|
||||
**YN00 - Yang-Nielsen method:**
|
||||
```python
|
||||
from Bio.Phylo.PAML import yn00
|
||||
|
||||
yn = yn00.Yn00()
|
||||
yn.alignment = "alignment.phy"
|
||||
results = yn.run()
|
||||
```
|
||||
|
||||
## Population Genetics
|
||||
|
||||
### Bio.PopGen - Population Genetics Analysis
|
||||
|
||||
Tools for population-level genetic analysis.
|
||||
|
||||
**Capabilities:**
|
||||
- Allele frequency calculations
|
||||
- Hardy-Weinberg equilibrium testing
|
||||
- Linkage disequilibrium analysis
|
||||
- F-statistics (FST, FIS, FIT)
|
||||
- Tajima's D
|
||||
- Population structure analysis
|
||||
|
||||
## Clustering and Machine Learning
|
||||
|
||||
### Bio.Cluster - Clustering Algorithms
|
||||
|
||||
Statistical clustering for gene expression and other biological data:
|
||||
|
||||
**Hierarchical clustering:**
|
||||
```python
|
||||
from Bio.Cluster import treecluster
|
||||
|
||||
tree = treecluster(data, method='a', dist='e')
|
||||
# method: 'a'=average, 's'=single, 'm'=maximum, 'c'=centroid
|
||||
# dist: 'e'=Euclidean, 'c'=correlation, 'a'=absolute correlation
|
||||
```
|
||||
|
||||
**k-means clustering:**
|
||||
```python
|
||||
from Bio.Cluster import kcluster
|
||||
|
||||
clusterid, error, nfound = kcluster(data, nclusters=5, npass=100)
|
||||
```
|
||||
|
||||
**Self-Organizing Maps (SOM):**
|
||||
```python
|
||||
from Bio.Cluster import somcluster
|
||||
|
||||
clusterid, celldata = somcluster(data, nx=3, ny=3)
|
||||
```
|
||||
|
||||
**Principal Component Analysis:**
|
||||
```python
|
||||
from Bio.Cluster import pca
|
||||
|
||||
columnmean, coordinates, components, eigenvalues = pca(data)
|
||||
```
|
||||
|
||||
## Visualization
|
||||
|
||||
### Bio.Graphics - Genomic Visualization
|
||||
|
||||
Tools for creating publication-quality biological graphics.
|
||||
|
||||
**GenomeDiagram - Circular and linear genome maps:**
|
||||
```python
|
||||
from Bio.Graphics import GenomeDiagram
|
||||
from Bio import SeqIO
|
||||
|
||||
record = SeqIO.read("genome.gb", "genbank")
|
||||
|
||||
gd_diagram = GenomeDiagram.Diagram("Genome Map")
|
||||
gd_track = gd_diagram.new_track(1, greytrack=True)
|
||||
gd_feature_set = gd_track.new_set()
|
||||
|
||||
# Add features
|
||||
for feature in record.features:
|
||||
if feature.type == "gene":
|
||||
gd_feature_set.add_feature(feature, color="blue", label=True)
|
||||
|
||||
gd_diagram.draw(format="linear", pagesize='A4', fragments=1)
|
||||
gd_diagram.write("genome_map.pdf", "PDF")
|
||||
```
|
||||
|
||||
**Chromosomes - Chromosome visualization:**
|
||||
```python
|
||||
from Bio.Graphics.BasicChromosome import Chromosome
|
||||
|
||||
chr = Chromosome("Chromosome 1")
|
||||
chr.add("gene1", 1000, 2000, color="red")
|
||||
chr.add("gene2", 3000, 4500, color="blue")
|
||||
```
|
||||
|
||||
## Phenotype Analysis
|
||||
|
||||
### Bio.phenotype - Phenotypic Microarray Analysis
|
||||
|
||||
Tools for analyzing phenotypic microarray data (e.g., Biolog plates):
|
||||
|
||||
**Capabilities:**
|
||||
- Parse PM plate data
|
||||
- Growth curve analysis
|
||||
- Compare phenotypic profiles
|
||||
- Calculate similarity metrics
|
||||
370
scientific-packages/biopython/scripts/alignment_phylogeny.py
Normal file
370
scientific-packages/biopython/scripts/alignment_phylogeny.py
Normal file
@@ -0,0 +1,370 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Sequence alignment and phylogenetic analysis using BioPython.
|
||||
|
||||
This script demonstrates:
|
||||
- Pairwise sequence alignment
|
||||
- Multiple sequence alignment I/O
|
||||
- Distance matrix calculation
|
||||
- Phylogenetic tree construction
|
||||
- Tree manipulation and visualization
|
||||
"""
|
||||
|
||||
from Bio import Align, AlignIO, Phylo
|
||||
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
|
||||
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher
|
||||
from Bio.Seq import Seq
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
|
||||
def pairwise_alignment_example():
|
||||
"""Demonstrate pairwise sequence alignment."""
|
||||
|
||||
print("Pairwise Sequence Alignment")
|
||||
print("=" * 60)
|
||||
|
||||
# Create aligner
|
||||
aligner = Align.PairwiseAligner()
|
||||
|
||||
# Set parameters
|
||||
aligner.mode = "global" # or 'local' for local alignment
|
||||
aligner.match_score = 2
|
||||
aligner.mismatch_score = -1
|
||||
aligner.open_gap_score = -2
|
||||
aligner.extend_gap_score = -0.5
|
||||
|
||||
# Sequences to align
|
||||
seq1 = "ACGTACGTACGT"
|
||||
seq2 = "ACGTTACGTGT"
|
||||
|
||||
print(f"Sequence 1: {seq1}")
|
||||
print(f"Sequence 2: {seq2}")
|
||||
print()
|
||||
|
||||
# Perform alignment
|
||||
alignments = aligner.align(seq1, seq2)
|
||||
|
||||
# Show results
|
||||
print(f"Number of optimal alignments: {len(alignments)}")
|
||||
print(f"Best alignment score: {alignments.score:.1f}")
|
||||
print()
|
||||
|
||||
# Display best alignment
|
||||
print("Best alignment:")
|
||||
print(alignments[0])
|
||||
print()
|
||||
|
||||
|
||||
def local_alignment_example():
|
||||
"""Demonstrate local alignment (Smith-Waterman)."""
|
||||
|
||||
print("Local Sequence Alignment")
|
||||
print("=" * 60)
|
||||
|
||||
aligner = Align.PairwiseAligner()
|
||||
aligner.mode = "local"
|
||||
aligner.match_score = 2
|
||||
aligner.mismatch_score = -1
|
||||
aligner.open_gap_score = -2
|
||||
aligner.extend_gap_score = -0.5
|
||||
|
||||
seq1 = "AAAAACGTACGTACGTAAAAA"
|
||||
seq2 = "TTTTTTACGTACGTTTTTTT"
|
||||
|
||||
print(f"Sequence 1: {seq1}")
|
||||
print(f"Sequence 2: {seq2}")
|
||||
print()
|
||||
|
||||
alignments = aligner.align(seq1, seq2)
|
||||
|
||||
print(f"Best local alignment score: {alignments.score:.1f}")
|
||||
print()
|
||||
print("Best local alignment:")
|
||||
print(alignments[0])
|
||||
print()
|
||||
|
||||
|
||||
def read_and_analyze_alignment(alignment_file, format="fasta"):
|
||||
"""Read and analyze a multiple sequence alignment."""
|
||||
|
||||
print(f"Reading alignment from: {alignment_file}")
|
||||
print("-" * 60)
|
||||
|
||||
# Read alignment
|
||||
alignment = AlignIO.read(alignment_file, format)
|
||||
|
||||
print(f"Number of sequences: {len(alignment)}")
|
||||
print(f"Alignment length: {alignment.get_alignment_length()}")
|
||||
print()
|
||||
|
||||
# Display alignment
|
||||
print("Alignment preview:")
|
||||
for record in alignment[:5]: # Show first 5 sequences
|
||||
print(f"{record.id[:15]:15s} {record.seq[:50]}...")
|
||||
|
||||
print()
|
||||
|
||||
# Calculate some statistics
|
||||
analyze_alignment_statistics(alignment)
|
||||
|
||||
return alignment
|
||||
|
||||
|
||||
def analyze_alignment_statistics(alignment):
|
||||
"""Calculate statistics for an alignment."""
|
||||
|
||||
print("Alignment Statistics:")
|
||||
print("-" * 60)
|
||||
|
||||
# Get alignment length
|
||||
length = alignment.get_alignment_length()
|
||||
|
||||
# Count gaps
|
||||
total_gaps = sum(str(record.seq).count("-") for record in alignment)
|
||||
gap_percentage = (total_gaps / (length * len(alignment))) * 100
|
||||
|
||||
print(f"Total positions: {length}")
|
||||
print(f"Number of sequences: {len(alignment)}")
|
||||
print(f"Total gaps: {total_gaps} ({gap_percentage:.1f}%)")
|
||||
print()
|
||||
|
||||
# Calculate conservation at each position
|
||||
conserved_positions = 0
|
||||
for i in range(length):
|
||||
column = alignment[:, i]
|
||||
# Count most common residue
|
||||
if column.count(max(set(column), key=column.count)) == len(alignment):
|
||||
conserved_positions += 1
|
||||
|
||||
conservation = (conserved_positions / length) * 100
|
||||
print(f"Fully conserved positions: {conserved_positions} ({conservation:.1f}%)")
|
||||
print()
|
||||
|
||||
|
||||
def calculate_distance_matrix(alignment):
|
||||
"""Calculate distance matrix from alignment."""
|
||||
|
||||
print("Calculating Distance Matrix")
|
||||
print("-" * 60)
|
||||
|
||||
calculator = DistanceCalculator("identity")
|
||||
dm = calculator.get_distance(alignment)
|
||||
|
||||
print("Distance matrix:")
|
||||
print(dm)
|
||||
print()
|
||||
|
||||
return dm
|
||||
|
||||
|
||||
def build_upgma_tree(alignment):
|
||||
"""Build phylogenetic tree using UPGMA."""
|
||||
|
||||
print("Building UPGMA Tree")
|
||||
print("=" * 60)
|
||||
|
||||
# Calculate distance matrix
|
||||
calculator = DistanceCalculator("identity")
|
||||
dm = calculator.get_distance(alignment)
|
||||
|
||||
# Construct tree
|
||||
constructor = DistanceTreeConstructor(calculator)
|
||||
tree = constructor.upgma(dm)
|
||||
|
||||
print("UPGMA tree constructed")
|
||||
print(f"Number of terminals: {tree.count_terminals()}")
|
||||
print()
|
||||
|
||||
return tree
|
||||
|
||||
|
||||
def build_nj_tree(alignment):
|
||||
"""Build phylogenetic tree using Neighbor-Joining."""
|
||||
|
||||
print("Building Neighbor-Joining Tree")
|
||||
print("=" * 60)
|
||||
|
||||
# Calculate distance matrix
|
||||
calculator = DistanceCalculator("identity")
|
||||
dm = calculator.get_distance(alignment)
|
||||
|
||||
# Construct tree
|
||||
constructor = DistanceTreeConstructor(calculator)
|
||||
tree = constructor.nj(dm)
|
||||
|
||||
print("Neighbor-Joining tree constructed")
|
||||
print(f"Number of terminals: {tree.count_terminals()}")
|
||||
print()
|
||||
|
||||
return tree
|
||||
|
||||
|
||||
def visualize_tree(tree, title="Phylogenetic Tree"):
|
||||
"""Visualize phylogenetic tree."""
|
||||
|
||||
print("Visualizing tree...")
|
||||
print()
|
||||
|
||||
# ASCII visualization
|
||||
print("ASCII tree:")
|
||||
Phylo.draw_ascii(tree)
|
||||
print()
|
||||
|
||||
# Matplotlib visualization
|
||||
fig, ax = plt.subplots(figsize=(10, 8))
|
||||
Phylo.draw(tree, axes=ax, do_show=False)
|
||||
ax.set_title(title)
|
||||
plt.tight_layout()
|
||||
plt.savefig("tree_visualization.png", dpi=300, bbox_inches="tight")
|
||||
print("Tree saved to tree_visualization.png")
|
||||
print()
|
||||
|
||||
|
||||
def manipulate_tree(tree):
|
||||
"""Demonstrate tree manipulation operations."""
|
||||
|
||||
print("Tree Manipulation")
|
||||
print("=" * 60)
|
||||
|
||||
# Get terminals
|
||||
terminals = tree.get_terminals()
|
||||
print(f"Terminal nodes: {[t.name for t in terminals]}")
|
||||
print()
|
||||
|
||||
# Get nonterminals
|
||||
nonterminals = tree.get_nonterminals()
|
||||
print(f"Number of internal nodes: {len(nonterminals)}")
|
||||
print()
|
||||
|
||||
# Calculate total branch length
|
||||
total_length = tree.total_branch_length()
|
||||
print(f"Total branch length: {total_length:.4f}")
|
||||
print()
|
||||
|
||||
# Find specific clade
|
||||
if len(terminals) > 0:
|
||||
target_name = terminals[0].name
|
||||
found = tree.find_any(name=target_name)
|
||||
print(f"Found clade: {found.name}")
|
||||
print()
|
||||
|
||||
# Ladderize tree (sort branches)
|
||||
tree.ladderize()
|
||||
print("Tree ladderized (branches sorted)")
|
||||
print()
|
||||
|
||||
# Root at midpoint
|
||||
tree.root_at_midpoint()
|
||||
print("Tree rooted at midpoint")
|
||||
print()
|
||||
|
||||
return tree
|
||||
|
||||
|
||||
def read_and_analyze_tree(tree_file, format="newick"):
|
||||
"""Read and analyze a phylogenetic tree."""
|
||||
|
||||
print(f"Reading tree from: {tree_file}")
|
||||
print("-" * 60)
|
||||
|
||||
tree = Phylo.read(tree_file, format)
|
||||
|
||||
print(f"Tree format: {format}")
|
||||
print(f"Number of terminals: {tree.count_terminals()}")
|
||||
print(f"Is bifurcating: {tree.is_bifurcating()}")
|
||||
print(f"Total branch length: {tree.total_branch_length():.4f}")
|
||||
print()
|
||||
|
||||
# Show tree structure
|
||||
print("Tree structure:")
|
||||
Phylo.draw_ascii(tree)
|
||||
print()
|
||||
|
||||
return tree
|
||||
|
||||
|
||||
def compare_trees(tree1, tree2):
|
||||
"""Compare two phylogenetic trees."""
|
||||
|
||||
print("Comparing Trees")
|
||||
print("=" * 60)
|
||||
|
||||
# Get terminal names
|
||||
terminals1 = {t.name for t in tree1.get_terminals()}
|
||||
terminals2 = {t.name for t in tree2.get_terminals()}
|
||||
|
||||
print(f"Tree 1 terminals: {len(terminals1)}")
|
||||
print(f"Tree 2 terminals: {len(terminals2)}")
|
||||
print(f"Shared terminals: {len(terminals1 & terminals2)}")
|
||||
print(f"Unique to tree 1: {len(terminals1 - terminals2)}")
|
||||
print(f"Unique to tree 2: {len(terminals2 - terminals1)}")
|
||||
print()
|
||||
|
||||
|
||||
def create_example_alignment():
|
||||
"""Create an example alignment for demonstration."""
|
||||
|
||||
from Bio.Seq import Seq
|
||||
from Bio.SeqRecord import SeqRecord
|
||||
from Bio.Align import MultipleSeqAlignment
|
||||
|
||||
sequences = [
|
||||
SeqRecord(Seq("ACTGCTAGCTAGCTAG"), id="seq1"),
|
||||
SeqRecord(Seq("ACTGCTAGCT-GCTAG"), id="seq2"),
|
||||
SeqRecord(Seq("ACTGCTAGCTAGCTGG"), id="seq3"),
|
||||
SeqRecord(Seq("ACTGCT-GCTAGCTAG"), id="seq4"),
|
||||
]
|
||||
|
||||
alignment = MultipleSeqAlignment(sequences)
|
||||
|
||||
# Save alignment
|
||||
AlignIO.write(alignment, "example_alignment.fasta", "fasta")
|
||||
print("Created example alignment: example_alignment.fasta")
|
||||
print()
|
||||
|
||||
return alignment
|
||||
|
||||
|
||||
def example_workflow():
|
||||
"""Demonstrate complete alignment and phylogeny workflow."""
|
||||
|
||||
print("=" * 60)
|
||||
print("BioPython Alignment & Phylogeny Workflow")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
# Pairwise alignment examples
|
||||
pairwise_alignment_example()
|
||||
print()
|
||||
local_alignment_example()
|
||||
print()
|
||||
|
||||
# Create example data
|
||||
alignment = create_example_alignment()
|
||||
|
||||
# Analyze alignment
|
||||
analyze_alignment_statistics(alignment)
|
||||
|
||||
# Calculate distance matrix
|
||||
dm = calculate_distance_matrix(alignment)
|
||||
|
||||
# Build trees
|
||||
upgma_tree = build_upgma_tree(alignment)
|
||||
nj_tree = build_nj_tree(alignment)
|
||||
|
||||
# Manipulate tree
|
||||
manipulate_tree(upgma_tree)
|
||||
|
||||
# Visualize
|
||||
visualize_tree(upgma_tree, "UPGMA Tree")
|
||||
|
||||
print("Workflow completed!")
|
||||
print()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
example_workflow()
|
||||
|
||||
print("Note: For real analyses, use actual alignment files.")
|
||||
print("Supported alignment formats: clustal, phylip, stockholm, nexus, fasta")
|
||||
print("Supported tree formats: newick, nexus, phyloxml, nexml")
|
||||
272
scientific-packages/biopython/scripts/blast_search.py
Normal file
272
scientific-packages/biopython/scripts/blast_search.py
Normal file
@@ -0,0 +1,272 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
BLAST searches and result parsing using BioPython.
|
||||
|
||||
This script demonstrates:
|
||||
- Running BLAST searches via NCBI (qblast)
|
||||
- Parsing BLAST XML output
|
||||
- Filtering and analyzing results
|
||||
- Working with alignments and HSPs
|
||||
"""
|
||||
|
||||
from Bio.Blast import NCBIWWW, NCBIXML
|
||||
from Bio import SeqIO
|
||||
|
||||
|
||||
def run_blast_online(sequence, program="blastn", database="nt", expect=0.001):
|
||||
"""
|
||||
Run BLAST search via NCBI's qblast.
|
||||
|
||||
Parameters:
|
||||
- sequence: Sequence string or Seq object
|
||||
- program: blastn, blastp, blastx, tblastn, tblastx
|
||||
- database: nt (nucleotide), nr (protein), refseq_rna, etc.
|
||||
- expect: E-value threshold
|
||||
"""
|
||||
|
||||
print(f"Running {program} search against {database} database...")
|
||||
print(f"E-value threshold: {expect}")
|
||||
print("-" * 60)
|
||||
|
||||
# Run BLAST
|
||||
result_handle = NCBIWWW.qblast(
|
||||
program=program,
|
||||
database=database,
|
||||
sequence=sequence,
|
||||
expect=expect,
|
||||
hitlist_size=50, # Number of sequences to show alignments for
|
||||
)
|
||||
|
||||
# Save results
|
||||
output_file = "blast_results.xml"
|
||||
with open(output_file, "w") as out:
|
||||
out.write(result_handle.read())
|
||||
|
||||
result_handle.close()
|
||||
|
||||
print(f"BLAST search complete. Results saved to {output_file}")
|
||||
print()
|
||||
|
||||
return output_file
|
||||
|
||||
|
||||
def parse_blast_results(xml_file, max_hits=10, evalue_threshold=0.001):
|
||||
"""Parse BLAST XML results."""
|
||||
|
||||
print(f"Parsing BLAST results from: {xml_file}")
|
||||
print(f"E-value threshold: {evalue_threshold}")
|
||||
print("=" * 60)
|
||||
|
||||
with open(xml_file) as result_handle:
|
||||
blast_record = NCBIXML.read(result_handle)
|
||||
|
||||
print(f"Query: {blast_record.query}")
|
||||
print(f"Query length: {blast_record.query_length} residues")
|
||||
print(f"Database: {blast_record.database}")
|
||||
print(f"Number of alignments: {len(blast_record.alignments)}")
|
||||
print()
|
||||
|
||||
hit_count = 0
|
||||
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
if hsp.expect <= evalue_threshold:
|
||||
hit_count += 1
|
||||
|
||||
if hit_count <= max_hits:
|
||||
print(f"Hit {hit_count}:")
|
||||
print(f" Sequence: {alignment.title}")
|
||||
print(f" Length: {alignment.length}")
|
||||
print(f" E-value: {hsp.expect:.2e}")
|
||||
print(f" Score: {hsp.score}")
|
||||
print(f" Identities: {hsp.identities}/{hsp.align_length} ({hsp.identities / hsp.align_length * 100:.1f}%)")
|
||||
print(f" Positives: {hsp.positives}/{hsp.align_length} ({hsp.positives / hsp.align_length * 100:.1f}%)")
|
||||
print(f" Gaps: {hsp.gaps}/{hsp.align_length}")
|
||||
print(f" Query range: {hsp.query_start} - {hsp.query_end}")
|
||||
print(f" Subject range: {hsp.sbjct_start} - {hsp.sbjct_end}")
|
||||
print()
|
||||
|
||||
# Show alignment (first 100 characters)
|
||||
print(" Alignment preview:")
|
||||
print(f" Query: {hsp.query[:100]}")
|
||||
print(f" Match: {hsp.match[:100]}")
|
||||
print(f" Sbjct: {hsp.sbjct[:100]}")
|
||||
print()
|
||||
|
||||
print(f"Total significant hits (E-value <= {evalue_threshold}): {hit_count}")
|
||||
print()
|
||||
|
||||
return blast_record
|
||||
|
||||
|
||||
def parse_multiple_queries(xml_file):
|
||||
"""Parse BLAST results with multiple queries."""
|
||||
|
||||
print(f"Parsing multiple queries from: {xml_file}")
|
||||
print("=" * 60)
|
||||
|
||||
with open(xml_file) as result_handle:
|
||||
blast_records = NCBIXML.parse(result_handle)
|
||||
|
||||
for i, blast_record in enumerate(blast_records, 1):
|
||||
print(f"\nQuery {i}: {blast_record.query}")
|
||||
print(f" Number of hits: {len(blast_record.alignments)}")
|
||||
|
||||
if blast_record.alignments:
|
||||
best_hit = blast_record.alignments[0]
|
||||
best_hsp = best_hit.hsps[0]
|
||||
print(f" Best hit: {best_hit.title[:80]}...")
|
||||
print(f" Best E-value: {best_hsp.expect:.2e}")
|
||||
|
||||
|
||||
def filter_blast_results(blast_record, min_identity=0.7, min_coverage=0.5):
|
||||
"""Filter BLAST results by identity and coverage."""
|
||||
|
||||
print(f"Filtering results:")
|
||||
print(f" Minimum identity: {min_identity * 100}%")
|
||||
print(f" Minimum coverage: {min_coverage * 100}%")
|
||||
print("-" * 60)
|
||||
|
||||
filtered_hits = []
|
||||
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
identity_fraction = hsp.identities / hsp.align_length
|
||||
coverage = hsp.align_length / blast_record.query_length
|
||||
|
||||
if identity_fraction >= min_identity and coverage >= min_coverage:
|
||||
filtered_hits.append(
|
||||
{
|
||||
"title": alignment.title,
|
||||
"length": alignment.length,
|
||||
"evalue": hsp.expect,
|
||||
"identity": identity_fraction,
|
||||
"coverage": coverage,
|
||||
"alignment": alignment,
|
||||
"hsp": hsp,
|
||||
}
|
||||
)
|
||||
|
||||
print(f"Found {len(filtered_hits)} hits matching criteria")
|
||||
print()
|
||||
|
||||
# Sort by E-value
|
||||
filtered_hits.sort(key=lambda x: x["evalue"])
|
||||
|
||||
# Display top hits
|
||||
for i, hit in enumerate(filtered_hits[:5], 1):
|
||||
print(f"{i}. {hit['title'][:80]}")
|
||||
print(f" Identity: {hit['identity']*100:.1f}%, Coverage: {hit['coverage']*100:.1f}%, E-value: {hit['evalue']:.2e}")
|
||||
print()
|
||||
|
||||
return filtered_hits
|
||||
|
||||
|
||||
def extract_hit_sequences(blast_record, output_file="blast_hits.fasta"):
|
||||
"""Extract aligned sequences from BLAST results."""
|
||||
|
||||
print(f"Extracting hit sequences to {output_file}...")
|
||||
|
||||
from Bio.Seq import Seq
|
||||
from Bio.SeqRecord import SeqRecord
|
||||
|
||||
records = []
|
||||
|
||||
for i, alignment in enumerate(blast_record.alignments[:10]): # Top 10 hits
|
||||
hsp = alignment.hsps[0] # Best HSP for this alignment
|
||||
|
||||
# Extract accession from title
|
||||
accession = alignment.title.split()[0]
|
||||
|
||||
# Create SeqRecord from aligned subject sequence
|
||||
record = SeqRecord(
|
||||
Seq(hsp.sbjct.replace("-", "")), # Remove gaps
|
||||
id=accession,
|
||||
description=f"E-value: {hsp.expect:.2e}, Identity: {hsp.identities}/{hsp.align_length}",
|
||||
)
|
||||
|
||||
records.append(record)
|
||||
|
||||
# Write to FASTA
|
||||
SeqIO.write(records, output_file, "fasta")
|
||||
|
||||
print(f"Extracted {len(records)} sequences")
|
||||
print()
|
||||
|
||||
|
||||
def analyze_blast_statistics(blast_record):
|
||||
"""Compute statistics from BLAST results."""
|
||||
|
||||
print("BLAST Result Statistics:")
|
||||
print("-" * 60)
|
||||
|
||||
if not blast_record.alignments:
|
||||
print("No hits found")
|
||||
return
|
||||
|
||||
evalues = []
|
||||
identities = []
|
||||
scores = []
|
||||
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
evalues.append(hsp.expect)
|
||||
identities.append(hsp.identities / hsp.align_length)
|
||||
scores.append(hsp.score)
|
||||
|
||||
import statistics
|
||||
|
||||
print(f"Total HSPs: {len(evalues)}")
|
||||
print(f"\nE-values:")
|
||||
print(f" Min: {min(evalues):.2e}")
|
||||
print(f" Max: {max(evalues):.2e}")
|
||||
print(f" Median: {statistics.median(evalues):.2e}")
|
||||
print(f"\nIdentity percentages:")
|
||||
print(f" Min: {min(identities)*100:.1f}%")
|
||||
print(f" Max: {max(identities)*100:.1f}%")
|
||||
print(f" Mean: {statistics.mean(identities)*100:.1f}%")
|
||||
print(f"\nBit scores:")
|
||||
print(f" Min: {min(scores):.1f}")
|
||||
print(f" Max: {max(scores):.1f}")
|
||||
print(f" Mean: {statistics.mean(scores):.1f}")
|
||||
print()
|
||||
|
||||
|
||||
def example_workflow():
|
||||
"""Demonstrate BLAST workflow."""
|
||||
|
||||
print("=" * 60)
|
||||
print("BioPython BLAST Example Workflow")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
# Example sequence (human beta-globin)
|
||||
example_sequence = """
|
||||
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
|
||||
""".replace("\n", "").replace(" ", "")
|
||||
|
||||
print("Example: Human beta-globin sequence")
|
||||
print(f"Length: {len(example_sequence)} bp")
|
||||
print()
|
||||
|
||||
# Note: Uncomment to run actual BLAST search (takes time)
|
||||
# xml_file = run_blast_online(example_sequence, program="blastn", database="nt", expect=0.001)
|
||||
|
||||
# For demonstration, use a pre-existing results file
|
||||
print("To run a real BLAST search, uncomment the run_blast_online() line")
|
||||
print("For now, demonstrating parsing with example results file")
|
||||
print()
|
||||
|
||||
# If you have results, parse them:
|
||||
# blast_record = parse_blast_results("blast_results.xml", max_hits=5)
|
||||
# filtered = filter_blast_results(blast_record, min_identity=0.9)
|
||||
# analyze_blast_statistics(blast_record)
|
||||
# extract_hit_sequences(blast_record)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
example_workflow()
|
||||
|
||||
print()
|
||||
print("Note: BLAST searches can take several minutes.")
|
||||
print("For production use, consider running local BLAST instead.")
|
||||
215
scientific-packages/biopython/scripts/file_io.py
Normal file
215
scientific-packages/biopython/scripts/file_io.py
Normal file
@@ -0,0 +1,215 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
File I/O operations using BioPython SeqIO.
|
||||
|
||||
This script demonstrates:
|
||||
- Reading sequences from various formats
|
||||
- Writing sequences to files
|
||||
- Converting between formats
|
||||
- Filtering and processing sequences
|
||||
- Working with large files efficiently
|
||||
"""
|
||||
|
||||
from Bio import SeqIO
|
||||
from Bio.Seq import Seq
|
||||
from Bio.SeqRecord import SeqRecord
|
||||
|
||||
|
||||
def read_sequences(filename, format_type):
|
||||
"""Read and display sequences from a file."""
|
||||
|
||||
print(f"Reading {format_type} file: {filename}")
|
||||
print("-" * 60)
|
||||
|
||||
count = 0
|
||||
for record in SeqIO.parse(filename, format_type):
|
||||
count += 1
|
||||
print(f"ID: {record.id}")
|
||||
print(f"Name: {record.name}")
|
||||
print(f"Description: {record.description}")
|
||||
print(f"Sequence length: {len(record.seq)}")
|
||||
print(f"Sequence: {record.seq[:50]}...")
|
||||
print()
|
||||
|
||||
# Only show first 3 sequences
|
||||
if count >= 3:
|
||||
break
|
||||
|
||||
# Count total sequences
|
||||
total = len(list(SeqIO.parse(filename, format_type)))
|
||||
print(f"Total sequences in file: {total}")
|
||||
print()
|
||||
|
||||
|
||||
def read_single_sequence(filename, format_type):
|
||||
"""Read a single sequence from a file."""
|
||||
|
||||
record = SeqIO.read(filename, format_type)
|
||||
|
||||
print("Single sequence record:")
|
||||
print(f"ID: {record.id}")
|
||||
print(f"Sequence: {record.seq}")
|
||||
print()
|
||||
|
||||
|
||||
def write_sequences(records, output_filename, format_type):
|
||||
"""Write sequences to a file."""
|
||||
|
||||
count = SeqIO.write(records, output_filename, format_type)
|
||||
print(f"Wrote {count} sequences to {output_filename} in {format_type} format")
|
||||
print()
|
||||
|
||||
|
||||
def convert_format(input_file, input_format, output_file, output_format):
|
||||
"""Convert sequences from one format to another."""
|
||||
|
||||
count = SeqIO.convert(input_file, input_format, output_file, output_format)
|
||||
print(f"Converted {count} sequences from {input_format} to {output_format}")
|
||||
print()
|
||||
|
||||
|
||||
def filter_sequences(input_file, format_type, min_length=100, max_length=1000):
|
||||
"""Filter sequences by length."""
|
||||
|
||||
filtered = []
|
||||
|
||||
for record in SeqIO.parse(input_file, format_type):
|
||||
if min_length <= len(record.seq) <= max_length:
|
||||
filtered.append(record)
|
||||
|
||||
print(f"Found {len(filtered)} sequences between {min_length} and {max_length} bp")
|
||||
return filtered
|
||||
|
||||
|
||||
def extract_subsequence(input_file, format_type, seq_id, start, end):
|
||||
"""Extract a subsequence from a specific record."""
|
||||
|
||||
# Index for efficient access
|
||||
record_dict = SeqIO.index(input_file, format_type)
|
||||
|
||||
if seq_id in record_dict:
|
||||
record = record_dict[seq_id]
|
||||
subseq = record.seq[start:end]
|
||||
print(f"Extracted subsequence from {seq_id} ({start}:{end}):")
|
||||
print(subseq)
|
||||
return subseq
|
||||
else:
|
||||
print(f"Sequence {seq_id} not found")
|
||||
return None
|
||||
|
||||
|
||||
def create_sequence_records():
|
||||
"""Create SeqRecord objects from scratch."""
|
||||
|
||||
# Simple record
|
||||
simple_record = SeqRecord(
|
||||
Seq("ATGCATGCATGC"),
|
||||
id="seq001",
|
||||
name="MySequence",
|
||||
description="Example sequence"
|
||||
)
|
||||
|
||||
# Record with annotations
|
||||
annotated_record = SeqRecord(
|
||||
Seq("ATGGTGCATCTGACTCCTGAGGAG"),
|
||||
id="seq002",
|
||||
name="GeneX",
|
||||
description="Important gene"
|
||||
)
|
||||
annotated_record.annotations["molecule_type"] = "DNA"
|
||||
annotated_record.annotations["organism"] = "Homo sapiens"
|
||||
|
||||
return [simple_record, annotated_record]
|
||||
|
||||
|
||||
def index_large_file(filename, format_type):
|
||||
"""Index a large file for random access without loading into memory."""
|
||||
|
||||
# Create index
|
||||
record_index = SeqIO.index(filename, format_type)
|
||||
|
||||
print(f"Indexed {len(record_index)} sequences")
|
||||
print(f"Available IDs: {list(record_index.keys())[:10]}...")
|
||||
print()
|
||||
|
||||
# Access specific record by ID
|
||||
if len(record_index) > 0:
|
||||
first_id = list(record_index.keys())[0]
|
||||
record = record_index[first_id]
|
||||
print(f"Accessed record: {record.id}")
|
||||
print()
|
||||
|
||||
# Close index
|
||||
record_index.close()
|
||||
|
||||
|
||||
def parse_with_quality_scores(fastq_file):
|
||||
"""Parse FASTQ files with quality scores."""
|
||||
|
||||
print("Parsing FASTQ with quality scores:")
|
||||
print("-" * 60)
|
||||
|
||||
for record in SeqIO.parse(fastq_file, "fastq"):
|
||||
print(f"ID: {record.id}")
|
||||
print(f"Sequence: {record.seq[:50]}...")
|
||||
print(f"Quality scores (first 10): {record.letter_annotations['phred_quality'][:10]}")
|
||||
|
||||
# Calculate average quality
|
||||
avg_quality = sum(record.letter_annotations["phred_quality"]) / len(record)
|
||||
print(f"Average quality: {avg_quality:.2f}")
|
||||
print()
|
||||
break # Just show first record
|
||||
|
||||
|
||||
def batch_process_large_file(input_file, format_type, batch_size=100):
|
||||
"""Process large files in batches to manage memory."""
|
||||
|
||||
batch = []
|
||||
count = 0
|
||||
|
||||
for record in SeqIO.parse(input_file, format_type):
|
||||
batch.append(record)
|
||||
count += 1
|
||||
|
||||
if len(batch) == batch_size:
|
||||
# Process batch
|
||||
print(f"Processing batch of {len(batch)} sequences...")
|
||||
# Do something with batch
|
||||
batch = [] # Clear for next batch
|
||||
|
||||
# Process remaining records
|
||||
if batch:
|
||||
print(f"Processing final batch of {len(batch)} sequences...")
|
||||
|
||||
print(f"Total sequences processed: {count}")
|
||||
|
||||
|
||||
def example_workflow():
|
||||
"""Demonstrate a complete workflow."""
|
||||
|
||||
print("=" * 60)
|
||||
print("BioPython SeqIO Workflow Example")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
# Create example sequences
|
||||
records = create_sequence_records()
|
||||
|
||||
# Write as FASTA
|
||||
write_sequences(records, "example_output.fasta", "fasta")
|
||||
|
||||
# Write as GenBank
|
||||
write_sequences(records, "example_output.gb", "genbank")
|
||||
|
||||
# Convert FASTA to GenBank (would work if file exists)
|
||||
# convert_format("input.fasta", "fasta", "output.gb", "genbank")
|
||||
|
||||
print("Example workflow completed!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
example_workflow()
|
||||
|
||||
print()
|
||||
print("Note: This script demonstrates BioPython SeqIO operations.")
|
||||
print("Uncomment and adapt the functions for your specific files.")
|
||||
293
scientific-packages/biopython/scripts/ncbi_entrez.py
Normal file
293
scientific-packages/biopython/scripts/ncbi_entrez.py
Normal file
@@ -0,0 +1,293 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
NCBI Entrez database access using BioPython.
|
||||
|
||||
This script demonstrates:
|
||||
- Searching NCBI databases
|
||||
- Downloading sequences by accession
|
||||
- Retrieving PubMed articles
|
||||
- Batch downloading with WebEnv
|
||||
- Proper error handling and rate limiting
|
||||
"""
|
||||
|
||||
import time
|
||||
from Bio import Entrez, SeqIO
|
||||
|
||||
# IMPORTANT: Always set your email
|
||||
Entrez.email = "your.email@example.com" # Change this!
|
||||
|
||||
|
||||
def search_nucleotide(query, max_results=10):
|
||||
"""Search NCBI nucleotide database."""
|
||||
|
||||
print(f"Searching nucleotide database for: {query}")
|
||||
print("-" * 60)
|
||||
|
||||
handle = Entrez.esearch(db="nucleotide", term=query, retmax=max_results)
|
||||
record = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
print(f"Found {record['Count']} total matches")
|
||||
print(f"Returning top {len(record['IdList'])} IDs:")
|
||||
print(record["IdList"])
|
||||
print()
|
||||
|
||||
return record["IdList"]
|
||||
|
||||
|
||||
def fetch_sequence_by_accession(accession):
|
||||
"""Download a sequence by accession number."""
|
||||
|
||||
print(f"Fetching sequence: {accession}")
|
||||
|
||||
try:
|
||||
handle = Entrez.efetch(
|
||||
db="nucleotide", id=accession, rettype="gb", retmode="text"
|
||||
)
|
||||
record = SeqIO.read(handle, "genbank")
|
||||
handle.close()
|
||||
|
||||
print(f"Successfully retrieved: {record.id}")
|
||||
print(f"Description: {record.description}")
|
||||
print(f"Length: {len(record.seq)} bp")
|
||||
print(f"Organism: {record.annotations.get('organism', 'Unknown')}")
|
||||
print()
|
||||
|
||||
return record
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error fetching {accession}: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def fetch_multiple_sequences(id_list, output_file="downloaded_sequences.fasta"):
|
||||
"""Download multiple sequences and save to file."""
|
||||
|
||||
print(f"Fetching {len(id_list)} sequences...")
|
||||
|
||||
try:
|
||||
# For >200 IDs, efetch automatically uses POST
|
||||
handle = Entrez.efetch(
|
||||
db="nucleotide", id=id_list, rettype="fasta", retmode="text"
|
||||
)
|
||||
|
||||
# Parse and save
|
||||
records = list(SeqIO.parse(handle, "fasta"))
|
||||
handle.close()
|
||||
|
||||
SeqIO.write(records, output_file, "fasta")
|
||||
|
||||
print(f"Successfully downloaded {len(records)} sequences to {output_file}")
|
||||
print()
|
||||
|
||||
return records
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error fetching sequences: {e}")
|
||||
return []
|
||||
|
||||
|
||||
def search_and_download(query, output_file, max_results=100):
|
||||
"""Complete workflow: search and download sequences."""
|
||||
|
||||
print(f"Searching and downloading: {query}")
|
||||
print("=" * 60)
|
||||
|
||||
# Search
|
||||
handle = Entrez.esearch(db="nucleotide", term=query, retmax=max_results)
|
||||
record = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
id_list = record["IdList"]
|
||||
print(f"Found {len(id_list)} sequences")
|
||||
|
||||
if not id_list:
|
||||
print("No results found")
|
||||
return
|
||||
|
||||
# Download in batches to be polite
|
||||
batch_size = 100
|
||||
all_records = []
|
||||
|
||||
for start in range(0, len(id_list), batch_size):
|
||||
end = min(start + batch_size, len(id_list))
|
||||
batch_ids = id_list[start:end]
|
||||
|
||||
print(f"Downloading batch {start // batch_size + 1} ({len(batch_ids)} sequences)...")
|
||||
|
||||
handle = Entrez.efetch(
|
||||
db="nucleotide", id=batch_ids, rettype="fasta", retmode="text"
|
||||
)
|
||||
batch_records = list(SeqIO.parse(handle, "fasta"))
|
||||
handle.close()
|
||||
|
||||
all_records.extend(batch_records)
|
||||
|
||||
# Be polite - wait between requests
|
||||
time.sleep(0.5)
|
||||
|
||||
# Save all records
|
||||
SeqIO.write(all_records, output_file, "fasta")
|
||||
print(f"Downloaded {len(all_records)} sequences to {output_file}")
|
||||
print()
|
||||
|
||||
|
||||
def use_history_for_large_queries(query, max_results=1000):
|
||||
"""Use NCBI History server for large queries."""
|
||||
|
||||
print("Using NCBI History server for large query")
|
||||
print("-" * 60)
|
||||
|
||||
# Search with history
|
||||
search_handle = Entrez.esearch(
|
||||
db="nucleotide", term=query, retmax=max_results, usehistory="y"
|
||||
)
|
||||
search_results = Entrez.read(search_handle)
|
||||
search_handle.close()
|
||||
|
||||
count = int(search_results["Count"])
|
||||
webenv = search_results["WebEnv"]
|
||||
query_key = search_results["QueryKey"]
|
||||
|
||||
print(f"Found {count} total sequences")
|
||||
print(f"WebEnv: {webenv[:20]}...")
|
||||
print(f"QueryKey: {query_key}")
|
||||
print()
|
||||
|
||||
# Fetch in batches using history
|
||||
batch_size = 500
|
||||
all_records = []
|
||||
|
||||
for start in range(0, min(count, max_results), batch_size):
|
||||
end = min(start + batch_size, max_results)
|
||||
|
||||
print(f"Downloading records {start + 1} to {end}...")
|
||||
|
||||
fetch_handle = Entrez.efetch(
|
||||
db="nucleotide",
|
||||
rettype="fasta",
|
||||
retmode="text",
|
||||
retstart=start,
|
||||
retmax=batch_size,
|
||||
webenv=webenv,
|
||||
query_key=query_key,
|
||||
)
|
||||
|
||||
batch_records = list(SeqIO.parse(fetch_handle, "fasta"))
|
||||
fetch_handle.close()
|
||||
|
||||
all_records.extend(batch_records)
|
||||
|
||||
# Be polite
|
||||
time.sleep(0.5)
|
||||
|
||||
print(f"Downloaded {len(all_records)} sequences total")
|
||||
return all_records
|
||||
|
||||
|
||||
def search_pubmed(query, max_results=10):
|
||||
"""Search PubMed for articles."""
|
||||
|
||||
print(f"Searching PubMed for: {query}")
|
||||
print("-" * 60)
|
||||
|
||||
handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
|
||||
record = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
id_list = record["IdList"]
|
||||
print(f"Found {record['Count']} total articles")
|
||||
print(f"Returning {len(id_list)} PMIDs:")
|
||||
print(id_list)
|
||||
print()
|
||||
|
||||
return id_list
|
||||
|
||||
|
||||
def fetch_pubmed_abstracts(pmid_list):
|
||||
"""Fetch PubMed article summaries."""
|
||||
|
||||
print(f"Fetching summaries for {len(pmid_list)} articles...")
|
||||
|
||||
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="abstract", retmode="text")
|
||||
abstracts = handle.read()
|
||||
handle.close()
|
||||
|
||||
print(abstracts[:500]) # Show first 500 characters
|
||||
print("...")
|
||||
print()
|
||||
|
||||
|
||||
def get_database_info(database="nucleotide"):
|
||||
"""Get information about an NCBI database."""
|
||||
|
||||
print(f"Getting info for database: {database}")
|
||||
print("-" * 60)
|
||||
|
||||
handle = Entrez.einfo(db=database)
|
||||
record = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
db_info = record["DbInfo"]
|
||||
print(f"Name: {db_info['DbName']}")
|
||||
print(f"Description: {db_info['Description']}")
|
||||
print(f"Record count: {db_info['Count']}")
|
||||
print(f"Last update: {db_info['LastUpdate']}")
|
||||
print()
|
||||
|
||||
|
||||
def link_databases(db_from, db_to, id_):
|
||||
"""Find related records in other databases."""
|
||||
|
||||
print(f"Finding links from {db_from} ID {id_} to {db_to}")
|
||||
print("-" * 60)
|
||||
|
||||
handle = Entrez.elink(dbfrom=db_from, db=db_to, id=id_)
|
||||
record = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
if record[0]["LinkSetDb"]:
|
||||
linked_ids = [link["Id"] for link in record[0]["LinkSetDb"][0]["Link"]]
|
||||
print(f"Found {len(linked_ids)} linked records")
|
||||
print(f"IDs: {linked_ids[:10]}")
|
||||
else:
|
||||
print("No linked records found")
|
||||
|
||||
print()
|
||||
|
||||
|
||||
def example_workflow():
|
||||
"""Demonstrate complete Entrez workflow."""
|
||||
|
||||
print("=" * 60)
|
||||
print("BioPython Entrez Example Workflow")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
# Note: These are examples - uncomment to run with your email set
|
||||
|
||||
# # Example 1: Search and get IDs
|
||||
# ids = search_nucleotide("Homo sapiens[Organism] AND COX1[Gene]", max_results=5)
|
||||
#
|
||||
# # Example 2: Fetch a specific sequence
|
||||
# fetch_sequence_by_accession("NM_001301717")
|
||||
#
|
||||
# # Example 3: Complete search and download
|
||||
# search_and_download("Escherichia coli[Organism] AND 16S", "ecoli_16s.fasta", max_results=50)
|
||||
#
|
||||
# # Example 4: PubMed search
|
||||
# pmids = search_pubmed("CRISPR[Title] AND 2023[PDAT]", max_results=5)
|
||||
# fetch_pubmed_abstracts(pmids[:2])
|
||||
#
|
||||
# # Example 5: Get database info
|
||||
# get_database_info("nucleotide")
|
||||
|
||||
print("Examples are commented out. Uncomment and set your email to run.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
example_workflow()
|
||||
|
||||
print()
|
||||
print("IMPORTANT: Always set Entrez.email before using these functions!")
|
||||
print("NCBI requires an email address for their E-utilities.")
|
||||
125
scientific-packages/biopython/scripts/sequence_operations.py
Normal file
125
scientific-packages/biopython/scripts/sequence_operations.py
Normal file
@@ -0,0 +1,125 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Common sequence operations using BioPython.
|
||||
|
||||
This script demonstrates basic sequence manipulation tasks like:
|
||||
- Creating and manipulating Seq objects
|
||||
- Transcription and translation
|
||||
- Complement and reverse complement
|
||||
- Calculating GC content and melting temperature
|
||||
"""
|
||||
|
||||
from Bio.Seq import Seq
|
||||
from Bio.SeqUtils import gc_fraction, MeltingTemp as mt
|
||||
|
||||
|
||||
def demonstrate_seq_operations():
|
||||
"""Show common Seq object operations."""
|
||||
|
||||
# Create DNA sequence
|
||||
dna_seq = Seq("ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTG")
|
||||
|
||||
print("Original DNA sequence:")
|
||||
print(dna_seq)
|
||||
print()
|
||||
|
||||
# Transcription (DNA -> RNA)
|
||||
rna_seq = dna_seq.transcribe()
|
||||
print("Transcribed to RNA:")
|
||||
print(rna_seq)
|
||||
print()
|
||||
|
||||
# Translation (DNA -> Protein)
|
||||
protein_seq = dna_seq.translate()
|
||||
print("Translated to protein:")
|
||||
print(protein_seq)
|
||||
print()
|
||||
|
||||
# Translation with stop codon handling
|
||||
protein_to_stop = dna_seq.translate(to_stop=True)
|
||||
print("Translated to first stop codon:")
|
||||
print(protein_to_stop)
|
||||
print()
|
||||
|
||||
# Complement
|
||||
complement = dna_seq.complement()
|
||||
print("Complement:")
|
||||
print(complement)
|
||||
print()
|
||||
|
||||
# Reverse complement
|
||||
reverse_complement = dna_seq.reverse_complement()
|
||||
print("Reverse complement:")
|
||||
print(reverse_complement)
|
||||
print()
|
||||
|
||||
# GC content
|
||||
gc = gc_fraction(dna_seq) * 100
|
||||
print(f"GC content: {gc:.2f}%")
|
||||
print()
|
||||
|
||||
# Melting temperature
|
||||
tm = mt.Tm_NN(dna_seq)
|
||||
print(f"Melting temperature (nearest-neighbor): {tm:.2f}°C")
|
||||
print()
|
||||
|
||||
# Sequence searching
|
||||
codon_start = dna_seq.find("ATG")
|
||||
print(f"Start codon (ATG) position: {codon_start}")
|
||||
|
||||
# Count occurrences
|
||||
g_count = dna_seq.count("G")
|
||||
print(f"Number of G nucleotides: {g_count}")
|
||||
print()
|
||||
|
||||
|
||||
def translate_with_genetic_code():
|
||||
"""Demonstrate translation with different genetic codes."""
|
||||
|
||||
dna_seq = Seq("ATGGTGCATCTGACTCCTGAGGAGAAGTCT")
|
||||
|
||||
# Standard genetic code (table 1)
|
||||
standard = dna_seq.translate(table=1)
|
||||
print("Standard genetic code translation:")
|
||||
print(standard)
|
||||
|
||||
# Vertebrate mitochondrial code (table 2)
|
||||
mito = dna_seq.translate(table=2)
|
||||
print("Vertebrate mitochondrial code translation:")
|
||||
print(mito)
|
||||
print()
|
||||
|
||||
|
||||
def working_with_codons():
|
||||
"""Access genetic code tables."""
|
||||
from Bio.Data import CodonTable
|
||||
|
||||
# Get standard genetic code
|
||||
standard_table = CodonTable.unambiguous_dna_by_id[1]
|
||||
|
||||
print("Standard genetic code:")
|
||||
print(f"Start codons: {standard_table.start_codons}")
|
||||
print(f"Stop codons: {standard_table.stop_codons}")
|
||||
print()
|
||||
|
||||
# Show some codon translations
|
||||
print("Example codons:")
|
||||
for codon in ["ATG", "TGG", "TAA", "TAG", "TGA"]:
|
||||
if codon in standard_table.stop_codons:
|
||||
print(f"{codon} -> STOP")
|
||||
else:
|
||||
aa = standard_table.forward_table.get(codon, "Unknown")
|
||||
print(f"{codon} -> {aa}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("=" * 60)
|
||||
print("BioPython Sequence Operations Demo")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
demonstrate_seq_operations()
|
||||
print("-" * 60)
|
||||
translate_with_genetic_code()
|
||||
print("-" * 60)
|
||||
working_with_codons()
|
||||
Reference in New Issue
Block a user