mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-28 07:33:45 +08:00
Add more scientific skills
This commit is contained in:
232
scientific-packages/biopython/references/core_modules.md
Normal file
232
scientific-packages/biopython/references/core_modules.md
Normal file
@@ -0,0 +1,232 @@
|
||||
# BioPython Core Modules Reference
|
||||
|
||||
This document provides detailed information about BioPython's core modules and their capabilities.
|
||||
|
||||
## Sequence Handling
|
||||
|
||||
### Bio.Seq - Sequence Objects
|
||||
|
||||
Seq objects are BioPython's fundamental data structure for biological sequences, providing biological methods on top of string-like behavior.
|
||||
|
||||
**Creation:**
|
||||
```python
|
||||
from Bio.Seq import Seq
|
||||
my_seq = Seq("AGTACACTGGT")
|
||||
```
|
||||
|
||||
**Key Operations:**
|
||||
- String methods: `find()`, `count()`, `count_overlap()` (for overlapping patterns)
|
||||
- Complement/Reverse complement: Returns complementary sequences
|
||||
- Transcription: DNA → RNA (T → U)
|
||||
- Back transcription: RNA → DNA
|
||||
- Translation: DNA/RNA → protein with customizable genetic codes and stop codon handling
|
||||
|
||||
**Use Cases:**
|
||||
- DNA/RNA sequence manipulation
|
||||
- Converting between nucleic acid types
|
||||
- Protein translation from coding sequences
|
||||
- Sequence searching and pattern counting
|
||||
|
||||
### Bio.SeqRecord - Sequence Metadata
|
||||
|
||||
SeqRecord wraps Seq objects with metadata like ID, description, and features.
|
||||
|
||||
**Attributes:**
|
||||
- `seq`: The sequence itself (Seq object)
|
||||
- `id`: Unique identifier
|
||||
- `name`: Short name
|
||||
- `description`: Longer description
|
||||
- `features`: List of SeqFeature objects
|
||||
- `annotations`: Dictionary of additional information
|
||||
- `letter_annotations`: Per-letter annotations (e.g., quality scores)
|
||||
|
||||
### Bio.SeqFeature - Sequence Annotations
|
||||
|
||||
Manages sequence annotations and features such as genes, promoters, and coding regions.
|
||||
|
||||
**Common Features:**
|
||||
- Gene locations
|
||||
- CDS (coding sequences)
|
||||
- Promoters and regulatory elements
|
||||
- Exons and introns
|
||||
- Protein domains
|
||||
|
||||
## File Input/Output
|
||||
|
||||
### Bio.SeqIO - Sequence File I/O
|
||||
|
||||
Unified interface for reading and writing sequence files in multiple formats.
|
||||
|
||||
**Supported Formats:**
|
||||
- FASTA/FASTQ: Standard sequence formats
|
||||
- GenBank/EMBL: Feature-rich annotation formats
|
||||
- Clustal/Stockholm/PHYLIP: Alignment formats
|
||||
- ABI/SFF: Trace and flowgram data
|
||||
- Swiss-Prot/PIR: Protein databases
|
||||
- PDB: Protein structure files
|
||||
|
||||
**Key Functions:**
|
||||
|
||||
**SeqIO.parse()** - Iterator for reading multiple records:
|
||||
```python
|
||||
from Bio import SeqIO
|
||||
for record in SeqIO.parse("file.fasta", "fasta"):
|
||||
print(record.id, len(record.seq))
|
||||
```
|
||||
|
||||
**SeqIO.read()** - Read single record:
|
||||
```python
|
||||
record = SeqIO.read("file.fasta", "fasta")
|
||||
```
|
||||
|
||||
**SeqIO.write()** - Write sequences:
|
||||
```python
|
||||
SeqIO.write(sequences, "output.fasta", "fasta")
|
||||
```
|
||||
|
||||
**SeqIO.convert()** - Direct format conversion:
|
||||
```python
|
||||
count = SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
|
||||
```
|
||||
|
||||
**SeqIO.index()** - Memory-efficient random access for large files:
|
||||
```python
|
||||
record_dict = SeqIO.index("large_file.fasta", "fasta")
|
||||
sequence = record_dict["seq_id"]
|
||||
```
|
||||
|
||||
**SeqIO.to_dict()** - Load all records into dictionary (memory-based):
|
||||
```python
|
||||
record_dict = SeqIO.to_dict(SeqIO.parse("file.fasta", "fasta"))
|
||||
```
|
||||
|
||||
**Common Patterns:**
|
||||
- Format conversion between FASTA, GenBank, FASTQ
|
||||
- Filtering sequences by length, ID, or content
|
||||
- Extracting subsequences
|
||||
- Batch processing large files with iterators
|
||||
|
||||
### Bio.AlignIO - Multiple Sequence Alignment I/O
|
||||
|
||||
Handles multiple sequence alignment files.
|
||||
|
||||
**Key Functions:**
|
||||
- `write()`: Save alignments
|
||||
- `parse()`: Read multiple alignments
|
||||
- `read()`: Read single alignment
|
||||
- `convert()`: Convert between formats
|
||||
|
||||
**Supported Formats:**
|
||||
- Clustal
|
||||
- PHYLIP (sequential and interleaved)
|
||||
- Stockholm
|
||||
- NEXUS
|
||||
- FASTA (aligned)
|
||||
- MAF (Multiple Alignment Format)
|
||||
|
||||
## Sequence Alignment
|
||||
|
||||
### Bio.Align - Alignment Tools
|
||||
|
||||
**PairwiseAligner** - High-performance pairwise alignment:
|
||||
```python
|
||||
from Bio import Align
|
||||
aligner = Align.PairwiseAligner()
|
||||
aligner.mode = 'global' # or 'local'
|
||||
aligner.match_score = 2
|
||||
aligner.mismatch_score = -1
|
||||
aligner.gap_score = -2.5
|
||||
alignments = aligner.align(seq1, seq2)
|
||||
```
|
||||
|
||||
**CodonAligner** - Codon-aware alignment
|
||||
|
||||
**MultipleSeqAlignment** - Container for MSA with column access
|
||||
|
||||
### Bio.pairwise2 (Legacy)
|
||||
|
||||
Legacy pairwise alignment module with functions like `align.globalxx()`, `align.localxx()`.
|
||||
|
||||
## Sequence Analysis Utilities
|
||||
|
||||
### Bio.SeqUtils - Sequence Analysis
|
||||
|
||||
Collection of utility functions:
|
||||
|
||||
**CheckSum** - Calculate sequence checksums (CRC32, CRC64, GCG)
|
||||
|
||||
**MeltingTemp** - DNA melting temperature calculations:
|
||||
- Nearest-neighbor method
|
||||
- Wallace rule
|
||||
- GC content method
|
||||
|
||||
**IsoelectricPoint** - Protein pI calculation
|
||||
|
||||
**ProtParam** - Protein analysis:
|
||||
- Molecular weight
|
||||
- Aromaticity
|
||||
- Instability index
|
||||
- Secondary structure fractions
|
||||
|
||||
**GC/GC_skew** - Calculate GC content and GC skew for sequence windows
|
||||
|
||||
### Bio.Data.CodonTable - Genetic Codes
|
||||
|
||||
Access to NCBI genetic code tables:
|
||||
```python
|
||||
from Bio.Data import CodonTable
|
||||
standard_table = CodonTable.unambiguous_dna_by_id[1]
|
||||
print(standard_table.forward_table) # codon to amino acid
|
||||
print(standard_table.back_table) # amino acid to codons
|
||||
print(standard_table.start_codons)
|
||||
print(standard_table.stop_codons)
|
||||
```
|
||||
|
||||
**Available codes:**
|
||||
- Standard code (1)
|
||||
- Vertebrate mitochondrial (2)
|
||||
- Yeast mitochondrial (3)
|
||||
- And many more organism-specific codes
|
||||
|
||||
## Sequence Motifs and Patterns
|
||||
|
||||
### Bio.motifs - Sequence Motif Analysis
|
||||
|
||||
Tools for working with sequence motifs:
|
||||
|
||||
**Position Weight Matrices (PWM):**
|
||||
- Create PWM from aligned sequences
|
||||
- Calculate information content
|
||||
- Search sequences for motif matches
|
||||
- Generate consensus sequences
|
||||
|
||||
**Position Specific Scoring Matrices (PSSM):**
|
||||
- Convert PWM to PSSM
|
||||
- Score sequences against motifs
|
||||
- Determine significance thresholds
|
||||
|
||||
**Supported Formats:**
|
||||
- JASPAR
|
||||
- TRANSFAC
|
||||
- MEME
|
||||
- AlignAce
|
||||
|
||||
### Bio.Restriction - Restriction Enzymes
|
||||
|
||||
Comprehensive restriction enzyme database and analysis:
|
||||
|
||||
**Capabilities:**
|
||||
- Search for restriction sites
|
||||
- Predict digestion products
|
||||
- Analyze restriction maps
|
||||
- Access enzyme properties (recognition site, cut positions, isoschizomers)
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
from Bio import Restriction
|
||||
from Bio.Seq import Seq
|
||||
|
||||
seq = Seq("GAATTC...")
|
||||
enzyme = Restriction.EcoRI
|
||||
results = enzyme.search(seq)
|
||||
```
|
||||
306
scientific-packages/biopython/references/database_tools.md
Normal file
306
scientific-packages/biopython/references/database_tools.md
Normal file
@@ -0,0 +1,306 @@
|
||||
# BioPython Database Access and Search Tools
|
||||
|
||||
This document covers BioPython's capabilities for accessing biological databases and performing sequence searches.
|
||||
|
||||
## NCBI Database Access
|
||||
|
||||
### Bio.Entrez - NCBI E-utilities Interface
|
||||
|
||||
Provides programmatic access to NCBI databases including PubMed, GenBank, Protein, Nucleotide, and more.
|
||||
|
||||
**Important:** Always set your email before using Entrez:
|
||||
```python
|
||||
from Bio import Entrez
|
||||
Entrez.email = "your.email@example.com"
|
||||
```
|
||||
|
||||
#### Core Query Functions
|
||||
|
||||
**esearch** - Search databases and retrieve IDs:
|
||||
```python
|
||||
handle = Entrez.esearch(db="nucleotide", term="Homo sapiens[Organism] AND COX1")
|
||||
record = Entrez.read(handle)
|
||||
id_list = record["IdList"]
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- `db`: Database to search (nucleotide, protein, pubmed, etc.)
|
||||
- `term`: Search query
|
||||
- `retmax`: Maximum number of IDs to return
|
||||
- `sort`: Sort order (relevance, pub_date, etc.)
|
||||
- `usehistory`: Store results on server (useful for large queries)
|
||||
|
||||
**efetch** - Retrieve full records:
|
||||
```python
|
||||
handle = Entrez.efetch(db="nucleotide", id="123456", rettype="gb", retmode="text")
|
||||
record = SeqIO.read(handle, "genbank")
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- `db`: Database name
|
||||
- `id`: Single ID or comma-separated list
|
||||
- `rettype`: Return type (gb, fasta, gp, xml, etc.)
|
||||
- `retmode`: Return mode (text, xml, asn.1)
|
||||
- Automatically uses POST for >200 IDs
|
||||
|
||||
**elink** - Find related records across databases:
|
||||
```python
|
||||
handle = Entrez.elink(dbfrom="protein", db="gene", id="15718680")
|
||||
result = Entrez.read(handle)
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- `dbfrom`: Source database
|
||||
- `db`: Target database
|
||||
- `id`: ID(s) to link from
|
||||
- Returns LinkOut providers and relevancy scores
|
||||
|
||||
**esummary** - Get document summaries:
|
||||
```python
|
||||
handle = Entrez.esummary(db="protein", id="15718680")
|
||||
summary = Entrez.read(handle)
|
||||
print(summary[0]['Title'])
|
||||
```
|
||||
|
||||
Returns quick overviews without full records.
|
||||
|
||||
**einfo** - Get database statistics:
|
||||
```python
|
||||
handle = Entrez.einfo(db="nucleotide")
|
||||
info = Entrez.read(handle)
|
||||
```
|
||||
|
||||
Provides field indices, term counts, update dates, and available links.
|
||||
|
||||
**epost** - Upload ID lists to server:
|
||||
```python
|
||||
handle = Entrez.epost("nucleotide", id="123456,789012")
|
||||
result = Entrez.read(handle)
|
||||
webenv = result["WebEnv"]
|
||||
query_key = result["QueryKey"]
|
||||
```
|
||||
|
||||
Useful for large queries split across multiple requests.
|
||||
|
||||
**espell** - Get spelling suggestions:
|
||||
```python
|
||||
handle = Entrez.espell(term="brest cancer")
|
||||
result = Entrez.read(handle)
|
||||
print(result["CorrectedQuery"]) # "breast cancer"
|
||||
```
|
||||
|
||||
**ecitmatch** - Convert citations to PubMed IDs:
|
||||
```python
|
||||
citation = "proc natl acad sci u s a|1991|88|3248|mann bj|"
|
||||
handle = Entrez.ecitmatch(db="pubmed", bdata=citation)
|
||||
```
|
||||
|
||||
#### Data Processing Functions
|
||||
|
||||
**Entrez.read()** - Parse XML to Python dictionary:
|
||||
```python
|
||||
handle = Entrez.esearch(db="protein", term="insulin")
|
||||
record = Entrez.read(handle)
|
||||
```
|
||||
|
||||
**Entrez.parse()** - Generator for large XML results:
|
||||
```python
|
||||
handle = Entrez.efetch(db="protein", id=id_list, rettype="gp", retmode="xml")
|
||||
for record in Entrez.parse(handle):
|
||||
process(record)
|
||||
```
|
||||
|
||||
#### Common Workflows
|
||||
|
||||
**Download sequences by accession:**
|
||||
```python
|
||||
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="fasta", retmode="text")
|
||||
record = SeqIO.read(handle, "fasta")
|
||||
```
|
||||
|
||||
**Search and download multiple sequences:**
|
||||
```python
|
||||
# Search
|
||||
search_handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax="100")
|
||||
search_results = Entrez.read(search_handle)
|
||||
|
||||
# Download
|
||||
fetch_handle = Entrez.efetch(db="nucleotide", id=search_results["IdList"], rettype="gb", retmode="text")
|
||||
for record in SeqIO.parse(fetch_handle, "genbank"):
|
||||
print(record.id)
|
||||
```
|
||||
|
||||
**Use WebEnv for large queries:**
|
||||
```python
|
||||
# Post IDs
|
||||
post_handle = Entrez.epost(db="nucleotide", id=",".join(large_id_list))
|
||||
post_result = Entrez.read(post_handle)
|
||||
|
||||
# Fetch in batches
|
||||
batch_size = 500
|
||||
for start in range(0, count, batch_size):
|
||||
fetch_handle = Entrez.efetch(
|
||||
db="nucleotide",
|
||||
rettype="fasta",
|
||||
retmode="text",
|
||||
retstart=start,
|
||||
retmax=batch_size,
|
||||
webenv=post_result["WebEnv"],
|
||||
query_key=post_result["QueryKey"]
|
||||
)
|
||||
# Process batch
|
||||
```
|
||||
|
||||
### Bio.GenBank - GenBank Format Parsing
|
||||
|
||||
Low-level GenBank file parser (SeqIO is usually preferred).
|
||||
|
||||
### Bio.SwissProt - Swiss-Prot/UniProt Parsing
|
||||
|
||||
Parse Swiss-Prot and UniProtKB flat file format:
|
||||
```python
|
||||
from Bio import SwissProt
|
||||
with open("uniprot.dat") as handle:
|
||||
for record in SwissProt.parse(handle):
|
||||
print(record.entry_name, record.organism)
|
||||
```
|
||||
|
||||
## Sequence Similarity Searches
|
||||
|
||||
### Bio.Blast - BLAST Interface
|
||||
|
||||
Tools for running BLAST searches and parsing results.
|
||||
|
||||
#### Running BLAST
|
||||
|
||||
**NCBI QBLAST (online):**
|
||||
```python
|
||||
from Bio.Blast import NCBIWWW
|
||||
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- Program: blastn, blastp, blastx, tblastn, tblastx
|
||||
- Database: nt, nr, refseq_rna, pdb, etc.
|
||||
- Sequence: string or Seq object
|
||||
- Additional parameters: `expect`, `word_size`, `hitlist_size`, `format_type`
|
||||
|
||||
**Local BLAST:**
|
||||
Run standalone BLAST from command line, then parse results.
|
||||
|
||||
#### Parsing BLAST Results
|
||||
|
||||
**XML format (recommended):**
|
||||
```python
|
||||
from Bio.Blast import NCBIXML
|
||||
|
||||
result_handle = open("blast_results.xml")
|
||||
blast_records = NCBIXML.parse(result_handle)
|
||||
|
||||
for blast_record in blast_records:
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
if hsp.expect < 0.001:
|
||||
print(f"Hit: {alignment.title}")
|
||||
print(f"Length: {alignment.length}")
|
||||
print(f"E-value: {hsp.expect}")
|
||||
print(f"Identities: {hsp.identities}/{hsp.align_length}")
|
||||
```
|
||||
|
||||
**Functions:**
|
||||
- `NCBIXML.read()`: Single query
|
||||
- `NCBIXML.parse()`: Multiple queries (generator)
|
||||
|
||||
**Key Record Attributes:**
|
||||
- `alignments`: List of matching sequences
|
||||
- `query`: Query sequence ID
|
||||
- `query_length`: Length of query
|
||||
|
||||
**Alignment Attributes:**
|
||||
- `title`: Description of hit
|
||||
- `length`: Length of hit sequence
|
||||
- `hsps`: High-scoring segment pairs
|
||||
|
||||
**HSP Attributes:**
|
||||
- `expect`: E-value
|
||||
- `score`: Bit score
|
||||
- `identities`: Number of identical residues
|
||||
- `positives`: Number of positive scoring matches
|
||||
- `gaps`: Number of gaps
|
||||
- `align_length`: Length of alignment
|
||||
- `query`: Aligned query sequence
|
||||
- `match`: Match indicators
|
||||
- `sbjct`: Aligned subject sequence
|
||||
- `query_start`, `query_end`: Query coordinates
|
||||
- `sbjct_start`, `sbjct_end`: Subject coordinates
|
||||
|
||||
#### Common BLAST Workflows
|
||||
|
||||
**Find homologs:**
|
||||
```python
|
||||
result = NCBIWWW.qblast("blastp", "nr", protein_sequence, expect=1e-10)
|
||||
with open("results.xml", "w") as out:
|
||||
out.write(result.read())
|
||||
```
|
||||
|
||||
**Filter results by criteria:**
|
||||
```python
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
if hsp.expect < 1e-5 and hsp.identities/hsp.align_length > 0.5:
|
||||
# Process high-quality hits
|
||||
pass
|
||||
```
|
||||
|
||||
### Bio.SearchIO - Unified Search Results Parser
|
||||
|
||||
Modern interface for parsing various search tool outputs (BLAST, HMMER, BLAT, etc.).
|
||||
|
||||
**Key Functions:**
|
||||
- `read()`: Parse single query
|
||||
- `parse()`: Parse multiple queries (generator)
|
||||
- `write()`: Write results to file
|
||||
- `convert()`: Convert between formats
|
||||
|
||||
**Supported Tools:**
|
||||
- BLAST (XML, tabular, plain text)
|
||||
- HMMER (hmmscan, hmmsearch, phmmer)
|
||||
- BLAT
|
||||
- FASTA
|
||||
- InterProScan
|
||||
- Exonerate
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from Bio import SearchIO
|
||||
results = SearchIO.parse("blast_output.xml", "blast-xml")
|
||||
for result in results:
|
||||
for hit in result:
|
||||
if hit.hsps[0].evalue < 0.001:
|
||||
print(hit.id, hit.hsps[0].evalue)
|
||||
```
|
||||
|
||||
## Local Database Management
|
||||
|
||||
### BioSQL - SQL Database Interface
|
||||
|
||||
Store and manage biological sequences in SQL databases (PostgreSQL, MySQL, SQLite).
|
||||
|
||||
**Features:**
|
||||
- Store SeqRecord objects with annotations
|
||||
- Efficient querying and retrieval
|
||||
- Cross-reference sequences
|
||||
- Track relationships between sequences
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from BioSQL import BioSeqDatabase
|
||||
server = BioSeqDatabase.open_database(driver="MySQLdb", user="user", passwd="pass", host="localhost", db="bioseqdb")
|
||||
db = server["my_db"]
|
||||
|
||||
# Store sequences
|
||||
db.load(SeqIO.parse("sequences.gb", "genbank"))
|
||||
|
||||
# Query
|
||||
seq = db.lookup(accession="NC_005816")
|
||||
```
|
||||
612
scientific-packages/biopython/references/specialized_modules.md
Normal file
612
scientific-packages/biopython/references/specialized_modules.md
Normal file
@@ -0,0 +1,612 @@
|
||||
# BioPython Specialized Analysis Modules
|
||||
|
||||
This document covers BioPython's specialized modules for structural biology, phylogenetics, population genetics, and other advanced analyses.
|
||||
|
||||
## Structural Bioinformatics
|
||||
|
||||
### Bio.PDB - Protein Structure Analysis
|
||||
|
||||
Comprehensive tools for handling macromolecular crystal structures.
|
||||
|
||||
#### Structure Hierarchy
|
||||
|
||||
PDB structures are organized hierarchically:
|
||||
- **Structure** → Models → Chains → Residues → Atoms
|
||||
|
||||
```python
|
||||
from Bio.PDB import PDBParser
|
||||
|
||||
parser = PDBParser()
|
||||
structure = parser.get_structure("protein", "1abc.pdb")
|
||||
|
||||
# Navigate hierarchy
|
||||
for model in structure:
|
||||
for chain in model:
|
||||
for residue in chain:
|
||||
for atom in residue:
|
||||
print(atom.coord) # xyz coordinates
|
||||
```
|
||||
|
||||
#### Parsing Structure Files
|
||||
|
||||
**PDB format:**
|
||||
```python
|
||||
from Bio.PDB import PDBParser
|
||||
parser = PDBParser(QUIET=True)
|
||||
structure = parser.get_structure("myprotein", "structure.pdb")
|
||||
```
|
||||
|
||||
**mmCIF format:**
|
||||
```python
|
||||
from Bio.PDB import MMCIFParser
|
||||
parser = MMCIFParser(QUIET=True)
|
||||
structure = parser.get_structure("myprotein", "structure.cif")
|
||||
```
|
||||
|
||||
**Fast mmCIF parser:**
|
||||
```python
|
||||
from Bio.PDB import FastMMCIFParser
|
||||
parser = FastMMCIFParser(QUIET=True)
|
||||
structure = parser.get_structure("myprotein", "structure.cif")
|
||||
```
|
||||
|
||||
**MMTF format:**
|
||||
```python
|
||||
from Bio.PDB import MMTFParser
|
||||
parser = MMTFParser()
|
||||
structure = parser.get_structure("structure.mmtf")
|
||||
```
|
||||
|
||||
**Binary CIF:**
|
||||
```python
|
||||
from Bio.PDB.binary_cif import BinaryCIFParser
|
||||
parser = BinaryCIFParser()
|
||||
structure = parser.get_structure("structure.bcif")
|
||||
```
|
||||
|
||||
#### Downloading Structures
|
||||
|
||||
```python
|
||||
from Bio.PDB import PDBList
|
||||
pdbl = PDBList()
|
||||
|
||||
# Download specific structure
|
||||
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir="structures/")
|
||||
|
||||
# Download entire PDB (obsolete entries)
|
||||
pdbl.download_obsolete_entries(pdir="obsolete/")
|
||||
|
||||
# Update local PDB mirror
|
||||
pdbl.update_pdb()
|
||||
```
|
||||
|
||||
#### Structure Selection and Filtering
|
||||
|
||||
```python
|
||||
# Select specific chains
|
||||
chain_A = structure[0]['A']
|
||||
|
||||
# Select specific residues
|
||||
residue_10 = chain_A[10]
|
||||
|
||||
# Select specific atoms
|
||||
ca_atom = residue_10['CA']
|
||||
|
||||
# Iterate over specific atom types
|
||||
for atom in structure.get_atoms():
|
||||
if atom.name == 'CA': # Alpha carbons only
|
||||
print(atom.coord)
|
||||
```
|
||||
|
||||
**Structure selectors:**
|
||||
```python
|
||||
from Bio.PDB.Polypeptide import is_aa
|
||||
|
||||
# Filter by residue type
|
||||
for residue in structure.get_residues():
|
||||
if is_aa(residue):
|
||||
print(f"Amino acid: {residue.resname}")
|
||||
```
|
||||
|
||||
#### Secondary Structure Analysis
|
||||
|
||||
**DSSP integration:**
|
||||
```python
|
||||
from Bio.PDB import DSSP
|
||||
|
||||
# Requires DSSP program installed
|
||||
model = structure[0]
|
||||
dssp = DSSP(model, "structure.pdb")
|
||||
|
||||
# Access secondary structure
|
||||
for key in dssp:
|
||||
secondary_structure = dssp[key][2]
|
||||
accessibility = dssp[key][3]
|
||||
print(f"Residue {key}: {secondary_structure}, accessible: {accessibility}")
|
||||
```
|
||||
|
||||
DSSP codes:
|
||||
- H: Alpha helix
|
||||
- B: Beta bridge
|
||||
- E: Extended strand (beta sheet)
|
||||
- G: 3-10 helix
|
||||
- I: Pi helix
|
||||
- T: Turn
|
||||
- S: Bend
|
||||
- -: Coil
|
||||
|
||||
#### Solvent Accessibility
|
||||
|
||||
**Shrake-Rupley algorithm:**
|
||||
```python
|
||||
from Bio.PDB import ShrakeRupley
|
||||
|
||||
sr = ShrakeRupley()
|
||||
sr.compute(structure, level="R") # R=residue, A=atom, C=chain, M=model, S=structure
|
||||
|
||||
for residue in structure.get_residues():
|
||||
print(f"{residue.resname} {residue.id[1]}: {residue.sasa} Ų")
|
||||
```
|
||||
|
||||
**NACCESS wrapper:**
|
||||
```python
|
||||
from Bio.PDB import NACCESS
|
||||
|
||||
# Requires NACCESS program
|
||||
naccess = NACCESS("structure.pdb")
|
||||
for residue_id, data in naccess.items():
|
||||
print(f"Residue {residue_id}: {data['all_atoms_abs']} Ų")
|
||||
```
|
||||
|
||||
**Half-sphere exposure:**
|
||||
```python
|
||||
from Bio.PDB import HSExposure
|
||||
|
||||
# Requires DSSP
|
||||
model = structure[0]
|
||||
hse = HSExposure()
|
||||
hse.calc_hs_exposure(model, "structure.pdb")
|
||||
|
||||
for chain in model:
|
||||
for residue in chain:
|
||||
if residue.has_id('EXP_HSE_A_U'):
|
||||
hse_up = residue.xtra['EXP_HSE_A_U']
|
||||
hse_down = residue.xtra['EXP_HSE_A_D']
|
||||
```
|
||||
|
||||
#### Structural Alignment and Superimposition
|
||||
|
||||
**Standard superimposition:**
|
||||
```python
|
||||
from Bio.PDB import Superimposer
|
||||
|
||||
sup = Superimposer()
|
||||
sup.set_atoms(ref_atoms, alt_atoms) # Lists of atoms to align
|
||||
sup.apply(structure2.get_atoms()) # Apply transformation
|
||||
|
||||
print(f"RMSD: {sup.rms}")
|
||||
print(f"Rotation matrix: {sup.rotran[0]}")
|
||||
print(f"Translation vector: {sup.rotran[1]}")
|
||||
```
|
||||
|
||||
**QCP (Quaternion Characteristic Polynomial) method:**
|
||||
```python
|
||||
from Bio.PDB import QCPSuperimposer
|
||||
|
||||
qcp = QCPSuperimposer()
|
||||
qcp.set(ref_coords, alt_coords)
|
||||
qcp.run()
|
||||
print(f"RMSD: {qcp.get_rms()}")
|
||||
```
|
||||
|
||||
#### Geometric Calculations
|
||||
|
||||
**Distances and angles:**
|
||||
```python
|
||||
# Distance between atoms
|
||||
from Bio.PDB import Vector
|
||||
dist = atom1 - atom2 # Returns distance
|
||||
|
||||
# Angle between three atoms
|
||||
from Bio.PDB import calc_angle
|
||||
angle = calc_angle(atom1.coord, atom2.coord, atom3.coord)
|
||||
|
||||
# Dihedral angle
|
||||
from Bio.PDB import calc_dihedral
|
||||
dihedral = calc_dihedral(atom1.coord, atom2.coord, atom3.coord, atom4.coord)
|
||||
```
|
||||
|
||||
**Vector operations:**
|
||||
```python
|
||||
from Bio.PDB.Vector import Vector
|
||||
|
||||
v1 = Vector(atom1.coord)
|
||||
v2 = Vector(atom2.coord)
|
||||
|
||||
# Vector operations
|
||||
v3 = v1 + v2
|
||||
v4 = v1 - v2
|
||||
dot_product = v1 * v2
|
||||
cross_product = v1 ** v2
|
||||
magnitude = v1.norm()
|
||||
normalized = v1.normalized()
|
||||
```
|
||||
|
||||
#### Internal Coordinates
|
||||
|
||||
Advanced residue geometry representation:
|
||||
```python
|
||||
from Bio.PDB import internal_coords
|
||||
|
||||
# Enable internal coordinates
|
||||
structure.atom_to_internal_coordinates()
|
||||
|
||||
# Access phi, psi angles
|
||||
for residue in structure.get_residues():
|
||||
if residue.internal_coord:
|
||||
print(f"Phi: {residue.internal_coord.get_angle('phi')}")
|
||||
print(f"Psi: {residue.internal_coord.get_angle('psi')}")
|
||||
```
|
||||
|
||||
#### Writing Structures
|
||||
|
||||
```python
|
||||
from Bio.PDB import PDBIO
|
||||
|
||||
io = PDBIO()
|
||||
io.set_structure(structure)
|
||||
io.save("output.pdb")
|
||||
|
||||
# Save specific selection
|
||||
io.save("chain_A.pdb", select=ChainSelector("A"))
|
||||
```
|
||||
|
||||
### Bio.SCOP - SCOP Database
|
||||
|
||||
Access to Structural Classification of Proteins database.
|
||||
|
||||
### Bio.KEGG - Pathway Analysis
|
||||
|
||||
Interface to KEGG (Kyoto Encyclopedia of Genes and Genomes) databases:
|
||||
|
||||
**Capabilities:**
|
||||
- Access pathway maps
|
||||
- Retrieve enzyme data
|
||||
- Get compound information
|
||||
- Query orthology relationships
|
||||
|
||||
## Phylogenetics
|
||||
|
||||
### Bio.Phylo - Phylogenetic Tree Analysis
|
||||
|
||||
Comprehensive phylogenetic tree manipulation and analysis.
|
||||
|
||||
#### Reading and Writing Trees
|
||||
|
||||
**Supported formats:**
|
||||
- Newick: Simple, widely-used format
|
||||
- NEXUS: Rich metadata format
|
||||
- PhyloXML: XML-based with extensive annotations
|
||||
- NeXML: Modern XML standard
|
||||
|
||||
```python
|
||||
from Bio import Phylo
|
||||
|
||||
# Read tree
|
||||
tree = Phylo.read("tree.nwk", "newick")
|
||||
|
||||
# Read multiple trees
|
||||
trees = list(Phylo.parse("trees.nex", "nexus"))
|
||||
|
||||
# Write tree
|
||||
Phylo.write(tree, "output.nwk", "newick")
|
||||
```
|
||||
|
||||
#### Tree Visualization
|
||||
|
||||
**ASCII visualization:**
|
||||
```python
|
||||
Phylo.draw_ascii(tree)
|
||||
```
|
||||
|
||||
**Matplotlib plotting:**
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
Phylo.draw(tree)
|
||||
plt.show()
|
||||
|
||||
# With customization
|
||||
fig, ax = plt.subplots(figsize=(10, 8))
|
||||
Phylo.draw(tree, axes=ax, do_show=False)
|
||||
ax.set_title("My Phylogenetic Tree")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
#### Tree Navigation and Manipulation
|
||||
|
||||
**Find clades:**
|
||||
```python
|
||||
# Get all terminal nodes (leaves)
|
||||
terminals = tree.get_terminals()
|
||||
|
||||
# Get all nonterminal nodes
|
||||
nonterminals = tree.get_nonterminals()
|
||||
|
||||
# Find specific clade
|
||||
target = tree.find_any(name="Species_A")
|
||||
|
||||
# Find all matching clades
|
||||
matches = tree.find_clades(terminal=True)
|
||||
```
|
||||
|
||||
**Tree properties:**
|
||||
```python
|
||||
# Count terminals
|
||||
num_species = tree.count_terminals()
|
||||
|
||||
# Get total branch length
|
||||
total_length = tree.total_branch_length()
|
||||
|
||||
# Check if tree is bifurcating
|
||||
is_bifurcating = tree.is_bifurcating()
|
||||
|
||||
# Get maximum distance from root
|
||||
max_dist = tree.distance(tree.root)
|
||||
```
|
||||
|
||||
**Tree modification:**
|
||||
```python
|
||||
# Prune tree to specific taxa
|
||||
keep_taxa = ["Species_A", "Species_B", "Species_C"]
|
||||
tree.prune(keep_taxa)
|
||||
|
||||
# Collapse short branches
|
||||
tree.collapse_all(lambda c: c.branch_length < 0.01)
|
||||
|
||||
# Ladderize (sort branches)
|
||||
tree.ladderize()
|
||||
|
||||
# Root tree at midpoint
|
||||
tree.root_at_midpoint()
|
||||
|
||||
# Root at specific clade
|
||||
outgroup = tree.find_any(name="Outgroup_species")
|
||||
tree.root_with_outgroup(outgroup)
|
||||
```
|
||||
|
||||
**Calculate distances:**
|
||||
```python
|
||||
# Distance between two clades
|
||||
dist = tree.distance(clade1, clade2)
|
||||
|
||||
# Distance from root
|
||||
root_dist = tree.distance(tree.root, terminal_clade)
|
||||
```
|
||||
|
||||
#### Tree Construction
|
||||
|
||||
**Distance-based methods:**
|
||||
```python
|
||||
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator
|
||||
from Bio import AlignIO
|
||||
|
||||
# Load alignment
|
||||
aln = AlignIO.read("alignment.fasta", "fasta")
|
||||
|
||||
# Calculate distance matrix
|
||||
calculator = DistanceCalculator('identity')
|
||||
dm = calculator.get_distance(aln)
|
||||
|
||||
# Construct tree using UPGMA
|
||||
constructor = DistanceTreeConstructor()
|
||||
tree_upgma = constructor.upgma(dm)
|
||||
|
||||
# Or using Neighbor-Joining
|
||||
tree_nj = constructor.nj(dm)
|
||||
```
|
||||
|
||||
**Parsimony method:**
|
||||
```python
|
||||
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher
|
||||
|
||||
scorer = ParsimonyScorer()
|
||||
searcher = NNITreeSearcher(scorer)
|
||||
tree = searcher.search(starting_tree, alignment)
|
||||
```
|
||||
|
||||
**Distance calculators:**
|
||||
- 'identity': Simple identity scoring
|
||||
- 'blastn': BLAST nucleotide scoring
|
||||
- 'blastp': BLAST protein scoring
|
||||
- 'dnafull': EMBOSS DNA scoring matrix
|
||||
- 'blosum62': BLOSUM62 protein matrix
|
||||
- 'pam250': PAM250 protein matrix
|
||||
|
||||
#### Consensus Trees
|
||||
|
||||
```python
|
||||
from Bio.Phylo.Consensus import majority_consensus, strict_consensus
|
||||
|
||||
# Strict consensus
|
||||
consensus_strict = strict_consensus(trees)
|
||||
|
||||
# Majority rule consensus
|
||||
consensus_majority = majority_consensus(trees, cutoff=0.5)
|
||||
|
||||
# Bootstrap consensus
|
||||
from Bio.Phylo.Consensus import bootstrap_consensus
|
||||
bootstrap_tree = bootstrap_consensus(trees, cutoff=0.7)
|
||||
```
|
||||
|
||||
#### External Tool Wrappers
|
||||
|
||||
**PhyML:**
|
||||
```python
|
||||
from Bio.Phylo.Applications import PhymlCommandline
|
||||
|
||||
cmd = PhymlCommandline(input="alignment.phy", datatype="nt", model="HKY85", alpha="e", bootstrap=100)
|
||||
stdout, stderr = cmd()
|
||||
tree = Phylo.read("alignment.phy_phyml_tree.txt", "newick")
|
||||
```
|
||||
|
||||
**RAxML:**
|
||||
```python
|
||||
from Bio.Phylo.Applications import RaxmlCommandline
|
||||
|
||||
cmd = RaxmlCommandline(
|
||||
sequences="alignment.phy",
|
||||
model="GTRGAMMA",
|
||||
name="mytree",
|
||||
parsimony_seed=12345
|
||||
)
|
||||
stdout, stderr = cmd()
|
||||
```
|
||||
|
||||
**FastTree:**
|
||||
```python
|
||||
from Bio.Phylo.Applications import FastTreeCommandline
|
||||
|
||||
cmd = FastTreeCommandline(input="alignment.fasta", out="tree.nwk", gtr=True, gamma=True)
|
||||
stdout, stderr = cmd()
|
||||
```
|
||||
|
||||
### Bio.Phylo.PAML - Evolutionary Analysis
|
||||
|
||||
Interface to PAML (Phylogenetic Analysis by Maximum Likelihood):
|
||||
|
||||
**CODEML - Codon-based analysis:**
|
||||
```python
|
||||
from Bio.Phylo.PAML import codeml
|
||||
|
||||
cml = codeml.Codeml()
|
||||
cml.alignment = "alignment.phy"
|
||||
cml.tree = "tree.nwk"
|
||||
cml.out_file = "results.out"
|
||||
cml.working_dir = "./paml_wd"
|
||||
|
||||
# Set parameters
|
||||
cml.set_options(
|
||||
seqtype=1, # Codon sequences
|
||||
model=0, # One omega ratio
|
||||
NSsites=[0, 1, 2], # Test different models
|
||||
CodonFreq=2, # F3x4 codon frequencies
|
||||
)
|
||||
|
||||
results = cml.run()
|
||||
```
|
||||
|
||||
**BaseML - Nucleotide-based analysis:**
|
||||
```python
|
||||
from Bio.Phylo.PAML import baseml
|
||||
|
||||
bml = baseml.Baseml()
|
||||
bml.alignment = "alignment.phy"
|
||||
bml.tree = "tree.nwk"
|
||||
results = bml.run()
|
||||
```
|
||||
|
||||
**YN00 - Yang-Nielsen method:**
|
||||
```python
|
||||
from Bio.Phylo.PAML import yn00
|
||||
|
||||
yn = yn00.Yn00()
|
||||
yn.alignment = "alignment.phy"
|
||||
results = yn.run()
|
||||
```
|
||||
|
||||
## Population Genetics
|
||||
|
||||
### Bio.PopGen - Population Genetics Analysis
|
||||
|
||||
Tools for population-level genetic analysis.
|
||||
|
||||
**Capabilities:**
|
||||
- Allele frequency calculations
|
||||
- Hardy-Weinberg equilibrium testing
|
||||
- Linkage disequilibrium analysis
|
||||
- F-statistics (FST, FIS, FIT)
|
||||
- Tajima's D
|
||||
- Population structure analysis
|
||||
|
||||
## Clustering and Machine Learning
|
||||
|
||||
### Bio.Cluster - Clustering Algorithms
|
||||
|
||||
Statistical clustering for gene expression and other biological data:
|
||||
|
||||
**Hierarchical clustering:**
|
||||
```python
|
||||
from Bio.Cluster import treecluster
|
||||
|
||||
tree = treecluster(data, method='a', dist='e')
|
||||
# method: 'a'=average, 's'=single, 'm'=maximum, 'c'=centroid
|
||||
# dist: 'e'=Euclidean, 'c'=correlation, 'a'=absolute correlation
|
||||
```
|
||||
|
||||
**k-means clustering:**
|
||||
```python
|
||||
from Bio.Cluster import kcluster
|
||||
|
||||
clusterid, error, nfound = kcluster(data, nclusters=5, npass=100)
|
||||
```
|
||||
|
||||
**Self-Organizing Maps (SOM):**
|
||||
```python
|
||||
from Bio.Cluster import somcluster
|
||||
|
||||
clusterid, celldata = somcluster(data, nx=3, ny=3)
|
||||
```
|
||||
|
||||
**Principal Component Analysis:**
|
||||
```python
|
||||
from Bio.Cluster import pca
|
||||
|
||||
columnmean, coordinates, components, eigenvalues = pca(data)
|
||||
```
|
||||
|
||||
## Visualization
|
||||
|
||||
### Bio.Graphics - Genomic Visualization
|
||||
|
||||
Tools for creating publication-quality biological graphics.
|
||||
|
||||
**GenomeDiagram - Circular and linear genome maps:**
|
||||
```python
|
||||
from Bio.Graphics import GenomeDiagram
|
||||
from Bio import SeqIO
|
||||
|
||||
record = SeqIO.read("genome.gb", "genbank")
|
||||
|
||||
gd_diagram = GenomeDiagram.Diagram("Genome Map")
|
||||
gd_track = gd_diagram.new_track(1, greytrack=True)
|
||||
gd_feature_set = gd_track.new_set()
|
||||
|
||||
# Add features
|
||||
for feature in record.features:
|
||||
if feature.type == "gene":
|
||||
gd_feature_set.add_feature(feature, color="blue", label=True)
|
||||
|
||||
gd_diagram.draw(format="linear", pagesize='A4', fragments=1)
|
||||
gd_diagram.write("genome_map.pdf", "PDF")
|
||||
```
|
||||
|
||||
**Chromosomes - Chromosome visualization:**
|
||||
```python
|
||||
from Bio.Graphics.BasicChromosome import Chromosome
|
||||
|
||||
chr = Chromosome("Chromosome 1")
|
||||
chr.add("gene1", 1000, 2000, color="red")
|
||||
chr.add("gene2", 3000, 4500, color="blue")
|
||||
```
|
||||
|
||||
## Phenotype Analysis
|
||||
|
||||
### Bio.phenotype - Phenotypic Microarray Analysis
|
||||
|
||||
Tools for analyzing phenotypic microarray data (e.g., Biolog plates):
|
||||
|
||||
**Capabilities:**
|
||||
- Parse PM plate data
|
||||
- Growth curve analysis
|
||||
- Compare phenotypic profiles
|
||||
- Calculate similarity metrics
|
||||
Reference in New Issue
Block a user