Add PyOpenms

This commit is contained in:
Timothy Kassis
2025-10-20 17:26:45 -07:00
parent df81c94a7a
commit c7296b2661
6 changed files with 2445 additions and 2 deletions

View File

@@ -7,7 +7,7 @@
},
"metadata": {
"description": "Claude scientific skills from K-Dense Inc",
"version": "1.18.3"
"version": "1.19.0"
},
"plugins": [
{
@@ -39,6 +39,7 @@
"./scientific-packages/pymatgen",
"./scientific-packages/pymc",
"./scientific-packages/pymoo",
"./scientific-packages/pyopenms",
"./scientific-packages/pytdc",
"./scientific-packages/pytorch-lightning",
"./scientific-packages/rdkit",

View File

@@ -66,6 +66,9 @@ After installing the plugin, you can use the skill by just mentioning it. Additi
- **PyTDC** - Therapeutics Data Commons for drug discovery datasets and benchmarks
- **RDKit** - Cheminformatics toolkit for molecular I/O, descriptors, fingerprints, and SMARTS
**Proteomics & Mass Spectrometry:**
- **pyOpenMS** - Comprehensive mass spectrometry data analysis for proteomics and metabolomics (LC-MS/MS processing, peptide identification, feature detection, quantification, chemical calculations, and integration with search engines like Comet, Mascot, MSGF+)
**Machine Learning & Deep Learning:**
- **PyMC** - Bayesian statistical modeling and probabilistic programming
- **PyMOO** - Multi-objective optimization with evolutionary algorithms
@@ -157,7 +160,6 @@ After installing the plugin, you can use the skill by just mentioning it. Additi
### Proteomics & Mass Spectrometry
- **pyteomics** - Mass spectrometry data analysis
- **pyOpenMS** - OpenMS Python bindings for proteomics
- **matchms** - Processing and similarity matching of mass spectrometry data
- **MSstats** - Statistical analysis of quantitative proteomics

View File

@@ -0,0 +1,522 @@
---
name: pyopenms
description: Toolkit for mass spectrometry data analysis with pyOpenMS, supporting proteomics and metabolomics workflows including LC-MS/MS data processing, peptide identification, feature detection, quantification, and chemical calculations. Use this skill when: (1) Working with mass spectrometry file formats (mzML, mzXML, FASTA, mzTab, mzIdentML, TraML, pepXML/protXML) and need to read, write, or convert between formats; (2) Processing raw LC-MS/MS data including spectral smoothing, peak picking, noise filtering, and signal processing; (3) Performing proteomics workflows such as peptide digestion simulation, theoretical fragmentation, modification analysis, and protein identification post-processing; (4) Conducting metabolomics analysis including feature detection, adduct annotation, isotope pattern matching, and small molecule identification; (5) Implementing quantitative proteomics pipelines with feature detection, alignment across samples, and statistical analysis; (6) Calculating chemical properties including molecular formulas, isotopic distributions, amino acid properties, and peptide masses; (7) Integrating with search engines (Comet, Mascot, MSGF+) and post-processing tools (Percolator, MSstats); (8) Building custom MS data analysis workflows that require low-level access to spectra, chromatograms, and peak data; (9) Performing quality control on MS data including TIC/BPC calculation, retention time analysis, and data validation; (10) When you need Python-based alternatives to vendor software for MS data processing and analysis.
---
# pyOpenMS
## Overview
pyOpenMS is an open-source Python library providing comprehensive tools for mass spectrometry data analysis in proteomics and metabolomics research. It offers Python bindings to the OpenMS C++ library, enabling efficient processing of LC-MS/MS data, peptide identification, feature detection, quantification, and integration with common proteomics tools like Comet, Mascot, MSGF+, Percolator, and MSstats.
Use this skill when working with mass spectrometry data analysis tasks, processing proteomics or metabolomics datasets, or implementing computational workflows for biomolecular identification and quantification.
## Core Capabilities
### 1. File I/O and Data Import/Export
Handle diverse mass spectrometry file formats efficiently:
**Supported Formats:**
- **mzML/mzXML**: Primary raw MS data formats (profile or centroid)
- **FASTA**: Protein/peptide sequence databases
- **mzTab**: Standardized reporting format for identification and quantification
- **mzIdentML**: Peptide and protein identification data
- **TraML**: Transition lists for targeted experiments
- **pepXML/protXML**: Search engine results
**Reading mzML Files:**
```python
import pyopenms as oms
# Load MS data
exp = oms.MSExperiment()
oms.MzMLFile().load("input_data.mzML", exp)
# Access basic information
print(f"Number of spectra: {exp.getNrSpectra()}")
print(f"Number of chromatograms: {exp.getNrChromatograms()}")
```
**Writing mzML Files:**
```python
# Save processed data
oms.MzMLFile().store("output_data.mzML", exp)
```
**File Encoding:** pyOpenMS automatically handles Base64 encoding, zlib compression, and Numpress compression internally.
### 2. MS Data Structures and Manipulation
Work with core mass spectrometry data structures. See `references/data_structures.md` for comprehensive details.
**MSSpectrum** - Individual mass spectrum:
```python
# Create spectrum with metadata
spectrum = oms.MSSpectrum()
spectrum.setRT(205.2) # Retention time in seconds
spectrum.setMSLevel(2) # MS2 spectrum
# Set peak data (m/z, intensity arrays)
mz_array = [100.5, 200.3, 300.7, 400.2]
intensity_array = [1000, 5000, 3000, 2000]
spectrum.set_peaks((mz_array, intensity_array))
# Add precursor information for MS2
precursor = oms.Precursor()
precursor.setMZ(450.5)
precursor.setCharge(2)
spectrum.setPrecursors([precursor])
```
**MSExperiment** - Complete LC-MS/MS run:
```python
# Create experiment and add spectra
exp = oms.MSExperiment()
exp.addSpectrum(spectrum)
# Access spectra
first_spectrum = exp.getSpectrum(0)
for spec in exp:
print(f"RT: {spec.getRT()}, MS Level: {spec.getMSLevel()}")
```
**MSChromatogram** - Extracted ion chromatogram:
```python
# Create chromatogram
chrom = oms.MSChromatogram()
chrom.set_peaks(([10.5, 11.2, 11.8], [1000, 5000, 3000])) # RT, intensity
exp.addChromatogram(chrom)
```
**Efficient Peak Access:**
```python
# Get peaks as numpy arrays for fast processing
mz_array, intensity_array = spectrum.get_peaks()
# Modify and set back
intensity_array *= 2 # Double all intensities
spectrum.set_peaks((mz_array, intensity_array))
```
### 3. Chemistry and Peptide Handling
Perform chemical calculations for proteomics and metabolomics. See `references/chemistry.md` for detailed examples.
**Molecular Formulas and Mass Calculations:**
```python
# Create empirical formula
formula = oms.EmpiricalFormula("C6H12O6") # Glucose
print(f"Monoisotopic mass: {formula.getMonoWeight()}")
print(f"Average mass: {formula.getAverageWeight()}")
# Formula arithmetic
water = oms.EmpiricalFormula("H2O")
dehydrated = formula - water
# Isotope-specific formulas
heavy_carbon = oms.EmpiricalFormula("(13)C6H12O6")
```
**Isotopic Distributions:**
```python
# Generate coarse isotope pattern (unit mass resolution)
coarse_gen = oms.CoarseIsotopePatternGenerator()
pattern = coarse_gen.run(formula)
# Generate fine structure (high resolution)
fine_gen = oms.FineIsotopePatternGenerator(0.01) # 0.01 Da resolution
fine_pattern = fine_gen.run(formula)
```
**Amino Acids and Residues:**
```python
# Access residue information
res_db = oms.ResidueDB()
leucine = res_db.getResidue("Leucine")
print(f"L monoisotopic mass: {leucine.getMonoWeight()}")
print(f"L formula: {leucine.getFormula()}")
print(f"L pKa: {leucine.getPka()}")
```
**Peptide Sequences:**
```python
# Create peptide sequence
peptide = oms.AASequence.fromString("PEPTIDE")
print(f"Peptide mass: {peptide.getMonoWeight()}")
print(f"Formula: {peptide.getFormula()}")
# Add modifications
modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)")
print(f"Modified mass: {modified.getMonoWeight()}")
# Theoretical fragmentation
ions = []
for i in range(1, peptide.size()):
b_ion = peptide.getPrefix(i)
y_ion = peptide.getSuffix(i)
ions.append(('b', i, b_ion.getMonoWeight()))
ions.append(('y', i, y_ion.getMonoWeight()))
```
**Protein Digestion:**
```python
# Enzymatic digestion
dig = oms.ProteaseDigestion()
dig.setEnzyme("Trypsin")
dig.setMissedCleavages(2)
protein_seq = oms.AASequence.fromString("MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK")
peptides = []
dig.digest(protein_seq, peptides)
for pep in peptides:
print(f"{pep.toString()}: {pep.getMonoWeight():.2f} Da")
```
**Modifications:**
```python
# Access modification database
mod_db = oms.ModificationsDB()
oxidation = mod_db.getModification("Oxidation")
print(f"Oxidation mass diff: {oxidation.getDiffMonoMass()}")
print(f"Residues: {oxidation.getResidues()}")
```
### 4. Signal Processing and Filtering
Apply algorithms to process and filter MS data. See `references/algorithms.md` for comprehensive coverage.
**Spectral Smoothing:**
```python
# Gaussian smoothing
gauss_filter = oms.GaussFilter()
params = gauss_filter.getParameters()
params.setValue("gaussian_width", 0.2)
gauss_filter.setParameters(params)
gauss_filter.filterExperiment(exp)
# Savitzky-Golay filter
sg_filter = oms.SavitzkyGolayFilter()
sg_filter.filterExperiment(exp)
```
**Peak Filtering:**
```python
# Keep only N largest peaks per spectrum
n_largest = oms.NLargest()
params = n_largest.getParameters()
params.setValue("n", 100) # Keep top 100 peaks
n_largest.setParameters(params)
n_largest.filterExperiment(exp)
# Threshold filtering
threshold_filter = oms.ThresholdMower()
params = threshold_filter.getParameters()
params.setValue("threshold", 1000.0) # Remove peaks below 1000 intensity
threshold_filter.setParameters(params)
threshold_filter.filterExperiment(exp)
# Window-based filtering
window_filter = oms.WindowMower()
params = window_filter.getParameters()
params.setValue("windowsize", 50.0) # 50 m/z windows
params.setValue("peakcount", 10) # Keep 10 highest per window
window_filter.setParameters(params)
window_filter.filterExperiment(exp)
```
**Spectrum Normalization:**
```python
normalizer = oms.Normalizer()
normalizer.filterExperiment(exp)
```
**MS Level Filtering:**
```python
# Keep only MS2 spectra
exp.filterMSLevel(2)
# Filter by retention time range
exp.filterRT(100.0, 500.0) # Keep RT between 100-500 seconds
# Filter by m/z range
exp.filterMZ(400.0, 1500.0) # Keep m/z between 400-1500
```
### 5. Feature Detection and Quantification
Detect and quantify features in LC-MS data:
**Peak Picking (Centroiding):**
```python
# Convert profile data to centroid
picker = oms.PeakPickerHiRes()
params = picker.getParameters()
params.setValue("signal_to_noise", 1.0)
picker.setParameters(params)
exp_centroided = oms.MSExperiment()
picker.pickExperiment(exp, exp_centroided)
```
**Feature Detection:**
```python
# Detect features across LC-MS runs
feature_finder = oms.FeatureFinderMultiplex()
features = oms.FeatureMap()
feature_finder.run(exp, features, params)
print(f"Found {features.size()} features")
for feature in features:
print(f"m/z: {feature.getMZ():.4f}, RT: {feature.getRT():.2f}, "
f"Intensity: {feature.getIntensity():.0f}")
```
**Feature Linking (Map Alignment):**
```python
# Link features across multiple samples
feature_grouper = oms.FeatureGroupingAlgorithmQT()
consensus_map = oms.ConsensusMap()
# Provide multiple feature maps from different samples
feature_maps = [features1, features2, features3]
feature_grouper.group(feature_maps, consensus_map)
```
### 6. Peptide Identification Workflows
Integrate with search engines and process identification results:
**Database Searching:**
```python
# Prepare parameters for search engine
params = oms.Param()
params.setValue("database", "uniprot_human.fasta")
params.setValue("precursor_mass_tolerance", 10.0) # ppm
params.setValue("fragment_mass_tolerance", 0.5) # Da
params.setValue("enzyme", "Trypsin")
params.setValue("missed_cleavages", 2)
# Variable modifications
params.setValue("variable_modifications", ["Oxidation (M)", "Phospho (STY)"])
# Fixed modifications
params.setValue("fixed_modifications", ["Carbamidomethyl (C)"])
```
**FDR Control:**
```python
# False discovery rate estimation
fdr = oms.FalseDiscoveryRate()
fdr_threshold = 0.01 # 1% FDR
# Apply to peptide identifications
protein_ids = []
peptide_ids = []
oms.IdXMLFile().load("search_results.idXML", protein_ids, peptide_ids)
fdr.apply(protein_ids, peptide_ids)
```
### 7. Metabolomics Workflows
Analyze small molecule data:
**Adduct Detection:**
```python
# Common metabolite adducts
adducts = ["[M+H]+", "[M+Na]+", "[M+K]+", "[M-H]-", "[M+Cl]-"]
# Feature annotation with adducts
for feature in features:
mz = feature.getMZ()
# Calculate neutral mass for each adduct hypothesis
for adduct in adducts:
# Annotation logic
pass
```
**Isotope Pattern Matching:**
```python
# Compare experimental to theoretical isotope patterns
experimental_pattern = [] # Extract from feature
theoretical = coarse_gen.run(formula)
# Calculate similarity score
similarity = compare_isotope_patterns(experimental_pattern, theoretical)
```
### 8. Quality Control and Visualization
Monitor data quality and visualize results:
**Basic Statistics:**
```python
# Calculate TIC (Total Ion Current)
tic_values = []
rt_values = []
for spectrum in exp:
if spectrum.getMSLevel() == 1:
tic = sum(spectrum.get_peaks()[1]) # Sum intensities
tic_values.append(tic)
rt_values.append(spectrum.getRT())
# Base peak chromatogram
bpc_values = []
for spectrum in exp:
if spectrum.getMSLevel() == 1:
max_intensity = max(spectrum.get_peaks()[1]) if spectrum.size() > 0 else 0
bpc_values.append(max_intensity)
```
**Plotting (with pyopenms.plotting or matplotlib):**
```python
import matplotlib.pyplot as plt
# Plot TIC
plt.figure(figsize=(10, 4))
plt.plot(rt_values, tic_values)
plt.xlabel('Retention Time (s)')
plt.ylabel('Total Ion Current')
plt.title('TIC')
plt.show()
# Plot single spectrum
spectrum = exp.getSpectrum(0)
mz, intensity = spectrum.get_peaks()
plt.stem(mz, intensity, basefmt=' ')
plt.xlabel('m/z')
plt.ylabel('Intensity')
plt.title(f'Spectrum at RT {spectrum.getRT():.2f}s')
plt.show()
```
## Common Workflows
### Complete LC-MS/MS Processing Pipeline
```python
import pyopenms as oms
# 1. Load data
exp = oms.MSExperiment()
oms.MzMLFile().load("raw_data.mzML", exp)
# 2. Filter and smooth
exp.filterMSLevel(1) # Keep only MS1 for feature detection
gauss = oms.GaussFilter()
gauss.filterExperiment(exp)
# 3. Peak picking
picker = oms.PeakPickerHiRes()
exp_centroid = oms.MSExperiment()
picker.pickExperiment(exp, exp_centroid)
# 4. Feature detection
ff = oms.FeatureFinderMultiplex()
features = oms.FeatureMap()
ff.run(exp_centroid, features, oms.Param())
# 5. Export results
oms.FeatureXMLFile().store("features.featureXML", features)
print(f"Detected {features.size()} features")
```
### Theoretical Peptide Mass Calculation
```python
# Calculate masses for peptide with modifications
peptide = oms.AASequence.fromString("PEPTIDEK")
print(f"Unmodified [M+H]+: {peptide.getMonoWeight() + 1.007276:.4f}")
# With modification
modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)K")
print(f"Oxidized [M+H]+: {modified.getMonoWeight() + 1.007276:.4f}")
# Calculate for different charge states
for z in [1, 2, 3]:
mz = (peptide.getMonoWeight() + z * 1.007276) / z
print(f"[M+{z}H]^{z}+: {mz:.4f}")
```
## Installation
Ensure pyOpenMS is installed before using this skill:
```bash
# Via conda (recommended)
conda install -c bioconda pyopenms
# Via pip
pip install pyopenms
```
## Integration with Other Tools
pyOpenMS integrates seamlessly with:
- **Search Engines**: Comet, Mascot, MSGF+, MSFragger, Sage, SpectraST
- **Post-processing**: Percolator, MSstats, Epiphany
- **Metabolomics**: SIRIUS, CSI:FingerID
- **Data Analysis**: Pandas, NumPy, SciPy for downstream analysis
- **Visualization**: Matplotlib, Seaborn for plotting
## Resources
### references/
Detailed documentation on core concepts:
- **data_structures.md** - Comprehensive guide to MSExperiment, MSSpectrum, MSChromatogram, and peak data handling
- **algorithms.md** - Complete reference for signal processing, filtering, feature detection, and quantification algorithms
- **chemistry.md** - In-depth coverage of chemistry calculations, peptide handling, modifications, and isotope distributions
Load these references when needing detailed information about specific pyOpenMS capabilities.
## Best Practices
1. **File Format**: Always use mzML for raw MS data (standardized, well-supported)
2. **Peak Access**: Use `get_peaks()` and `set_peaks()` with numpy arrays for efficient processing
3. **Parameters**: Always check and configure algorithm parameters via `getParameters()` and `setParameters()`
4. **Memory**: For large datasets, process spectra iteratively rather than loading entire experiments
5. **Validation**: Check data integrity (MS levels, RT ordering, precursor information) after loading
6. **Modifications**: Use standard modification names from UniMod database
7. **Units**: RT in seconds, m/z in Thomson (Da/charge), intensity in arbitrary units
## Common Patterns
**Algorithm Application Pattern:**
```python
# 1. Instantiate algorithm
algorithm = oms.SomeAlgorithm()
# 2. Get and configure parameters
params = algorithm.getParameters()
params.setValue("parameter_name", value)
algorithm.setParameters(params)
# 3. Apply to data
algorithm.filterExperiment(exp) # or .process(), .run(), depending on algorithm
```
**File I/O Pattern:**
```python
# Read
data_container = oms.DataContainer() # MSExperiment, FeatureMap, etc.
oms.FileHandler().load("input.format", data_container)
# Process
# ... manipulate data_container ...
# Write
oms.FileHandler().store("output.format", data_container)
```
## Getting Help
- **Documentation**: https://pyopenms.readthedocs.io/
- **API Reference**: Browse class documentation for detailed method signatures
- **OpenMS Website**: https://www.openms.org/
- **GitHub Issues**: https://github.com/OpenMS/OpenMS/issues

View File

@@ -0,0 +1,643 @@
# pyOpenMS Algorithms Reference
This document provides comprehensive coverage of algorithms available in pyOpenMS for signal processing, feature detection, and quantification.
## Algorithm Usage Pattern
Most pyOpenMS algorithms follow a consistent pattern:
```python
import pyopenms as oms
# 1. Instantiate algorithm
algorithm = oms.AlgorithmName()
# 2. Get parameters
params = algorithm.getParameters()
# 3. Modify parameters
params.setValue("parameter_name", value)
# 4. Set parameters back
algorithm.setParameters(params)
# 5. Apply to data
algorithm.filterExperiment(exp) # or .process(), .run(), etc.
```
## Signal Processing Algorithms
### Smoothing Filters
#### GaussFilter - Gaussian Smoothing
Applies Gaussian smoothing to reduce noise.
```python
gauss = oms.GaussFilter()
# Configure parameters
params = gauss.getParameters()
params.setValue("gaussian_width", 0.2) # Gaussian width (larger = more smoothing)
params.setValue("ppm_tolerance", 10.0) # PPM tolerance for spacing
params.setValue("use_ppm_tolerance", "true")
gauss.setParameters(params)
# Apply to experiment
gauss.filterExperiment(exp)
# Or apply to single spectrum
spectrum_smoothed = oms.MSSpectrum()
gauss.filter(spectrum, spectrum_smoothed)
```
**Key Parameters:**
- `gaussian_width`: Width of Gaussian kernel (default: 0.2 Da)
- `ppm_tolerance`: Tolerance in ppm for spacing
- `use_ppm_tolerance`: Whether to use ppm instead of absolute spacing
#### SavitzkyGolayFilter
Applies Savitzky-Golay smoothing (polynomial fitting).
```python
sg_filter = oms.SavitzkyGolayFilter()
params = sg_filter.getParameters()
params.setValue("frame_length", 11) # Window size (must be odd)
params.setValue("polynomial_order", 3) # Polynomial degree
sg_filter.setParameters(params)
sg_filter.filterExperiment(exp)
```
**Key Parameters:**
- `frame_length`: Size of smoothing window (must be odd)
- `polynomial_order`: Degree of polynomial (typically 2-4)
### Peak Filtering
#### NLargest - Keep Top N Peaks
Retains only the N most intense peaks per spectrum.
```python
n_largest = oms.NLargest()
params = n_largest.getParameters()
params.setValue("n", 100) # Keep top 100 peaks
params.setValue("threshold", 0.0) # Optional minimum intensity
n_largest.setParameters(params)
n_largest.filterExperiment(exp)
```
**Key Parameters:**
- `n`: Number of peaks to keep per spectrum
- `threshold`: Minimum absolute intensity threshold
#### ThresholdMower - Intensity Threshold Filtering
Removes peaks below a specified intensity threshold.
```python
threshold_filter = oms.ThresholdMower()
params = threshold_filter.getParameters()
params.setValue("threshold", 1000.0) # Absolute intensity threshold
threshold_filter.setParameters(params)
threshold_filter.filterExperiment(exp)
```
**Key Parameters:**
- `threshold`: Absolute intensity cutoff
#### WindowMower - Window-Based Peak Selection
Divides m/z range into windows and keeps top N peaks per window.
```python
window_mower = oms.WindowMower()
params = window_mower.getParameters()
params.setValue("windowsize", 50.0) # Window size in Da (or Thomson)
params.setValue("peakcount", 10) # Peaks to keep per window
params.setValue("movetype", "jump") # "jump" or "slide"
window_mower.setParameters(params)
window_mower.filterExperiment(exp)
```
**Key Parameters:**
- `windowsize`: Size of m/z window (Da)
- `peakcount`: Number of peaks to retain per window
- `movetype`: "jump" (non-overlapping) or "slide" (overlapping windows)
#### BernNorm - Bernoulli Normalization
Statistical normalization based on Bernoulli distribution.
```python
bern_norm = oms.BernNorm()
params = bern_norm.getParameters()
params.setValue("threshold", 0.7) # Threshold for normalization
bern_norm.setParameters(params)
bern_norm.filterExperiment(exp)
```
### Spectrum Normalization
#### Normalizer
Normalizes spectrum intensities to unit total intensity or maximum intensity.
```python
normalizer = oms.Normalizer()
params = normalizer.getParameters()
params.setValue("method", "to_one") # "to_one" or "to_TIC"
normalizer.setParameters(params)
normalizer.filterExperiment(exp)
```
**Methods:**
- `to_one`: Normalize max peak to 1.0
- `to_TIC`: Normalize to total ion current = 1.0
#### Scaler
Scales intensities by a constant factor.
```python
scaler = oms.Scaler()
params = scaler.getParameters()
params.setValue("scaling", 1000.0) # Scaling factor
scaler.setParameters(params)
scaler.filterExperiment(exp)
```
## Centroiding and Peak Picking
### PeakPickerHiRes - High-Resolution Peak Picking
Converts profile spectra to centroid mode for high-resolution data.
```python
picker = oms.PeakPickerHiRes()
params = picker.getParameters()
params.setValue("signal_to_noise", 1.0) # S/N threshold
params.setValue("spacing_difference", 1.5) # Peak spacing factor
params.setValue("sn_win_len", 20.0) # S/N window length
params.setValue("sn_bin_count", 30) # Bins for S/N estimation
params.setValue("ms1_only", "false") # Process only MS1
params.setValue("ms_levels", [1, 2]) # MS levels to process
picker.setParameters(params)
# Pick peaks
exp_centroided = oms.MSExperiment()
picker.pickExperiment(exp, exp_centroided)
```
**Key Parameters:**
- `signal_to_noise`: Minimum signal-to-noise ratio
- `spacing_difference`: Minimum spacing between peaks
- `ms_levels`: List of MS levels to process
### PeakPickerWavelet - Wavelet-Based Peak Picking
Uses continuous wavelet transform for peak detection.
```python
wavelet_picker = oms.PeakPickerWavelet()
params = wavelet_picker.getParameters()
params.setValue("signal_to_noise", 1.0)
params.setValue("peak_width", 0.15) # Expected peak width (Da)
wavelet_picker.setParameters(params)
wavelet_picker.pickExperiment(exp, exp_centroided)
```
## Feature Detection
### FeatureFinder Algorithms
Feature finders detect 2D features (m/z and RT) in LC-MS data.
#### FeatureFinderMultiplex
For multiplex labeling experiments (SILAC, dimethyl labeling).
```python
ff = oms.FeatureFinderMultiplex()
params = ff.getParameters()
params.setValue("algorithm:labels", "[]") # Empty for label-free
params.setValue("algorithm:charge", "2:4") # Charge range
params.setValue("algorithm:rt_typical", 40.0) # Expected feature RT width
params.setValue("algorithm:rt_min", 2.0) # Minimum RT width
params.setValue("algorithm:mz_tolerance", 10.0) # m/z tolerance (ppm)
params.setValue("algorithm:intensity_cutoff", 1000.0) # Minimum intensity
ff.setParameters(params)
# Run feature detection
features = oms.FeatureMap()
ff.run(exp, features, oms.Param())
print(f"Found {features.size()} features")
```
**Key Parameters:**
- `algorithm:charge`: Charge state range to consider
- `algorithm:rt_typical`: Expected peak width in RT dimension
- `algorithm:mz_tolerance`: Mass tolerance in ppm
- `algorithm:intensity_cutoff`: Minimum intensity threshold
#### FeatureFinderCentroided
For centroided data, identifies isotope patterns and traces over RT.
```python
ff_centroided = oms.FeatureFinderCentroided()
params = ff_centroided.getParameters()
params.setValue("mass_trace:mz_tolerance", 10.0) # ppm
params.setValue("mass_trace:min_spectra", 5) # Min consecutive spectra
params.setValue("isotopic_pattern:charge_low", 1)
params.setValue("isotopic_pattern:charge_high", 4)
params.setValue("seed:min_score", 0.5)
ff_centroided.setParameters(params)
features = oms.FeatureMap()
seeds = oms.FeatureMap() # Optional seed features
ff_centroided.run(exp, features, params, seeds)
```
#### FeatureFinderIdentification
Uses peptide identifications to guide feature detection.
```python
ff_id = oms.FeatureFinderIdentification()
params = ff_id.getParameters()
params.setValue("extract:mz_window", 10.0) # ppm
params.setValue("extract:rt_window", 60.0) # seconds
params.setValue("detect:peak_width", 30.0) # Expected peak width
ff_id.setParameters(params)
# Requires peptide identifications
protein_ids = []
peptide_ids = []
features = oms.FeatureMap()
ff_id.run(exp, protein_ids, peptide_ids, features)
```
## Charge and Isotope Deconvolution
### Decharging and Charge State Deconvolution
#### FeatureDeconvolution
Resolves charge states and combines features.
```python
deconv = oms.FeatureDeconvolution()
params = deconv.getParameters()
params.setValue("charge_min", 1)
params.setValue("charge_max", 4)
params.setValue("q_value", 0.01) # FDR threshold
deconv.setParameters(params)
features_deconv = oms.FeatureMap()
consensus_map = oms.ConsensusMap()
deconv.compute(features, features_deconv, consensus_map)
```
## Map Alignment
### MapAlignmentAlgorithm
Aligns retention times across multiple LC-MS runs.
#### MapAlignmentAlgorithmPoseClustering
Pose clustering-based RT alignment.
```python
aligner = oms.MapAlignmentAlgorithmPoseClustering()
params = aligner.getParameters()
params.setValue("max_num_peaks_considered", 1000)
params.setValue("pairfinder:distance_MZ:max_difference", 0.3) # Da
params.setValue("pairfinder:distance_RT:max_difference", 60.0) # seconds
aligner.setParameters(params)
# Align multiple feature maps
feature_maps = [features1, features2, features3]
transformations = []
# Create reference (e.g., use first map)
reference = oms.FeatureMap(feature_maps[0])
# Align others to reference
for fm in feature_maps[1:]:
transformation = oms.TransformationDescription()
aligner.align(fm, reference, transformation)
transformations.append(transformation)
# Apply transformation
transformer = oms.MapAlignmentTransformer()
transformer.transformRetentionTimes(fm, transformation)
```
## Feature Linking
### FeatureGroupingAlgorithm
Links features across samples to create consensus features.
#### FeatureGroupingAlgorithmQT
Quality threshold-based feature linking.
```python
grouper = oms.FeatureGroupingAlgorithmQT()
params = grouper.getParameters()
params.setValue("distance_RT:max_difference", 60.0) # seconds
params.setValue("distance_MZ:max_difference", 10.0) # ppm
params.setValue("distance_MZ:unit", "ppm")
grouper.setParameters(params)
# Create consensus map
consensus_map = oms.ConsensusMap()
# Group features from multiple samples
feature_maps = [features1, features2, features3]
grouper.group(feature_maps, consensus_map)
print(f"Created {consensus_map.size()} consensus features")
```
#### FeatureGroupingAlgorithmKD
KD-tree based linking (faster for large datasets).
```python
grouper_kd = oms.FeatureGroupingAlgorithmKD()
params = grouper_kd.getParameters()
params.setValue("mz_unit", "ppm")
params.setValue("mz_tolerance", 10.0)
params.setValue("rt_tolerance", 30.0)
grouper_kd.setParameters(params)
consensus_map = oms.ConsensusMap()
grouper_kd.group(feature_maps, consensus_map)
```
## Chromatographic Analysis
### ElutionPeakDetection
Detects elution peaks in chromatograms.
```python
epd = oms.ElutionPeakDetection()
params = epd.getParameters()
params.setValue("chrom_peak_snr", 3.0) # Signal-to-noise threshold
params.setValue("chrom_fwhm", 5.0) # Expected FWHM (seconds)
epd.setParameters(params)
# Apply to chromatograms
for chrom in exp.getChromatograms():
peaks = epd.detectPeaks(chrom)
```
### MRMFeatureFinderScoring
Scoring and peak picking for targeted (MRM/SRM) experiments.
```python
mrm_finder = oms.MRMFeatureFinderScoring()
params = mrm_finder.getParameters()
params.setValue("TransitionGroupPicker:min_peak_width", 2.0)
params.setValue("TransitionGroupPicker:recalculate_peaks", "true")
params.setValue("TransitionGroupPicker:PeakPickerMRM:signal_to_noise", 1.0)
mrm_finder.setParameters(params)
# Requires chromatograms
features = oms.FeatureMap()
mrm_finder.pickExperiment(chrom_exp, features, targets, transformation, swath_maps)
```
## Quantification
### ProteinInference
Infers proteins from peptide identifications.
```python
protein_inference = oms.BasicProteinInferenceAlgorithm()
# Apply to identification results
protein_inference.run(peptide_ids, protein_ids)
```
### IsobaricQuantification
Quantification for isobaric labeling (TMT, iTRAQ).
```python
# For TMT/iTRAQ quantification
iso_quant = oms.IsobaricQuantification()
params = iso_quant.getParameters()
params.setValue("channel_116_description", "Sample1")
params.setValue("channel_117_description", "Sample2")
# ... configure all channels
iso_quant.setParameters(params)
# Run quantification
quant_method = oms.IsobaricQuantitationMethod.TMT_10PLEX
quant_info = oms.IsobaricQuantifierStatistics()
iso_quant.quantify(exp, quant_info)
```
## Data Processing
### BaselineFiltering
Removes baseline from spectra.
```python
baseline = oms.TopHatFilter()
params = baseline.getParameters()
params.setValue("struc_elem_length", 3.0) # Structuring element size
params.setValue("struc_elem_unit", "Thomson")
baseline.setParameters(params)
baseline.filterExperiment(exp)
```
### SpectraMerger
Merges consecutive similar spectra.
```python
merger = oms.SpectraMerger()
params = merger.getParameters()
params.setValue("mz_binning_width", 0.05) # Binning width (Da)
params.setValue("sort_blocks", "RT_ascending")
merger.setParameters(params)
merger.mergeSpectra(exp)
```
## Quality Control
### MzMLFileQuality
Analyzes mzML file quality.
```python
# Calculate basic QC metrics
def calculate_qc_metrics(exp):
metrics = {
'n_spectra': exp.getNrSpectra(),
'n_ms1': sum(1 for s in exp if s.getMSLevel() == 1),
'n_ms2': sum(1 for s in exp if s.getMSLevel() == 2),
'rt_range': (exp.getMinRT(), exp.getMaxRT()),
'mz_range': (exp.getMinMZ(), exp.getMaxMZ()),
}
# Calculate TIC
tics = []
for spectrum in exp:
if spectrum.getMSLevel() == 1:
mz, intensity = spectrum.get_peaks()
tics.append(sum(intensity))
metrics['median_tic'] = np.median(tics)
metrics['mean_tic'] = np.mean(tics)
return metrics
```
## FDR Control
### FalseDiscoveryRate
Estimates and controls false discovery rate.
```python
fdr = oms.FalseDiscoveryRate()
params = fdr.getParameters()
params.setValue("add_decoy_peptides", "false")
params.setValue("add_decoy_proteins", "false")
fdr.setParameters(params)
# Apply to identifications
fdr.apply(protein_ids, peptide_ids)
# Filter by FDR threshold
fdr_threshold = 0.01
filtered_peptides = [p for p in peptide_ids if p.getMetaValue("q-value") <= fdr_threshold]
```
## Algorithm Selection Guide
### When to Use Which Algorithm
**For Smoothing:**
- Use `GaussFilter` for general-purpose smoothing
- Use `SavitzkyGolayFilter` for preserving peak shapes
**For Peak Picking:**
- Use `PeakPickerHiRes` for high-resolution Orbitrap/FT-ICR data
- Use `PeakPickerWavelet` for lower-resolution TOF data
**For Feature Detection:**
- Use `FeatureFinderCentroided` for label-free proteomics (DDA)
- Use `FeatureFinderMultiplex` for SILAC/dimethyl labeling
- Use `FeatureFinderIdentification` when you have ID information
- Use `MRMFeatureFinderScoring` for targeted (MRM/SRM) experiments
**For Feature Linking:**
- Use `FeatureGroupingAlgorithmQT` for small-medium datasets (<10 samples)
- Use `FeatureGroupingAlgorithmKD` for large datasets (>10 samples)
## Parameter Tuning Tips
1. **S/N Thresholds**: Start with 1-3 for clean data, increase for noisy data
2. **m/z Tolerance**: Use 5-10 ppm for high-resolution instruments, 0.5-1 Da for low-res
3. **RT Tolerance**: Typically 30-60 seconds depending on chromatographic stability
4. **Peak Width**: Measure from real data - varies by instrument and gradient length
5. **Charge States**: Set based on expected analytes (1-2 for metabolites, 2-4 for peptides)
## Common Algorithm Workflows
### Complete Proteomics Workflow
```python
# 1. Load data
exp = oms.MSExperiment()
oms.MzMLFile().load("raw.mzML", exp)
# 2. Smooth
gauss = oms.GaussFilter()
gauss.filterExperiment(exp)
# 3. Peak picking
picker = oms.PeakPickerHiRes()
exp_centroid = oms.MSExperiment()
picker.pickExperiment(exp, exp_centroid)
# 4. Feature detection
ff = oms.FeatureFinderCentroided()
features = oms.FeatureMap()
ff.run(exp_centroid, features, oms.Param(), oms.FeatureMap())
# 5. Save results
oms.FeatureXMLFile().store("features.featureXML", features)
```
### Multi-Sample Quantification
```python
# Load multiple samples
feature_maps = []
for filename in ["sample1.mzML", "sample2.mzML", "sample3.mzML"]:
exp = oms.MSExperiment()
oms.MzMLFile().load(filename, exp)
# Process and detect features
features = detect_features(exp) # Your processing function
feature_maps.append(features)
# Align retention times
align_feature_maps(feature_maps) # Implement alignment
# Link features
grouper = oms.FeatureGroupingAlgorithmQT()
consensus_map = oms.ConsensusMap()
grouper.group(feature_maps, consensus_map)
# Export quantification matrix
export_quant_matrix(consensus_map)
```

View File

@@ -0,0 +1,715 @@
# pyOpenMS Chemistry Reference
This document provides comprehensive coverage of chemistry-related functionality in pyOpenMS, including elements, isotopes, molecular formulas, amino acids, peptides, proteins, and modifications.
## Elements and Isotopes
### ElementDB - Element Database
Access atomic and isotopic data for all elements.
```python
import pyopenms as oms
# Get element database instance
element_db = oms.ElementDB()
# Get element by symbol
carbon = element_db.getElement("C")
nitrogen = element_db.getElement("N")
oxygen = element_db.getElement("O")
# Element properties
print(f"Carbon monoisotopic weight: {carbon.getMonoWeight()}")
print(f"Carbon average weight: {carbon.getAverageWeight()}")
print(f"Atomic number: {carbon.getAtomicNumber()}")
print(f"Symbol: {carbon.getSymbol()}")
print(f"Name: {carbon.getName()}")
```
### Isotope Information
```python
# Get isotope distribution for an element
isotopes = carbon.getIsotopeDistribution()
# Access specific isotope
c12 = element_db.getElement("C", 12) # Carbon-12
c13 = element_db.getElement("C", 13) # Carbon-13
print(f"C-12 abundance: {isotopes.getContainer()[0].getIntensity()}")
print(f"C-13 abundance: {isotopes.getContainer()[1].getIntensity()}")
# Isotope mass
print(f"C-12 mass: {c12.getMonoWeight()}")
print(f"C-13 mass: {c13.getMonoWeight()}")
```
### Constants
```python
# Physical constants
avogadro = oms.Constants.AVOGADRO
electron_mass = oms.Constants.ELECTRON_MASS_U
proton_mass = oms.Constants.PROTON_MASS_U
print(f"Avogadro's number: {avogadro}")
print(f"Electron mass: {electron_mass} u")
print(f"Proton mass: {proton_mass} u")
```
## Empirical Formulas
### EmpiricalFormula - Molecular Formulas
Represent and manipulate molecular formulas.
#### Creating Formulas
```python
# From string
glucose = oms.EmpiricalFormula("C6H12O6")
water = oms.EmpiricalFormula("H2O")
ammonia = oms.EmpiricalFormula("NH3")
# From element composition
formula = oms.EmpiricalFormula()
formula.setCharge(1) # Set charge state
```
#### Formula Arithmetic
```python
# Addition
sucrose = oms.EmpiricalFormula("C12H22O11")
hydrolyzed = sucrose + water # Hydrolysis adds water
# Subtraction
dehydrated = glucose - water # Dehydration removes water
# Multiplication
three_waters = water * 3 # 3 H2O = H6O3
# Division
formula_half = sucrose / 2 # Half the formula
```
#### Mass Calculations
```python
# Monoisotopic mass
mono_mass = glucose.getMonoWeight()
print(f"Glucose monoisotopic mass: {mono_mass:.6f} Da")
# Average mass
avg_mass = glucose.getAverageWeight()
print(f"Glucose average mass: {avg_mass:.6f} Da")
# Mass difference
mass_diff = (glucose - water).getMonoWeight()
```
#### Elemental Composition
```python
# Get element counts
formula = oms.EmpiricalFormula("C6H12O6")
# Access individual elements
n_carbon = formula.getNumberOf(element_db.getElement("C"))
n_hydrogen = formula.getNumberOf(element_db.getElement("H"))
n_oxygen = formula.getNumberOf(element_db.getElement("O"))
print(f"C: {n_carbon}, H: {n_hydrogen}, O: {n_oxygen}")
# String representation
print(f"Formula: {formula.toString()}")
```
#### Isotope-Specific Formulas
```python
# Specify specific isotopes using parentheses
labeled_glucose = oms.EmpiricalFormula("(13)C6H12O6") # All carbons are C-13
partially_labeled = oms.EmpiricalFormula("C5(13)CH12O6") # One C-13
# Deuterium labeling
deuterated = oms.EmpiricalFormula("C6D12O6") # D2O instead of H2O
```
#### Charge States
```python
# Set charge
formula = oms.EmpiricalFormula("C6H12O6")
formula.setCharge(1) # Positive charge
# Get charge
charge = formula.getCharge()
# Calculate m/z for charged molecule
mz = formula.getMonoWeight() / abs(charge) if charge != 0 else formula.getMonoWeight()
```
### Isotope Pattern Generation
Generate theoretical isotope patterns for formulas.
#### CoarseIsotopePatternGenerator
For unit mass resolution (low-resolution instruments).
```python
# Create generator
coarse_gen = oms.CoarseIsotopePatternGenerator()
# Generate pattern
formula = oms.EmpiricalFormula("C6H12O6")
pattern = coarse_gen.run(formula)
# Access isotope peaks
iso_dist = pattern.getContainer()
for peak in iso_dist:
mass = peak.getMZ()
abundance = peak.getIntensity()
print(f"m/z: {mass:.4f}, Abundance: {abundance:.4f}")
```
#### FineIsotopePatternGenerator
For high-resolution instruments (hyperfine structure).
```python
# Create generator with resolution
fine_gen = oms.FineIsotopePatternGenerator(0.01) # 0.01 Da resolution
# Generate fine pattern
fine_pattern = fine_gen.run(formula)
# Access fine isotope structure
for peak in fine_pattern.getContainer():
print(f"m/z: {peak.getMZ():.6f}, Abundance: {peak.getIntensity():.6f}")
```
#### Isotope Pattern Matching
```python
# Compare experimental to theoretical
def compare_isotope_patterns(experimental_mz, experimental_int, formula):
# Generate theoretical
coarse_gen = oms.CoarseIsotopePatternGenerator()
theoretical = coarse_gen.run(formula)
# Extract theoretical peaks
theo_peaks = theoretical.getContainer()
theo_mz = [p.getMZ() for p in theo_peaks]
theo_int = [p.getIntensity() for p in theo_peaks]
# Normalize both patterns
exp_int_norm = [i / max(experimental_int) for i in experimental_int]
theo_int_norm = [i / max(theo_int) for i in theo_int]
# Calculate similarity (e.g., cosine similarity)
# ... implement similarity calculation
return similarity_score
```
## Amino Acids and Residues
### Residue - Amino Acid Representation
Access properties of amino acids.
```python
# Get residue database
res_db = oms.ResidueDB()
# Get specific residue
leucine = res_db.getResidue("Leucine")
# Or by one-letter code
leu = res_db.getResidue("L")
# Residue properties
print(f"Name: {leucine.getName()}")
print(f"Three-letter code: {leucine.getThreeLetterCode()}")
print(f"One-letter code: {leucine.getOneLetterCode()}")
print(f"Monoisotopic mass: {leucine.getMonoWeight():.6f}")
print(f"Average mass: {leucine.getAverageWeight():.6f}")
# Chemical formula
formula = leucine.getFormula()
print(f"Formula: {formula.toString()}")
# pKa values
print(f"pKa (N-term): {leucine.getPka()}")
print(f"pKa (C-term): {leucine.getPkb()}")
print(f"pKa (side chain): {leucine.getPkc()}")
# Side chain basicity/acidity
print(f"Basicity: {leucine.getBasicity()}")
print(f"Hydrophobicity: {leucine.getHydrophobicity()}")
```
### All Standard Amino Acids
```python
# Iterate over all residues
for residue_name in ["Alanine", "Cysteine", "Aspartic acid", "Glutamic acid",
"Phenylalanine", "Glycine", "Histidine", "Isoleucine",
"Lysine", "Leucine", "Methionine", "Asparagine",
"Proline", "Glutamine", "Arginine", "Serine",
"Threonine", "Valine", "Tryptophan", "Tyrosine"]:
res = res_db.getResidue(residue_name)
print(f"{res.getOneLetterCode()}: {res.getMonoWeight():.4f} Da")
```
### Internal Residues vs. Termini
```python
# Get internal residue mass (no terminal groups)
internal_mass = leucine.getInternalToFull()
# Get residue with N-terminal modification
n_terminal = res_db.getResidue("L[1]") # With NH2
# Get residue with C-terminal modification
c_terminal = res_db.getResidue("L[2]") # With COOH
```
## Peptide Sequences
### AASequence - Amino Acid Sequences
Represent and manipulate peptide sequences.
#### Creating Sequences
```python
# From string
peptide = oms.AASequence.fromString("PEPTIDE")
longer = oms.AASequence.fromString("MKTAYIAKQRQISFVK")
# Empty sequence
empty_seq = oms.AASequence()
```
#### Sequence Properties
```python
peptide = oms.AASequence.fromString("PEPTIDE")
# Length
length = peptide.size()
print(f"Length: {length} residues")
# Mass
mono_mass = peptide.getMonoWeight()
avg_mass = peptide.getAverageWeight()
print(f"Monoisotopic mass: {mono_mass:.6f} Da")
print(f"Average mass: {avg_mass:.6f} Da")
# Formula
formula = peptide.getFormula()
print(f"Formula: {formula.toString()}")
# String representation
seq_str = peptide.toString()
print(f"Sequence: {seq_str}")
```
#### Accessing Individual Residues
```python
peptide = oms.AASequence.fromString("PEPTIDE")
# Access by index
first_aa = peptide[0] # Returns Residue object
print(f"First amino acid: {first_aa.getOneLetterCode()}")
# Iterate
for i in range(peptide.size()):
residue = peptide[i]
print(f"Position {i}: {residue.getOneLetterCode()}")
```
#### Modifications
Add post-translational modifications (PTMs) to sequences.
```python
# Modifications in sequence string
# Format: AA(ModificationName)
oxidized_met = oms.AASequence.fromString("PEPTIDEM(Oxidation)")
phospho = oms.AASequence.fromString("PEPTIDES(Phospho)T(Phospho)")
# Multiple modifications
multi_mod = oms.AASequence.fromString("M(Oxidation)PEPTIDEK(Acetyl)")
# N-terminal modifications
n_term_acetyl = oms.AASequence.fromString("(Acetyl)PEPTIDE")
# C-terminal modifications
c_term_amide = oms.AASequence.fromString("PEPTIDE(Amidated)")
# Check mass change
unmodified = oms.AASequence.fromString("PEPTIDE")
modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)")
mass_diff = modified.getMonoWeight() - unmodified.getMonoWeight()
print(f"Mass shift from oxidation: {mass_diff:.6f} Da")
```
#### Sequence Manipulation
```python
# Prefix (N-terminal fragment)
prefix = peptide.getPrefix(3) # First 3 residues
print(f"Prefix: {prefix.toString()}")
# Suffix (C-terminal fragment)
suffix = peptide.getSuffix(3) # Last 3 residues
print(f"Suffix: {suffix.toString()}")
# Subsequence
subseq = peptide.getSubsequence(2, 4) # Residues 2-4
print(f"Subsequence: {subseq.toString()}")
```
#### Theoretical Fragmentation
Generate theoretical fragment ions for MS/MS.
```python
peptide = oms.AASequence.fromString("PEPTIDE")
# b-ions (N-terminal fragments)
b_ions = []
for i in range(1, peptide.size()):
b_fragment = peptide.getPrefix(i)
b_mass = b_fragment.getMonoWeight()
b_ions.append(('b', i, b_mass))
print(f"b{i}: {b_mass:.4f}")
# y-ions (C-terminal fragments)
y_ions = []
for i in range(1, peptide.size()):
y_fragment = peptide.getSuffix(i)
y_mass = y_fragment.getMonoWeight()
y_ions.append(('y', i, y_mass))
print(f"y{i}: {y_mass:.4f}")
# a-ions (b - CO)
a_ions = []
CO_mass = 27.994915 # CO loss
for ion_type, position, mass in b_ions:
a_mass = mass - CO_mass
a_ions.append(('a', position, a_mass))
# c-ions (b + NH3)
NH3_mass = 17.026549 # NH3 gain
c_ions = []
for ion_type, position, mass in b_ions:
c_mass = mass + NH3_mass
c_ions.append(('c', position, c_mass))
# z-ions (y - NH3)
z_ions = []
for ion_type, position, mass in y_ions:
z_mass = mass - NH3_mass
z_ions.append(('z', position, z_mass))
```
#### Calculate m/z for Charge States
```python
peptide = oms.AASequence.fromString("PEPTIDE")
proton_mass = 1.007276
# [M+H]+
mz_1 = peptide.getMonoWeight() + proton_mass
print(f"[M+H]+: {mz_1:.4f}")
# [M+2H]2+
mz_2 = (peptide.getMonoWeight() + 2 * proton_mass) / 2
print(f"[M+2H]2+: {mz_2:.4f}")
# [M+3H]3+
mz_3 = (peptide.getMonoWeight() + 3 * proton_mass) / 3
print(f"[M+3H]3+: {mz_3:.4f}")
# General formula for any charge
def calculate_mz(sequence, charge):
proton_mass = 1.007276
return (sequence.getMonoWeight() + charge * proton_mass) / charge
for z in range(1, 5):
print(f"[M+{z}H]{z}+: {calculate_mz(peptide, z):.4f}")
```
## Protein Digestion
### ProteaseDigestion - Enzymatic Cleavage
Simulate enzymatic protein digestion.
#### Basic Digestion
```python
# Create digestion object
dig = oms.ProteaseDigestion()
# Set enzyme
dig.setEnzyme("Trypsin") # Cleaves after K, R
# Other common enzymes:
# - "Trypsin" (K, R)
# - "Lys-C" (K)
# - "Arg-C" (R)
# - "Asp-N" (D)
# - "Glu-C" (E, D)
# - "Chymotrypsin" (F, Y, W, L)
# Set missed cleavages
dig.setMissedCleavages(0) # No missed cleavages
dig.setMissedCleavages(2) # Allow up to 2 missed cleavages
# Perform digestion
protein = oms.AASequence.fromString("MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK")
peptides = []
dig.digest(protein, peptides)
# Print results
for pep in peptides:
print(f"{pep.toString()}: {pep.getMonoWeight():.2f} Da")
```
#### Advanced Digestion Options
```python
# Get enzyme specificity
specificity = dig.getSpecificity()
# oms.EnzymaticDigestion.SPEC_FULL (both termini)
# oms.EnzymaticDigestion.SPEC_SEMI (one terminus)
# oms.EnzymaticDigestion.SPEC_NONE (no specificity)
# Set specificity for semi-tryptic search
dig.setSpecificity(oms.EnzymaticDigestion.SPEC_SEMI)
# Get cleavage sites
cleavage_residues = dig.getEnzyme().getCutAfterResidues()
restriction_residues = dig.getEnzyme().getRestriction()
```
#### Filter Peptides by Properties
```python
# Filter by mass range
min_mass = 600.0
max_mass = 4000.0
filtered = [p for p in peptides if min_mass <= p.getMonoWeight() <= max_mass]
# Filter by length
min_length = 6
max_length = 30
length_filtered = [p for p in peptides if min_length <= p.size() <= max_length]
# Combine filters
valid_peptides = [p for p in peptides
if min_mass <= p.getMonoWeight() <= max_mass
and min_length <= p.size() <= max_length]
```
## Modifications
### ModificationsDB - Modification Database
Access and apply post-translational modifications.
#### Accessing Modifications
```python
# Get modifications database
mod_db = oms.ModificationsDB()
# Get specific modification
oxidation = mod_db.getModification("Oxidation")
phospho = mod_db.getModification("Phospho")
acetyl = mod_db.getModification("Acetyl")
# Modification properties
print(f"Name: {oxidation.getFullName()}")
print(f"Mass difference: {oxidation.getDiffMonoMass():.6f} Da")
print(f"Formula: {oxidation.getDiffFormula().toString()}")
# Affected residues
print(f"Residues: {oxidation.getResidues()}") # e.g., ['M']
# Specificity (N-term, C-term, anywhere)
print(f"Term specificity: {oxidation.getTermSpecificity()}")
```
#### Common Modifications
```python
# Oxidation (M)
oxidation = mod_db.getModification("Oxidation")
print(f"Oxidation: +{oxidation.getDiffMonoMass():.4f} Da")
# Phosphorylation (S, T, Y)
phospho = mod_db.getModification("Phospho")
print(f"Phospho: +{phospho.getDiffMonoMass():.4f} Da")
# Carbamidomethylation (C) - common alkylation
carbamido = mod_db.getModification("Carbamidomethyl")
print(f"Carbamidomethyl: +{carbamido.getDiffMonoMass():.4f} Da")
# Acetylation (K, N-term)
acetyl = mod_db.getModification("Acetyl")
print(f"Acetyl: +{acetyl.getDiffMonoMass():.4f} Da")
# Deamidation (N, Q)
deamid = mod_db.getModification("Deamidated")
print(f"Deamidation: +{deamid.getDiffMonoMass():.4f} Da")
```
#### Searching Modifications
```python
# Search modifications by mass
mass_tolerance = 0.01 # Da
target_mass = 15.9949 # Oxidation
# Get all modifications
all_mods = []
mod_db.getAllSearchModifications(all_mods)
# Find matching modifications
matching = []
for mod_name in all_mods:
mod = mod_db.getModification(mod_name)
if abs(mod.getDiffMonoMass() - target_mass) < mass_tolerance:
matching.append(mod)
print(f"Match: {mod.getFullName()} ({mod.getDiffMonoMass():.4f} Da)")
```
#### Variable vs. Fixed Modifications
```python
# In search engines, specify:
# Fixed modifications: applied to all occurrences
fixed_mods = ["Carbamidomethyl (C)"]
# Variable modifications: optionally present
variable_mods = ["Oxidation (M)", "Phospho (S)", "Phospho (T)", "Phospho (Y)"]
```
## Ribonucleotides (RNA)
### Ribonucleotide - RNA Building Blocks
```python
# Get ribonucleotide database
ribo_db = oms.RibonucleotideDB()
# Get specific ribonucleotide
adenine = ribo_db.getRibonucleotide("A")
uracil = ribo_db.getRibonucleotide("U")
guanine = ribo_db.getRibonucleotide("G")
cytosine = ribo_db.getRibonucleotide("C")
# Properties
print(f"Adenine mono mass: {adenine.getMonoWeight()}")
print(f"Formula: {adenine.getFormula().toString()}")
# Modified ribonucleotides
modified_ribo = ribo_db.getRibonucleotide("m6A") # N6-methyladenosine
```
## Practical Examples
### Calculate Peptide Mass with Modifications
```python
def calculate_peptide_mz(sequence_str, charge):
"""Calculate m/z for a peptide sequence string with modifications."""
peptide = oms.AASequence.fromString(sequence_str)
proton_mass = 1.007276
mz = (peptide.getMonoWeight() + charge * proton_mass) / charge
return mz
# Examples
print(calculate_peptide_mz("PEPTIDE", 2)) # Unmodified [M+2H]2+
print(calculate_peptide_mz("PEPTIDEM(Oxidation)", 2)) # With oxidation
print(calculate_peptide_mz("(Acetyl)PEPTIDEK(Acetyl)", 2)) # Acetylated
```
### Generate Complete Fragment Ion Series
```python
def generate_fragment_ions(sequence_str, charge_states=[1, 2]):
"""Generate comprehensive fragment ion list."""
peptide = oms.AASequence.fromString(sequence_str)
proton_mass = 1.007276
fragments = []
for i in range(1, peptide.size()):
# b and y ions
b_frag = peptide.getPrefix(i)
y_frag = peptide.getSuffix(i)
for z in charge_states:
b_mz = (b_frag.getMonoWeight() + z * proton_mass) / z
y_mz = (y_frag.getMonoWeight() + z * proton_mass) / z
fragments.append({
'type': 'b',
'position': i,
'charge': z,
'mz': b_mz
})
fragments.append({
'type': 'y',
'position': i,
'charge': z,
'mz': y_mz
})
return fragments
# Usage
ions = generate_fragment_ions("PEPTIDE", charge_states=[1, 2])
for ion in ions:
print(f"{ion['type']}{ion['position']}^{ion['charge']}+: {ion['mz']:.4f}")
```
### Digest Protein and Calculate Peptide Masses
```python
def digest_and_calculate(protein_seq_str, enzyme="Trypsin", missed_cleavages=2,
min_mass=600, max_mass=4000):
"""Digest protein and return valid peptides with masses."""
dig = oms.ProteaseDigestion()
dig.setEnzyme(enzyme)
dig.setMissedCleavages(missed_cleavages)
protein = oms.AASequence.fromString(protein_seq_str)
peptides = []
dig.digest(protein, peptides)
results = []
for pep in peptides:
mass = pep.getMonoWeight()
if min_mass <= mass <= max_mass:
results.append({
'sequence': pep.toString(),
'mass': mass,
'length': pep.size()
})
return results
# Usage
protein = "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK"
peptides = digest_and_calculate(protein)
for pep in peptides:
print(f"{pep['sequence']}: {pep['mass']:.2f} Da ({pep['length']} aa)")
```

View File

@@ -0,0 +1,560 @@
# pyOpenMS Data Structures Reference
This document provides comprehensive coverage of core data structures in pyOpenMS for representing mass spectrometry data.
## Core Hierarchy
```
MSExperiment # Top-level: Complete LC-MS/MS run
├── MSSpectrum[] # Collection of mass spectra
│ ├── Peak1D[] # Individual m/z, intensity pairs
│ └── SpectrumSettings # Metadata (RT, MS level, precursor)
└── MSChromatogram[] # Collection of chromatograms
├── ChromatogramPeak[] # RT, intensity pairs
└── ChromatogramSettings # Metadata
```
## MSSpectrum
Represents a single mass spectrum (1-dimensional peak data).
### Creation and Basic Properties
```python
import pyopenms as oms
# Create empty spectrum
spectrum = oms.MSSpectrum()
# Set metadata
spectrum.setRT(123.45) # Retention time in seconds
spectrum.setMSLevel(1) # MS level (1 for MS1, 2 for MS2, etc.)
spectrum.setNativeID("scan=1234") # Native ID from file
# Additional metadata
spectrum.setDriftTime(15.2) # Ion mobility drift time
spectrum.setName("MyScan") # Optional name
```
### Peak Data Management
**Setting Peaks (Method 1 - Lists):**
```python
mz_values = [100.5, 200.3, 300.7, 400.2, 500.1]
intensity_values = [1000, 5000, 3000, 2000, 1500]
spectrum.set_peaks((mz_values, intensity_values))
```
**Setting Peaks (Method 2 - NumPy arrays):**
```python
import numpy as np
mz_array = np.array([100.5, 200.3, 300.7, 400.2, 500.1])
intensity_array = np.array([1000, 5000, 3000, 2000, 1500])
spectrum.set_peaks((mz_array, intensity_array))
```
**Retrieving Peaks:**
```python
# Get as numpy arrays (efficient)
mz_array, intensity_array = spectrum.get_peaks()
# Check number of peaks
n_peaks = spectrum.size()
# Get individual peak (slower)
for i in range(spectrum.size()):
peak = spectrum[i]
mz = peak.getMZ()
intensity = peak.getIntensity()
```
### Precursor Information (for MS2/MSn spectra)
```python
# Create precursor
precursor = oms.Precursor()
precursor.setMZ(456.789) # Precursor m/z
precursor.setCharge(2) # Precursor charge
precursor.setIntensity(50000) # Precursor intensity
precursor.setIsolationWindowLowerOffset(1.5) # Lower isolation window
precursor.setIsolationWindowUpperOffset(1.5) # Upper isolation window
# Set activation method
activation = oms.Activation()
activation.setActivationEnergy(35.0) # Collision energy
activation.setMethod(oms.Activation.ActivationMethod.CID)
precursor.setActivation(activation)
# Assign to spectrum
spectrum.setPrecursors([precursor])
# Retrieve precursor information
precursors = spectrum.getPrecursors()
if len(precursors) > 0:
prec = precursors[0]
print(f"Precursor m/z: {prec.getMZ()}")
print(f"Precursor charge: {prec.getCharge()}")
```
### Spectrum Metadata Access
```python
# Check if spectrum is sorted by m/z
is_sorted = spectrum.isSorted()
# Sort spectrum by m/z
spectrum.sortByPosition()
# Sort by intensity
spectrum.sortByIntensity()
# Clear all peaks
spectrum.clear(False) # False = keep metadata, True = clear everything
# Get retention time
rt = spectrum.getRT()
# Get MS level
ms_level = spectrum.getMSLevel()
```
### Spectrum Types and Modes
```python
# Set spectrum type
spectrum.setType(oms.SpectrumSettings.SpectrumType.CENTROID) # or PROFILE
# Get spectrum type
spec_type = spectrum.getType()
if spec_type == oms.SpectrumSettings.SpectrumType.CENTROID:
print("Centroid spectrum")
elif spec_type == oms.SpectrumSettings.SpectrumType.PROFILE:
print("Profile spectrum")
```
### Data Processing Annotations
```python
# Add processing information
processing = oms.DataProcessing()
processing.setMetaValue("smoothing", "gaussian")
spectrum.setDataProcessing([processing])
```
## MSExperiment
Represents a complete LC-MS/MS experiment containing multiple spectra and chromatograms.
### Creation and Population
```python
# Create empty experiment
exp = oms.MSExperiment()
# Add spectra
spectrum1 = oms.MSSpectrum()
spectrum1.setRT(100.0)
spectrum1.set_peaks(([100, 200], [1000, 2000]))
spectrum2 = oms.MSSpectrum()
spectrum2.setRT(200.0)
spectrum2.set_peaks(([100, 200], [1500, 2500]))
exp.addSpectrum(spectrum1)
exp.addSpectrum(spectrum2)
# Add chromatograms
chrom = oms.MSChromatogram()
chrom.set_peaks(([10.5, 11.0, 11.5], [1000, 5000, 3000]))
exp.addChromatogram(chrom)
```
### Accessing Spectra and Chromatograms
```python
# Get number of spectra and chromatograms
n_spectra = exp.getNrSpectra()
n_chroms = exp.getNrChromatograms()
# Access by index
first_spectrum = exp.getSpectrum(0)
last_spectrum = exp.getSpectrum(exp.getNrSpectra() - 1)
# Iterate over all spectra
for spectrum in exp:
rt = spectrum.getRT()
ms_level = spectrum.getMSLevel()
n_peaks = spectrum.size()
print(f"RT: {rt:.2f}s, MS{ms_level}, Peaks: {n_peaks}")
# Get all spectra as list
spectra = exp.getSpectra()
# Access chromatograms
chrom = exp.getChromatogram(0)
```
### Filtering Operations
```python
# Filter by MS level
exp.filterMSLevel(1) # Keep only MS1 spectra
exp.filterMSLevel(2) # Keep only MS2 spectra
# Filter by retention time range
exp.filterRT(100.0, 500.0) # Keep RT between 100-500 seconds
# Filter by m/z range (all spectra)
exp.filterMZ(300.0, 1500.0) # Keep m/z between 300-1500
# Filter by scan number
exp.filterScanNumber(100, 200) # Keep scans 100-200
```
### Metadata and Properties
```python
# Set experiment metadata
exp.setMetaValue("operator", "John Doe")
exp.setMetaValue("instrument", "Q Exactive HF")
# Get metadata
operator = exp.getMetaValue("operator")
# Get RT range
rt_range = exp.getMinRT(), exp.getMaxRT()
# Get m/z range
mz_range = exp.getMinMZ(), exp.getMaxMZ()
# Clear all data
exp.clear(False) # False = keep metadata
```
### Sorting and Organization
```python
# Sort spectra by retention time
exp.sortSpectra()
# Update ranges (call after modifications)
exp.updateRanges()
# Check if experiment is empty
is_empty = exp.empty()
# Reset (clear everything)
exp.reset()
```
## MSChromatogram
Represents an extracted or reconstructed chromatogram (retention time vs. intensity).
### Creation and Basic Usage
```python
# Create chromatogram
chrom = oms.MSChromatogram()
# Set peaks (RT, intensity pairs)
rt_values = [10.0, 10.5, 11.0, 11.5, 12.0]
intensity_values = [1000, 5000, 8000, 6000, 2000]
chrom.set_peaks((rt_values, intensity_values))
# Get peaks
rt_array, int_array = chrom.get_peaks()
# Get size
n_points = chrom.size()
```
### Chromatogram Types
```python
# Set chromatogram type
chrom.setChromatogramType(oms.ChromatogramSettings.ChromatogramType.SELECTED_ION_CURRENT_CHROMATOGRAM)
# Other types:
# - TOTAL_ION_CURRENT_CHROMATOGRAM
# - BASEPEAK_CHROMATOGRAM
# - SELECTED_ION_CURRENT_CHROMATOGRAM
# - SELECTED_REACTION_MONITORING_CHROMATOGRAM
```
### Metadata
```python
# Set native ID
chrom.setNativeID("TIC")
# Set name
chrom.setName("Total Ion Current")
# Access
native_id = chrom.getNativeID()
name = chrom.getName()
```
### Precursor and Product Information (for SRM/MRM)
```python
# For targeted experiments
precursor = oms.Precursor()
precursor.setMZ(456.7)
chrom.setPrecursor(precursor)
product = oms.Product()
product.setMZ(789.4)
chrom.setProduct(product)
```
## Peak1D and ChromatogramPeak
Individual peak data points.
### Peak1D (for mass spectra)
```python
# Create individual peak
peak = oms.Peak1D()
peak.setMZ(456.789)
peak.setIntensity(10000)
# Access
mz = peak.getMZ()
intensity = peak.getIntensity()
# Set position and intensity
peak.setPosition([456.789])
peak.setIntensity(10000)
```
### ChromatogramPeak (for chromatograms)
```python
# Create chromatogram peak
chrom_peak = oms.ChromatogramPeak()
chrom_peak.setRT(125.5)
chrom_peak.setIntensity(5000)
# Access
rt = chrom_peak.getRT()
intensity = chrom_peak.getIntensity()
```
## FeatureMap and Feature
For quantification results.
### Feature
Represents a detected LC-MS feature (peptide or metabolite signal).
```python
# Create feature
feature = oms.Feature()
# Set properties
feature.setMZ(456.789)
feature.setRT(123.45)
feature.setIntensity(1000000)
feature.setCharge(2)
feature.setWidth(15.0) # RT width in seconds
# Set quality score
feature.setOverallQuality(0.95)
# Access
mz = feature.getMZ()
rt = feature.getRT()
intensity = feature.getIntensity()
charge = feature.getCharge()
```
### FeatureMap
Collection of features.
```python
# Create feature map
feature_map = oms.FeatureMap()
# Add features
feature1 = oms.Feature()
feature1.setMZ(456.789)
feature1.setRT(123.45)
feature1.setIntensity(1000000)
feature_map.push_back(feature1)
# Get size
n_features = feature_map.size()
# Iterate
for feature in feature_map:
print(f"m/z: {feature.getMZ():.4f}, RT: {feature.getRT():.2f}")
# Access by index
first_feature = feature_map[0]
# Clear
feature_map.clear()
```
## PeptideIdentification and ProteinIdentification
For identification results.
### PeptideIdentification
```python
# Create peptide identification
pep_id = oms.PeptideIdentification()
pep_id.setRT(123.45)
pep_id.setMZ(456.789)
# Create peptide hit
hit = oms.PeptideHit()
hit.setSequence(oms.AASequence.fromString("PEPTIDE"))
hit.setCharge(2)
hit.setScore(25.5)
hit.setRank(1)
# Add to identification
pep_id.setHits([hit])
pep_id.setHigherScoreBetter(True)
pep_id.setScoreType("XCorr")
# Access
hits = pep_id.getHits()
for hit in hits:
seq = hit.getSequence().toString()
score = hit.getScore()
print(f"Sequence: {seq}, Score: {score}")
```
### ProteinIdentification
```python
# Create protein identification
prot_id = oms.ProteinIdentification()
# Create protein hit
prot_hit = oms.ProteinHit()
prot_hit.setAccession("P12345")
prot_hit.setSequence("MKTAYIAKQRQISFVK...")
prot_hit.setScore(100.5)
# Add to identification
prot_id.setHits([prot_hit])
prot_id.setScoreType("Mascot Score")
prot_id.setHigherScoreBetter(True)
# Search parameters
search_params = oms.ProteinIdentification.SearchParameters()
search_params.db = "uniprot_human.fasta"
search_params.enzyme = "Trypsin"
prot_id.setSearchParameters(search_params)
```
## ConsensusMap and ConsensusFeature
For linking features across multiple samples.
### ConsensusFeature
```python
# Create consensus feature
cons_feature = oms.ConsensusFeature()
cons_feature.setMZ(456.789)
cons_feature.setRT(123.45)
cons_feature.setIntensity(5000000) # Combined intensity
# Access linked features
for handle in cons_feature.getFeatureList():
map_index = handle.getMapIndex()
feature_index = handle.getIndex()
intensity = handle.getIntensity()
```
### ConsensusMap
```python
# Create consensus map
consensus_map = oms.ConsensusMap()
# Add consensus features
consensus_map.push_back(cons_feature)
# Iterate
for cons_feat in consensus_map:
mz = cons_feat.getMZ()
rt = cons_feat.getRT()
n_features = cons_feat.size() # Number of linked features
```
## Best Practices
1. **Use numpy arrays** for peak data when possible - much faster than individual peak access
2. **Sort spectra** by position (m/z) before searching or filtering
3. **Update ranges** after modifying MSExperiment: `exp.updateRanges()`
4. **Check MS level** before processing - different algorithms for MS1 vs MS2
5. **Validate precursor info** for MS2 spectra - ensure charge and m/z are set
6. **Use appropriate containers** - MSExperiment for raw data, FeatureMap for quantification
7. **Clear metadata carefully** - use `clear(False)` to preserve metadata when clearing peaks
## Common Patterns
### Create MS2 Spectrum with Precursor
```python
spectrum = oms.MSSpectrum()
spectrum.setRT(205.2)
spectrum.setMSLevel(2)
spectrum.set_peaks(([100, 200, 300], [1000, 5000, 3000]))
precursor = oms.Precursor()
precursor.setMZ(450.5)
precursor.setCharge(2)
spectrum.setPrecursors([precursor])
```
### Extract MS1 Spectra from Experiment
```python
ms1_exp = oms.MSExperiment()
for spectrum in exp:
if spectrum.getMSLevel() == 1:
ms1_exp.addSpectrum(spectrum)
```
### Calculate Total Ion Current (TIC)
```python
tic_values = []
rt_values = []
for spectrum in exp:
if spectrum.getMSLevel() == 1:
mz, intensity = spectrum.get_peaks()
tic = np.sum(intensity)
tic_values.append(tic)
rt_values.append(spectrum.getRT())
```
### Find Spectrum Closest to RT
```python
target_rt = 125.5
closest_spectrum = None
min_diff = float('inf')
for spectrum in exp:
diff = abs(spectrum.getRT() - target_rt)
if diff < min_diff:
min_diff = diff
closest_spectrum = spectrum
```