Update the PyOpenMS skill

This commit is contained in:
Timothy Kassis
2025-11-03 16:55:01 -08:00
parent c56fa43747
commit f124e28509
9 changed files with 2699 additions and 2343 deletions

View File

@@ -1,530 +1,211 @@
---
name: pyopenms
description: "Mass spectrometry toolkit (OpenMS Python). Process mzML/mzXML, peak picking, feature detection, peptide ID, proteomics/metabolomics workflows, for LC-MS/MS analysis."
description: Python interface to OpenMS for mass spectrometry data analysis. Use for LC-MS/MS proteomics and metabolomics workflows including file handling (mzML, mzXML, mzTab, FASTA, pepXML, protXML, mzIdentML), signal processing, feature detection, peptide identification, and quantitative analysis. Apply when working with mass spectrometry data, analyzing proteomics experiments, or processing metabolomics datasets.
---
# pyOpenMS
# PyOpenMS
## Overview
pyOpenMS is an open-source Python library for mass spectrometry data analysis in proteomics and metabolomics. Process LC-MS/MS data, perform peptide identification, detect and quantify features, and integrate with common proteomics tools (Comet, Mascot, MSGF+, Percolator, MSstats) using Python bindings to the OpenMS C++ library.
## When to Use This Skill
This skill should be used when:
- Processing mass spectrometry data (mzML, mzXML files)
- Performing peak picking and feature detection in LC-MS data
- Conducting peptide and protein identification workflows
- Quantifying metabolites or proteins
- Integrating proteomics or metabolomics tools into Python pipelines
- Working with OpenMS tools and file formats
## Core Capabilities
### 1. File I/O and Data Import/Export
Handle diverse mass spectrometry file formats efficiently:
**Supported Formats:**
- **mzML/mzXML**: Primary raw MS data formats (profile or centroid)
- **FASTA**: Protein/peptide sequence databases
- **mzTab**: Standardized reporting format for identification and quantification
- **mzIdentML**: Peptide and protein identification data
- **TraML**: Transition lists for targeted experiments
- **pepXML/protXML**: Search engine results
**Reading mzML Files:**
```python
import pyopenms as oms
# Load MS data
exp = oms.MSExperiment()
oms.MzMLFile().load("input_data.mzML", exp)
# Access basic information
print(f"Number of spectra: {exp.getNrSpectra()}")
print(f"Number of chromatograms: {exp.getNrChromatograms()}")
```
**Writing mzML Files:**
```python
# Save processed data
oms.MzMLFile().store("output_data.mzML", exp)
```
**File Encoding:** pyOpenMS automatically handles Base64 encoding, zlib compression, and Numpress compression internally.
### 2. MS Data Structures and Manipulation
Work with core mass spectrometry data structures. See `references/data_structures.md` for comprehensive details.
**MSSpectrum** - Individual mass spectrum:
```python
# Create spectrum with metadata
spectrum = oms.MSSpectrum()
spectrum.setRT(205.2) # Retention time in seconds
spectrum.setMSLevel(2) # MS2 spectrum
# Set peak data (m/z, intensity arrays)
mz_array = [100.5, 200.3, 300.7, 400.2]
intensity_array = [1000, 5000, 3000, 2000]
spectrum.set_peaks((mz_array, intensity_array))
# Add precursor information for MS2
precursor = oms.Precursor()
precursor.setMZ(450.5)
precursor.setCharge(2)
spectrum.setPrecursors([precursor])
```
**MSExperiment** - Complete LC-MS/MS run:
```python
# Create experiment and add spectra
exp = oms.MSExperiment()
exp.addSpectrum(spectrum)
# Access spectra
first_spectrum = exp.getSpectrum(0)
for spec in exp:
print(f"RT: {spec.getRT()}, MS Level: {spec.getMSLevel()}")
```
**MSChromatogram** - Extracted ion chromatogram:
```python
# Create chromatogram
chrom = oms.MSChromatogram()
chrom.set_peaks(([10.5, 11.2, 11.8], [1000, 5000, 3000])) # RT, intensity
exp.addChromatogram(chrom)
```
**Efficient Peak Access:**
```python
# Get peaks as numpy arrays for fast processing
mz_array, intensity_array = spectrum.get_peaks()
# Modify and set back
intensity_array *= 2 # Double all intensities
spectrum.set_peaks((mz_array, intensity_array))
```
### 3. Chemistry and Peptide Handling
Perform chemical calculations for proteomics and metabolomics. See `references/chemistry.md` for detailed examples.
**Molecular Formulas and Mass Calculations:**
```python
# Create empirical formula
formula = oms.EmpiricalFormula("C6H12O6") # Glucose
print(f"Monoisotopic mass: {formula.getMonoWeight()}")
print(f"Average mass: {formula.getAverageWeight()}")
# Formula arithmetic
water = oms.EmpiricalFormula("H2O")
dehydrated = formula - water
# Isotope-specific formulas
heavy_carbon = oms.EmpiricalFormula("(13)C6H12O6")
```
**Isotopic Distributions:**
```python
# Generate coarse isotope pattern (unit mass resolution)
coarse_gen = oms.CoarseIsotopePatternGenerator()
pattern = coarse_gen.run(formula)
# Generate fine structure (high resolution)
fine_gen = oms.FineIsotopePatternGenerator(0.01) # 0.01 Da resolution
fine_pattern = fine_gen.run(formula)
```
**Amino Acids and Residues:**
```python
# Access residue information
res_db = oms.ResidueDB()
leucine = res_db.getResidue("Leucine")
print(f"L monoisotopic mass: {leucine.getMonoWeight()}")
print(f"L formula: {leucine.getFormula()}")
print(f"L pKa: {leucine.getPka()}")
```
**Peptide Sequences:**
```python
# Create peptide sequence
peptide = oms.AASequence.fromString("PEPTIDE")
print(f"Peptide mass: {peptide.getMonoWeight()}")
print(f"Formula: {peptide.getFormula()}")
# Add modifications
modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)")
print(f"Modified mass: {modified.getMonoWeight()}")
# Theoretical fragmentation
ions = []
for i in range(1, peptide.size()):
b_ion = peptide.getPrefix(i)
y_ion = peptide.getSuffix(i)
ions.append(('b', i, b_ion.getMonoWeight()))
ions.append(('y', i, y_ion.getMonoWeight()))
```
**Protein Digestion:**
```python
# Enzymatic digestion
dig = oms.ProteaseDigestion()
dig.setEnzyme("Trypsin")
dig.setMissedCleavages(2)
protein_seq = oms.AASequence.fromString("MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK")
peptides = []
dig.digest(protein_seq, peptides)
for pep in peptides:
print(f"{pep.toString()}: {pep.getMonoWeight():.2f} Da")
```
**Modifications:**
```python
# Access modification database
mod_db = oms.ModificationsDB()
oxidation = mod_db.getModification("Oxidation")
print(f"Oxidation mass diff: {oxidation.getDiffMonoMass()}")
print(f"Residues: {oxidation.getResidues()}")
```
### 4. Signal Processing and Filtering
Apply algorithms to process and filter MS data. See `references/algorithms.md` for comprehensive coverage.
**Spectral Smoothing:**
```python
# Gaussian smoothing
gauss_filter = oms.GaussFilter()
params = gauss_filter.getParameters()
params.setValue("gaussian_width", 0.2)
gauss_filter.setParameters(params)
gauss_filter.filterExperiment(exp)
# Savitzky-Golay filter
sg_filter = oms.SavitzkyGolayFilter()
sg_filter.filterExperiment(exp)
```
**Peak Filtering:**
```python
# Keep only N largest peaks per spectrum
n_largest = oms.NLargest()
params = n_largest.getParameters()
params.setValue("n", 100) # Keep top 100 peaks
n_largest.setParameters(params)
n_largest.filterExperiment(exp)
# Threshold filtering
threshold_filter = oms.ThresholdMower()
params = threshold_filter.getParameters()
params.setValue("threshold", 1000.0) # Remove peaks below 1000 intensity
threshold_filter.setParameters(params)
threshold_filter.filterExperiment(exp)
# Window-based filtering
window_filter = oms.WindowMower()
params = window_filter.getParameters()
params.setValue("windowsize", 50.0) # 50 m/z windows
params.setValue("peakcount", 10) # Keep 10 highest per window
window_filter.setParameters(params)
window_filter.filterExperiment(exp)
```
**Spectrum Normalization:**
```python
normalizer = oms.Normalizer()
normalizer.filterExperiment(exp)
```
**MS Level Filtering:**
```python
# Keep only MS2 spectra
exp.filterMSLevel(2)
# Filter by retention time range
exp.filterRT(100.0, 500.0) # Keep RT between 100-500 seconds
# Filter by m/z range
exp.filterMZ(400.0, 1500.0) # Keep m/z between 400-1500
```
### 5. Feature Detection and Quantification
Detect and quantify features in LC-MS data:
**Peak Picking (Centroiding):**
```python
# Convert profile data to centroid
picker = oms.PeakPickerHiRes()
params = picker.getParameters()
params.setValue("signal_to_noise", 1.0)
picker.setParameters(params)
exp_centroided = oms.MSExperiment()
picker.pickExperiment(exp, exp_centroided)
```
**Feature Detection:**
```python
# Detect features across LC-MS runs
feature_finder = oms.FeatureFinderMultiplex()
features = oms.FeatureMap()
feature_finder.run(exp, features, params)
print(f"Found {features.size()} features")
for feature in features:
print(f"m/z: {feature.getMZ():.4f}, RT: {feature.getRT():.2f}, "
f"Intensity: {feature.getIntensity():.0f}")
```
**Feature Linking (Map Alignment):**
```python
# Link features across multiple samples
feature_grouper = oms.FeatureGroupingAlgorithmQT()
consensus_map = oms.ConsensusMap()
# Provide multiple feature maps from different samples
feature_maps = [features1, features2, features3]
feature_grouper.group(feature_maps, consensus_map)
```
### 6. Peptide Identification Workflows
Integrate with search engines and process identification results:
**Database Searching:**
```python
# Prepare parameters for search engine
params = oms.Param()
params.setValue("database", "uniprot_human.fasta")
params.setValue("precursor_mass_tolerance", 10.0) # ppm
params.setValue("fragment_mass_tolerance", 0.5) # Da
params.setValue("enzyme", "Trypsin")
params.setValue("missed_cleavages", 2)
# Variable modifications
params.setValue("variable_modifications", ["Oxidation (M)", "Phospho (STY)"])
# Fixed modifications
params.setValue("fixed_modifications", ["Carbamidomethyl (C)"])
```
**FDR Control:**
```python
# False discovery rate estimation
fdr = oms.FalseDiscoveryRate()
fdr_threshold = 0.01 # 1% FDR
# Apply to peptide identifications
protein_ids = []
peptide_ids = []
oms.IdXMLFile().load("search_results.idXML", protein_ids, peptide_ids)
fdr.apply(protein_ids, peptide_ids)
```
### 7. Metabolomics Workflows
Analyze small molecule data:
**Adduct Detection:**
```python
# Common metabolite adducts
adducts = ["[M+H]+", "[M+Na]+", "[M+K]+", "[M-H]-", "[M+Cl]-"]
# Feature annotation with adducts
for feature in features:
mz = feature.getMZ()
# Calculate neutral mass for each adduct hypothesis
for adduct in adducts:
# Annotation logic
pass
```
**Isotope Pattern Matching:**
```python
# Compare experimental to theoretical isotope patterns
experimental_pattern = [] # Extract from feature
theoretical = coarse_gen.run(formula)
# Calculate similarity score
similarity = compare_isotope_patterns(experimental_pattern, theoretical)
```
### 8. Quality Control and Visualization
Monitor data quality and visualize results:
**Basic Statistics:**
```python
# Calculate TIC (Total Ion Current)
tic_values = []
rt_values = []
for spectrum in exp:
if spectrum.getMSLevel() == 1:
tic = sum(spectrum.get_peaks()[1]) # Sum intensities
tic_values.append(tic)
rt_values.append(spectrum.getRT())
# Base peak chromatogram
bpc_values = []
for spectrum in exp:
if spectrum.getMSLevel() == 1:
max_intensity = max(spectrum.get_peaks()[1]) if spectrum.size() > 0 else 0
bpc_values.append(max_intensity)
```
**Plotting (with pyopenms.plotting or matplotlib):**
```python
import matplotlib.pyplot as plt
# Plot TIC
plt.figure(figsize=(10, 4))
plt.plot(rt_values, tic_values)
plt.xlabel('Retention Time (s)')
plt.ylabel('Total Ion Current')
plt.title('TIC')
plt.show()
# Plot single spectrum
spectrum = exp.getSpectrum(0)
mz, intensity = spectrum.get_peaks()
plt.stem(mz, intensity, basefmt=' ')
plt.xlabel('m/z')
plt.ylabel('Intensity')
plt.title(f'Spectrum at RT {spectrum.getRT():.2f}s')
plt.show()
```
## Common Workflows
### Complete LC-MS/MS Processing Pipeline
```python
import pyopenms as oms
# 1. Load data
exp = oms.MSExperiment()
oms.MzMLFile().load("raw_data.mzML", exp)
# 2. Filter and smooth
exp.filterMSLevel(1) # Keep only MS1 for feature detection
gauss = oms.GaussFilter()
gauss.filterExperiment(exp)
# 3. Peak picking
picker = oms.PeakPickerHiRes()
exp_centroid = oms.MSExperiment()
picker.pickExperiment(exp, exp_centroid)
# 4. Feature detection
ff = oms.FeatureFinderMultiplex()
features = oms.FeatureMap()
ff.run(exp_centroid, features, oms.Param())
# 5. Export results
oms.FeatureXMLFile().store("features.featureXML", features)
print(f"Detected {features.size()} features")
```
### Theoretical Peptide Mass Calculation
```python
# Calculate masses for peptide with modifications
peptide = oms.AASequence.fromString("PEPTIDEK")
print(f"Unmodified [M+H]+: {peptide.getMonoWeight() + 1.007276:.4f}")
# With modification
modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)K")
print(f"Oxidized [M+H]+: {modified.getMonoWeight() + 1.007276:.4f}")
# Calculate for different charge states
for z in [1, 2, 3]:
mz = (peptide.getMonoWeight() + z * 1.007276) / z
print(f"[M+{z}H]^{z}+: {mz:.4f}")
```
PyOpenMS provides Python bindings to the OpenMS library for computational mass spectrometry, enabling analysis of proteomics and metabolomics data. Use for handling mass spectrometry file formats, processing spectral data, detecting features, identifying peptides/proteins, and performing quantitative analysis.
## Installation
Ensure pyOpenMS is installed before using this skill:
Install using uv:
```bash
# Via conda (recommended)
conda install -c bioconda pyopenms
uv pip install pyopenms
```
# Via pip
pip install pyopenms
Verify installation:
```python
import pyopenms
print(pyopenms.__version__)
```
## Core Capabilities
PyOpenMS organizes functionality into these domains:
### 1. File I/O and Data Formats
Handle mass spectrometry file formats and convert between representations.
**Supported formats**: mzML, mzXML, TraML, mzTab, FASTA, pepXML, protXML, mzIdentML, featureXML, consensusXML, idXML
Basic file reading:
```python
import pyopenms as ms
# Read mzML file
exp = ms.MSExperiment()
ms.MzMLFile().load("data.mzML", exp)
# Access spectra
for spectrum in exp:
mz, intensity = spectrum.get_peaks()
print(f"Spectrum: {len(mz)} peaks")
```
**For detailed file handling**: See `references/file_io.md`
### 2. Signal Processing
Process raw spectral data with smoothing, filtering, centroiding, and normalization.
Basic spectrum processing:
```python
# Smooth spectrum with Gaussian filter
gaussian = ms.GaussFilter()
params = gaussian.getParameters()
params.setValue("gaussian_width", 0.1)
gaussian.setParameters(params)
gaussian.filterExperiment(exp)
```
**For algorithm details**: See `references/signal_processing.md`
### 3. Feature Detection
Detect and link features across spectra and samples for quantitative analysis.
```python
# Detect features
ff = ms.FeatureFinder()
ff.run("centroided", exp, features, params, ms.FeatureMap())
```
**For complete workflows**: See `references/feature_detection.md`
### 4. Peptide and Protein Identification
Integrate with search engines and process identification results.
**Supported engines**: Comet, Mascot, MSGFPlus, XTandem, OMSSA, Myrimatch
Basic identification workflow:
```python
# Load identification data
protein_ids = []
peptide_ids = []
ms.IdXMLFile().load("identifications.idXML", protein_ids, peptide_ids)
# Apply FDR filtering
fdr = ms.FalseDiscoveryRate()
fdr.apply(peptide_ids)
```
**For detailed workflows**: See `references/identification.md`
### 5. Metabolomics Analysis
Perform untargeted metabolomics preprocessing and analysis.
Typical workflow:
1. Load and process raw data
2. Detect features
3. Align retention times across samples
4. Link features to consensus map
5. Annotate with compound databases
**For complete metabolomics workflows**: See `references/metabolomics.md`
## Data Structures
PyOpenMS uses these primary objects:
- **MSExperiment**: Collection of spectra and chromatograms
- **MSSpectrum**: Single mass spectrum with m/z and intensity pairs
- **MSChromatogram**: Chromatographic trace
- **Feature**: Detected chromatographic peak with quality metrics
- **FeatureMap**: Collection of features
- **PeptideIdentification**: Search results for peptides
- **ProteinIdentification**: Search results for proteins
**For detailed documentation**: See `references/data_structures.md`
## Common Workflows
### Quick Start: Load and Explore Data
```python
import pyopenms as ms
# Load mzML file
exp = ms.MSExperiment()
ms.MzMLFile().load("sample.mzML", exp)
# Get basic statistics
print(f"Number of spectra: {exp.getNrSpectra()}")
print(f"Number of chromatograms: {exp.getNrChromatograms()}")
# Examine first spectrum
spec = exp.getSpectrum(0)
print(f"MS level: {spec.getMSLevel()}")
print(f"Retention time: {spec.getRT()}")
mz, intensity = spec.get_peaks()
print(f"Peaks: {len(mz)}")
```
### Parameter Management
Most algorithms use a parameter system:
```python
# Get algorithm parameters
algo = ms.GaussFilter()
params = algo.getParameters()
# View available parameters
for param in params.keys():
print(f"{param}: {params.getValue(param)}")
# Modify parameters
params.setValue("gaussian_width", 0.2)
algo.setParameters(params)
```
### Export to Pandas
Convert data to pandas DataFrames for analysis:
```python
import pyopenms as ms
import pandas as pd
# Load feature map
fm = ms.FeatureMap()
ms.FeatureXMLFile().load("features.featureXML", fm)
# Convert to DataFrame
df = fm.get_df()
print(df.head())
```
## Integration with Other Tools
pyOpenMS integrates seamlessly with:
- **Search Engines**: Comet, Mascot, MSGF+, MSFragger, Sage, SpectraST
- **Post-processing**: Percolator, MSstats, Epiphany
- **Metabolomics**: SIRIUS, CSI:FingerID
- **Data Analysis**: Pandas, NumPy, SciPy for downstream analysis
- **Visualization**: Matplotlib, Seaborn for plotting
PyOpenMS integrates with:
- **Pandas**: Export data to DataFrames
- **NumPy**: Work with peak arrays
- **Scikit-learn**: Machine learning on MS data
- **Matplotlib/Seaborn**: Visualization
- **R**: Via rpy2 bridge
## Resources
### references/
- **Official documentation**: https://pyopenms.readthedocs.io
- **OpenMS documentation**: https://www.openms.org
- **GitHub**: https://github.com/OpenMS/OpenMS
Detailed documentation on core concepts:
## References
- **data_structures.md** - Comprehensive guide to MSExperiment, MSSpectrum, MSChromatogram, and peak data handling
- **algorithms.md** - Complete reference for signal processing, filtering, feature detection, and quantification algorithms
- **chemistry.md** - In-depth coverage of chemistry calculations, peptide handling, modifications, and isotope distributions
Load these references when needing detailed information about specific pyOpenMS capabilities.
## Best Practices
1. **File Format**: Always use mzML for raw MS data (standardized, well-supported)
2. **Peak Access**: Use `get_peaks()` and `set_peaks()` with numpy arrays for efficient processing
3. **Parameters**: Always check and configure algorithm parameters via `getParameters()` and `setParameters()`
4. **Memory**: For large datasets, process spectra iteratively rather than loading entire experiments
5. **Validation**: Check data integrity (MS levels, RT ordering, precursor information) after loading
6. **Modifications**: Use standard modification names from UniMod database
7. **Units**: RT in seconds, m/z in Thomson (Da/charge), intensity in arbitrary units
## Common Patterns
**Algorithm Application Pattern:**
```python
# 1. Instantiate algorithm
algorithm = oms.SomeAlgorithm()
# 2. Get and configure parameters
params = algorithm.getParameters()
params.setValue("parameter_name", value)
algorithm.setParameters(params)
# 3. Apply to data
algorithm.filterExperiment(exp) # or .process(), .run(), depending on algorithm
```
**File I/O Pattern:**
```python
# Read
data_container = oms.DataContainer() # MSExperiment, FeatureMap, etc.
oms.FileHandler().load("input.format", data_container)
# Process
# ... manipulate data_container ...
# Write
oms.FileHandler().store("output.format", data_container)
```
## Getting Help
- **Documentation**: https://pyopenms.readthedocs.io/
- **API Reference**: Browse class documentation for detailed method signatures
- **OpenMS Website**: https://www.openms.org/
- **GitHub Issues**: https://github.com/OpenMS/OpenMS/issues
- `references/file_io.md` - Comprehensive file format handling
- `references/signal_processing.md` - Signal processing algorithms
- `references/feature_detection.md` - Feature detection and linking
- `references/identification.md` - Peptide and protein identification
- `references/metabolomics.md` - Metabolomics-specific workflows
- `references/data_structures.md` - Core objects and data structures