Files
claude-scientific-skills/scientific-packages/pyopenms/SKILL.md

16 KiB

name, description
name description
pyopenms Toolkit for mass spectrometry data analysis with pyOpenMS, supporting proteomics and metabolomics workflows including LC-MS/MS data processing, peptide identification, feature detection, quantification, and chemical calculations. Use this skill when: (1) Working with mass spectrometry file formats (mzML, mzXML, FASTA, mzTab, mzIdentML, TraML, pepXML/protXML) and need to read, write, or convert between formats; (2) Processing raw LC-MS/MS data including spectral smoothing, peak picking, noise filtering, and signal processing; (3) Performing proteomics workflows such as peptide digestion simulation, theoretical fragmentation, modification analysis, and protein identification post-processing; (4) Conducting metabolomics analysis including feature detection, adduct annotation, isotope pattern matching, and small molecule identification; (5) Implementing quantitative proteomics pipelines with feature detection, alignment across samples, and statistical analysis; (6) Calculating chemical properties including molecular formulas, isotopic distributions, amino acid properties, and peptide masses; (7) Integrating with search engines (Comet, Mascot, MSGF+) and post-processing tools (Percolator, MSstats); (8) Building custom MS data analysis workflows that require low-level access to spectra, chromatograms, and peak data; (9) Performing quality control on MS data including TIC/BPC calculation, retention time analysis, and data validation; (10) When you need Python-based alternatives to vendor software for MS data processing and analysis.

pyOpenMS

Overview

pyOpenMS is an open-source Python library providing comprehensive tools for mass spectrometry data analysis in proteomics and metabolomics research. It offers Python bindings to the OpenMS C++ library, enabling efficient processing of LC-MS/MS data, peptide identification, feature detection, quantification, and integration with common proteomics tools like Comet, Mascot, MSGF+, Percolator, and MSstats.

Use this skill when working with mass spectrometry data analysis tasks, processing proteomics or metabolomics datasets, or implementing computational workflows for biomolecular identification and quantification.

Core Capabilities

1. File I/O and Data Import/Export

Handle diverse mass spectrometry file formats efficiently:

Supported Formats:

  • mzML/mzXML: Primary raw MS data formats (profile or centroid)
  • FASTA: Protein/peptide sequence databases
  • mzTab: Standardized reporting format for identification and quantification
  • mzIdentML: Peptide and protein identification data
  • TraML: Transition lists for targeted experiments
  • pepXML/protXML: Search engine results

Reading mzML Files:

import pyopenms as oms

# Load MS data
exp = oms.MSExperiment()
oms.MzMLFile().load("input_data.mzML", exp)

# Access basic information
print(f"Number of spectra: {exp.getNrSpectra()}")
print(f"Number of chromatograms: {exp.getNrChromatograms()}")

Writing mzML Files:

# Save processed data
oms.MzMLFile().store("output_data.mzML", exp)

File Encoding: pyOpenMS automatically handles Base64 encoding, zlib compression, and Numpress compression internally.

2. MS Data Structures and Manipulation

Work with core mass spectrometry data structures. See references/data_structures.md for comprehensive details.

MSSpectrum - Individual mass spectrum:

# Create spectrum with metadata
spectrum = oms.MSSpectrum()
spectrum.setRT(205.2)  # Retention time in seconds
spectrum.setMSLevel(2)  # MS2 spectrum

# Set peak data (m/z, intensity arrays)
mz_array = [100.5, 200.3, 300.7, 400.2]
intensity_array = [1000, 5000, 3000, 2000]
spectrum.set_peaks((mz_array, intensity_array))

# Add precursor information for MS2
precursor = oms.Precursor()
precursor.setMZ(450.5)
precursor.setCharge(2)
spectrum.setPrecursors([precursor])

MSExperiment - Complete LC-MS/MS run:

# Create experiment and add spectra
exp = oms.MSExperiment()
exp.addSpectrum(spectrum)

# Access spectra
first_spectrum = exp.getSpectrum(0)
for spec in exp:
    print(f"RT: {spec.getRT()}, MS Level: {spec.getMSLevel()}")

MSChromatogram - Extracted ion chromatogram:

# Create chromatogram
chrom = oms.MSChromatogram()
chrom.set_peaks(([10.5, 11.2, 11.8], [1000, 5000, 3000]))  # RT, intensity
exp.addChromatogram(chrom)

Efficient Peak Access:

# Get peaks as numpy arrays for fast processing
mz_array, intensity_array = spectrum.get_peaks()

# Modify and set back
intensity_array *= 2  # Double all intensities
spectrum.set_peaks((mz_array, intensity_array))

3. Chemistry and Peptide Handling

Perform chemical calculations for proteomics and metabolomics. See references/chemistry.md for detailed examples.

Molecular Formulas and Mass Calculations:

# Create empirical formula
formula = oms.EmpiricalFormula("C6H12O6")  # Glucose
print(f"Monoisotopic mass: {formula.getMonoWeight()}")
print(f"Average mass: {formula.getAverageWeight()}")

# Formula arithmetic
water = oms.EmpiricalFormula("H2O")
dehydrated = formula - water

# Isotope-specific formulas
heavy_carbon = oms.EmpiricalFormula("(13)C6H12O6")

Isotopic Distributions:

# Generate coarse isotope pattern (unit mass resolution)
coarse_gen = oms.CoarseIsotopePatternGenerator()
pattern = coarse_gen.run(formula)

# Generate fine structure (high resolution)
fine_gen = oms.FineIsotopePatternGenerator(0.01)  # 0.01 Da resolution
fine_pattern = fine_gen.run(formula)

Amino Acids and Residues:

# Access residue information
res_db = oms.ResidueDB()
leucine = res_db.getResidue("Leucine")
print(f"L monoisotopic mass: {leucine.getMonoWeight()}")
print(f"L formula: {leucine.getFormula()}")
print(f"L pKa: {leucine.getPka()}")

Peptide Sequences:

# Create peptide sequence
peptide = oms.AASequence.fromString("PEPTIDE")
print(f"Peptide mass: {peptide.getMonoWeight()}")
print(f"Formula: {peptide.getFormula()}")

# Add modifications
modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)")
print(f"Modified mass: {modified.getMonoWeight()}")

# Theoretical fragmentation
ions = []
for i in range(1, peptide.size()):
    b_ion = peptide.getPrefix(i)
    y_ion = peptide.getSuffix(i)
    ions.append(('b', i, b_ion.getMonoWeight()))
    ions.append(('y', i, y_ion.getMonoWeight()))

Protein Digestion:

# Enzymatic digestion
dig = oms.ProteaseDigestion()
dig.setEnzyme("Trypsin")
dig.setMissedCleavages(2)

protein_seq = oms.AASequence.fromString("MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK")
peptides = []
dig.digest(protein_seq, peptides)

for pep in peptides:
    print(f"{pep.toString()}: {pep.getMonoWeight():.2f} Da")

Modifications:

# Access modification database
mod_db = oms.ModificationsDB()
oxidation = mod_db.getModification("Oxidation")
print(f"Oxidation mass diff: {oxidation.getDiffMonoMass()}")
print(f"Residues: {oxidation.getResidues()}")

4. Signal Processing and Filtering

Apply algorithms to process and filter MS data. See references/algorithms.md for comprehensive coverage.

Spectral Smoothing:

# Gaussian smoothing
gauss_filter = oms.GaussFilter()
params = gauss_filter.getParameters()
params.setValue("gaussian_width", 0.2)
gauss_filter.setParameters(params)
gauss_filter.filterExperiment(exp)

# Savitzky-Golay filter
sg_filter = oms.SavitzkyGolayFilter()
sg_filter.filterExperiment(exp)

Peak Filtering:

# Keep only N largest peaks per spectrum
n_largest = oms.NLargest()
params = n_largest.getParameters()
params.setValue("n", 100)  # Keep top 100 peaks
n_largest.setParameters(params)
n_largest.filterExperiment(exp)

# Threshold filtering
threshold_filter = oms.ThresholdMower()
params = threshold_filter.getParameters()
params.setValue("threshold", 1000.0)  # Remove peaks below 1000 intensity
threshold_filter.setParameters(params)
threshold_filter.filterExperiment(exp)

# Window-based filtering
window_filter = oms.WindowMower()
params = window_filter.getParameters()
params.setValue("windowsize", 50.0)  # 50 m/z windows
params.setValue("peakcount", 10)     # Keep 10 highest per window
window_filter.setParameters(params)
window_filter.filterExperiment(exp)

Spectrum Normalization:

normalizer = oms.Normalizer()
normalizer.filterExperiment(exp)

MS Level Filtering:

# Keep only MS2 spectra
exp.filterMSLevel(2)

# Filter by retention time range
exp.filterRT(100.0, 500.0)  # Keep RT between 100-500 seconds

# Filter by m/z range
exp.filterMZ(400.0, 1500.0)  # Keep m/z between 400-1500

5. Feature Detection and Quantification

Detect and quantify features in LC-MS data:

Peak Picking (Centroiding):

# Convert profile data to centroid
picker = oms.PeakPickerHiRes()
params = picker.getParameters()
params.setValue("signal_to_noise", 1.0)
picker.setParameters(params)

exp_centroided = oms.MSExperiment()
picker.pickExperiment(exp, exp_centroided)

Feature Detection:

# Detect features across LC-MS runs
feature_finder = oms.FeatureFinderMultiplex()

features = oms.FeatureMap()
feature_finder.run(exp, features, params)

print(f"Found {features.size()} features")
for feature in features:
    print(f"m/z: {feature.getMZ():.4f}, RT: {feature.getRT():.2f}, "
          f"Intensity: {feature.getIntensity():.0f}")

Feature Linking (Map Alignment):

# Link features across multiple samples
feature_grouper = oms.FeatureGroupingAlgorithmQT()
consensus_map = oms.ConsensusMap()

# Provide multiple feature maps from different samples
feature_maps = [features1, features2, features3]
feature_grouper.group(feature_maps, consensus_map)

6. Peptide Identification Workflows

Integrate with search engines and process identification results:

Database Searching:

# Prepare parameters for search engine
params = oms.Param()
params.setValue("database", "uniprot_human.fasta")
params.setValue("precursor_mass_tolerance", 10.0)  # ppm
params.setValue("fragment_mass_tolerance", 0.5)     # Da
params.setValue("enzyme", "Trypsin")
params.setValue("missed_cleavages", 2)

# Variable modifications
params.setValue("variable_modifications", ["Oxidation (M)", "Phospho (STY)"])

# Fixed modifications
params.setValue("fixed_modifications", ["Carbamidomethyl (C)"])

FDR Control:

# False discovery rate estimation
fdr = oms.FalseDiscoveryRate()
fdr_threshold = 0.01  # 1% FDR

# Apply to peptide identifications
protein_ids = []
peptide_ids = []
oms.IdXMLFile().load("search_results.idXML", protein_ids, peptide_ids)

fdr.apply(protein_ids, peptide_ids)

7. Metabolomics Workflows

Analyze small molecule data:

Adduct Detection:

# Common metabolite adducts
adducts = ["[M+H]+", "[M+Na]+", "[M+K]+", "[M-H]-", "[M+Cl]-"]

# Feature annotation with adducts
for feature in features:
    mz = feature.getMZ()
    # Calculate neutral mass for each adduct hypothesis
    for adduct in adducts:
        # Annotation logic
        pass

Isotope Pattern Matching:

# Compare experimental to theoretical isotope patterns
experimental_pattern = []  # Extract from feature
theoretical = coarse_gen.run(formula)

# Calculate similarity score
similarity = compare_isotope_patterns(experimental_pattern, theoretical)

8. Quality Control and Visualization

Monitor data quality and visualize results:

Basic Statistics:

# Calculate TIC (Total Ion Current)
tic_values = []
rt_values = []
for spectrum in exp:
    if spectrum.getMSLevel() == 1:
        tic = sum(spectrum.get_peaks()[1])  # Sum intensities
        tic_values.append(tic)
        rt_values.append(spectrum.getRT())

# Base peak chromatogram
bpc_values = []
for spectrum in exp:
    if spectrum.getMSLevel() == 1:
        max_intensity = max(spectrum.get_peaks()[1]) if spectrum.size() > 0 else 0
        bpc_values.append(max_intensity)

Plotting (with pyopenms.plotting or matplotlib):

import matplotlib.pyplot as plt

# Plot TIC
plt.figure(figsize=(10, 4))
plt.plot(rt_values, tic_values)
plt.xlabel('Retention Time (s)')
plt.ylabel('Total Ion Current')
plt.title('TIC')
plt.show()

# Plot single spectrum
spectrum = exp.getSpectrum(0)
mz, intensity = spectrum.get_peaks()
plt.stem(mz, intensity, basefmt=' ')
plt.xlabel('m/z')
plt.ylabel('Intensity')
plt.title(f'Spectrum at RT {spectrum.getRT():.2f}s')
plt.show()

Common Workflows

Complete LC-MS/MS Processing Pipeline

import pyopenms as oms

# 1. Load data
exp = oms.MSExperiment()
oms.MzMLFile().load("raw_data.mzML", exp)

# 2. Filter and smooth
exp.filterMSLevel(1)  # Keep only MS1 for feature detection
gauss = oms.GaussFilter()
gauss.filterExperiment(exp)

# 3. Peak picking
picker = oms.PeakPickerHiRes()
exp_centroid = oms.MSExperiment()
picker.pickExperiment(exp, exp_centroid)

# 4. Feature detection
ff = oms.FeatureFinderMultiplex()
features = oms.FeatureMap()
ff.run(exp_centroid, features, oms.Param())

# 5. Export results
oms.FeatureXMLFile().store("features.featureXML", features)
print(f"Detected {features.size()} features")

Theoretical Peptide Mass Calculation

# Calculate masses for peptide with modifications
peptide = oms.AASequence.fromString("PEPTIDEK")
print(f"Unmodified [M+H]+: {peptide.getMonoWeight() + 1.007276:.4f}")

# With modification
modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)K")
print(f"Oxidized [M+H]+: {modified.getMonoWeight() + 1.007276:.4f}")

# Calculate for different charge states
for z in [1, 2, 3]:
    mz = (peptide.getMonoWeight() + z * 1.007276) / z
    print(f"[M+{z}H]^{z}+: {mz:.4f}")

Installation

Ensure pyOpenMS is installed before using this skill:

# Via conda (recommended)
conda install -c bioconda pyopenms

# Via pip
pip install pyopenms

Integration with Other Tools

pyOpenMS integrates seamlessly with:

  • Search Engines: Comet, Mascot, MSGF+, MSFragger, Sage, SpectraST
  • Post-processing: Percolator, MSstats, Epiphany
  • Metabolomics: SIRIUS, CSI:FingerID
  • Data Analysis: Pandas, NumPy, SciPy for downstream analysis
  • Visualization: Matplotlib, Seaborn for plotting

Resources

references/

Detailed documentation on core concepts:

  • data_structures.md - Comprehensive guide to MSExperiment, MSSpectrum, MSChromatogram, and peak data handling
  • algorithms.md - Complete reference for signal processing, filtering, feature detection, and quantification algorithms
  • chemistry.md - In-depth coverage of chemistry calculations, peptide handling, modifications, and isotope distributions

Load these references when needing detailed information about specific pyOpenMS capabilities.

Best Practices

  1. File Format: Always use mzML for raw MS data (standardized, well-supported)
  2. Peak Access: Use get_peaks() and set_peaks() with numpy arrays for efficient processing
  3. Parameters: Always check and configure algorithm parameters via getParameters() and setParameters()
  4. Memory: For large datasets, process spectra iteratively rather than loading entire experiments
  5. Validation: Check data integrity (MS levels, RT ordering, precursor information) after loading
  6. Modifications: Use standard modification names from UniMod database
  7. Units: RT in seconds, m/z in Thomson (Da/charge), intensity in arbitrary units

Common Patterns

Algorithm Application Pattern:

# 1. Instantiate algorithm
algorithm = oms.SomeAlgorithm()

# 2. Get and configure parameters
params = algorithm.getParameters()
params.setValue("parameter_name", value)
algorithm.setParameters(params)

# 3. Apply to data
algorithm.filterExperiment(exp)  # or .process(), .run(), depending on algorithm

File I/O Pattern:

# Read
data_container = oms.DataContainer()  # MSExperiment, FeatureMap, etc.
oms.FileHandler().load("input.format", data_container)

# Process
# ... manipulate data_container ...

# Write
oms.FileHandler().store("output.format", data_container)

Getting Help