Files
claude-scientific-skills/scientific-packages/pyopenms/references/data_structures.md
Timothy Kassis c7296b2661 Add PyOpenms
2025-10-20 17:26:45 -07:00

12 KiB

pyOpenMS Data Structures Reference

This document provides comprehensive coverage of core data structures in pyOpenMS for representing mass spectrometry data.

Core Hierarchy

MSExperiment                    # Top-level: Complete LC-MS/MS run
├── MSSpectrum[]               # Collection of mass spectra
│   ├── Peak1D[]              # Individual m/z, intensity pairs
│   └── SpectrumSettings      # Metadata (RT, MS level, precursor)
└── MSChromatogram[]          # Collection of chromatograms
    ├── ChromatogramPeak[]    # RT, intensity pairs
    └── ChromatogramSettings  # Metadata

MSSpectrum

Represents a single mass spectrum (1-dimensional peak data).

Creation and Basic Properties

import pyopenms as oms

# Create empty spectrum
spectrum = oms.MSSpectrum()

# Set metadata
spectrum.setRT(123.45)           # Retention time in seconds
spectrum.setMSLevel(1)           # MS level (1 for MS1, 2 for MS2, etc.)
spectrum.setNativeID("scan=1234") # Native ID from file

# Additional metadata
spectrum.setDriftTime(15.2)      # Ion mobility drift time
spectrum.setName("MyScan")       # Optional name

Peak Data Management

Setting Peaks (Method 1 - Lists):

mz_values = [100.5, 200.3, 300.7, 400.2, 500.1]
intensity_values = [1000, 5000, 3000, 2000, 1500]

spectrum.set_peaks((mz_values, intensity_values))

Setting Peaks (Method 2 - NumPy arrays):

import numpy as np

mz_array = np.array([100.5, 200.3, 300.7, 400.2, 500.1])
intensity_array = np.array([1000, 5000, 3000, 2000, 1500])

spectrum.set_peaks((mz_array, intensity_array))

Retrieving Peaks:

# Get as numpy arrays (efficient)
mz_array, intensity_array = spectrum.get_peaks()

# Check number of peaks
n_peaks = spectrum.size()

# Get individual peak (slower)
for i in range(spectrum.size()):
    peak = spectrum[i]
    mz = peak.getMZ()
    intensity = peak.getIntensity()

Precursor Information (for MS2/MSn spectra)

# Create precursor
precursor = oms.Precursor()
precursor.setMZ(456.789)           # Precursor m/z
precursor.setCharge(2)             # Precursor charge
precursor.setIntensity(50000)      # Precursor intensity
precursor.setIsolationWindowLowerOffset(1.5)  # Lower isolation window
precursor.setIsolationWindowUpperOffset(1.5)  # Upper isolation window

# Set activation method
activation = oms.Activation()
activation.setActivationEnergy(35.0)  # Collision energy
activation.setMethod(oms.Activation.ActivationMethod.CID)
precursor.setActivation(activation)

# Assign to spectrum
spectrum.setPrecursors([precursor])

# Retrieve precursor information
precursors = spectrum.getPrecursors()
if len(precursors) > 0:
    prec = precursors[0]
    print(f"Precursor m/z: {prec.getMZ()}")
    print(f"Precursor charge: {prec.getCharge()}")

Spectrum Metadata Access

# Check if spectrum is sorted by m/z
is_sorted = spectrum.isSorted()

# Sort spectrum by m/z
spectrum.sortByPosition()

# Sort by intensity
spectrum.sortByIntensity()

# Clear all peaks
spectrum.clear(False)  # False = keep metadata, True = clear everything

# Get retention time
rt = spectrum.getRT()

# Get MS level
ms_level = spectrum.getMSLevel()

Spectrum Types and Modes

# Set spectrum type
spectrum.setType(oms.SpectrumSettings.SpectrumType.CENTROID)  # or PROFILE

# Get spectrum type
spec_type = spectrum.getType()
if spec_type == oms.SpectrumSettings.SpectrumType.CENTROID:
    print("Centroid spectrum")
elif spec_type == oms.SpectrumSettings.SpectrumType.PROFILE:
    print("Profile spectrum")

Data Processing Annotations

# Add processing information
processing = oms.DataProcessing()
processing.setMetaValue("smoothing", "gaussian")
spectrum.setDataProcessing([processing])

MSExperiment

Represents a complete LC-MS/MS experiment containing multiple spectra and chromatograms.

Creation and Population

# Create empty experiment
exp = oms.MSExperiment()

# Add spectra
spectrum1 = oms.MSSpectrum()
spectrum1.setRT(100.0)
spectrum1.set_peaks(([100, 200], [1000, 2000]))

spectrum2 = oms.MSSpectrum()
spectrum2.setRT(200.0)
spectrum2.set_peaks(([100, 200], [1500, 2500]))

exp.addSpectrum(spectrum1)
exp.addSpectrum(spectrum2)

# Add chromatograms
chrom = oms.MSChromatogram()
chrom.set_peaks(([10.5, 11.0, 11.5], [1000, 5000, 3000]))
exp.addChromatogram(chrom)

Accessing Spectra and Chromatograms

# Get number of spectra and chromatograms
n_spectra = exp.getNrSpectra()
n_chroms = exp.getNrChromatograms()

# Access by index
first_spectrum = exp.getSpectrum(0)
last_spectrum = exp.getSpectrum(exp.getNrSpectra() - 1)

# Iterate over all spectra
for spectrum in exp:
    rt = spectrum.getRT()
    ms_level = spectrum.getMSLevel()
    n_peaks = spectrum.size()
    print(f"RT: {rt:.2f}s, MS{ms_level}, Peaks: {n_peaks}")

# Get all spectra as list
spectra = exp.getSpectra()

# Access chromatograms
chrom = exp.getChromatogram(0)

Filtering Operations

# Filter by MS level
exp.filterMSLevel(1)  # Keep only MS1 spectra
exp.filterMSLevel(2)  # Keep only MS2 spectra

# Filter by retention time range
exp.filterRT(100.0, 500.0)  # Keep RT between 100-500 seconds

# Filter by m/z range (all spectra)
exp.filterMZ(300.0, 1500.0)  # Keep m/z between 300-1500

# Filter by scan number
exp.filterScanNumber(100, 200)  # Keep scans 100-200

Metadata and Properties

# Set experiment metadata
exp.setMetaValue("operator", "John Doe")
exp.setMetaValue("instrument", "Q Exactive HF")

# Get metadata
operator = exp.getMetaValue("operator")

# Get RT range
rt_range = exp.getMinRT(), exp.getMaxRT()

# Get m/z range
mz_range = exp.getMinMZ(), exp.getMaxMZ()

# Clear all data
exp.clear(False)  # False = keep metadata

Sorting and Organization

# Sort spectra by retention time
exp.sortSpectra()

# Update ranges (call after modifications)
exp.updateRanges()

# Check if experiment is empty
is_empty = exp.empty()

# Reset (clear everything)
exp.reset()

MSChromatogram

Represents an extracted or reconstructed chromatogram (retention time vs. intensity).

Creation and Basic Usage

# Create chromatogram
chrom = oms.MSChromatogram()

# Set peaks (RT, intensity pairs)
rt_values = [10.0, 10.5, 11.0, 11.5, 12.0]
intensity_values = [1000, 5000, 8000, 6000, 2000]
chrom.set_peaks((rt_values, intensity_values))

# Get peaks
rt_array, int_array = chrom.get_peaks()

# Get size
n_points = chrom.size()

Chromatogram Types

# Set chromatogram type
chrom.setChromatogramType(oms.ChromatogramSettings.ChromatogramType.SELECTED_ION_CURRENT_CHROMATOGRAM)

# Other types:
# - TOTAL_ION_CURRENT_CHROMATOGRAM
# - BASEPEAK_CHROMATOGRAM
# - SELECTED_ION_CURRENT_CHROMATOGRAM
# - SELECTED_REACTION_MONITORING_CHROMATOGRAM

Metadata

# Set native ID
chrom.setNativeID("TIC")

# Set name
chrom.setName("Total Ion Current")

# Access
native_id = chrom.getNativeID()
name = chrom.getName()

Precursor and Product Information (for SRM/MRM)

# For targeted experiments
precursor = oms.Precursor()
precursor.setMZ(456.7)
chrom.setPrecursor(precursor)

product = oms.Product()
product.setMZ(789.4)
chrom.setProduct(product)

Peak1D and ChromatogramPeak

Individual peak data points.

Peak1D (for mass spectra)

# Create individual peak
peak = oms.Peak1D()
peak.setMZ(456.789)
peak.setIntensity(10000)

# Access
mz = peak.getMZ()
intensity = peak.getIntensity()

# Set position and intensity
peak.setPosition([456.789])
peak.setIntensity(10000)

ChromatogramPeak (for chromatograms)

# Create chromatogram peak
chrom_peak = oms.ChromatogramPeak()
chrom_peak.setRT(125.5)
chrom_peak.setIntensity(5000)

# Access
rt = chrom_peak.getRT()
intensity = chrom_peak.getIntensity()

FeatureMap and Feature

For quantification results.

Feature

Represents a detected LC-MS feature (peptide or metabolite signal).

# Create feature
feature = oms.Feature()

# Set properties
feature.setMZ(456.789)
feature.setRT(123.45)
feature.setIntensity(1000000)
feature.setCharge(2)
feature.setWidth(15.0)  # RT width in seconds

# Set quality score
feature.setOverallQuality(0.95)

# Access
mz = feature.getMZ()
rt = feature.getRT()
intensity = feature.getIntensity()
charge = feature.getCharge()

FeatureMap

Collection of features.

# Create feature map
feature_map = oms.FeatureMap()

# Add features
feature1 = oms.Feature()
feature1.setMZ(456.789)
feature1.setRT(123.45)
feature1.setIntensity(1000000)

feature_map.push_back(feature1)

# Get size
n_features = feature_map.size()

# Iterate
for feature in feature_map:
    print(f"m/z: {feature.getMZ():.4f}, RT: {feature.getRT():.2f}")

# Access by index
first_feature = feature_map[0]

# Clear
feature_map.clear()

PeptideIdentification and ProteinIdentification

For identification results.

PeptideIdentification

# Create peptide identification
pep_id = oms.PeptideIdentification()
pep_id.setRT(123.45)
pep_id.setMZ(456.789)

# Create peptide hit
hit = oms.PeptideHit()
hit.setSequence(oms.AASequence.fromString("PEPTIDE"))
hit.setCharge(2)
hit.setScore(25.5)
hit.setRank(1)

# Add to identification
pep_id.setHits([hit])
pep_id.setHigherScoreBetter(True)
pep_id.setScoreType("XCorr")

# Access
hits = pep_id.getHits()
for hit in hits:
    seq = hit.getSequence().toString()
    score = hit.getScore()
    print(f"Sequence: {seq}, Score: {score}")

ProteinIdentification

# Create protein identification
prot_id = oms.ProteinIdentification()

# Create protein hit
prot_hit = oms.ProteinHit()
prot_hit.setAccession("P12345")
prot_hit.setSequence("MKTAYIAKQRQISFVK...")
prot_hit.setScore(100.5)

# Add to identification
prot_id.setHits([prot_hit])
prot_id.setScoreType("Mascot Score")
prot_id.setHigherScoreBetter(True)

# Search parameters
search_params = oms.ProteinIdentification.SearchParameters()
search_params.db = "uniprot_human.fasta"
search_params.enzyme = "Trypsin"
prot_id.setSearchParameters(search_params)

ConsensusMap and ConsensusFeature

For linking features across multiple samples.

ConsensusFeature

# Create consensus feature
cons_feature = oms.ConsensusFeature()
cons_feature.setMZ(456.789)
cons_feature.setRT(123.45)
cons_feature.setIntensity(5000000)  # Combined intensity

# Access linked features
for handle in cons_feature.getFeatureList():
    map_index = handle.getMapIndex()
    feature_index = handle.getIndex()
    intensity = handle.getIntensity()

ConsensusMap

# Create consensus map
consensus_map = oms.ConsensusMap()

# Add consensus features
consensus_map.push_back(cons_feature)

# Iterate
for cons_feat in consensus_map:
    mz = cons_feat.getMZ()
    rt = cons_feat.getRT()
    n_features = cons_feat.size()  # Number of linked features

Best Practices

  1. Use numpy arrays for peak data when possible - much faster than individual peak access
  2. Sort spectra by position (m/z) before searching or filtering
  3. Update ranges after modifying MSExperiment: exp.updateRanges()
  4. Check MS level before processing - different algorithms for MS1 vs MS2
  5. Validate precursor info for MS2 spectra - ensure charge and m/z are set
  6. Use appropriate containers - MSExperiment for raw data, FeatureMap for quantification
  7. Clear metadata carefully - use clear(False) to preserve metadata when clearing peaks

Common Patterns

Create MS2 Spectrum with Precursor

spectrum = oms.MSSpectrum()
spectrum.setRT(205.2)
spectrum.setMSLevel(2)
spectrum.set_peaks(([100, 200, 300], [1000, 5000, 3000]))

precursor = oms.Precursor()
precursor.setMZ(450.5)
precursor.setCharge(2)
spectrum.setPrecursors([precursor])

Extract MS1 Spectra from Experiment

ms1_exp = oms.MSExperiment()
for spectrum in exp:
    if spectrum.getMSLevel() == 1:
        ms1_exp.addSpectrum(spectrum)

Calculate Total Ion Current (TIC)

tic_values = []
rt_values = []
for spectrum in exp:
    if spectrum.getMSLevel() == 1:
        mz, intensity = spectrum.get_peaks()
        tic = np.sum(intensity)
        tic_values.append(tic)
        rt_values.append(spectrum.getRT())

Find Spectrum Closest to RT

target_rt = 125.5
closest_spectrum = None
min_diff = float('inf')

for spectrum in exp:
    diff = abs(spectrum.getRT() - target_rt)
    if diff < min_diff:
        min_diff = diff
        closest_spectrum = spectrum