Update the PyOpenMS skill

2026-03-28 07:33:45 +08:00 · 2025-11-03 16:55:01 -08:00
parent c56fa43747
commit f124e28509
9 changed files with 2699 additions and 2343 deletions
--- a/scientific-packages/pyopenms/SKILL.md
+++ b/scientific-packages/pyopenms/SKILL.md
@@ -1,530 +1,211 @@
 ---
 name: pyopenms
-description: "Mass spectrometry toolkit (OpenMS Python). Process mzML/mzXML, peak picking, feature detection, peptide ID, proteomics/metabolomics workflows, for LC-MS/MS analysis."
+description: Python interface to OpenMS for mass spectrometry data analysis. Use for LC-MS/MS proteomics and metabolomics workflows including file handling (mzML, mzXML, mzTab, FASTA, pepXML, protXML, mzIdentML), signal processing, feature detection, peptide identification, and quantitative analysis. Apply when working with mass spectrometry data, analyzing proteomics experiments, or processing metabolomics datasets.
 ---

-# pyOpenMS
+# PyOpenMS

 ## Overview

-pyOpenMS is an open-source Python library for mass spectrometry data analysis in proteomics and metabolomics. Process LC-MS/MS data, perform peptide identification, detect and quantify features, and integrate with common proteomics tools (Comet, Mascot, MSGF+, Percolator, MSstats) using Python bindings to the OpenMS C++ library.
-
-## When to Use This Skill
-
-This skill should be used when:
- Processing mass spectrometry data (mzML, mzXML files)
- Performing peak picking and feature detection in LC-MS data
- Conducting peptide and protein identification workflows
- Quantifying metabolites or proteins
- Integrating proteomics or metabolomics tools into Python pipelines
- Working with OpenMS tools and file formats
-
-## Core Capabilities
-
-### 1. File I/O and Data Import/Export
-
-Handle diverse mass spectrometry file formats efficiently:
-
-**Supported Formats:**
- **mzML/mzXML**: Primary raw MS data formats (profile or centroid)
- **FASTA**: Protein/peptide sequence databases
- **mzTab**: Standardized reporting format for identification and quantification
- **mzIdentML**: Peptide and protein identification data
- **TraML**: Transition lists for targeted experiments
- **pepXML/protXML**: Search engine results
-
-**Reading mzML Files:**
-```python
-import pyopenms as oms
-
-# Load MS data
-exp = oms.MSExperiment()
-oms.MzMLFile().load("input_data.mzML", exp)
-
-# Access basic information
-print(f"Number of spectra: {exp.getNrSpectra()}")
-print(f"Number of chromatograms: {exp.getNrChromatograms()}")
-```
-
-**Writing mzML Files:**
-```python
-# Save processed data
-oms.MzMLFile().store("output_data.mzML", exp)
-```
-
-**File Encoding:** pyOpenMS automatically handles Base64 encoding, zlib compression, and Numpress compression internally.
-
-### 2. MS Data Structures and Manipulation
-
-Work with core mass spectrometry data structures. See `references/data_structures.md` for comprehensive details.
-
-**MSSpectrum** - Individual mass spectrum:
-```python
-# Create spectrum with metadata
-spectrum = oms.MSSpectrum()
-spectrum.setRT(205.2)  # Retention time in seconds
-spectrum.setMSLevel(2)  # MS2 spectrum
-
-# Set peak data (m/z, intensity arrays)
-mz_array = [100.5, 200.3, 300.7, 400.2]
-intensity_array = [1000, 5000, 3000, 2000]
-spectrum.set_peaks((mz_array, intensity_array))
-
-# Add precursor information for MS2
-precursor = oms.Precursor()
-precursor.setMZ(450.5)
-precursor.setCharge(2)
-spectrum.setPrecursors([precursor])
-```
-
-**MSExperiment** - Complete LC-MS/MS run:
-```python
-# Create experiment and add spectra
-exp = oms.MSExperiment()
-exp.addSpectrum(spectrum)
-
-# Access spectra
-first_spectrum = exp.getSpectrum(0)
-for spec in exp:
-    print(f"RT: {spec.getRT()}, MS Level: {spec.getMSLevel()}")
-```
-
-**MSChromatogram** - Extracted ion chromatogram:
-```python
-# Create chromatogram
-chrom = oms.MSChromatogram()
-chrom.set_peaks(([10.5, 11.2, 11.8], [1000, 5000, 3000]))  # RT, intensity
-exp.addChromatogram(chrom)
-```
-
-**Efficient Peak Access:**
-```python
-# Get peaks as numpy arrays for fast processing
-mz_array, intensity_array = spectrum.get_peaks()
-
-# Modify and set back
-intensity_array *= 2  # Double all intensities
-spectrum.set_peaks((mz_array, intensity_array))
-```
-
-### 3. Chemistry and Peptide Handling
-
-Perform chemical calculations for proteomics and metabolomics. See `references/chemistry.md` for detailed examples.
-
-**Molecular Formulas and Mass Calculations:**
-```python
-# Create empirical formula
-formula = oms.EmpiricalFormula("C6H12O6")  # Glucose
-print(f"Monoisotopic mass: {formula.getMonoWeight()}")
-print(f"Average mass: {formula.getAverageWeight()}")
-
-# Formula arithmetic
-water = oms.EmpiricalFormula("H2O")
-dehydrated = formula - water
-
-# Isotope-specific formulas
-heavy_carbon = oms.EmpiricalFormula("(13)C6H12O6")
-```
-
-**Isotopic Distributions:**
-```python
-# Generate coarse isotope pattern (unit mass resolution)
-coarse_gen = oms.CoarseIsotopePatternGenerator()
-pattern = coarse_gen.run(formula)
-
-# Generate fine structure (high resolution)
-fine_gen = oms.FineIsotopePatternGenerator(0.01)  # 0.01 Da resolution
-fine_pattern = fine_gen.run(formula)
-```
-
-**Amino Acids and Residues:**
-```python
-# Access residue information
-res_db = oms.ResidueDB()
-leucine = res_db.getResidue("Leucine")
-print(f"L monoisotopic mass: {leucine.getMonoWeight()}")
-print(f"L formula: {leucine.getFormula()}")
-print(f"L pKa: {leucine.getPka()}")
-```
-
-**Peptide Sequences:**
-```python
-# Create peptide sequence
-peptide = oms.AASequence.fromString("PEPTIDE")
-print(f"Peptide mass: {peptide.getMonoWeight()}")
-print(f"Formula: {peptide.getFormula()}")
-
-# Add modifications
-modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)")
-print(f"Modified mass: {modified.getMonoWeight()}")
-
-# Theoretical fragmentation
-ions = []
-for i in range(1, peptide.size()):
-    b_ion = peptide.getPrefix(i)
-    y_ion = peptide.getSuffix(i)
-    ions.append(('b', i, b_ion.getMonoWeight()))
-    ions.append(('y', i, y_ion.getMonoWeight()))
-```
-
-**Protein Digestion:**
-```python
-# Enzymatic digestion
-dig = oms.ProteaseDigestion()
-dig.setEnzyme("Trypsin")
-dig.setMissedCleavages(2)
-
-protein_seq = oms.AASequence.fromString("MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK")
-peptides = []
-dig.digest(protein_seq, peptides)
-
-for pep in peptides:
-    print(f"{pep.toString()}: {pep.getMonoWeight():.2f} Da")
-```
-
-**Modifications:**
-```python
-# Access modification database
-mod_db = oms.ModificationsDB()
-oxidation = mod_db.getModification("Oxidation")
-print(f"Oxidation mass diff: {oxidation.getDiffMonoMass()}")
-print(f"Residues: {oxidation.getResidues()}")
-```
-
-### 4. Signal Processing and Filtering
-
-Apply algorithms to process and filter MS data. See `references/algorithms.md` for comprehensive coverage.
-
-**Spectral Smoothing:**
-```python
-# Gaussian smoothing
-gauss_filter = oms.GaussFilter()
-params = gauss_filter.getParameters()
-params.setValue("gaussian_width", 0.2)
-gauss_filter.setParameters(params)
-gauss_filter.filterExperiment(exp)
-
-# Savitzky-Golay filter
-sg_filter = oms.SavitzkyGolayFilter()
-sg_filter.filterExperiment(exp)
-```
-
-**Peak Filtering:**
-```python
-# Keep only N largest peaks per spectrum
-n_largest = oms.NLargest()
-params = n_largest.getParameters()
-params.setValue("n", 100)  # Keep top 100 peaks
-n_largest.setParameters(params)
-n_largest.filterExperiment(exp)
-
-# Threshold filtering
-threshold_filter = oms.ThresholdMower()
-params = threshold_filter.getParameters()
-params.setValue("threshold", 1000.0)  # Remove peaks below 1000 intensity
-threshold_filter.setParameters(params)
-threshold_filter.filterExperiment(exp)
-
-# Window-based filtering
-window_filter = oms.WindowMower()
-params = window_filter.getParameters()
-params.setValue("windowsize", 50.0)  # 50 m/z windows
-params.setValue("peakcount", 10)     # Keep 10 highest per window
-window_filter.setParameters(params)
-window_filter.filterExperiment(exp)
-```
-
-**Spectrum Normalization:**
-```python
-normalizer = oms.Normalizer()
-normalizer.filterExperiment(exp)
-```
-
-**MS Level Filtering:**
-```python
-# Keep only MS2 spectra
-exp.filterMSLevel(2)
-
-# Filter by retention time range
-exp.filterRT(100.0, 500.0)  # Keep RT between 100-500 seconds
-
-# Filter by m/z range
-exp.filterMZ(400.0, 1500.0)  # Keep m/z between 400-1500
-```
-
-### 5. Feature Detection and Quantification
-
-Detect and quantify features in LC-MS data:
-
-**Peak Picking (Centroiding):**
-```python
-# Convert profile data to centroid
-picker = oms.PeakPickerHiRes()
-params = picker.getParameters()
-params.setValue("signal_to_noise", 1.0)
-picker.setParameters(params)
-
-exp_centroided = oms.MSExperiment()
-picker.pickExperiment(exp, exp_centroided)
-```
-
-**Feature Detection:**
-```python
-# Detect features across LC-MS runs
-feature_finder = oms.FeatureFinderMultiplex()
-
-features = oms.FeatureMap()
-feature_finder.run(exp, features, params)
-
-print(f"Found {features.size()} features")
-for feature in features:
-    print(f"m/z: {feature.getMZ():.4f}, RT: {feature.getRT():.2f}, "
-          f"Intensity: {feature.getIntensity():.0f}")
-```
-
-**Feature Linking (Map Alignment):**
-```python
-# Link features across multiple samples
-feature_grouper = oms.FeatureGroupingAlgorithmQT()
-consensus_map = oms.ConsensusMap()
-
-# Provide multiple feature maps from different samples
-feature_maps = [features1, features2, features3]
-feature_grouper.group(feature_maps, consensus_map)
-```
-
-### 6. Peptide Identification Workflows
-
-Integrate with search engines and process identification results:
-
-**Database Searching:**
-```python
-# Prepare parameters for search engine
-params = oms.Param()
-params.setValue("database", "uniprot_human.fasta")
-params.setValue("precursor_mass_tolerance", 10.0)  # ppm
-params.setValue("fragment_mass_tolerance", 0.5)     # Da
-params.setValue("enzyme", "Trypsin")
-params.setValue("missed_cleavages", 2)
-
-# Variable modifications
-params.setValue("variable_modifications", ["Oxidation (M)", "Phospho (STY)"])
-
-# Fixed modifications
-params.setValue("fixed_modifications", ["Carbamidomethyl (C)"])
-```
-
-**FDR Control:**
-```python
-# False discovery rate estimation
-fdr = oms.FalseDiscoveryRate()
-fdr_threshold = 0.01  # 1% FDR
-
-# Apply to peptide identifications
-protein_ids = []
-peptide_ids = []
-oms.IdXMLFile().load("search_results.idXML", protein_ids, peptide_ids)
-
-fdr.apply(protein_ids, peptide_ids)
-```
-
-### 7. Metabolomics Workflows
-
-Analyze small molecule data:
-
-**Adduct Detection:**
-```python
-# Common metabolite adducts
-adducts = ["[M+H]+", "[M+Na]+", "[M+K]+", "[M-H]-", "[M+Cl]-"]
-
-# Feature annotation with adducts
-for feature in features:
-    mz = feature.getMZ()
-    # Calculate neutral mass for each adduct hypothesis
-    for adduct in adducts:
-        # Annotation logic
-        pass
-```
-
-**Isotope Pattern Matching:**
-```python
-# Compare experimental to theoretical isotope patterns
-experimental_pattern = []  # Extract from feature
-theoretical = coarse_gen.run(formula)
-
-# Calculate similarity score
-similarity = compare_isotope_patterns(experimental_pattern, theoretical)
-```
-
-### 8. Quality Control and Visualization
-
-Monitor data quality and visualize results:
-
-**Basic Statistics:**
-```python
-# Calculate TIC (Total Ion Current)
-tic_values = []
-rt_values = []
-for spectrum in exp:
-    if spectrum.getMSLevel() == 1:
-        tic = sum(spectrum.get_peaks()[1])  # Sum intensities
-        tic_values.append(tic)
-        rt_values.append(spectrum.getRT())
-
-# Base peak chromatogram
-bpc_values = []
-for spectrum in exp:
-    if spectrum.getMSLevel() == 1:
-        max_intensity = max(spectrum.get_peaks()[1]) if spectrum.size() > 0 else 0
-        bpc_values.append(max_intensity)
-```
-
-**Plotting (with pyopenms.plotting or matplotlib):**
-```python
-import matplotlib.pyplot as plt
-
-# Plot TIC
-plt.figure(figsize=(10, 4))
-plt.plot(rt_values, tic_values)
-plt.xlabel('Retention Time (s)')
-plt.ylabel('Total Ion Current')
-plt.title('TIC')
-plt.show()
-
-# Plot single spectrum
-spectrum = exp.getSpectrum(0)
-mz, intensity = spectrum.get_peaks()
-plt.stem(mz, intensity, basefmt=' ')
-plt.xlabel('m/z')
-plt.ylabel('Intensity')
-plt.title(f'Spectrum at RT {spectrum.getRT():.2f}s')
-plt.show()
-```
-
-## Common Workflows
-
-### Complete LC-MS/MS Processing Pipeline
-
-```python
-import pyopenms as oms
-
-# 1. Load data
-exp = oms.MSExperiment()
-oms.MzMLFile().load("raw_data.mzML", exp)
-
-# 2. Filter and smooth
-exp.filterMSLevel(1)  # Keep only MS1 for feature detection
-gauss = oms.GaussFilter()
-gauss.filterExperiment(exp)
-
-# 3. Peak picking
-picker = oms.PeakPickerHiRes()
-exp_centroid = oms.MSExperiment()
-picker.pickExperiment(exp, exp_centroid)
-
-# 4. Feature detection
-ff = oms.FeatureFinderMultiplex()
-features = oms.FeatureMap()
-ff.run(exp_centroid, features, oms.Param())
-
-# 5. Export results
-oms.FeatureXMLFile().store("features.featureXML", features)
-print(f"Detected {features.size()} features")
-```
-
-### Theoretical Peptide Mass Calculation
-
-```python
-# Calculate masses for peptide with modifications
-peptide = oms.AASequence.fromString("PEPTIDEK")
-print(f"Unmodified [M+H]+: {peptide.getMonoWeight() + 1.007276:.4f}")
-
-# With modification
-modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)K")
-print(f"Oxidized [M+H]+: {modified.getMonoWeight() + 1.007276:.4f}")
-
-# Calculate for different charge states
-for z in [1, 2, 3]:
-    mz = (peptide.getMonoWeight() + z * 1.007276) / z
-    print(f"[M+{z}H]^{z}+: {mz:.4f}")
-```
+PyOpenMS provides Python bindings to the OpenMS library for computational mass spectrometry, enabling analysis of proteomics and metabolomics data. Use for handling mass spectrometry file formats, processing spectral data, detecting features, identifying peptides/proteins, and performing quantitative analysis.

 ## Installation

-Ensure pyOpenMS is installed before using this skill:
+Install using uv:

 ```bash
-# Via conda (recommended)
-conda install -c bioconda pyopenms
+uv pip install pyopenms
+```

-# Via pip
-pip install pyopenms
+Verify installation:
+
+```python
+import pyopenms
+print(pyopenms.__version__)
+```
+
+## Core Capabilities
+
+PyOpenMS organizes functionality into these domains:
+
+### 1. File I/O and Data Formats
+
+Handle mass spectrometry file formats and convert between representations.
+
+**Supported formats**: mzML, mzXML, TraML, mzTab, FASTA, pepXML, protXML, mzIdentML, featureXML, consensusXML, idXML
+
+Basic file reading:
+
+```python
+import pyopenms as ms
+
+# Read mzML file
+exp = ms.MSExperiment()
+ms.MzMLFile().load("data.mzML", exp)
+
+# Access spectra
+for spectrum in exp:
+    mz, intensity = spectrum.get_peaks()
+    print(f"Spectrum: {len(mz)} peaks")
+```
+
+**For detailed file handling**: See `references/file_io.md`
+
+### 2. Signal Processing
+
+Process raw spectral data with smoothing, filtering, centroiding, and normalization.
+
+Basic spectrum processing:
+
+```python
+# Smooth spectrum with Gaussian filter
+gaussian = ms.GaussFilter()
+params = gaussian.getParameters()
+params.setValue("gaussian_width", 0.1)
+gaussian.setParameters(params)
+gaussian.filterExperiment(exp)
+```
+
+**For algorithm details**: See `references/signal_processing.md`
+
+### 3. Feature Detection
+
+Detect and link features across spectra and samples for quantitative analysis.
+
+```python
+# Detect features
+ff = ms.FeatureFinder()
+ff.run("centroided", exp, features, params, ms.FeatureMap())
+```
+
+**For complete workflows**: See `references/feature_detection.md`
+
+### 4. Peptide and Protein Identification
+
+Integrate with search engines and process identification results.
+
+**Supported engines**: Comet, Mascot, MSGFPlus, XTandem, OMSSA, Myrimatch
+
+Basic identification workflow:
+
+```python
+# Load identification data
+protein_ids = []
+peptide_ids = []
+ms.IdXMLFile().load("identifications.idXML", protein_ids, peptide_ids)
+
+# Apply FDR filtering
+fdr = ms.FalseDiscoveryRate()
+fdr.apply(peptide_ids)
+```
+
+**For detailed workflows**: See `references/identification.md`
+
+### 5. Metabolomics Analysis
+
+Perform untargeted metabolomics preprocessing and analysis.
+
+Typical workflow:
+1. Load and process raw data
+2. Detect features
+3. Align retention times across samples
+4. Link features to consensus map
+5. Annotate with compound databases
+
+**For complete metabolomics workflows**: See `references/metabolomics.md`
+
+## Data Structures
+
+PyOpenMS uses these primary objects:
+
+- **MSExperiment**: Collection of spectra and chromatograms
+- **MSSpectrum**: Single mass spectrum with m/z and intensity pairs
+- **MSChromatogram**: Chromatographic trace
+- **Feature**: Detected chromatographic peak with quality metrics
+- **FeatureMap**: Collection of features
+- **PeptideIdentification**: Search results for peptides
+- **ProteinIdentification**: Search results for proteins
+
+**For detailed documentation**: See `references/data_structures.md`
+
+## Common Workflows
+
+### Quick Start: Load and Explore Data
+
+```python
+import pyopenms as ms
+
+# Load mzML file
+exp = ms.MSExperiment()
+ms.MzMLFile().load("sample.mzML", exp)
+
+# Get basic statistics
+print(f"Number of spectra: {exp.getNrSpectra()}")
+print(f"Number of chromatograms: {exp.getNrChromatograms()}")
+
+# Examine first spectrum
+spec = exp.getSpectrum(0)
+print(f"MS level: {spec.getMSLevel()}")
+print(f"Retention time: {spec.getRT()}")
+mz, intensity = spec.get_peaks()
+print(f"Peaks: {len(mz)}")
+```
+
+### Parameter Management
+
+Most algorithms use a parameter system:
+
+```python
+# Get algorithm parameters
+algo = ms.GaussFilter()
+params = algo.getParameters()
+
+# View available parameters
+for param in params.keys():
+    print(f"{param}: {params.getValue(param)}")
+
+# Modify parameters
+params.setValue("gaussian_width", 0.2)
+algo.setParameters(params)
+```
+
+### Export to Pandas
+
+Convert data to pandas DataFrames for analysis:
+
+```python
+import pyopenms as ms
+import pandas as pd
+
+# Load feature map
+fm = ms.FeatureMap()
+ms.FeatureXMLFile().load("features.featureXML", fm)
+
+# Convert to DataFrame
+df = fm.get_df()
+print(df.head())
 ```

 ## Integration with Other Tools

-pyOpenMS integrates seamlessly with:
-
- **Search Engines**: Comet, Mascot, MSGF+, MSFragger, Sage, SpectraST
- **Post-processing**: Percolator, MSstats, Epiphany
- **Metabolomics**: SIRIUS, CSI:FingerID
- **Data Analysis**: Pandas, NumPy, SciPy for downstream analysis
- **Visualization**: Matplotlib, Seaborn for plotting
+PyOpenMS integrates with:
+- **Pandas**: Export data to DataFrames
+- **NumPy**: Work with peak arrays
+- **Scikit-learn**: Machine learning on MS data
+- **Matplotlib/Seaborn**: Visualization
+- **R**: Via rpy2 bridge

 ## Resources

-### references/
+- **Official documentation**: https://pyopenms.readthedocs.io
+- **OpenMS documentation**: https://www.openms.org
+- **GitHub**: https://github.com/OpenMS/OpenMS

-Detailed documentation on core concepts:
+## References

- **data_structures.md** - Comprehensive guide to MSExperiment, MSSpectrum, MSChromatogram, and peak data handling
- **algorithms.md** - Complete reference for signal processing, filtering, feature detection, and quantification algorithms
- **chemistry.md** - In-depth coverage of chemistry calculations, peptide handling, modifications, and isotope distributions
-
-Load these references when needing detailed information about specific pyOpenMS capabilities.
-
-## Best Practices
-
-1. **File Format**: Always use mzML for raw MS data (standardized, well-supported)
-2. **Peak Access**: Use `get_peaks()` and `set_peaks()` with numpy arrays for efficient processing
-3. **Parameters**: Always check and configure algorithm parameters via `getParameters()` and `setParameters()`
-4. **Memory**: For large datasets, process spectra iteratively rather than loading entire experiments
-5. **Validation**: Check data integrity (MS levels, RT ordering, precursor information) after loading
-6. **Modifications**: Use standard modification names from UniMod database
-7. **Units**: RT in seconds, m/z in Thomson (Da/charge), intensity in arbitrary units
-
-## Common Patterns
-
-**Algorithm Application Pattern:**
-```python
-# 1. Instantiate algorithm
-algorithm = oms.SomeAlgorithm()
-
-# 2. Get and configure parameters
-params = algorithm.getParameters()
-params.setValue("parameter_name", value)
-algorithm.setParameters(params)
-
-# 3. Apply to data
-algorithm.filterExperiment(exp)  # or .process(), .run(), depending on algorithm
-```
-
-**File I/O Pattern:**
-```python
-# Read
-data_container = oms.DataContainer()  # MSExperiment, FeatureMap, etc.
-oms.FileHandler().load("input.format", data_container)
-
-# Process
-# ... manipulate data_container ...
-
-# Write
-oms.FileHandler().store("output.format", data_container)
-```
-
-## Getting Help
-
- **Documentation**: https://pyopenms.readthedocs.io/
- **API Reference**: Browse class documentation for detailed method signatures
- **OpenMS Website**: https://www.openms.org/
- **GitHub Issues**: https://github.com/OpenMS/OpenMS/issues
+- `references/file_io.md` - Comprehensive file format handling
+- `references/signal_processing.md` - Signal processing algorithms
+- `references/feature_detection.md` - Feature detection and linking
+- `references/identification.md` - Peptide and protein identification
+- `references/metabolomics.md` - Metabolomics-specific workflows
+- `references/data_structures.md` - Core objects and data structures