Add PyOpenms

2026-03-28 07:33:45 +08:00 · 2025-10-20 17:26:45 -07:00
parent df81c94a7a
commit c7296b2661
6 changed files with 2445 additions and 2 deletions
--- a/scientific-packages/pyopenms/SKILL.md
+++ b/scientific-packages/pyopenms/SKILL.md
@@ -0,0 +1,522 @@
+---
+name: pyopenms
+description: Toolkit for mass spectrometry data analysis with pyOpenMS, supporting proteomics and metabolomics workflows including LC-MS/MS data processing, peptide identification, feature detection, quantification, and chemical calculations. Use this skill when: (1) Working with mass spectrometry file formats (mzML, mzXML, FASTA, mzTab, mzIdentML, TraML, pepXML/protXML) and need to read, write, or convert between formats; (2) Processing raw LC-MS/MS data including spectral smoothing, peak picking, noise filtering, and signal processing; (3) Performing proteomics workflows such as peptide digestion simulation, theoretical fragmentation, modification analysis, and protein identification post-processing; (4) Conducting metabolomics analysis including feature detection, adduct annotation, isotope pattern matching, and small molecule identification; (5) Implementing quantitative proteomics pipelines with feature detection, alignment across samples, and statistical analysis; (6) Calculating chemical properties including molecular formulas, isotopic distributions, amino acid properties, and peptide masses; (7) Integrating with search engines (Comet, Mascot, MSGF+) and post-processing tools (Percolator, MSstats); (8) Building custom MS data analysis workflows that require low-level access to spectra, chromatograms, and peak data; (9) Performing quality control on MS data including TIC/BPC calculation, retention time analysis, and data validation; (10) When you need Python-based alternatives to vendor software for MS data processing and analysis.
+---
+
+# pyOpenMS
+
+## Overview
+
+pyOpenMS is an open-source Python library providing comprehensive tools for mass spectrometry data analysis in proteomics and metabolomics research. It offers Python bindings to the OpenMS C++ library, enabling efficient processing of LC-MS/MS data, peptide identification, feature detection, quantification, and integration with common proteomics tools like Comet, Mascot, MSGF+, Percolator, and MSstats.
+
+Use this skill when working with mass spectrometry data analysis tasks, processing proteomics or metabolomics datasets, or implementing computational workflows for biomolecular identification and quantification.
+
+## Core Capabilities
+
+### 1. File I/O and Data Import/Export
+
+Handle diverse mass spectrometry file formats efficiently:
+
+**Supported Formats:**
+- **mzML/mzXML**: Primary raw MS data formats (profile or centroid)
+- **FASTA**: Protein/peptide sequence databases
+- **mzTab**: Standardized reporting format for identification and quantification
+- **mzIdentML**: Peptide and protein identification data
+- **TraML**: Transition lists for targeted experiments
+- **pepXML/protXML**: Search engine results
+
+**Reading mzML Files:**
+```python
+import pyopenms as oms
+
+# Load MS data
+exp = oms.MSExperiment()
+oms.MzMLFile().load("input_data.mzML", exp)
+
+# Access basic information
+print(f"Number of spectra: {exp.getNrSpectra()}")
+print(f"Number of chromatograms: {exp.getNrChromatograms()}")
+```
+
+**Writing mzML Files:**
+```python
+# Save processed data
+oms.MzMLFile().store("output_data.mzML", exp)
+```
+
+**File Encoding:** pyOpenMS automatically handles Base64 encoding, zlib compression, and Numpress compression internally.
+
+### 2. MS Data Structures and Manipulation
+
+Work with core mass spectrometry data structures. See `references/data_structures.md` for comprehensive details.
+
+**MSSpectrum** - Individual mass spectrum:
+```python
+# Create spectrum with metadata
+spectrum = oms.MSSpectrum()
+spectrum.setRT(205.2)  # Retention time in seconds
+spectrum.setMSLevel(2)  # MS2 spectrum
+
+# Set peak data (m/z, intensity arrays)
+mz_array = [100.5, 200.3, 300.7, 400.2]
+intensity_array = [1000, 5000, 3000, 2000]
+spectrum.set_peaks((mz_array, intensity_array))
+
+# Add precursor information for MS2
+precursor = oms.Precursor()
+precursor.setMZ(450.5)
+precursor.setCharge(2)
+spectrum.setPrecursors([precursor])
+```
+
+**MSExperiment** - Complete LC-MS/MS run:
+```python
+# Create experiment and add spectra
+exp = oms.MSExperiment()
+exp.addSpectrum(spectrum)
+
+# Access spectra
+first_spectrum = exp.getSpectrum(0)
+for spec in exp:
+    print(f"RT: {spec.getRT()}, MS Level: {spec.getMSLevel()}")
+```
+
+**MSChromatogram** - Extracted ion chromatogram:
+```python
+# Create chromatogram
+chrom = oms.MSChromatogram()
+chrom.set_peaks(([10.5, 11.2, 11.8], [1000, 5000, 3000]))  # RT, intensity
+exp.addChromatogram(chrom)
+```
+
+**Efficient Peak Access:**
+```python
+# Get peaks as numpy arrays for fast processing
+mz_array, intensity_array = spectrum.get_peaks()
+
+# Modify and set back
+intensity_array *= 2  # Double all intensities
+spectrum.set_peaks((mz_array, intensity_array))
+```
+
+### 3. Chemistry and Peptide Handling
+
+Perform chemical calculations for proteomics and metabolomics. See `references/chemistry.md` for detailed examples.
+
+**Molecular Formulas and Mass Calculations:**
+```python
+# Create empirical formula
+formula = oms.EmpiricalFormula("C6H12O6")  # Glucose
+print(f"Monoisotopic mass: {formula.getMonoWeight()}")
+print(f"Average mass: {formula.getAverageWeight()}")
+
+# Formula arithmetic
+water = oms.EmpiricalFormula("H2O")
+dehydrated = formula - water
+
+# Isotope-specific formulas
+heavy_carbon = oms.EmpiricalFormula("(13)C6H12O6")
+```
+
+**Isotopic Distributions:**
+```python
+# Generate coarse isotope pattern (unit mass resolution)
+coarse_gen = oms.CoarseIsotopePatternGenerator()
+pattern = coarse_gen.run(formula)
+
+# Generate fine structure (high resolution)
+fine_gen = oms.FineIsotopePatternGenerator(0.01)  # 0.01 Da resolution
+fine_pattern = fine_gen.run(formula)
+```
+
+**Amino Acids and Residues:**
+```python
+# Access residue information
+res_db = oms.ResidueDB()
+leucine = res_db.getResidue("Leucine")
+print(f"L monoisotopic mass: {leucine.getMonoWeight()}")
+print(f"L formula: {leucine.getFormula()}")
+print(f"L pKa: {leucine.getPka()}")
+```
+
+**Peptide Sequences:**
+```python
+# Create peptide sequence
+peptide = oms.AASequence.fromString("PEPTIDE")
+print(f"Peptide mass: {peptide.getMonoWeight()}")
+print(f"Formula: {peptide.getFormula()}")
+
+# Add modifications
+modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)")
+print(f"Modified mass: {modified.getMonoWeight()}")
+
+# Theoretical fragmentation
+ions = []
+for i in range(1, peptide.size()):
+    b_ion = peptide.getPrefix(i)
+    y_ion = peptide.getSuffix(i)
+    ions.append(('b', i, b_ion.getMonoWeight()))
+    ions.append(('y', i, y_ion.getMonoWeight()))
+```
+
+**Protein Digestion:**
+```python
+# Enzymatic digestion
+dig = oms.ProteaseDigestion()
+dig.setEnzyme("Trypsin")
+dig.setMissedCleavages(2)
+
+protein_seq = oms.AASequence.fromString("MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK")
+peptides = []
+dig.digest(protein_seq, peptides)
+
+for pep in peptides:
+    print(f"{pep.toString()}: {pep.getMonoWeight():.2f} Da")
+```
+
+**Modifications:**
+```python
+# Access modification database
+mod_db = oms.ModificationsDB()
+oxidation = mod_db.getModification("Oxidation")
+print(f"Oxidation mass diff: {oxidation.getDiffMonoMass()}")
+print(f"Residues: {oxidation.getResidues()}")
+```
+
+### 4. Signal Processing and Filtering
+
+Apply algorithms to process and filter MS data. See `references/algorithms.md` for comprehensive coverage.
+
+**Spectral Smoothing:**
+```python
+# Gaussian smoothing
+gauss_filter = oms.GaussFilter()
+params = gauss_filter.getParameters()
+params.setValue("gaussian_width", 0.2)
+gauss_filter.setParameters(params)
+gauss_filter.filterExperiment(exp)
+
+# Savitzky-Golay filter
+sg_filter = oms.SavitzkyGolayFilter()
+sg_filter.filterExperiment(exp)
+```
+
+**Peak Filtering:**
+```python
+# Keep only N largest peaks per spectrum
+n_largest = oms.NLargest()
+params = n_largest.getParameters()
+params.setValue("n", 100)  # Keep top 100 peaks
+n_largest.setParameters(params)
+n_largest.filterExperiment(exp)
+
+# Threshold filtering
+threshold_filter = oms.ThresholdMower()
+params = threshold_filter.getParameters()
+params.setValue("threshold", 1000.0)  # Remove peaks below 1000 intensity
+threshold_filter.setParameters(params)
+threshold_filter.filterExperiment(exp)
+
+# Window-based filtering
+window_filter = oms.WindowMower()
+params = window_filter.getParameters()
+params.setValue("windowsize", 50.0)  # 50 m/z windows
+params.setValue("peakcount", 10)     # Keep 10 highest per window
+window_filter.setParameters(params)
+window_filter.filterExperiment(exp)
+```
+
+**Spectrum Normalization:**
+```python
+normalizer = oms.Normalizer()
+normalizer.filterExperiment(exp)
+```
+
+**MS Level Filtering:**
+```python
+# Keep only MS2 spectra
+exp.filterMSLevel(2)
+
+# Filter by retention time range
+exp.filterRT(100.0, 500.0)  # Keep RT between 100-500 seconds
+
+# Filter by m/z range
+exp.filterMZ(400.0, 1500.0)  # Keep m/z between 400-1500
+```
+
+### 5. Feature Detection and Quantification
+
+Detect and quantify features in LC-MS data:
+
+**Peak Picking (Centroiding):**
+```python
+# Convert profile data to centroid
+picker = oms.PeakPickerHiRes()
+params = picker.getParameters()
+params.setValue("signal_to_noise", 1.0)
+picker.setParameters(params)
+
+exp_centroided = oms.MSExperiment()
+picker.pickExperiment(exp, exp_centroided)
+```
+
+**Feature Detection:**
+```python
+# Detect features across LC-MS runs
+feature_finder = oms.FeatureFinderMultiplex()
+
+features = oms.FeatureMap()
+feature_finder.run(exp, features, params)
+
+print(f"Found {features.size()} features")
+for feature in features:
+    print(f"m/z: {feature.getMZ():.4f}, RT: {feature.getRT():.2f}, "
+          f"Intensity: {feature.getIntensity():.0f}")
+```
+
+**Feature Linking (Map Alignment):**
+```python
+# Link features across multiple samples
+feature_grouper = oms.FeatureGroupingAlgorithmQT()
+consensus_map = oms.ConsensusMap()
+
+# Provide multiple feature maps from different samples
+feature_maps = [features1, features2, features3]
+feature_grouper.group(feature_maps, consensus_map)
+```
+
+### 6. Peptide Identification Workflows
+
+Integrate with search engines and process identification results:
+
+**Database Searching:**
+```python
+# Prepare parameters for search engine
+params = oms.Param()
+params.setValue("database", "uniprot_human.fasta")
+params.setValue("precursor_mass_tolerance", 10.0)  # ppm
+params.setValue("fragment_mass_tolerance", 0.5)     # Da
+params.setValue("enzyme", "Trypsin")
+params.setValue("missed_cleavages", 2)
+
+# Variable modifications
+params.setValue("variable_modifications", ["Oxidation (M)", "Phospho (STY)"])
+
+# Fixed modifications
+params.setValue("fixed_modifications", ["Carbamidomethyl (C)"])
+```
+
+**FDR Control:**
+```python
+# False discovery rate estimation
+fdr = oms.FalseDiscoveryRate()
+fdr_threshold = 0.01  # 1% FDR
+
+# Apply to peptide identifications
+protein_ids = []
+peptide_ids = []
+oms.IdXMLFile().load("search_results.idXML", protein_ids, peptide_ids)
+
+fdr.apply(protein_ids, peptide_ids)
+```
+
+### 7. Metabolomics Workflows
+
+Analyze small molecule data:
+
+**Adduct Detection:**
+```python
+# Common metabolite adducts
+adducts = ["[M+H]+", "[M+Na]+", "[M+K]+", "[M-H]-", "[M+Cl]-"]
+
+# Feature annotation with adducts
+for feature in features:
+    mz = feature.getMZ()
+    # Calculate neutral mass for each adduct hypothesis
+    for adduct in adducts:
+        # Annotation logic
+        pass
+```
+
+**Isotope Pattern Matching:**
+```python
+# Compare experimental to theoretical isotope patterns
+experimental_pattern = []  # Extract from feature
+theoretical = coarse_gen.run(formula)
+
+# Calculate similarity score
+similarity = compare_isotope_patterns(experimental_pattern, theoretical)
+```
+
+### 8. Quality Control and Visualization
+
+Monitor data quality and visualize results:
+
+**Basic Statistics:**
+```python
+# Calculate TIC (Total Ion Current)
+tic_values = []
+rt_values = []
+for spectrum in exp:
+    if spectrum.getMSLevel() == 1:
+        tic = sum(spectrum.get_peaks()[1])  # Sum intensities
+        tic_values.append(tic)
+        rt_values.append(spectrum.getRT())
+
+# Base peak chromatogram
+bpc_values = []
+for spectrum in exp:
+    if spectrum.getMSLevel() == 1:
+        max_intensity = max(spectrum.get_peaks()[1]) if spectrum.size() > 0 else 0
+        bpc_values.append(max_intensity)
+```
+
+**Plotting (with pyopenms.plotting or matplotlib):**
+```python
+import matplotlib.pyplot as plt
+
+# Plot TIC
+plt.figure(figsize=(10, 4))
+plt.plot(rt_values, tic_values)
+plt.xlabel('Retention Time (s)')
+plt.ylabel('Total Ion Current')
+plt.title('TIC')
+plt.show()
+
+# Plot single spectrum
+spectrum = exp.getSpectrum(0)
+mz, intensity = spectrum.get_peaks()
+plt.stem(mz, intensity, basefmt=' ')
+plt.xlabel('m/z')
+plt.ylabel('Intensity')
+plt.title(f'Spectrum at RT {spectrum.getRT():.2f}s')
+plt.show()
+```
+
+## Common Workflows
+
+### Complete LC-MS/MS Processing Pipeline
+
+```python
+import pyopenms as oms
+
+# 1. Load data
+exp = oms.MSExperiment()
+oms.MzMLFile().load("raw_data.mzML", exp)
+
+# 2. Filter and smooth
+exp.filterMSLevel(1)  # Keep only MS1 for feature detection
+gauss = oms.GaussFilter()
+gauss.filterExperiment(exp)
+
+# 3. Peak picking
+picker = oms.PeakPickerHiRes()
+exp_centroid = oms.MSExperiment()
+picker.pickExperiment(exp, exp_centroid)
+
+# 4. Feature detection
+ff = oms.FeatureFinderMultiplex()
+features = oms.FeatureMap()
+ff.run(exp_centroid, features, oms.Param())
+
+# 5. Export results
+oms.FeatureXMLFile().store("features.featureXML", features)
+print(f"Detected {features.size()} features")
+```
+
+### Theoretical Peptide Mass Calculation
+
+```python
+# Calculate masses for peptide with modifications
+peptide = oms.AASequence.fromString("PEPTIDEK")
+print(f"Unmodified [M+H]+: {peptide.getMonoWeight() + 1.007276:.4f}")
+
+# With modification
+modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)K")
+print(f"Oxidized [M+H]+: {modified.getMonoWeight() + 1.007276:.4f}")
+
+# Calculate for different charge states
+for z in [1, 2, 3]:
+    mz = (peptide.getMonoWeight() + z * 1.007276) / z
+    print(f"[M+{z}H]^{z}+: {mz:.4f}")
+```
+
+## Installation
+
+Ensure pyOpenMS is installed before using this skill:
+
+```bash
+# Via conda (recommended)
+conda install -c bioconda pyopenms
+
+# Via pip
+pip install pyopenms
+```
+
+## Integration with Other Tools
+
+pyOpenMS integrates seamlessly with:
+
+- **Search Engines**: Comet, Mascot, MSGF+, MSFragger, Sage, SpectraST
+- **Post-processing**: Percolator, MSstats, Epiphany
+- **Metabolomics**: SIRIUS, CSI:FingerID
+- **Data Analysis**: Pandas, NumPy, SciPy for downstream analysis
+- **Visualization**: Matplotlib, Seaborn for plotting
+
+## Resources
+
+### references/
+
+Detailed documentation on core concepts:
+
+- **data_structures.md** - Comprehensive guide to MSExperiment, MSSpectrum, MSChromatogram, and peak data handling
+- **algorithms.md** - Complete reference for signal processing, filtering, feature detection, and quantification algorithms
+- **chemistry.md** - In-depth coverage of chemistry calculations, peptide handling, modifications, and isotope distributions
+
+Load these references when needing detailed information about specific pyOpenMS capabilities.
+
+## Best Practices
+
+1. **File Format**: Always use mzML for raw MS data (standardized, well-supported)
+2. **Peak Access**: Use `get_peaks()` and `set_peaks()` with numpy arrays for efficient processing
+3. **Parameters**: Always check and configure algorithm parameters via `getParameters()` and `setParameters()`
+4. **Memory**: For large datasets, process spectra iteratively rather than loading entire experiments
+5. **Validation**: Check data integrity (MS levels, RT ordering, precursor information) after loading
+6. **Modifications**: Use standard modification names from UniMod database
+7. **Units**: RT in seconds, m/z in Thomson (Da/charge), intensity in arbitrary units
+
+## Common Patterns
+
+**Algorithm Application Pattern:**
+```python
+# 1. Instantiate algorithm
+algorithm = oms.SomeAlgorithm()
+
+# 2. Get and configure parameters
+params = algorithm.getParameters()
+params.setValue("parameter_name", value)
+algorithm.setParameters(params)
+
+# 3. Apply to data
+algorithm.filterExperiment(exp)  # or .process(), .run(), depending on algorithm
+```
+
+**File I/O Pattern:**
+```python
+# Read
+data_container = oms.DataContainer()  # MSExperiment, FeatureMap, etc.
+oms.FileHandler().load("input.format", data_container)
+
+# Process
+# ... manipulate data_container ...
+
+# Write
+oms.FileHandler().store("output.format", data_container)
+```
+
+## Getting Help
+
+- **Documentation**: https://pyopenms.readthedocs.io/
+- **API Reference**: Browse class documentation for detailed method signatures
+- **OpenMS Website**: https://www.openms.org/
+- **GitHub Issues**: https://github.com/OpenMS/OpenMS/issues