mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-28 07:33:45 +08:00
Add PyOpenms
This commit is contained in:
522
scientific-packages/pyopenms/SKILL.md
Normal file
522
scientific-packages/pyopenms/SKILL.md
Normal file
@@ -0,0 +1,522 @@
|
||||
---
|
||||
name: pyopenms
|
||||
description: Toolkit for mass spectrometry data analysis with pyOpenMS, supporting proteomics and metabolomics workflows including LC-MS/MS data processing, peptide identification, feature detection, quantification, and chemical calculations. Use this skill when: (1) Working with mass spectrometry file formats (mzML, mzXML, FASTA, mzTab, mzIdentML, TraML, pepXML/protXML) and need to read, write, or convert between formats; (2) Processing raw LC-MS/MS data including spectral smoothing, peak picking, noise filtering, and signal processing; (3) Performing proteomics workflows such as peptide digestion simulation, theoretical fragmentation, modification analysis, and protein identification post-processing; (4) Conducting metabolomics analysis including feature detection, adduct annotation, isotope pattern matching, and small molecule identification; (5) Implementing quantitative proteomics pipelines with feature detection, alignment across samples, and statistical analysis; (6) Calculating chemical properties including molecular formulas, isotopic distributions, amino acid properties, and peptide masses; (7) Integrating with search engines (Comet, Mascot, MSGF+) and post-processing tools (Percolator, MSstats); (8) Building custom MS data analysis workflows that require low-level access to spectra, chromatograms, and peak data; (9) Performing quality control on MS data including TIC/BPC calculation, retention time analysis, and data validation; (10) When you need Python-based alternatives to vendor software for MS data processing and analysis.
|
||||
---
|
||||
|
||||
# pyOpenMS
|
||||
|
||||
## Overview
|
||||
|
||||
pyOpenMS is an open-source Python library providing comprehensive tools for mass spectrometry data analysis in proteomics and metabolomics research. It offers Python bindings to the OpenMS C++ library, enabling efficient processing of LC-MS/MS data, peptide identification, feature detection, quantification, and integration with common proteomics tools like Comet, Mascot, MSGF+, Percolator, and MSstats.
|
||||
|
||||
Use this skill when working with mass spectrometry data analysis tasks, processing proteomics or metabolomics datasets, or implementing computational workflows for biomolecular identification and quantification.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. File I/O and Data Import/Export
|
||||
|
||||
Handle diverse mass spectrometry file formats efficiently:
|
||||
|
||||
**Supported Formats:**
|
||||
- **mzML/mzXML**: Primary raw MS data formats (profile or centroid)
|
||||
- **FASTA**: Protein/peptide sequence databases
|
||||
- **mzTab**: Standardized reporting format for identification and quantification
|
||||
- **mzIdentML**: Peptide and protein identification data
|
||||
- **TraML**: Transition lists for targeted experiments
|
||||
- **pepXML/protXML**: Search engine results
|
||||
|
||||
**Reading mzML Files:**
|
||||
```python
|
||||
import pyopenms as oms
|
||||
|
||||
# Load MS data
|
||||
exp = oms.MSExperiment()
|
||||
oms.MzMLFile().load("input_data.mzML", exp)
|
||||
|
||||
# Access basic information
|
||||
print(f"Number of spectra: {exp.getNrSpectra()}")
|
||||
print(f"Number of chromatograms: {exp.getNrChromatograms()}")
|
||||
```
|
||||
|
||||
**Writing mzML Files:**
|
||||
```python
|
||||
# Save processed data
|
||||
oms.MzMLFile().store("output_data.mzML", exp)
|
||||
```
|
||||
|
||||
**File Encoding:** pyOpenMS automatically handles Base64 encoding, zlib compression, and Numpress compression internally.
|
||||
|
||||
### 2. MS Data Structures and Manipulation
|
||||
|
||||
Work with core mass spectrometry data structures. See `references/data_structures.md` for comprehensive details.
|
||||
|
||||
**MSSpectrum** - Individual mass spectrum:
|
||||
```python
|
||||
# Create spectrum with metadata
|
||||
spectrum = oms.MSSpectrum()
|
||||
spectrum.setRT(205.2) # Retention time in seconds
|
||||
spectrum.setMSLevel(2) # MS2 spectrum
|
||||
|
||||
# Set peak data (m/z, intensity arrays)
|
||||
mz_array = [100.5, 200.3, 300.7, 400.2]
|
||||
intensity_array = [1000, 5000, 3000, 2000]
|
||||
spectrum.set_peaks((mz_array, intensity_array))
|
||||
|
||||
# Add precursor information for MS2
|
||||
precursor = oms.Precursor()
|
||||
precursor.setMZ(450.5)
|
||||
precursor.setCharge(2)
|
||||
spectrum.setPrecursors([precursor])
|
||||
```
|
||||
|
||||
**MSExperiment** - Complete LC-MS/MS run:
|
||||
```python
|
||||
# Create experiment and add spectra
|
||||
exp = oms.MSExperiment()
|
||||
exp.addSpectrum(spectrum)
|
||||
|
||||
# Access spectra
|
||||
first_spectrum = exp.getSpectrum(0)
|
||||
for spec in exp:
|
||||
print(f"RT: {spec.getRT()}, MS Level: {spec.getMSLevel()}")
|
||||
```
|
||||
|
||||
**MSChromatogram** - Extracted ion chromatogram:
|
||||
```python
|
||||
# Create chromatogram
|
||||
chrom = oms.MSChromatogram()
|
||||
chrom.set_peaks(([10.5, 11.2, 11.8], [1000, 5000, 3000])) # RT, intensity
|
||||
exp.addChromatogram(chrom)
|
||||
```
|
||||
|
||||
**Efficient Peak Access:**
|
||||
```python
|
||||
# Get peaks as numpy arrays for fast processing
|
||||
mz_array, intensity_array = spectrum.get_peaks()
|
||||
|
||||
# Modify and set back
|
||||
intensity_array *= 2 # Double all intensities
|
||||
spectrum.set_peaks((mz_array, intensity_array))
|
||||
```
|
||||
|
||||
### 3. Chemistry and Peptide Handling
|
||||
|
||||
Perform chemical calculations for proteomics and metabolomics. See `references/chemistry.md` for detailed examples.
|
||||
|
||||
**Molecular Formulas and Mass Calculations:**
|
||||
```python
|
||||
# Create empirical formula
|
||||
formula = oms.EmpiricalFormula("C6H12O6") # Glucose
|
||||
print(f"Monoisotopic mass: {formula.getMonoWeight()}")
|
||||
print(f"Average mass: {formula.getAverageWeight()}")
|
||||
|
||||
# Formula arithmetic
|
||||
water = oms.EmpiricalFormula("H2O")
|
||||
dehydrated = formula - water
|
||||
|
||||
# Isotope-specific formulas
|
||||
heavy_carbon = oms.EmpiricalFormula("(13)C6H12O6")
|
||||
```
|
||||
|
||||
**Isotopic Distributions:**
|
||||
```python
|
||||
# Generate coarse isotope pattern (unit mass resolution)
|
||||
coarse_gen = oms.CoarseIsotopePatternGenerator()
|
||||
pattern = coarse_gen.run(formula)
|
||||
|
||||
# Generate fine structure (high resolution)
|
||||
fine_gen = oms.FineIsotopePatternGenerator(0.01) # 0.01 Da resolution
|
||||
fine_pattern = fine_gen.run(formula)
|
||||
```
|
||||
|
||||
**Amino Acids and Residues:**
|
||||
```python
|
||||
# Access residue information
|
||||
res_db = oms.ResidueDB()
|
||||
leucine = res_db.getResidue("Leucine")
|
||||
print(f"L monoisotopic mass: {leucine.getMonoWeight()}")
|
||||
print(f"L formula: {leucine.getFormula()}")
|
||||
print(f"L pKa: {leucine.getPka()}")
|
||||
```
|
||||
|
||||
**Peptide Sequences:**
|
||||
```python
|
||||
# Create peptide sequence
|
||||
peptide = oms.AASequence.fromString("PEPTIDE")
|
||||
print(f"Peptide mass: {peptide.getMonoWeight()}")
|
||||
print(f"Formula: {peptide.getFormula()}")
|
||||
|
||||
# Add modifications
|
||||
modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)")
|
||||
print(f"Modified mass: {modified.getMonoWeight()}")
|
||||
|
||||
# Theoretical fragmentation
|
||||
ions = []
|
||||
for i in range(1, peptide.size()):
|
||||
b_ion = peptide.getPrefix(i)
|
||||
y_ion = peptide.getSuffix(i)
|
||||
ions.append(('b', i, b_ion.getMonoWeight()))
|
||||
ions.append(('y', i, y_ion.getMonoWeight()))
|
||||
```
|
||||
|
||||
**Protein Digestion:**
|
||||
```python
|
||||
# Enzymatic digestion
|
||||
dig = oms.ProteaseDigestion()
|
||||
dig.setEnzyme("Trypsin")
|
||||
dig.setMissedCleavages(2)
|
||||
|
||||
protein_seq = oms.AASequence.fromString("MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK")
|
||||
peptides = []
|
||||
dig.digest(protein_seq, peptides)
|
||||
|
||||
for pep in peptides:
|
||||
print(f"{pep.toString()}: {pep.getMonoWeight():.2f} Da")
|
||||
```
|
||||
|
||||
**Modifications:**
|
||||
```python
|
||||
# Access modification database
|
||||
mod_db = oms.ModificationsDB()
|
||||
oxidation = mod_db.getModification("Oxidation")
|
||||
print(f"Oxidation mass diff: {oxidation.getDiffMonoMass()}")
|
||||
print(f"Residues: {oxidation.getResidues()}")
|
||||
```
|
||||
|
||||
### 4. Signal Processing and Filtering
|
||||
|
||||
Apply algorithms to process and filter MS data. See `references/algorithms.md` for comprehensive coverage.
|
||||
|
||||
**Spectral Smoothing:**
|
||||
```python
|
||||
# Gaussian smoothing
|
||||
gauss_filter = oms.GaussFilter()
|
||||
params = gauss_filter.getParameters()
|
||||
params.setValue("gaussian_width", 0.2)
|
||||
gauss_filter.setParameters(params)
|
||||
gauss_filter.filterExperiment(exp)
|
||||
|
||||
# Savitzky-Golay filter
|
||||
sg_filter = oms.SavitzkyGolayFilter()
|
||||
sg_filter.filterExperiment(exp)
|
||||
```
|
||||
|
||||
**Peak Filtering:**
|
||||
```python
|
||||
# Keep only N largest peaks per spectrum
|
||||
n_largest = oms.NLargest()
|
||||
params = n_largest.getParameters()
|
||||
params.setValue("n", 100) # Keep top 100 peaks
|
||||
n_largest.setParameters(params)
|
||||
n_largest.filterExperiment(exp)
|
||||
|
||||
# Threshold filtering
|
||||
threshold_filter = oms.ThresholdMower()
|
||||
params = threshold_filter.getParameters()
|
||||
params.setValue("threshold", 1000.0) # Remove peaks below 1000 intensity
|
||||
threshold_filter.setParameters(params)
|
||||
threshold_filter.filterExperiment(exp)
|
||||
|
||||
# Window-based filtering
|
||||
window_filter = oms.WindowMower()
|
||||
params = window_filter.getParameters()
|
||||
params.setValue("windowsize", 50.0) # 50 m/z windows
|
||||
params.setValue("peakcount", 10) # Keep 10 highest per window
|
||||
window_filter.setParameters(params)
|
||||
window_filter.filterExperiment(exp)
|
||||
```
|
||||
|
||||
**Spectrum Normalization:**
|
||||
```python
|
||||
normalizer = oms.Normalizer()
|
||||
normalizer.filterExperiment(exp)
|
||||
```
|
||||
|
||||
**MS Level Filtering:**
|
||||
```python
|
||||
# Keep only MS2 spectra
|
||||
exp.filterMSLevel(2)
|
||||
|
||||
# Filter by retention time range
|
||||
exp.filterRT(100.0, 500.0) # Keep RT between 100-500 seconds
|
||||
|
||||
# Filter by m/z range
|
||||
exp.filterMZ(400.0, 1500.0) # Keep m/z between 400-1500
|
||||
```
|
||||
|
||||
### 5. Feature Detection and Quantification
|
||||
|
||||
Detect and quantify features in LC-MS data:
|
||||
|
||||
**Peak Picking (Centroiding):**
|
||||
```python
|
||||
# Convert profile data to centroid
|
||||
picker = oms.PeakPickerHiRes()
|
||||
params = picker.getParameters()
|
||||
params.setValue("signal_to_noise", 1.0)
|
||||
picker.setParameters(params)
|
||||
|
||||
exp_centroided = oms.MSExperiment()
|
||||
picker.pickExperiment(exp, exp_centroided)
|
||||
```
|
||||
|
||||
**Feature Detection:**
|
||||
```python
|
||||
# Detect features across LC-MS runs
|
||||
feature_finder = oms.FeatureFinderMultiplex()
|
||||
|
||||
features = oms.FeatureMap()
|
||||
feature_finder.run(exp, features, params)
|
||||
|
||||
print(f"Found {features.size()} features")
|
||||
for feature in features:
|
||||
print(f"m/z: {feature.getMZ():.4f}, RT: {feature.getRT():.2f}, "
|
||||
f"Intensity: {feature.getIntensity():.0f}")
|
||||
```
|
||||
|
||||
**Feature Linking (Map Alignment):**
|
||||
```python
|
||||
# Link features across multiple samples
|
||||
feature_grouper = oms.FeatureGroupingAlgorithmQT()
|
||||
consensus_map = oms.ConsensusMap()
|
||||
|
||||
# Provide multiple feature maps from different samples
|
||||
feature_maps = [features1, features2, features3]
|
||||
feature_grouper.group(feature_maps, consensus_map)
|
||||
```
|
||||
|
||||
### 6. Peptide Identification Workflows
|
||||
|
||||
Integrate with search engines and process identification results:
|
||||
|
||||
**Database Searching:**
|
||||
```python
|
||||
# Prepare parameters for search engine
|
||||
params = oms.Param()
|
||||
params.setValue("database", "uniprot_human.fasta")
|
||||
params.setValue("precursor_mass_tolerance", 10.0) # ppm
|
||||
params.setValue("fragment_mass_tolerance", 0.5) # Da
|
||||
params.setValue("enzyme", "Trypsin")
|
||||
params.setValue("missed_cleavages", 2)
|
||||
|
||||
# Variable modifications
|
||||
params.setValue("variable_modifications", ["Oxidation (M)", "Phospho (STY)"])
|
||||
|
||||
# Fixed modifications
|
||||
params.setValue("fixed_modifications", ["Carbamidomethyl (C)"])
|
||||
```
|
||||
|
||||
**FDR Control:**
|
||||
```python
|
||||
# False discovery rate estimation
|
||||
fdr = oms.FalseDiscoveryRate()
|
||||
fdr_threshold = 0.01 # 1% FDR
|
||||
|
||||
# Apply to peptide identifications
|
||||
protein_ids = []
|
||||
peptide_ids = []
|
||||
oms.IdXMLFile().load("search_results.idXML", protein_ids, peptide_ids)
|
||||
|
||||
fdr.apply(protein_ids, peptide_ids)
|
||||
```
|
||||
|
||||
### 7. Metabolomics Workflows
|
||||
|
||||
Analyze small molecule data:
|
||||
|
||||
**Adduct Detection:**
|
||||
```python
|
||||
# Common metabolite adducts
|
||||
adducts = ["[M+H]+", "[M+Na]+", "[M+K]+", "[M-H]-", "[M+Cl]-"]
|
||||
|
||||
# Feature annotation with adducts
|
||||
for feature in features:
|
||||
mz = feature.getMZ()
|
||||
# Calculate neutral mass for each adduct hypothesis
|
||||
for adduct in adducts:
|
||||
# Annotation logic
|
||||
pass
|
||||
```
|
||||
|
||||
**Isotope Pattern Matching:**
|
||||
```python
|
||||
# Compare experimental to theoretical isotope patterns
|
||||
experimental_pattern = [] # Extract from feature
|
||||
theoretical = coarse_gen.run(formula)
|
||||
|
||||
# Calculate similarity score
|
||||
similarity = compare_isotope_patterns(experimental_pattern, theoretical)
|
||||
```
|
||||
|
||||
### 8. Quality Control and Visualization
|
||||
|
||||
Monitor data quality and visualize results:
|
||||
|
||||
**Basic Statistics:**
|
||||
```python
|
||||
# Calculate TIC (Total Ion Current)
|
||||
tic_values = []
|
||||
rt_values = []
|
||||
for spectrum in exp:
|
||||
if spectrum.getMSLevel() == 1:
|
||||
tic = sum(spectrum.get_peaks()[1]) # Sum intensities
|
||||
tic_values.append(tic)
|
||||
rt_values.append(spectrum.getRT())
|
||||
|
||||
# Base peak chromatogram
|
||||
bpc_values = []
|
||||
for spectrum in exp:
|
||||
if spectrum.getMSLevel() == 1:
|
||||
max_intensity = max(spectrum.get_peaks()[1]) if spectrum.size() > 0 else 0
|
||||
bpc_values.append(max_intensity)
|
||||
```
|
||||
|
||||
**Plotting (with pyopenms.plotting or matplotlib):**
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Plot TIC
|
||||
plt.figure(figsize=(10, 4))
|
||||
plt.plot(rt_values, tic_values)
|
||||
plt.xlabel('Retention Time (s)')
|
||||
plt.ylabel('Total Ion Current')
|
||||
plt.title('TIC')
|
||||
plt.show()
|
||||
|
||||
# Plot single spectrum
|
||||
spectrum = exp.getSpectrum(0)
|
||||
mz, intensity = spectrum.get_peaks()
|
||||
plt.stem(mz, intensity, basefmt=' ')
|
||||
plt.xlabel('m/z')
|
||||
plt.ylabel('Intensity')
|
||||
plt.title(f'Spectrum at RT {spectrum.getRT():.2f}s')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Complete LC-MS/MS Processing Pipeline
|
||||
|
||||
```python
|
||||
import pyopenms as oms
|
||||
|
||||
# 1. Load data
|
||||
exp = oms.MSExperiment()
|
||||
oms.MzMLFile().load("raw_data.mzML", exp)
|
||||
|
||||
# 2. Filter and smooth
|
||||
exp.filterMSLevel(1) # Keep only MS1 for feature detection
|
||||
gauss = oms.GaussFilter()
|
||||
gauss.filterExperiment(exp)
|
||||
|
||||
# 3. Peak picking
|
||||
picker = oms.PeakPickerHiRes()
|
||||
exp_centroid = oms.MSExperiment()
|
||||
picker.pickExperiment(exp, exp_centroid)
|
||||
|
||||
# 4. Feature detection
|
||||
ff = oms.FeatureFinderMultiplex()
|
||||
features = oms.FeatureMap()
|
||||
ff.run(exp_centroid, features, oms.Param())
|
||||
|
||||
# 5. Export results
|
||||
oms.FeatureXMLFile().store("features.featureXML", features)
|
||||
print(f"Detected {features.size()} features")
|
||||
```
|
||||
|
||||
### Theoretical Peptide Mass Calculation
|
||||
|
||||
```python
|
||||
# Calculate masses for peptide with modifications
|
||||
peptide = oms.AASequence.fromString("PEPTIDEK")
|
||||
print(f"Unmodified [M+H]+: {peptide.getMonoWeight() + 1.007276:.4f}")
|
||||
|
||||
# With modification
|
||||
modified = oms.AASequence.fromString("PEPTIDEM(Oxidation)K")
|
||||
print(f"Oxidized [M+H]+: {modified.getMonoWeight() + 1.007276:.4f}")
|
||||
|
||||
# Calculate for different charge states
|
||||
for z in [1, 2, 3]:
|
||||
mz = (peptide.getMonoWeight() + z * 1.007276) / z
|
||||
print(f"[M+{z}H]^{z}+: {mz:.4f}")
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
Ensure pyOpenMS is installed before using this skill:
|
||||
|
||||
```bash
|
||||
# Via conda (recommended)
|
||||
conda install -c bioconda pyopenms
|
||||
|
||||
# Via pip
|
||||
pip install pyopenms
|
||||
```
|
||||
|
||||
## Integration with Other Tools
|
||||
|
||||
pyOpenMS integrates seamlessly with:
|
||||
|
||||
- **Search Engines**: Comet, Mascot, MSGF+, MSFragger, Sage, SpectraST
|
||||
- **Post-processing**: Percolator, MSstats, Epiphany
|
||||
- **Metabolomics**: SIRIUS, CSI:FingerID
|
||||
- **Data Analysis**: Pandas, NumPy, SciPy for downstream analysis
|
||||
- **Visualization**: Matplotlib, Seaborn for plotting
|
||||
|
||||
## Resources
|
||||
|
||||
### references/
|
||||
|
||||
Detailed documentation on core concepts:
|
||||
|
||||
- **data_structures.md** - Comprehensive guide to MSExperiment, MSSpectrum, MSChromatogram, and peak data handling
|
||||
- **algorithms.md** - Complete reference for signal processing, filtering, feature detection, and quantification algorithms
|
||||
- **chemistry.md** - In-depth coverage of chemistry calculations, peptide handling, modifications, and isotope distributions
|
||||
|
||||
Load these references when needing detailed information about specific pyOpenMS capabilities.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **File Format**: Always use mzML for raw MS data (standardized, well-supported)
|
||||
2. **Peak Access**: Use `get_peaks()` and `set_peaks()` with numpy arrays for efficient processing
|
||||
3. **Parameters**: Always check and configure algorithm parameters via `getParameters()` and `setParameters()`
|
||||
4. **Memory**: For large datasets, process spectra iteratively rather than loading entire experiments
|
||||
5. **Validation**: Check data integrity (MS levels, RT ordering, precursor information) after loading
|
||||
6. **Modifications**: Use standard modification names from UniMod database
|
||||
7. **Units**: RT in seconds, m/z in Thomson (Da/charge), intensity in arbitrary units
|
||||
|
||||
## Common Patterns
|
||||
|
||||
**Algorithm Application Pattern:**
|
||||
```python
|
||||
# 1. Instantiate algorithm
|
||||
algorithm = oms.SomeAlgorithm()
|
||||
|
||||
# 2. Get and configure parameters
|
||||
params = algorithm.getParameters()
|
||||
params.setValue("parameter_name", value)
|
||||
algorithm.setParameters(params)
|
||||
|
||||
# 3. Apply to data
|
||||
algorithm.filterExperiment(exp) # or .process(), .run(), depending on algorithm
|
||||
```
|
||||
|
||||
**File I/O Pattern:**
|
||||
```python
|
||||
# Read
|
||||
data_container = oms.DataContainer() # MSExperiment, FeatureMap, etc.
|
||||
oms.FileHandler().load("input.format", data_container)
|
||||
|
||||
# Process
|
||||
# ... manipulate data_container ...
|
||||
|
||||
# Write
|
||||
oms.FileHandler().store("output.format", data_container)
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
- **Documentation**: https://pyopenms.readthedocs.io/
|
||||
- **API Reference**: Browse class documentation for detailed method signatures
|
||||
- **OpenMS Website**: https://www.openms.org/
|
||||
- **GitHub Issues**: https://github.com/OpenMS/OpenMS/issues
|
||||
Reference in New Issue
Block a user