mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
Add more scientific skills
This commit is contained in:
477
scientific-packages/diffdock/SKILL.md
Normal file
477
scientific-packages/diffdock/SKILL.md
Normal file
@@ -0,0 +1,477 @@
|
||||
---
|
||||
name: diffdock
|
||||
description: This skill provides comprehensive guidance for using DiffDock, a state-of-the-art diffusion-based molecular docking tool that predicts protein-ligand binding poses. Use this skill when users request molecular docking simulations, protein-ligand binding predictions, virtual screening, structure-based drug design tasks, or need to predict how small molecules bind to protein targets. This skill applies to tasks involving PDB files, SMILES strings, protein sequences, ligand structure files, or batch docking of compound libraries.
|
||||
---
|
||||
|
||||
# DiffDock: Molecular Docking with Diffusion Models
|
||||
|
||||
## Overview
|
||||
|
||||
DiffDock is a diffusion-based deep learning tool for molecular docking that predicts 3D binding poses of small molecule ligands to protein targets. It represents the state-of-the-art in computational docking, crucial for structure-based drug discovery and chemical biology.
|
||||
|
||||
**Core Capabilities:**
|
||||
- Predict ligand binding poses with high accuracy using deep learning
|
||||
- Support protein structures (PDB files) or sequences (via ESMFold)
|
||||
- Process single complexes or batch virtual screening campaigns
|
||||
- Generate confidence scores to assess prediction reliability
|
||||
- Handle diverse ligand inputs (SMILES, SDF, MOL2)
|
||||
|
||||
**Key Distinction:** DiffDock predicts **binding poses** (3D structure) and **confidence** (prediction certainty), NOT binding affinity (ΔG, Kd). Always combine with scoring functions (GNINA, MM/GBSA) for affinity assessment.
|
||||
|
||||
## When to Use DiffDock
|
||||
|
||||
Invoke this skill when users request:
|
||||
|
||||
- "Dock this ligand to a protein" or "predict binding pose"
|
||||
- "Run molecular docking" or "perform protein-ligand docking"
|
||||
- "Virtual screening" or "screen compound library"
|
||||
- "Where does this molecule bind?" or "predict binding site"
|
||||
- Structure-based drug design or lead optimization tasks
|
||||
- Tasks involving PDB files + SMILES strings or ligand structures
|
||||
- Batch docking of multiple protein-ligand pairs
|
||||
|
||||
## Installation and Environment Setup
|
||||
|
||||
### Check Environment Status
|
||||
|
||||
Before proceeding with DiffDock tasks, verify the environment setup:
|
||||
|
||||
```bash
|
||||
# Use the provided setup checker
|
||||
python scripts/setup_check.py
|
||||
```
|
||||
|
||||
This script validates Python version, PyTorch with CUDA, PyTorch Geometric, RDKit, ESM, and other dependencies.
|
||||
|
||||
### Installation Options
|
||||
|
||||
**Option 1: Conda (Recommended)**
|
||||
```bash
|
||||
git clone https://github.com/gcorso/DiffDock.git
|
||||
cd DiffDock
|
||||
conda env create --file environment.yml
|
||||
conda activate diffdock
|
||||
```
|
||||
|
||||
**Option 2: Docker**
|
||||
```bash
|
||||
docker pull rbgcsail/diffdock
|
||||
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
|
||||
micromamba activate diffdock
|
||||
```
|
||||
|
||||
**Important Notes:**
|
||||
- GPU strongly recommended (10-100x speedup vs CPU)
|
||||
- First run pre-computes SO(2)/SO(3) lookup tables (~2-5 minutes)
|
||||
- Model checkpoints (~500MB) download automatically if not present
|
||||
|
||||
## Core Workflows
|
||||
|
||||
### Workflow 1: Single Protein-Ligand Docking
|
||||
|
||||
**Use Case:** Dock one ligand to one protein target
|
||||
|
||||
**Input Requirements:**
|
||||
- Protein: PDB file OR amino acid sequence
|
||||
- Ligand: SMILES string OR structure file (SDF/MOL2)
|
||||
|
||||
**Command:**
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand "CC(=O)Oc1ccccc1C(=O)O" \
|
||||
--out_dir results/single_docking/
|
||||
```
|
||||
|
||||
**Alternative (protein sequence):**
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKF..." \
|
||||
--ligand ligand.sdf \
|
||||
--out_dir results/sequence_docking/
|
||||
```
|
||||
|
||||
**Output Structure:**
|
||||
```
|
||||
results/single_docking/
|
||||
├── rank_1.sdf # Top-ranked pose
|
||||
├── rank_2.sdf # Second-ranked pose
|
||||
├── ...
|
||||
├── rank_10.sdf # 10th pose (default: 10 samples)
|
||||
└── confidence_scores.txt
|
||||
```
|
||||
|
||||
### Workflow 2: Batch Processing Multiple Complexes
|
||||
|
||||
**Use Case:** Dock multiple ligands to proteins, virtual screening campaigns
|
||||
|
||||
**Step 1: Prepare Batch CSV**
|
||||
|
||||
Use the provided script to create or validate batch input:
|
||||
|
||||
```bash
|
||||
# Create template
|
||||
python scripts/prepare_batch_csv.py --create --output batch_input.csv
|
||||
|
||||
# Validate existing CSV
|
||||
python scripts/prepare_batch_csv.py my_input.csv --validate
|
||||
```
|
||||
|
||||
**CSV Format:**
|
||||
```csv
|
||||
complex_name,protein_path,ligand_description,protein_sequence
|
||||
complex1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
|
||||
complex2,,COc1ccc(C#N)cc1,MSKGEELFT...
|
||||
complex3,protein3.pdb,ligand3.sdf,
|
||||
```
|
||||
|
||||
**Required Columns:**
|
||||
- `complex_name`: Unique identifier
|
||||
- `protein_path`: PDB file path (leave empty if using sequence)
|
||||
- `ligand_description`: SMILES string or ligand file path
|
||||
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
|
||||
|
||||
**Step 2: Run Batch Docking**
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv batch_input.csv \
|
||||
--out_dir results/batch/ \
|
||||
--batch_size 10
|
||||
```
|
||||
|
||||
**For Large Virtual Screening (>100 compounds):**
|
||||
|
||||
Pre-compute protein embeddings for faster processing:
|
||||
```bash
|
||||
# Pre-compute embeddings
|
||||
python datasets/esm_embedding_preparation.py \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--out_file protein_embeddings.pt
|
||||
|
||||
# Run with pre-computed embeddings
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--esm_embeddings_path protein_embeddings.pt \
|
||||
--out_dir results/screening/
|
||||
```
|
||||
|
||||
### Workflow 3: Analyzing Results
|
||||
|
||||
After docking completes, analyze confidence scores and rank predictions:
|
||||
|
||||
```bash
|
||||
# Analyze all results
|
||||
python scripts/analyze_results.py results/batch/
|
||||
|
||||
# Show top 5 per complex
|
||||
python scripts/analyze_results.py results/batch/ --top 5
|
||||
|
||||
# Filter by confidence threshold
|
||||
python scripts/analyze_results.py results/batch/ --threshold 0.0
|
||||
|
||||
# Export to CSV
|
||||
python scripts/analyze_results.py results/batch/ --export summary.csv
|
||||
|
||||
# Show top 20 predictions across all complexes
|
||||
python scripts/analyze_results.py results/batch/ --best 20
|
||||
```
|
||||
|
||||
The analysis script:
|
||||
- Parses confidence scores from all predictions
|
||||
- Classifies as High (>0), Moderate (-1.5 to 0), or Low (<-1.5)
|
||||
- Ranks predictions within and across complexes
|
||||
- Generates statistical summaries
|
||||
- Exports results to CSV for downstream analysis
|
||||
|
||||
## Confidence Score Interpretation
|
||||
|
||||
**Understanding Scores:**
|
||||
|
||||
| Score Range | Confidence Level | Interpretation |
|
||||
|------------|------------------|----------------|
|
||||
| **> 0** | High | Strong prediction, likely accurate |
|
||||
| **-1.5 to 0** | Moderate | Reasonable prediction, validate carefully |
|
||||
| **< -1.5** | Low | Uncertain prediction, requires validation |
|
||||
|
||||
**Critical Notes:**
|
||||
1. **Confidence ≠ Affinity**: High confidence means model certainty about structure, NOT strong binding
|
||||
2. **Context Matters**: Adjust expectations for:
|
||||
- Large ligands (>500 Da): Lower confidence expected
|
||||
- Multiple protein chains: May decrease confidence
|
||||
- Novel protein families: May underperform
|
||||
3. **Multiple Samples**: Review top 3-5 predictions, look for consensus
|
||||
|
||||
**For detailed guidance:** Read `references/confidence_and_limitations.md` using the Read tool
|
||||
|
||||
## Parameter Customization
|
||||
|
||||
### Using Custom Configuration
|
||||
|
||||
Create custom configuration for specific use cases:
|
||||
|
||||
```bash
|
||||
# Copy template
|
||||
cp assets/custom_inference_config.yaml my_config.yaml
|
||||
|
||||
# Edit parameters (see template for presets)
|
||||
# Then run with custom config
|
||||
python -m inference \
|
||||
--config my_config.yaml \
|
||||
--protein_ligand_csv input.csv \
|
||||
--out_dir results/
|
||||
```
|
||||
|
||||
### Key Parameters to Adjust
|
||||
|
||||
**Sampling Density:**
|
||||
- `samples_per_complex: 10` → Increase to 20-40 for difficult cases
|
||||
- More samples = better coverage but longer runtime
|
||||
|
||||
**Inference Steps:**
|
||||
- `inference_steps: 20` → Increase to 25-30 for higher accuracy
|
||||
- More steps = potentially better quality but slower
|
||||
|
||||
**Temperature Parameters (control diversity):**
|
||||
- `temp_sampling_tor: 7.04` → Increase for flexible ligands (8-10)
|
||||
- `temp_sampling_tor: 7.04` → Decrease for rigid ligands (5-6)
|
||||
- Higher temperature = more diverse poses
|
||||
|
||||
**Presets Available in Template:**
|
||||
1. High Accuracy: More samples + steps, lower temperature
|
||||
2. Fast Screening: Fewer samples, faster
|
||||
3. Flexible Ligands: Increased torsion temperature
|
||||
4. Rigid Ligands: Decreased torsion temperature
|
||||
|
||||
**For complete parameter reference:** Read `references/parameters_reference.md` using the Read tool
|
||||
|
||||
## Advanced Techniques
|
||||
|
||||
### Ensemble Docking (Protein Flexibility)
|
||||
|
||||
For proteins with known flexibility, dock to multiple conformations:
|
||||
|
||||
```python
|
||||
# Create ensemble CSV
|
||||
import pandas as pd
|
||||
|
||||
conformations = ["conf1.pdb", "conf2.pdb", "conf3.pdb"]
|
||||
ligand = "CC(=O)Oc1ccccc1C(=O)O"
|
||||
|
||||
data = {
|
||||
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
|
||||
"protein_path": conformations,
|
||||
"ligand_description": [ligand] * len(conformations),
|
||||
"protein_sequence": [""] * len(conformations)
|
||||
}
|
||||
|
||||
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
|
||||
```
|
||||
|
||||
Run docking with increased sampling:
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv ensemble_input.csv \
|
||||
--samples_per_complex 20 \
|
||||
--out_dir results/ensemble/
|
||||
```
|
||||
|
||||
### Integration with Scoring Functions
|
||||
|
||||
DiffDock generates poses; combine with other tools for affinity:
|
||||
|
||||
**GNINA (Fast neural network scoring):**
|
||||
```bash
|
||||
for pose in results/*.sdf; do
|
||||
gnina -r protein.pdb -l "$pose" --score_only
|
||||
done
|
||||
```
|
||||
|
||||
**MM/GBSA (More accurate, slower):**
|
||||
Use AmberTools MMPBSA.py or gmx_MMPBSA after energy minimization
|
||||
|
||||
**Free Energy Calculations (Most accurate):**
|
||||
Use OpenMM + OpenFE or GROMACS for FEP/TI calculations
|
||||
|
||||
**Recommended Workflow:**
|
||||
1. DiffDock → Generate poses with confidence scores
|
||||
2. Visual inspection → Check structural plausibility
|
||||
3. GNINA or MM/GBSA → Rescore and rank by affinity
|
||||
4. Experimental validation → Biochemical assays
|
||||
|
||||
## Limitations and Scope
|
||||
|
||||
**DiffDock IS Designed For:**
|
||||
- Small molecule ligands (typically 100-1000 Da)
|
||||
- Drug-like organic compounds
|
||||
- Small peptides (<20 residues)
|
||||
- Single or multi-chain proteins
|
||||
|
||||
**DiffDock IS NOT Designed For:**
|
||||
- Large biomolecules (protein-protein docking) → Use DiffDock-PP or AlphaFold-Multimer
|
||||
- Large peptides (>20 residues) → Use alternative methods
|
||||
- Covalent docking → Use specialized covalent docking tools
|
||||
- Binding affinity prediction → Combine with scoring functions
|
||||
- Membrane proteins → Not specifically trained, use with caution
|
||||
|
||||
**For complete limitations:** Read `references/confidence_and_limitations.md` using the Read tool
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue: Low confidence scores across all predictions**
|
||||
- Cause: Large/unusual ligands, unclear binding site, protein flexibility
|
||||
- Solution: Increase `samples_per_complex` (20-40), try ensemble docking, validate protein structure
|
||||
|
||||
**Issue: Out of memory errors**
|
||||
- Cause: GPU memory insufficient for batch size
|
||||
- Solution: Reduce `--batch_size 2` or process fewer complexes at once
|
||||
|
||||
**Issue: Slow performance**
|
||||
- Cause: Running on CPU instead of GPU
|
||||
- Solution: Verify CUDA with `python -c "import torch; print(torch.cuda.is_available())"`, use GPU
|
||||
|
||||
**Issue: Unrealistic binding poses**
|
||||
- Cause: Poor protein preparation, ligand too large, wrong binding site
|
||||
- Solution: Check protein for missing residues, remove far waters, consider specifying binding site
|
||||
|
||||
**Issue: "Module not found" errors**
|
||||
- Cause: Missing dependencies or wrong environment
|
||||
- Solution: Run `python scripts/setup_check.py` to diagnose
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
**For Best Results:**
|
||||
1. Use GPU (essential for practical use)
|
||||
2. Pre-compute ESM embeddings for repeated protein use
|
||||
3. Batch process multiple complexes together
|
||||
4. Start with default parameters, then tune if needed
|
||||
5. Validate protein structures (resolve missing residues)
|
||||
6. Use canonical SMILES for ligands
|
||||
|
||||
## Graphical User Interface
|
||||
|
||||
For interactive use, launch the web interface:
|
||||
|
||||
```bash
|
||||
python app/main.py
|
||||
# Navigate to http://localhost:7860
|
||||
```
|
||||
|
||||
Or use the online demo without installation:
|
||||
- https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
|
||||
|
||||
## Resources
|
||||
|
||||
### Helper Scripts (`scripts/`)
|
||||
|
||||
**`prepare_batch_csv.py`**: Create and validate batch input CSV files
|
||||
- Create templates with example entries
|
||||
- Validate file paths and SMILES strings
|
||||
- Check for required columns and format issues
|
||||
|
||||
**`analyze_results.py`**: Analyze confidence scores and rank predictions
|
||||
- Parse results from single or batch runs
|
||||
- Generate statistical summaries
|
||||
- Export to CSV for downstream analysis
|
||||
- Identify top predictions across complexes
|
||||
|
||||
**`setup_check.py`**: Verify DiffDock environment setup
|
||||
- Check Python version and dependencies
|
||||
- Verify PyTorch and CUDA availability
|
||||
- Test RDKit and PyTorch Geometric installation
|
||||
- Provide installation instructions if needed
|
||||
|
||||
### Reference Documentation (`references/`)
|
||||
|
||||
**`parameters_reference.md`**: Complete parameter documentation
|
||||
- All command-line options and configuration parameters
|
||||
- Default values and acceptable ranges
|
||||
- Temperature parameters for controlling diversity
|
||||
- Model checkpoint locations and version flags
|
||||
|
||||
Read this file when users need:
|
||||
- Detailed parameter explanations
|
||||
- Fine-tuning guidance for specific systems
|
||||
- Alternative sampling strategies
|
||||
|
||||
**`confidence_and_limitations.md`**: Confidence score interpretation and tool limitations
|
||||
- Detailed confidence score interpretation
|
||||
- When to trust predictions
|
||||
- Scope and limitations of DiffDock
|
||||
- Integration with complementary tools
|
||||
- Troubleshooting prediction quality
|
||||
|
||||
Read this file when users need:
|
||||
- Help interpreting confidence scores
|
||||
- Understanding when NOT to use DiffDock
|
||||
- Guidance on combining with other tools
|
||||
- Validation strategies
|
||||
|
||||
**`workflows_examples.md`**: Comprehensive workflow examples
|
||||
- Detailed installation instructions
|
||||
- Step-by-step examples for all workflows
|
||||
- Advanced integration patterns
|
||||
- Troubleshooting common issues
|
||||
- Best practices and optimization tips
|
||||
|
||||
Read this file when users need:
|
||||
- Complete workflow examples with code
|
||||
- Integration with GNINA, OpenMM, or other tools
|
||||
- Virtual screening workflows
|
||||
- Ensemble docking procedures
|
||||
|
||||
### Assets (`assets/`)
|
||||
|
||||
**`batch_template.csv`**: Template for batch processing
|
||||
- Pre-formatted CSV with required columns
|
||||
- Example entries showing different input types
|
||||
- Ready to customize with actual data
|
||||
|
||||
**`custom_inference_config.yaml`**: Configuration template
|
||||
- Annotated YAML with all parameters
|
||||
- Four preset configurations for common use cases
|
||||
- Detailed comments explaining each parameter
|
||||
- Ready to customize and use
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always verify environment** with `setup_check.py` before starting large jobs
|
||||
2. **Validate batch CSVs** with `prepare_batch_csv.py` to catch errors early
|
||||
3. **Start with defaults** then tune parameters based on system-specific needs
|
||||
4. **Generate multiple samples** (10-40) for robust predictions
|
||||
5. **Visual inspection** of top poses before downstream analysis
|
||||
6. **Combine with scoring** functions for affinity assessment
|
||||
7. **Use confidence scores** for initial ranking, not final decisions
|
||||
8. **Pre-compute embeddings** for virtual screening campaigns
|
||||
9. **Document parameters** used for reproducibility
|
||||
10. **Validate results** experimentally when possible
|
||||
|
||||
## Citations
|
||||
|
||||
When using DiffDock, cite the appropriate papers:
|
||||
|
||||
**DiffDock-L (current default model):**
|
||||
```
|
||||
Stärk et al. (2024) "DiffDock-L: Improving Molecular Docking with Diffusion Models"
|
||||
arXiv:2402.18396
|
||||
```
|
||||
|
||||
**Original DiffDock:**
|
||||
```
|
||||
Corso et al. (2023) "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
|
||||
ICLR 2023, arXiv:2210.01776
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **GitHub Repository**: https://github.com/gcorso/DiffDock
|
||||
- **Online Demo**: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
|
||||
- **DiffDock-L Paper**: https://arxiv.org/abs/2402.18396
|
||||
- **Original Paper**: https://arxiv.org/abs/2210.01776
|
||||
4
scientific-packages/diffdock/assets/batch_template.csv
Normal file
4
scientific-packages/diffdock/assets/batch_template.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
complex_name,protein_path,ligand_description,protein_sequence
|
||||
example_1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
|
||||
example_2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
|
||||
example_3,protein3.pdb,ligand3.sdf,
|
||||
|
@@ -0,0 +1,90 @@
|
||||
# DiffDock Custom Inference Configuration Template
|
||||
# Copy and modify this file to customize inference parameters
|
||||
|
||||
# Model paths (usually don't need to change these)
|
||||
model_dir: ./workdir/v1.1/score_model
|
||||
confidence_model_dir: ./workdir/v1.1/confidence_model
|
||||
ckpt: best_ema_inference_epoch_model.pt
|
||||
confidence_ckpt: best_model_epoch75.pt
|
||||
|
||||
# Model version flags
|
||||
old_score_model: false # Set to true to use original DiffDock instead of DiffDock-L
|
||||
old_filtering_model: true
|
||||
|
||||
# Inference steps
|
||||
inference_steps: 20 # Increase for potentially better accuracy (e.g., 25-30)
|
||||
actual_steps: 19
|
||||
no_final_step_noise: true
|
||||
|
||||
# Sampling parameters
|
||||
samples_per_complex: 10 # Increase for difficult cases (e.g., 20-40)
|
||||
sigma_schedule: expbeta
|
||||
initial_noise_std_proportion: 1.46
|
||||
|
||||
# Temperature controls - Adjust these to balance exploration vs accuracy
|
||||
# Higher values = more diverse predictions, lower values = more focused predictions
|
||||
|
||||
# Sampling temperatures
|
||||
temp_sampling_tr: 1.17 # Translation sampling temperature
|
||||
temp_sampling_rot: 2.06 # Rotation sampling temperature
|
||||
temp_sampling_tor: 7.04 # Torsion sampling temperature (increase for flexible ligands)
|
||||
|
||||
# Psi angle temperatures
|
||||
temp_psi_tr: 0.73
|
||||
temp_psi_rot: 0.90
|
||||
temp_psi_tor: 0.59
|
||||
|
||||
# Sigma data temperatures
|
||||
temp_sigma_data_tr: 0.93
|
||||
temp_sigma_data_rot: 0.75
|
||||
temp_sigma_data_tor: 0.69
|
||||
|
||||
# Feature flags
|
||||
no_model: false
|
||||
no_random: false
|
||||
ode: false # Set to true to use ODE solver instead of SDE
|
||||
different_schedules: false
|
||||
limit_failures: 5
|
||||
|
||||
# Output settings
|
||||
# save_visualisation: true # Uncomment to save SDF files
|
||||
|
||||
# ============================================================================
|
||||
# Configuration Presets for Common Use Cases
|
||||
# ============================================================================
|
||||
|
||||
# PRESET 1: High Accuracy (slower, more thorough)
|
||||
# samples_per_complex: 30
|
||||
# inference_steps: 25
|
||||
# temp_sampling_tr: 1.0
|
||||
# temp_sampling_rot: 1.8
|
||||
# temp_sampling_tor: 6.5
|
||||
|
||||
# PRESET 2: Fast Screening (faster, less thorough)
|
||||
# samples_per_complex: 5
|
||||
# inference_steps: 15
|
||||
# temp_sampling_tr: 1.3
|
||||
# temp_sampling_rot: 2.2
|
||||
# temp_sampling_tor: 7.5
|
||||
|
||||
# PRESET 3: Flexible Ligands (more conformational diversity)
|
||||
# samples_per_complex: 20
|
||||
# inference_steps: 20
|
||||
# temp_sampling_tr: 1.2
|
||||
# temp_sampling_rot: 2.1
|
||||
# temp_sampling_tor: 8.5 # Increased torsion temperature
|
||||
|
||||
# PRESET 4: Rigid Ligands (more focused predictions)
|
||||
# samples_per_complex: 10
|
||||
# inference_steps: 20
|
||||
# temp_sampling_tr: 1.1
|
||||
# temp_sampling_rot: 2.0
|
||||
# temp_sampling_tor: 6.0 # Decreased torsion temperature
|
||||
|
||||
# ============================================================================
|
||||
# Usage Example
|
||||
# ============================================================================
|
||||
# python -m inference \
|
||||
# --config custom_inference_config.yaml \
|
||||
# --protein_ligand_csv input.csv \
|
||||
# --out_dir results/
|
||||
@@ -0,0 +1,182 @@
|
||||
# DiffDock Confidence Scores and Limitations
|
||||
|
||||
This document provides detailed guidance on interpreting DiffDock confidence scores and understanding the tool's limitations.
|
||||
|
||||
## Confidence Score Interpretation
|
||||
|
||||
DiffDock generates a confidence score for each predicted binding pose. This score indicates the model's certainty about the prediction.
|
||||
|
||||
### Score Ranges
|
||||
|
||||
| Score Range | Confidence Level | Interpretation |
|
||||
|------------|------------------|----------------|
|
||||
| **> 0** | High confidence | Strong prediction, likely accurate binding pose |
|
||||
| **-1.5 to 0** | Moderate confidence | Reasonable prediction, may need validation |
|
||||
| **< -1.5** | Low confidence | Uncertain prediction, requires careful validation |
|
||||
|
||||
### Important Notes on Confidence Scores
|
||||
|
||||
1. **Not Binding Affinity**: Confidence scores reflect prediction certainty, NOT binding affinity strength
|
||||
- High confidence = model is confident about the structure
|
||||
- Does NOT indicate strong/weak binding affinity
|
||||
|
||||
2. **Context-Dependent**: Confidence scores should be adjusted based on system complexity:
|
||||
- **Lower expectations** for:
|
||||
- Large ligands (>500 Da)
|
||||
- Protein complexes with many chains
|
||||
- Unbound protein conformations (may require conformational changes)
|
||||
- Novel protein families not well-represented in training data
|
||||
|
||||
- **Higher expectations** for:
|
||||
- Drug-like small molecules (150-500 Da)
|
||||
- Single-chain proteins or well-defined binding sites
|
||||
- Proteins similar to those in training data (PDBBind, BindingMOAD)
|
||||
|
||||
3. **Multiple Predictions**: DiffDock generates multiple samples per complex (default: 10)
|
||||
- Review top-ranked predictions (by confidence)
|
||||
- Consider clustering similar poses
|
||||
- High-confidence consensus across multiple samples strengthens prediction
|
||||
|
||||
## What DiffDock Predicts
|
||||
|
||||
### ✅ DiffDock DOES Predict
|
||||
- **Binding poses**: 3D spatial orientation of ligand in protein binding site
|
||||
- **Confidence scores**: Model's certainty about predictions
|
||||
- **Multiple conformations**: Various possible binding modes
|
||||
|
||||
### ❌ DiffDock DOES NOT Predict
|
||||
- **Binding affinity**: Strength of protein-ligand interaction (ΔG, Kd, Ki)
|
||||
- **Binding kinetics**: On/off rates, residence time
|
||||
- **ADMET properties**: Absorption, distribution, metabolism, excretion, toxicity
|
||||
- **Selectivity**: Relative binding to different targets
|
||||
|
||||
## Scope and Limitations
|
||||
|
||||
### Designed For
|
||||
- **Small molecule docking**: Organic compounds typically 100-1000 Da
|
||||
- **Protein targets**: Single or multi-chain proteins
|
||||
- **Small peptides**: Short peptide ligands (< ~20 residues)
|
||||
- **Small nucleic acids**: Short oligonucleotides
|
||||
|
||||
### NOT Designed For
|
||||
- **Large biomolecules**: Full protein-protein interactions
|
||||
- Use DiffDock-PP, AlphaFold-Multimer, or RoseTTAFold2NA instead
|
||||
- **Large peptides/proteins**: >20 residues as ligands
|
||||
- **Covalent docking**: Irreversible covalent bond formation
|
||||
- **Metalloprotein specifics**: May not accurately handle metal coordination
|
||||
- **Membrane proteins**: Not specifically trained on membrane-embedded proteins
|
||||
|
||||
### Training Data Considerations
|
||||
|
||||
DiffDock was trained on:
|
||||
- **PDBBind**: Diverse protein-ligand complexes
|
||||
- **BindingMOAD**: Multi-domain protein structures
|
||||
|
||||
**Implications**:
|
||||
- Best performance on proteins/ligands similar to training data
|
||||
- May underperform on:
|
||||
- Novel protein families
|
||||
- Unusual ligand chemotypes
|
||||
- Allosteric sites not well-represented in training data
|
||||
|
||||
## Validation and Complementary Tools
|
||||
|
||||
### Recommended Workflow
|
||||
|
||||
1. **Generate poses with DiffDock**
|
||||
- Use confidence scores for initial ranking
|
||||
- Consider multiple high-confidence predictions
|
||||
|
||||
2. **Visual Inspection**
|
||||
- Examine protein-ligand interactions in molecular viewer
|
||||
- Check for reasonable:
|
||||
- Hydrogen bonds
|
||||
- Hydrophobic interactions
|
||||
- Steric complementarity
|
||||
- Electrostatic interactions
|
||||
|
||||
3. **Scoring and Refinement** (choose one or more):
|
||||
- **GNINA**: Deep learning-based scoring function
|
||||
- **Molecular mechanics**: Energy minimization and refinement
|
||||
- **MM/GBSA or MM/PBSA**: Binding free energy estimation
|
||||
- **Free energy calculations**: FEP or TI for accurate affinity prediction
|
||||
|
||||
4. **Experimental Validation**
|
||||
- Biochemical assays (IC50, Kd measurements)
|
||||
- Structural validation (X-ray crystallography, cryo-EM)
|
||||
|
||||
### Tools for Binding Affinity Assessment
|
||||
|
||||
DiffDock should be combined with these tools for affinity prediction:
|
||||
|
||||
- **GNINA**: Fast, accurate scoring function
|
||||
- Github: github.com/gnina/gnina
|
||||
|
||||
- **AutoDock Vina**: Classical docking and scoring
|
||||
- Website: vina.scripps.edu
|
||||
|
||||
- **Free Energy Calculations**:
|
||||
- OpenMM + OpenFE
|
||||
- GROMACS + ABFE/RBFE protocols
|
||||
|
||||
- **MM/GBSA Tools**:
|
||||
- MMPBSA.py (AmberTools)
|
||||
- gmx_MMPBSA
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### For Best Results
|
||||
|
||||
1. **Protein Preparation**:
|
||||
- Remove water molecules far from binding site
|
||||
- Resolve missing residues if possible
|
||||
- Consider protonation states at physiological pH
|
||||
|
||||
2. **Ligand Input**:
|
||||
- Provide reasonable 3D conformers when using structure files
|
||||
- Use canonical SMILES for consistent results
|
||||
- Pre-process with RDKit if needed
|
||||
|
||||
3. **Computational Resources**:
|
||||
- GPU strongly recommended (10-100x speedup)
|
||||
- First run pre-computes lookup tables (takes a few minutes)
|
||||
- Batch processing more efficient than single predictions
|
||||
|
||||
4. **Parameter Tuning**:
|
||||
- Increase `samples_per_complex` for difficult cases (20-40)
|
||||
- Adjust temperature parameters for diversity/accuracy trade-off
|
||||
- Use pre-computed ESM embeddings for repeated predictions
|
||||
|
||||
## Common Issues and Troubleshooting
|
||||
|
||||
### Low Confidence Scores
|
||||
- **Large/flexible ligands**: Consider splitting into fragments or use alternative methods
|
||||
- **Multiple binding sites**: May predict multiple locations with distributed confidence
|
||||
- **Protein flexibility**: Consider using ensemble of protein conformations
|
||||
|
||||
### Unrealistic Predictions
|
||||
- **Clashes**: May indicate need for protein preparation or refinement
|
||||
- **Surface binding**: Check if true binding site is blocked or unclear
|
||||
- **Unusual poses**: Consider increasing samples to explore more conformations
|
||||
|
||||
### Slow Performance
|
||||
- **Use GPU**: Essential for reasonable runtime
|
||||
- **Pre-compute embeddings**: Reuse ESM embeddings for same protein
|
||||
- **Batch processing**: More efficient than sequential individual predictions
|
||||
- **Reduce samples**: Lower `samples_per_complex` for quick screening
|
||||
|
||||
## Citation and Further Reading
|
||||
|
||||
For methodology details and benchmarking results, see:
|
||||
|
||||
1. **Original DiffDock Paper** (ICLR 2023):
|
||||
- "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
|
||||
- Corso et al., arXiv:2210.01776
|
||||
|
||||
2. **DiffDock-L Paper** (2024):
|
||||
- Enhanced model with improved generalization
|
||||
- Stärk et al., arXiv:2402.18396
|
||||
|
||||
3. **PoseBusters Benchmark**:
|
||||
- Rigorous docking evaluation framework
|
||||
- Used for DiffDock validation
|
||||
163
scientific-packages/diffdock/references/parameters_reference.md
Normal file
163
scientific-packages/diffdock/references/parameters_reference.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# DiffDock Configuration Parameters Reference
|
||||
|
||||
This document provides comprehensive details on all DiffDock configuration parameters and command-line options.
|
||||
|
||||
## Model & Checkpoint Settings
|
||||
|
||||
### Model Paths
|
||||
- **`--model_dir`**: Directory containing the score model checkpoint
|
||||
- Default: `./workdir/v1.1/score_model`
|
||||
- DiffDock-L model (current default)
|
||||
|
||||
- **`--confidence_model_dir`**: Directory containing the confidence model checkpoint
|
||||
- Default: `./workdir/v1.1/confidence_model`
|
||||
|
||||
- **`--ckpt`**: Name of the score model checkpoint file
|
||||
- Default: `best_ema_inference_epoch_model.pt`
|
||||
|
||||
- **`--confidence_ckpt`**: Name of the confidence model checkpoint file
|
||||
- Default: `best_model_epoch75.pt`
|
||||
|
||||
### Model Version Flags
|
||||
- **`--old_score_model`**: Use original DiffDock model instead of DiffDock-L
|
||||
- Default: `false` (uses DiffDock-L)
|
||||
|
||||
- **`--old_filtering_model`**: Use legacy confidence filtering approach
|
||||
- Default: `true`
|
||||
|
||||
## Input/Output Options
|
||||
|
||||
### Input Specification
|
||||
- **`--protein_path`**: Path to protein PDB file
|
||||
- Example: `--protein_path protein.pdb`
|
||||
- Alternative to `--protein_sequence`
|
||||
|
||||
- **`--protein_sequence`**: Amino acid sequence for ESMFold folding
|
||||
- Automatically generates protein structure from sequence
|
||||
- Alternative to `--protein_path`
|
||||
|
||||
- **`--ligand`**: Ligand specification (SMILES string or file path)
|
||||
- SMILES string: `--ligand "COc(cc1)ccc1C#N"`
|
||||
- File path: `--ligand ligand.sdf` or `.mol2`
|
||||
|
||||
- **`--protein_ligand_csv`**: CSV file for batch processing
|
||||
- Required columns: `complex_name`, `protein_path`, `ligand_description`, `protein_sequence`
|
||||
- Example: `--protein_ligand_csv data/protein_ligand_example.csv`
|
||||
|
||||
### Output Control
|
||||
- **`--out_dir`**: Output directory for predictions
|
||||
- Example: `--out_dir results/user_predictions/`
|
||||
|
||||
- **`--save_visualisation`**: Export predicted molecules as SDF files
|
||||
- Enables visualization of results
|
||||
|
||||
## Inference Parameters
|
||||
|
||||
### Diffusion Steps
|
||||
- **`--inference_steps`**: Number of planned inference iterations
|
||||
- Default: `20`
|
||||
- Higher values may improve accuracy but increase runtime
|
||||
|
||||
- **`--actual_steps`**: Actual diffusion steps executed
|
||||
- Default: `19`
|
||||
|
||||
- **`--no_final_step_noise`**: Omit noise at the final diffusion step
|
||||
- Default: `true`
|
||||
|
||||
### Sampling Settings
|
||||
- **`--samples_per_complex`**: Number of samples to generate per complex
|
||||
- Default: `10`
|
||||
- More samples provide better coverage but increase computation
|
||||
|
||||
- **`--sigma_schedule`**: Noise schedule type
|
||||
- Default: `expbeta` (exponential-beta)
|
||||
|
||||
- **`--initial_noise_std_proportion`**: Initial noise standard deviation scaling
|
||||
- Default: `1.46`
|
||||
|
||||
### Temperature Parameters
|
||||
|
||||
#### Sampling Temperatures (Controls diversity of predictions)
|
||||
- **`--temp_sampling_tr`**: Translation sampling temperature
|
||||
- Default: `1.17`
|
||||
|
||||
- **`--temp_sampling_rot`**: Rotation sampling temperature
|
||||
- Default: `2.06`
|
||||
|
||||
- **`--temp_sampling_tor`**: Torsion sampling temperature
|
||||
- Default: `7.04`
|
||||
|
||||
#### Psi Angle Temperatures
|
||||
- **`--temp_psi_tr`**: Translation psi temperature
|
||||
- Default: `0.73`
|
||||
|
||||
- **`--temp_psi_rot`**: Rotation psi temperature
|
||||
- Default: `0.90`
|
||||
|
||||
- **`--temp_psi_tor`**: Torsion psi temperature
|
||||
- Default: `0.59`
|
||||
|
||||
#### Sigma Data Temperatures
|
||||
- **`--temp_sigma_data_tr`**: Translation data distribution scaling
|
||||
- Default: `0.93`
|
||||
|
||||
- **`--temp_sigma_data_rot`**: Rotation data distribution scaling
|
||||
- Default: `0.75`
|
||||
|
||||
- **`--temp_sigma_data_tor`**: Torsion data distribution scaling
|
||||
- Default: `0.69`
|
||||
|
||||
## Processing Options
|
||||
|
||||
### Performance
|
||||
- **`--batch_size`**: Processing batch size
|
||||
- Default: `10`
|
||||
- Larger values increase throughput but require more memory
|
||||
|
||||
- **`--tqdm`**: Enable progress bar visualization
|
||||
- Useful for monitoring long-running jobs
|
||||
|
||||
### Protein Structure
|
||||
- **`--chain_cutoff`**: Maximum number of protein chains to process
|
||||
- Example: `--chain_cutoff 10`
|
||||
- Useful for large multi-chain complexes
|
||||
|
||||
- **`--esm_embeddings_path`**: Path to pre-computed ESM2 protein embeddings
|
||||
- Speeds up inference by reusing embeddings
|
||||
- Optional optimization
|
||||
|
||||
### Dataset Options
|
||||
- **`--split`**: Dataset split to use (train/test/val)
|
||||
- Used for evaluation on standard benchmarks
|
||||
|
||||
## Advanced Flags
|
||||
|
||||
### Debugging & Testing
|
||||
- **`--no_model`**: Disable model inference (debugging)
|
||||
- Default: `false`
|
||||
|
||||
- **`--no_random`**: Disable randomization
|
||||
- Default: `false`
|
||||
- Useful for reproducibility testing
|
||||
|
||||
### Alternative Sampling
|
||||
- **`--ode`**: Use ODE solver instead of SDE
|
||||
- Default: `false`
|
||||
- Alternative sampling approach
|
||||
|
||||
- **`--different_schedules`**: Use different noise schedules per component
|
||||
- Default: `false`
|
||||
|
||||
### Error Handling
|
||||
- **`--limit_failures`**: Maximum allowed failures before stopping
|
||||
- Default: `5`
|
||||
|
||||
## Configuration File
|
||||
|
||||
All parameters can be specified in a YAML configuration file (typically `default_inference_args.yaml`) or overridden via command line:
|
||||
|
||||
```bash
|
||||
python -m inference --config default_inference_args.yaml --samples_per_complex 20
|
||||
```
|
||||
|
||||
Command-line arguments take precedence over configuration file values.
|
||||
392
scientific-packages/diffdock/references/workflows_examples.md
Normal file
392
scientific-packages/diffdock/references/workflows_examples.md
Normal file
@@ -0,0 +1,392 @@
|
||||
# DiffDock Workflows and Examples
|
||||
|
||||
This document provides practical workflows and usage examples for common DiffDock tasks.
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
### Conda Installation (Recommended)
|
||||
|
||||
```bash
|
||||
# Clone repository
|
||||
git clone https://github.com/gcorso/DiffDock.git
|
||||
cd DiffDock
|
||||
|
||||
# Create conda environment
|
||||
conda env create --file environment.yml
|
||||
conda activate diffdock
|
||||
```
|
||||
|
||||
### Docker Installation
|
||||
|
||||
```bash
|
||||
# Pull Docker image
|
||||
docker pull rbgcsail/diffdock
|
||||
|
||||
# Run container with GPU support
|
||||
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
|
||||
|
||||
# Inside container, activate environment
|
||||
micromamba activate diffdock
|
||||
```
|
||||
|
||||
### First Run
|
||||
The first execution pre-computes SO(2) and SO(3) lookup tables, taking a few minutes. Subsequent runs start immediately.
|
||||
|
||||
## Workflow 1: Single Protein-Ligand Docking
|
||||
|
||||
### Using PDB File and SMILES String
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path examples/protein.pdb \
|
||||
--ligand "COc1ccc(C(=O)Nc2ccccc2)cc1" \
|
||||
--out_dir results/single_docking/
|
||||
```
|
||||
|
||||
**Output Structure**:
|
||||
```
|
||||
results/single_docking/
|
||||
├── index_0_rank_1.sdf # Top-ranked prediction
|
||||
├── index_0_rank_2.sdf # Second-ranked prediction
|
||||
├── ...
|
||||
├── index_0_rank_10.sdf # 10th prediction (if samples_per_complex=10)
|
||||
└── confidence_scores.txt # Scores for all predictions
|
||||
```
|
||||
|
||||
### Using Ligand Structure File
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand ligand.sdf \
|
||||
--out_dir results/ligand_file/
|
||||
```
|
||||
|
||||
**Supported ligand formats**: SDF, MOL2, or any format readable by RDKit
|
||||
|
||||
## Workflow 2: Protein Sequence to Structure Docking
|
||||
|
||||
### Using ESMFold for Protein Folding
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK" \
|
||||
--ligand "CC(C)Cc1ccc(cc1)C(C)C(=O)O" \
|
||||
--out_dir results/sequence_docking/
|
||||
```
|
||||
|
||||
**Use Cases**:
|
||||
- Protein structure not available in PDB
|
||||
- Modeling mutations or variants
|
||||
- De novo protein design validation
|
||||
|
||||
**Note**: ESMFold folding adds computation time (30s-5min depending on sequence length)
|
||||
|
||||
## Workflow 3: Batch Processing Multiple Complexes
|
||||
|
||||
### Prepare CSV File
|
||||
|
||||
Create `complexes.csv` with required columns:
|
||||
|
||||
```csv
|
||||
complex_name,protein_path,ligand_description,protein_sequence
|
||||
complex1,proteins/protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
|
||||
complex2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKF...
|
||||
complex3,proteins/protein3.pdb,ligands/ligand3.sdf,
|
||||
```
|
||||
|
||||
**Column Descriptions**:
|
||||
- `complex_name`: Unique identifier for the complex
|
||||
- `protein_path`: Path to PDB file (leave empty if using sequence)
|
||||
- `ligand_description`: SMILES string or path to ligand file
|
||||
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
|
||||
|
||||
### Run Batch Docking
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv complexes.csv \
|
||||
--out_dir results/batch_predictions/ \
|
||||
--batch_size 10
|
||||
```
|
||||
|
||||
**Output Structure**:
|
||||
```
|
||||
results/batch_predictions/
|
||||
├── complex1/
|
||||
│ ├── rank_1.sdf
|
||||
│ ├── rank_2.sdf
|
||||
│ └── ...
|
||||
├── complex2/
|
||||
│ ├── rank_1.sdf
|
||||
│ └── ...
|
||||
└── complex3/
|
||||
└── ...
|
||||
```
|
||||
|
||||
## Workflow 4: High-Throughput Virtual Screening
|
||||
|
||||
### Setup for Screening Large Ligand Libraries
|
||||
|
||||
```python
|
||||
# generate_screening_csv.py
|
||||
import pandas as pd
|
||||
|
||||
# Load ligand library
|
||||
ligands = pd.read_csv("ligand_library.csv") # Contains SMILES
|
||||
|
||||
# Create DiffDock input
|
||||
screening_data = {
|
||||
"complex_name": [f"screen_{i}" for i in range(len(ligands))],
|
||||
"protein_path": ["target_protein.pdb"] * len(ligands),
|
||||
"ligand_description": ligands["smiles"].tolist(),
|
||||
"protein_sequence": [""] * len(ligands)
|
||||
}
|
||||
|
||||
df = pd.DataFrame(screening_data)
|
||||
df.to_csv("screening_input.csv", index=False)
|
||||
```
|
||||
|
||||
### Run Screening
|
||||
|
||||
```bash
|
||||
# Pre-compute ESM embeddings for faster screening
|
||||
python datasets/esm_embedding_preparation.py \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--out_file protein_embeddings.pt
|
||||
|
||||
# Run docking with pre-computed embeddings
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--esm_embeddings_path protein_embeddings.pt \
|
||||
--out_dir results/virtual_screening/ \
|
||||
--batch_size 32
|
||||
```
|
||||
|
||||
### Post-Processing: Extract Top Hits
|
||||
|
||||
```python
|
||||
# analyze_screening_results.py
|
||||
import os
|
||||
import pandas as pd
|
||||
|
||||
results = []
|
||||
results_dir = "results/virtual_screening/"
|
||||
|
||||
for complex_dir in os.listdir(results_dir):
|
||||
confidence_file = os.path.join(results_dir, complex_dir, "confidence_scores.txt")
|
||||
if os.path.exists(confidence_file):
|
||||
with open(confidence_file) as f:
|
||||
scores = [float(line.strip()) for line in f]
|
||||
top_score = max(scores)
|
||||
results.append({"complex": complex_dir, "top_confidence": top_score})
|
||||
|
||||
# Sort by confidence
|
||||
df = pd.DataFrame(results)
|
||||
df_sorted = df.sort_values("top_confidence", ascending=False)
|
||||
|
||||
# Get top 100 hits
|
||||
top_hits = df_sorted.head(100)
|
||||
top_hits.to_csv("top_hits.csv", index=False)
|
||||
```
|
||||
|
||||
## Workflow 5: Ensemble Docking with Protein Flexibility
|
||||
|
||||
### Prepare Protein Ensemble
|
||||
|
||||
```python
|
||||
# For proteins with known flexibility, use multiple conformations
|
||||
# Example: Using MD snapshots or crystal structures
|
||||
|
||||
# create_ensemble_csv.py
|
||||
import pandas as pd
|
||||
|
||||
conformations = [
|
||||
"protein_conf1.pdb",
|
||||
"protein_conf2.pdb",
|
||||
"protein_conf3.pdb",
|
||||
"protein_conf4.pdb"
|
||||
]
|
||||
|
||||
ligand = "CC(C)Cc1ccc(cc1)C(C)C(=O)O"
|
||||
|
||||
data = {
|
||||
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
|
||||
"protein_path": conformations,
|
||||
"ligand_description": [ligand] * len(conformations),
|
||||
"protein_sequence": [""] * len(conformations)
|
||||
}
|
||||
|
||||
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
|
||||
```
|
||||
|
||||
### Run Ensemble Docking
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv ensemble_input.csv \
|
||||
--out_dir results/ensemble_docking/ \
|
||||
--samples_per_complex 20 # More samples per conformation
|
||||
```
|
||||
|
||||
## Workflow 6: Integration with Downstream Analysis
|
||||
|
||||
### Example: DiffDock + GNINA Rescoring
|
||||
|
||||
```bash
|
||||
# 1. Run DiffDock
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
|
||||
--out_dir results/diffdock_poses/ \
|
||||
--save_visualisation
|
||||
|
||||
# 2. Rescore with GNINA
|
||||
for pose in results/diffdock_poses/*.sdf; do
|
||||
gnina -r protein.pdb -l "$pose" --score_only -o "${pose%.sdf}_gnina.sdf"
|
||||
done
|
||||
```
|
||||
|
||||
### Example: DiffDock + OpenMM Energy Minimization
|
||||
|
||||
```python
|
||||
# minimize_poses.py
|
||||
from openmm import app, LangevinIntegrator, Platform
|
||||
from openmm.app import ForceField, Modeller, PDBFile
|
||||
from rdkit import Chem
|
||||
import os
|
||||
|
||||
# Load protein
|
||||
protein = PDBFile('protein.pdb')
|
||||
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')
|
||||
|
||||
# Process each DiffDock pose
|
||||
pose_dir = 'results/diffdock_poses/'
|
||||
for pose_file in os.listdir(pose_dir):
|
||||
if pose_file.endswith('.sdf'):
|
||||
# Load ligand
|
||||
mol = Chem.SDMolSupplier(os.path.join(pose_dir, pose_file))[0]
|
||||
|
||||
# Combine protein + ligand
|
||||
modeller = Modeller(protein.topology, protein.positions)
|
||||
# ... add ligand to modeller ...
|
||||
|
||||
# Create system and minimize
|
||||
system = forcefield.createSystem(modeller.topology)
|
||||
integrator = LangevinIntegrator(300, 1.0, 0.002)
|
||||
simulation = app.Simulation(modeller.topology, system, integrator)
|
||||
simulation.minimizeEnergy(maxIterations=1000)
|
||||
|
||||
# Save minimized structure
|
||||
positions = simulation.context.getState(getPositions=True).getPositions()
|
||||
PDBFile.writeFile(simulation.topology, positions,
|
||||
open(f"minimized_{pose_file}.pdb", 'w'))
|
||||
```
|
||||
|
||||
## Workflow 7: Using the Graphical Interface
|
||||
|
||||
### Launch Web Interface
|
||||
|
||||
```bash
|
||||
python app/main.py
|
||||
```
|
||||
|
||||
### Access Interface
|
||||
Navigate to `http://localhost:7860` in web browser
|
||||
|
||||
### Features
|
||||
- Upload protein PDB or enter sequence
|
||||
- Input ligand SMILES or upload structure
|
||||
- Adjust inference parameters via GUI
|
||||
- Visualize results interactively
|
||||
- Download predictions directly
|
||||
|
||||
### Online Alternative
|
||||
Use the Hugging Face Spaces demo without local installation:
|
||||
- URL: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Custom Inference Settings
|
||||
|
||||
Create custom YAML configuration:
|
||||
|
||||
```yaml
|
||||
# custom_inference.yaml
|
||||
# Model settings
|
||||
model_dir: ./workdir/v1.1/score_model
|
||||
confidence_model_dir: ./workdir/v1.1/confidence_model
|
||||
|
||||
# Sampling parameters
|
||||
samples_per_complex: 20 # More samples for better coverage
|
||||
inference_steps: 25 # More steps for accuracy
|
||||
|
||||
# Temperature adjustments (increase for more diversity)
|
||||
temp_sampling_tr: 1.3
|
||||
temp_sampling_rot: 2.2
|
||||
temp_sampling_tor: 7.5
|
||||
|
||||
# Output
|
||||
save_visualisation: true
|
||||
```
|
||||
|
||||
Use custom configuration:
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config custom_inference.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
|
||||
--out_dir results/custom_config/
|
||||
```
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Issue: Out of Memory Errors
|
||||
|
||||
**Solution**: Reduce batch size
|
||||
```bash
|
||||
python -m inference ... --batch_size 2
|
||||
```
|
||||
|
||||
### Issue: Slow Performance
|
||||
|
||||
**Solution**: Ensure GPU usage
|
||||
```python
|
||||
import torch
|
||||
print(torch.cuda.is_available()) # Should return True
|
||||
```
|
||||
|
||||
### Issue: Poor Predictions for Large Ligands
|
||||
|
||||
**Solution**: Increase sampling diversity
|
||||
```bash
|
||||
python -m inference ... --samples_per_complex 40 --temp_sampling_tor 9.0
|
||||
```
|
||||
|
||||
### Issue: Protein with Many Chains
|
||||
|
||||
**Solution**: Limit chains or isolate binding site
|
||||
```bash
|
||||
python -m inference ... --chain_cutoff 4
|
||||
```
|
||||
|
||||
Or pre-process PDB to include only relevant chains.
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **Start Simple**: Test with single complex before batch processing
|
||||
2. **GPU Essential**: Use GPU for reasonable performance
|
||||
3. **Multiple Samples**: Generate 10-40 samples for robust predictions
|
||||
4. **Validate Results**: Use molecular visualization and complementary scoring
|
||||
5. **Consider Confidence**: Use confidence scores for initial ranking, not final decisions
|
||||
6. **Iterate Parameters**: Adjust temperature/steps for specific systems
|
||||
7. **Pre-compute Embeddings**: For repeated use of same protein
|
||||
8. **Combine Tools**: Integrate with scoring functions and energy minimization
|
||||
334
scientific-packages/diffdock/scripts/analyze_results.py
Executable file
334
scientific-packages/diffdock/scripts/analyze_results.py
Executable file
@@ -0,0 +1,334 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
DiffDock Results Analysis Script
|
||||
|
||||
This script analyzes DiffDock prediction results, extracting confidence scores,
|
||||
ranking predictions, and generating summary reports.
|
||||
|
||||
Usage:
|
||||
python analyze_results.py results/output_dir/
|
||||
python analyze_results.py results/ --top 50 --threshold 0.0
|
||||
python analyze_results.py results/ --export summary.csv
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
from pathlib import Path
|
||||
from collections import defaultdict
|
||||
import re
|
||||
|
||||
|
||||
def parse_confidence_scores(results_dir):
|
||||
"""
|
||||
Parse confidence scores from DiffDock output directory.
|
||||
|
||||
Args:
|
||||
results_dir: Path to DiffDock results directory
|
||||
|
||||
Returns:
|
||||
dict: Dictionary mapping complex names to their predictions and scores
|
||||
"""
|
||||
results = {}
|
||||
results_path = Path(results_dir)
|
||||
|
||||
# Check if this is a single complex or batch results
|
||||
sdf_files = list(results_path.glob("*.sdf"))
|
||||
|
||||
if sdf_files:
|
||||
# Single complex output
|
||||
results['single_complex'] = parse_single_complex(results_path)
|
||||
else:
|
||||
# Batch output - multiple subdirectories
|
||||
for subdir in results_path.iterdir():
|
||||
if subdir.is_dir():
|
||||
complex_results = parse_single_complex(subdir)
|
||||
if complex_results:
|
||||
results[subdir.name] = complex_results
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def parse_single_complex(complex_dir):
|
||||
"""Parse results for a single complex."""
|
||||
predictions = []
|
||||
|
||||
# Look for SDF files with rank information
|
||||
for sdf_file in complex_dir.glob("*.sdf"):
|
||||
filename = sdf_file.name
|
||||
|
||||
# Extract rank from filename (e.g., "rank_1.sdf" or "index_0_rank_1.sdf")
|
||||
rank_match = re.search(r'rank_(\d+)', filename)
|
||||
if rank_match:
|
||||
rank = int(rank_match.group(1))
|
||||
|
||||
# Try to extract confidence score from filename or separate file
|
||||
confidence = extract_confidence_score(sdf_file, complex_dir)
|
||||
|
||||
predictions.append({
|
||||
'rank': rank,
|
||||
'file': sdf_file.name,
|
||||
'path': str(sdf_file),
|
||||
'confidence': confidence
|
||||
})
|
||||
|
||||
# Sort by rank
|
||||
predictions.sort(key=lambda x: x['rank'])
|
||||
|
||||
return {'predictions': predictions} if predictions else None
|
||||
|
||||
|
||||
def extract_confidence_score(sdf_file, complex_dir):
|
||||
"""
|
||||
Extract confidence score for a prediction.
|
||||
|
||||
Tries multiple methods:
|
||||
1. Read from confidence_scores.txt file
|
||||
2. Parse from SDF file properties
|
||||
3. Extract from filename if present
|
||||
"""
|
||||
# Method 1: confidence_scores.txt
|
||||
confidence_file = complex_dir / "confidence_scores.txt"
|
||||
if confidence_file.exists():
|
||||
try:
|
||||
with open(confidence_file) as f:
|
||||
lines = f.readlines()
|
||||
# Extract rank from filename
|
||||
rank_match = re.search(r'rank_(\d+)', sdf_file.name)
|
||||
if rank_match:
|
||||
rank = int(rank_match.group(1))
|
||||
if rank <= len(lines):
|
||||
return float(lines[rank - 1].strip())
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Method 2: Parse from SDF file
|
||||
try:
|
||||
with open(sdf_file) as f:
|
||||
content = f.read()
|
||||
# Look for confidence score in SDF properties
|
||||
conf_match = re.search(r'confidence[:\s]+(-?\d+\.?\d*)', content, re.IGNORECASE)
|
||||
if conf_match:
|
||||
return float(conf_match.group(1))
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Method 3: Filename (e.g., "rank_1_conf_0.95.sdf")
|
||||
conf_match = re.search(r'conf_(-?\d+\.?\d*)', sdf_file.name)
|
||||
if conf_match:
|
||||
return float(conf_match.group(1))
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def classify_confidence(score):
|
||||
"""Classify confidence score into categories."""
|
||||
if score is None:
|
||||
return "Unknown"
|
||||
elif score > 0:
|
||||
return "High"
|
||||
elif score > -1.5:
|
||||
return "Moderate"
|
||||
else:
|
||||
return "Low"
|
||||
|
||||
|
||||
def print_summary(results, top_n=None, min_confidence=None):
|
||||
"""Print a formatted summary of results."""
|
||||
|
||||
print("\n" + "="*80)
|
||||
print("DiffDock Results Summary")
|
||||
print("="*80)
|
||||
|
||||
all_predictions = []
|
||||
|
||||
for complex_name, data in results.items():
|
||||
predictions = data.get('predictions', [])
|
||||
|
||||
print(f"\n{complex_name}")
|
||||
print("-" * 80)
|
||||
|
||||
if not predictions:
|
||||
print(" No predictions found")
|
||||
continue
|
||||
|
||||
# Filter by confidence if specified
|
||||
filtered_predictions = predictions
|
||||
if min_confidence is not None:
|
||||
filtered_predictions = [p for p in predictions if p['confidence'] is not None and p['confidence'] >= min_confidence]
|
||||
|
||||
# Limit to top N if specified
|
||||
if top_n is not None:
|
||||
filtered_predictions = filtered_predictions[:top_n]
|
||||
|
||||
for pred in filtered_predictions:
|
||||
confidence = pred['confidence']
|
||||
confidence_class = classify_confidence(confidence)
|
||||
|
||||
conf_str = f"{confidence:>7.3f}" if confidence is not None else " N/A"
|
||||
print(f" Rank {pred['rank']:2d}: Confidence = {conf_str} ({confidence_class:8s}) | {pred['file']}")
|
||||
|
||||
# Add to all predictions for overall statistics
|
||||
if confidence is not None:
|
||||
all_predictions.append((complex_name, pred['rank'], confidence))
|
||||
|
||||
# Show statistics for this complex
|
||||
if filtered_predictions and any(p['confidence'] is not None for p in filtered_predictions):
|
||||
confidences = [p['confidence'] for p in filtered_predictions if p['confidence'] is not None]
|
||||
print(f"\n Statistics: {len(filtered_predictions)} predictions")
|
||||
print(f" Mean confidence: {sum(confidences)/len(confidences):.3f}")
|
||||
print(f" Max confidence: {max(confidences):.3f}")
|
||||
print(f" Min confidence: {min(confidences):.3f}")
|
||||
|
||||
# Overall statistics
|
||||
if all_predictions:
|
||||
print("\n" + "="*80)
|
||||
print("Overall Statistics")
|
||||
print("="*80)
|
||||
|
||||
confidences = [conf for _, _, conf in all_predictions]
|
||||
print(f" Total predictions: {len(all_predictions)}")
|
||||
print(f" Total complexes: {len(results)}")
|
||||
print(f" Mean confidence: {sum(confidences)/len(confidences):.3f}")
|
||||
print(f" Max confidence: {max(confidences):.3f}")
|
||||
print(f" Min confidence: {min(confidences):.3f}")
|
||||
|
||||
# Confidence distribution
|
||||
high = sum(1 for c in confidences if c > 0)
|
||||
moderate = sum(1 for c in confidences if -1.5 < c <= 0)
|
||||
low = sum(1 for c in confidences if c <= -1.5)
|
||||
|
||||
print(f"\n Confidence distribution:")
|
||||
print(f" High (> 0): {high:4d} ({100*high/len(confidences):5.1f}%)")
|
||||
print(f" Moderate (-1.5 to 0): {moderate:4d} ({100*moderate/len(confidences):5.1f}%)")
|
||||
print(f" Low (< -1.5): {low:4d} ({100*low/len(confidences):5.1f}%)")
|
||||
|
||||
print("\n" + "="*80)
|
||||
|
||||
|
||||
def export_to_csv(results, output_path):
|
||||
"""Export results to CSV file."""
|
||||
import csv
|
||||
|
||||
with open(output_path, 'w', newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
writer.writerow(['complex_name', 'rank', 'confidence', 'confidence_class', 'file_path'])
|
||||
|
||||
for complex_name, data in results.items():
|
||||
predictions = data.get('predictions', [])
|
||||
for pred in predictions:
|
||||
confidence = pred['confidence']
|
||||
confidence_class = classify_confidence(confidence)
|
||||
conf_value = confidence if confidence is not None else ''
|
||||
|
||||
writer.writerow([
|
||||
complex_name,
|
||||
pred['rank'],
|
||||
conf_value,
|
||||
confidence_class,
|
||||
pred['path']
|
||||
])
|
||||
|
||||
print(f"✓ Exported results to: {output_path}")
|
||||
|
||||
|
||||
def get_top_predictions(results, n=10, sort_by='confidence'):
|
||||
"""Get top N predictions across all complexes."""
|
||||
all_predictions = []
|
||||
|
||||
for complex_name, data in results.items():
|
||||
predictions = data.get('predictions', [])
|
||||
for pred in predictions:
|
||||
if pred['confidence'] is not None:
|
||||
all_predictions.append({
|
||||
'complex': complex_name,
|
||||
**pred
|
||||
})
|
||||
|
||||
# Sort by confidence (descending)
|
||||
all_predictions.sort(key=lambda x: x['confidence'], reverse=True)
|
||||
|
||||
return all_predictions[:n]
|
||||
|
||||
|
||||
def print_top_predictions(results, n=10):
|
||||
"""Print top N predictions across all complexes."""
|
||||
top_preds = get_top_predictions(results, n)
|
||||
|
||||
print("\n" + "="*80)
|
||||
print(f"Top {n} Predictions Across All Complexes")
|
||||
print("="*80)
|
||||
|
||||
for i, pred in enumerate(top_preds, 1):
|
||||
confidence_class = classify_confidence(pred['confidence'])
|
||||
print(f"{i:2d}. {pred['complex']:30s} | Rank {pred['rank']:2d} | "
|
||||
f"Confidence: {pred['confidence']:7.3f} ({confidence_class})")
|
||||
|
||||
print("="*80)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Analyze DiffDock prediction results',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Analyze all results in directory
|
||||
python analyze_results.py results/output_dir/
|
||||
|
||||
# Show only top 5 predictions per complex
|
||||
python analyze_results.py results/ --top 5
|
||||
|
||||
# Filter by confidence threshold
|
||||
python analyze_results.py results/ --threshold 0.0
|
||||
|
||||
# Export to CSV
|
||||
python analyze_results.py results/ --export summary.csv
|
||||
|
||||
# Show top 20 predictions across all complexes
|
||||
python analyze_results.py results/ --best 20
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('results_dir', help='Path to DiffDock results directory')
|
||||
parser.add_argument('--top', '-t', type=int,
|
||||
help='Show only top N predictions per complex')
|
||||
parser.add_argument('--threshold', type=float,
|
||||
help='Minimum confidence threshold')
|
||||
parser.add_argument('--export', '-e', metavar='FILE',
|
||||
help='Export results to CSV file')
|
||||
parser.add_argument('--best', '-b', type=int, metavar='N',
|
||||
help='Show top N predictions across all complexes')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate results directory
|
||||
if not os.path.exists(args.results_dir):
|
||||
print(f"Error: Results directory not found: {args.results_dir}")
|
||||
return 1
|
||||
|
||||
# Parse results
|
||||
print(f"Analyzing results in: {args.results_dir}")
|
||||
results = parse_confidence_scores(args.results_dir)
|
||||
|
||||
if not results:
|
||||
print("No DiffDock results found in directory")
|
||||
return 1
|
||||
|
||||
# Print summary
|
||||
print_summary(results, top_n=args.top, min_confidence=args.threshold)
|
||||
|
||||
# Print top predictions across all complexes
|
||||
if args.best:
|
||||
print_top_predictions(results, args.best)
|
||||
|
||||
# Export to CSV if requested
|
||||
if args.export:
|
||||
export_to_csv(results, args.export)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
254
scientific-packages/diffdock/scripts/prepare_batch_csv.py
Executable file
254
scientific-packages/diffdock/scripts/prepare_batch_csv.py
Executable file
@@ -0,0 +1,254 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
DiffDock Batch CSV Preparation and Validation Script
|
||||
|
||||
This script helps prepare and validate CSV files for DiffDock batch processing.
|
||||
It checks for required columns, validates file paths, and ensures SMILES strings
|
||||
are properly formatted.
|
||||
|
||||
Usage:
|
||||
python prepare_batch_csv.py input.csv --validate
|
||||
python prepare_batch_csv.py --create --output batch_input.csv
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import pandas as pd
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
from rdkit import Chem
|
||||
from rdkit import RDLogger
|
||||
RDLogger.DisableLog('rdApp.*')
|
||||
RDKIT_AVAILABLE = True
|
||||
except ImportError:
|
||||
RDKIT_AVAILABLE = False
|
||||
print("Warning: RDKit not available. SMILES validation will be skipped.")
|
||||
|
||||
|
||||
def validate_smiles(smiles_string):
|
||||
"""Validate a SMILES string using RDKit."""
|
||||
if not RDKIT_AVAILABLE:
|
||||
return True, "RDKit not available for validation"
|
||||
|
||||
try:
|
||||
mol = Chem.MolFromSmiles(smiles_string)
|
||||
if mol is None:
|
||||
return False, "Invalid SMILES structure"
|
||||
return True, "Valid SMILES"
|
||||
except Exception as e:
|
||||
return False, str(e)
|
||||
|
||||
|
||||
def validate_file_path(file_path, base_dir=None):
|
||||
"""Validate that a file path exists."""
|
||||
if pd.isna(file_path) or file_path == "":
|
||||
return True, "Empty (will use protein_sequence)"
|
||||
|
||||
# Handle relative paths
|
||||
if base_dir:
|
||||
full_path = Path(base_dir) / file_path
|
||||
else:
|
||||
full_path = Path(file_path)
|
||||
|
||||
if full_path.exists():
|
||||
return True, f"File exists: {full_path}"
|
||||
else:
|
||||
return False, f"File not found: {full_path}"
|
||||
|
||||
|
||||
def validate_csv(csv_path, base_dir=None):
|
||||
"""
|
||||
Validate a DiffDock batch input CSV file.
|
||||
|
||||
Args:
|
||||
csv_path: Path to CSV file
|
||||
base_dir: Base directory for relative paths (default: CSV directory)
|
||||
|
||||
Returns:
|
||||
bool: True if validation passes
|
||||
list: List of validation messages
|
||||
"""
|
||||
messages = []
|
||||
valid = True
|
||||
|
||||
# Read CSV
|
||||
try:
|
||||
df = pd.read_csv(csv_path)
|
||||
messages.append(f"✓ Successfully read CSV with {len(df)} rows")
|
||||
except Exception as e:
|
||||
messages.append(f"✗ Error reading CSV: {e}")
|
||||
return False, messages
|
||||
|
||||
# Check required columns
|
||||
required_cols = ['complex_name', 'protein_path', 'ligand_description', 'protein_sequence']
|
||||
missing_cols = [col for col in required_cols if col not in df.columns]
|
||||
|
||||
if missing_cols:
|
||||
messages.append(f"✗ Missing required columns: {', '.join(missing_cols)}")
|
||||
valid = False
|
||||
else:
|
||||
messages.append("✓ All required columns present")
|
||||
|
||||
# Set base directory
|
||||
if base_dir is None:
|
||||
base_dir = Path(csv_path).parent
|
||||
|
||||
# Validate each row
|
||||
for idx, row in df.iterrows():
|
||||
row_msgs = []
|
||||
|
||||
# Check complex name
|
||||
if pd.isna(row['complex_name']) or row['complex_name'] == "":
|
||||
row_msgs.append("Missing complex_name")
|
||||
valid = False
|
||||
|
||||
# Check that either protein_path or protein_sequence is provided
|
||||
has_protein_path = not pd.isna(row['protein_path']) and row['protein_path'] != ""
|
||||
has_protein_seq = not pd.isna(row['protein_sequence']) and row['protein_sequence'] != ""
|
||||
|
||||
if not has_protein_path and not has_protein_seq:
|
||||
row_msgs.append("Must provide either protein_path or protein_sequence")
|
||||
valid = False
|
||||
elif has_protein_path and has_protein_seq:
|
||||
row_msgs.append("Warning: Both protein_path and protein_sequence provided, will use protein_path")
|
||||
|
||||
# Validate protein path if provided
|
||||
if has_protein_path:
|
||||
file_valid, msg = validate_file_path(row['protein_path'], base_dir)
|
||||
if not file_valid:
|
||||
row_msgs.append(f"Protein file issue: {msg}")
|
||||
valid = False
|
||||
|
||||
# Validate ligand description
|
||||
if pd.isna(row['ligand_description']) or row['ligand_description'] == "":
|
||||
row_msgs.append("Missing ligand_description")
|
||||
valid = False
|
||||
else:
|
||||
ligand_desc = row['ligand_description']
|
||||
# Check if it's a file path or SMILES
|
||||
if os.path.exists(ligand_desc) or "/" in ligand_desc or "\\" in ligand_desc:
|
||||
# Likely a file path
|
||||
file_valid, msg = validate_file_path(ligand_desc, base_dir)
|
||||
if not file_valid:
|
||||
row_msgs.append(f"Ligand file issue: {msg}")
|
||||
valid = False
|
||||
else:
|
||||
# Likely a SMILES string
|
||||
smiles_valid, msg = validate_smiles(ligand_desc)
|
||||
if not smiles_valid:
|
||||
row_msgs.append(f"SMILES issue: {msg}")
|
||||
valid = False
|
||||
|
||||
if row_msgs:
|
||||
messages.append(f"\nRow {idx + 1} ({row.get('complex_name', 'unnamed')}):")
|
||||
for msg in row_msgs:
|
||||
messages.append(f" - {msg}")
|
||||
|
||||
# Summary
|
||||
messages.append(f"\n{'='*60}")
|
||||
if valid:
|
||||
messages.append("✓ CSV validation PASSED - ready for DiffDock")
|
||||
else:
|
||||
messages.append("✗ CSV validation FAILED - please fix issues above")
|
||||
|
||||
return valid, messages
|
||||
|
||||
|
||||
def create_template_csv(output_path, num_examples=3):
|
||||
"""Create a template CSV file with example entries."""
|
||||
|
||||
examples = {
|
||||
'complex_name': ['example1', 'example2', 'example3'][:num_examples],
|
||||
'protein_path': ['protein1.pdb', '', 'protein3.pdb'][:num_examples],
|
||||
'ligand_description': [
|
||||
'CC(=O)Oc1ccccc1C(=O)O', # Aspirin SMILES
|
||||
'COc1ccc(C#N)cc1', # Example SMILES
|
||||
'ligand.sdf' # Example file path
|
||||
][:num_examples],
|
||||
'protein_sequence': [
|
||||
'', # Empty - using PDB file
|
||||
'MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK', # GFP sequence
|
||||
'' # Empty - using PDB file
|
||||
][:num_examples]
|
||||
}
|
||||
|
||||
df = pd.DataFrame(examples)
|
||||
df.to_csv(output_path, index=False)
|
||||
|
||||
return df
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Prepare and validate DiffDock batch CSV files',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Validate existing CSV
|
||||
python prepare_batch_csv.py input.csv --validate
|
||||
|
||||
# Create template CSV
|
||||
python prepare_batch_csv.py --create --output batch_template.csv
|
||||
|
||||
# Create template with 5 example rows
|
||||
python prepare_batch_csv.py --create --output template.csv --num-examples 5
|
||||
|
||||
# Validate with custom base directory for relative paths
|
||||
python prepare_batch_csv.py input.csv --validate --base-dir /path/to/data/
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('csv_file', nargs='?', help='CSV file to validate')
|
||||
parser.add_argument('--validate', action='store_true',
|
||||
help='Validate the CSV file')
|
||||
parser.add_argument('--create', action='store_true',
|
||||
help='Create a template CSV file')
|
||||
parser.add_argument('--output', '-o', help='Output path for template CSV')
|
||||
parser.add_argument('--num-examples', type=int, default=3,
|
||||
help='Number of example rows in template (default: 3)')
|
||||
parser.add_argument('--base-dir', help='Base directory for relative file paths')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Create template
|
||||
if args.create:
|
||||
output_path = args.output or 'diffdock_batch_template.csv'
|
||||
df = create_template_csv(output_path, args.num_examples)
|
||||
print(f"✓ Created template CSV: {output_path}")
|
||||
print(f"\nTemplate contents:")
|
||||
print(df.to_string(index=False))
|
||||
print(f"\nEdit this file with your protein-ligand pairs and run with:")
|
||||
print(f" python -m inference --config default_inference_args.yaml \\")
|
||||
print(f" --protein_ligand_csv {output_path} --out_dir results/")
|
||||
return 0
|
||||
|
||||
# Validate CSV
|
||||
if args.validate or args.csv_file:
|
||||
if not args.csv_file:
|
||||
print("Error: CSV file required for validation")
|
||||
parser.print_help()
|
||||
return 1
|
||||
|
||||
if not os.path.exists(args.csv_file):
|
||||
print(f"Error: CSV file not found: {args.csv_file}")
|
||||
return 1
|
||||
|
||||
print(f"Validating: {args.csv_file}")
|
||||
print("="*60)
|
||||
|
||||
valid, messages = validate_csv(args.csv_file, args.base_dir)
|
||||
|
||||
for msg in messages:
|
||||
print(msg)
|
||||
|
||||
return 0 if valid else 1
|
||||
|
||||
# No action specified
|
||||
parser.print_help()
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
278
scientific-packages/diffdock/scripts/setup_check.py
Executable file
278
scientific-packages/diffdock/scripts/setup_check.py
Executable file
@@ -0,0 +1,278 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
DiffDock Environment Setup Checker
|
||||
|
||||
This script verifies that the DiffDock environment is properly configured
|
||||
and all dependencies are available.
|
||||
|
||||
Usage:
|
||||
python setup_check.py
|
||||
python setup_check.py --verbose
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def check_python_version():
|
||||
"""Check Python version."""
|
||||
import sys
|
||||
version = sys.version_info
|
||||
|
||||
print("Checking Python version...")
|
||||
if version.major == 3 and version.minor >= 8:
|
||||
print(f" ✓ Python {version.major}.{version.minor}.{version.micro}")
|
||||
return True
|
||||
else:
|
||||
print(f" ✗ Python {version.major}.{version.minor}.{version.micro} "
|
||||
f"(requires Python 3.8 or higher)")
|
||||
return False
|
||||
|
||||
|
||||
def check_package(package_name, import_name=None, version_attr='__version__'):
|
||||
"""Check if a Python package is installed."""
|
||||
if import_name is None:
|
||||
import_name = package_name
|
||||
|
||||
try:
|
||||
module = __import__(import_name)
|
||||
version = getattr(module, version_attr, 'unknown')
|
||||
print(f" ✓ {package_name:20s} (version: {version})")
|
||||
return True
|
||||
except ImportError:
|
||||
print(f" ✗ {package_name:20s} (not installed)")
|
||||
return False
|
||||
|
||||
|
||||
def check_pytorch():
|
||||
"""Check PyTorch installation and CUDA availability."""
|
||||
print("\nChecking PyTorch...")
|
||||
try:
|
||||
import torch
|
||||
print(f" ✓ PyTorch version: {torch.__version__}")
|
||||
|
||||
# Check CUDA
|
||||
if torch.cuda.is_available():
|
||||
print(f" ✓ CUDA available: {torch.cuda.get_device_name(0)}")
|
||||
print(f" - CUDA version: {torch.version.cuda}")
|
||||
print(f" - Number of GPUs: {torch.cuda.device_count()}")
|
||||
return True, True
|
||||
else:
|
||||
print(f" ⚠ CUDA not available (will run on CPU)")
|
||||
return True, False
|
||||
except ImportError:
|
||||
print(f" ✗ PyTorch not installed")
|
||||
return False, False
|
||||
|
||||
|
||||
def check_pytorch_geometric():
|
||||
"""Check PyTorch Geometric installation."""
|
||||
print("\nChecking PyTorch Geometric...")
|
||||
packages = [
|
||||
('torch-geometric', 'torch_geometric'),
|
||||
('torch-scatter', 'torch_scatter'),
|
||||
('torch-sparse', 'torch_sparse'),
|
||||
('torch-cluster', 'torch_cluster'),
|
||||
]
|
||||
|
||||
all_ok = True
|
||||
for pkg_name, import_name in packages:
|
||||
if not check_package(pkg_name, import_name):
|
||||
all_ok = False
|
||||
|
||||
return all_ok
|
||||
|
||||
|
||||
def check_core_dependencies():
|
||||
"""Check core DiffDock dependencies."""
|
||||
print("\nChecking core dependencies...")
|
||||
|
||||
dependencies = [
|
||||
('numpy', 'numpy'),
|
||||
('scipy', 'scipy'),
|
||||
('pandas', 'pandas'),
|
||||
('rdkit', 'rdkit', 'rdBase.__version__'),
|
||||
('biopython', 'Bio', '__version__'),
|
||||
('pytorch-lightning', 'pytorch_lightning'),
|
||||
('PyYAML', 'yaml'),
|
||||
]
|
||||
|
||||
all_ok = True
|
||||
for dep in dependencies:
|
||||
pkg_name = dep[0]
|
||||
import_name = dep[1]
|
||||
version_attr = dep[2] if len(dep) > 2 else '__version__'
|
||||
|
||||
if not check_package(pkg_name, import_name, version_attr):
|
||||
all_ok = False
|
||||
|
||||
return all_ok
|
||||
|
||||
|
||||
def check_esm():
|
||||
"""Check ESM (protein language model) installation."""
|
||||
print("\nChecking ESM (for protein sequence folding)...")
|
||||
try:
|
||||
import esm
|
||||
print(f" ✓ ESM installed (version: {esm.__version__ if hasattr(esm, '__version__') else 'unknown'})")
|
||||
return True
|
||||
except ImportError:
|
||||
print(f" ⚠ ESM not installed (needed for protein sequence folding)")
|
||||
print(f" Install with: pip install fair-esm")
|
||||
return False
|
||||
|
||||
|
||||
def check_diffdock_installation():
|
||||
"""Check if DiffDock is properly installed/cloned."""
|
||||
print("\nChecking DiffDock installation...")
|
||||
|
||||
# Look for key files
|
||||
key_files = [
|
||||
'inference.py',
|
||||
'default_inference_args.yaml',
|
||||
'environment.yml',
|
||||
]
|
||||
|
||||
found_files = []
|
||||
missing_files = []
|
||||
|
||||
for filename in key_files:
|
||||
if os.path.exists(filename):
|
||||
found_files.append(filename)
|
||||
else:
|
||||
missing_files.append(filename)
|
||||
|
||||
if found_files:
|
||||
print(f" ✓ Found DiffDock files in current directory:")
|
||||
for f in found_files:
|
||||
print(f" - {f}")
|
||||
else:
|
||||
print(f" ⚠ DiffDock files not found in current directory")
|
||||
print(f" Current directory: {os.getcwd()}")
|
||||
print(f" Make sure you're in the DiffDock repository root")
|
||||
|
||||
# Check for model checkpoints
|
||||
model_dir = Path('./workdir/v1.1/score_model')
|
||||
confidence_dir = Path('./workdir/v1.1/confidence_model')
|
||||
|
||||
if model_dir.exists() and confidence_dir.exists():
|
||||
print(f" ✓ Model checkpoints found")
|
||||
else:
|
||||
print(f" ⚠ Model checkpoints not found in ./workdir/v1.1/")
|
||||
print(f" Models will be downloaded on first run")
|
||||
|
||||
return len(found_files) > 0
|
||||
|
||||
|
||||
def print_installation_instructions():
|
||||
"""Print installation instructions if setup is incomplete."""
|
||||
print("\n" + "="*80)
|
||||
print("Installation Instructions")
|
||||
print("="*80)
|
||||
|
||||
print("""
|
||||
If DiffDock is not installed, follow these steps:
|
||||
|
||||
1. Clone the repository:
|
||||
git clone https://github.com/gcorso/DiffDock.git
|
||||
cd DiffDock
|
||||
|
||||
2. Create conda environment:
|
||||
conda env create --file environment.yml
|
||||
conda activate diffdock
|
||||
|
||||
3. Verify installation:
|
||||
python setup_check.py
|
||||
|
||||
For Docker installation:
|
||||
docker pull rbgcsail/diffdock
|
||||
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
|
||||
micromamba activate diffdock
|
||||
|
||||
For more information, visit: https://github.com/gcorso/DiffDock
|
||||
""")
|
||||
|
||||
|
||||
def print_performance_notes(has_cuda):
|
||||
"""Print performance notes based on available hardware."""
|
||||
print("\n" + "="*80)
|
||||
print("Performance Notes")
|
||||
print("="*80)
|
||||
|
||||
if has_cuda:
|
||||
print("""
|
||||
✓ GPU detected - DiffDock will run efficiently
|
||||
|
||||
Expected performance:
|
||||
- First run: ~2-5 minutes (pre-computing SO(2)/SO(3) tables)
|
||||
- Subsequent runs: ~10-60 seconds per complex (depending on settings)
|
||||
- Batch processing: Highly efficient with GPU
|
||||
""")
|
||||
else:
|
||||
print("""
|
||||
⚠ No GPU detected - DiffDock will run on CPU
|
||||
|
||||
Expected performance:
|
||||
- CPU inference is SIGNIFICANTLY slower than GPU
|
||||
- Single complex: Several minutes to hours
|
||||
- Batch processing: Not recommended on CPU
|
||||
|
||||
Recommendation: Use GPU for practical applications
|
||||
- Cloud options: Google Colab, AWS, or other cloud GPU services
|
||||
- Local: Install CUDA-capable GPU
|
||||
""")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Check DiffDock environment setup',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter
|
||||
)
|
||||
|
||||
parser.add_argument('--verbose', '-v', action='store_true',
|
||||
help='Show detailed version information')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
print("="*80)
|
||||
print("DiffDock Environment Setup Checker")
|
||||
print("="*80)
|
||||
|
||||
checks = []
|
||||
|
||||
# Run all checks
|
||||
checks.append(("Python version", check_python_version()))
|
||||
|
||||
pytorch_ok, has_cuda = check_pytorch()
|
||||
checks.append(("PyTorch", pytorch_ok))
|
||||
|
||||
checks.append(("PyTorch Geometric", check_pytorch_geometric()))
|
||||
checks.append(("Core dependencies", check_core_dependencies()))
|
||||
checks.append(("ESM", check_esm()))
|
||||
checks.append(("DiffDock files", check_diffdock_installation()))
|
||||
|
||||
# Summary
|
||||
print("\n" + "="*80)
|
||||
print("Summary")
|
||||
print("="*80)
|
||||
|
||||
all_passed = all(result for _, result in checks)
|
||||
|
||||
for check_name, result in checks:
|
||||
status = "✓ PASS" if result else "✗ FAIL"
|
||||
print(f" {status:8s} - {check_name}")
|
||||
|
||||
if all_passed:
|
||||
print("\n✓ All checks passed! DiffDock is ready to use.")
|
||||
print_performance_notes(has_cuda)
|
||||
return 0
|
||||
else:
|
||||
print("\n✗ Some checks failed. Please install missing dependencies.")
|
||||
print_installation_instructions()
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user