Add scVelo RNA velocity analysis workflow and IQ-TREE reference documentation

- Introduced a comprehensive RNA velocity analysis pipeline using scVelo, including data loading, preprocessing, velocity estimation, and visualization. - Added a script for running RNA velocity analysis with customizable parameters and output options. - Created detailed documentation for IQ-TREE 2 phylogenetic inference, covering command syntax, model selection, bootstrapping methods, and output interpretation. - Included references for velocity models and their mathematical framework, along with a comparison of different models. - Enhanced the scVelo skill documentation with installation instructions, use cases, and best practices for RNA velocity analysis.
2026-03-27 07:09:27 +08:00 · 2026-03-03 07:15:36 -05:00
parent b271271df4
commit 7f94783fab
27 changed files with 6961 additions and 0 deletions
--- a/scientific-skills/bindingdb-database/SKILL.md
+++ b/scientific-skills/bindingdb-database/SKILL.md
@@ -0,0 +1,332 @@
+---
+name: bindingdb-database
+description: Query BindingDB for measured drug-target binding affinities (Ki, Kd, IC50, EC50). Search by target (UniProt ID), compound (SMILES/name), or pathogen. Essential for drug discovery, lead optimization, polypharmacology analysis, and structure-activity relationship (SAR) studies.
+license: CC-BY-3.0
+metadata:
+    skill-author: Kuan-lin Huang
+---
+
+# BindingDB Database
+
+## Overview
+
+BindingDB (https://www.bindingdb.org/) is the primary public database of measured drug-protein binding affinities. It contains over 3 million binding data records for ~1.4 million compounds tested against ~9,200 protein targets, curated from scientific literature and patent literature. BindingDB stores quantitative binding measurements (Ki, Kd, IC50, EC50) essential for drug discovery, pharmacology, and computational chemistry research.
+
+**Key resources:**
+- BindingDB website: https://www.bindingdb.org/
+- REST API: https://www.bindingdb.org/axis2/services/BDBService
+- Downloads: https://www.bindingdb.org/bind/chemsearch/marvin/Download.jsp
+- GitHub: https://github.com/drugilsberg/bindingdb
+
+## When to Use This Skill
+
+Use BindingDB when:
+
+- **Target-based drug discovery**: What known compounds bind to a target protein? What are their affinities?
+- **SAR analysis**: How do structural modifications affect binding affinity for a series of analogs?
+- **Lead compound profiling**: What targets does a compound bind (selectivity/polypharmacology)?
+- **Benchmark datasets**: Obtain curated protein-ligand affinity data for ML model training
+- **Repurposing analysis**: Does an approved drug bind to an unintended target?
+- **Competitive analysis**: What is the best reported affinity for a target class?
+- **Fragment screening**: Find validated binding data for fragments against a target
+
+## Core Capabilities
+
+### 1. BindingDB REST API
+
+Base URL: `https://www.bindingdb.org/axis2/services/BDBService`
+
+```python
+import requests
+
+BASE_URL = "https://www.bindingdb.org/axis2/services/BDBService"
+
+def bindingdb_query(method, params):
+    """Query the BindingDB REST API."""
+    url = f"{BASE_URL}/{method}"
+    response = requests.get(url, params=params, headers={"Accept": "application/json"})
+    response.raise_for_status()
+    return response.json()
+```
+
+### 2. Query by Target (UniProt ID)
+
+```python
+def get_ligands_for_target(uniprot_id, affinity_type="Ki", cutoff=10000, unit="nM"):
+    """
+    Get all ligands with measured affinity for a UniProt target.
+
+    Args:
+        uniprot_id: UniProt accession (e.g., "P00519" for ABL1)
+        affinity_type: "Ki", "Kd", "IC50", "EC50"
+        cutoff: Maximum affinity value to return (in nM)
+        unit: "nM" or "uM"
+    """
+    params = {
+        "uniprot_id": uniprot_id,
+        "affinity_type": affinity_type,
+        "affinity_cutoff": cutoff,
+        "response": "json"
+    }
+    return bindingdb_query("getLigandsByUniprotID", params)
+
+# Example: Get all compounds binding ABL1 (imatinib target)
+ligands = get_ligands_for_target("P00519", affinity_type="Ki", cutoff=100)
+```
+
+### 3. Query by Compound Name or SMILES
+
+```python
+def search_by_name(compound_name, limit=100):
+    """Search BindingDB for compounds by name."""
+    params = {
+        "compound_name": compound_name,
+        "response": "json",
+        "max_results": limit
+    }
+    return bindingdb_query("getAffinitiesByCompoundName", params)
+
+def search_by_smiles(smiles, similarity=100, limit=50):
+    """
+    Search BindingDB by SMILES string.
+
+    Args:
+        smiles: SMILES string of the compound
+        similarity: Tanimoto similarity threshold (1-100, 100 = exact)
+    """
+    params = {
+        "SMILES": smiles,
+        "similarity": similarity,
+        "response": "json",
+        "max_results": limit
+    }
+    return bindingdb_query("getAffinitiesByBEI", params)
+
+# Example: Search for imatinib binding data
+result = search_by_name("imatinib")
+```
+
+### 4. Download-Based Analysis (Recommended for Large Queries)
+
+For comprehensive analyses, download BindingDB data directly:
+
+```python
+import pandas as pd
+
+def load_bindingdb(filepath="BindingDB_All.tsv"):
+    """
+    Load BindingDB TSV file.
+    Download from: https://www.bindingdb.org/bind/chemsearch/marvin/Download.jsp
+    """
+    # Key columns
+    usecols = [
+        "BindingDB Reactant_set_id",
+        "Ligand SMILES",
+        "Ligand InChI",
+        "Ligand InChI Key",
+        "BindingDB Target Chain  Sequence",
+        "PDB ID(s) for Ligand-Target Complex",
+        "UniProt (SwissProt) Entry Name of Target Chain",
+        "UniProt (SwissProt) Primary ID of Target Chain",
+        "UniProt (TrEMBL) Primary ID of Target Chain",
+        "Ki (nM)",
+        "IC50 (nM)",
+        "Kd (nM)",
+        "EC50 (nM)",
+        "kon (M-1-s-1)",
+        "koff (s-1)",
+        "Target Name",
+        "Target Source Organism According to Curator or DataSource",
+        "Number of Protein Chains in Target (>1 implies a multichain complex)",
+        "PubChem CID",
+        "PubChem SID",
+        "ChEMBL ID of Ligand",
+        "DrugBank ID of Ligand",
+    ]
+
+    df = pd.read_csv(filepath, sep="\t", usecols=[c for c in usecols if c],
+                     low_memory=False, on_bad_lines='skip')
+
+    # Convert affinity columns to numeric
+    for col in ["Ki (nM)", "IC50 (nM)", "Kd (nM)", "EC50 (nM)"]:
+        if col in df.columns:
+            df[col] = pd.to_numeric(df[col], errors='coerce')
+
+    return df
+
+def query_target_affinity(df, uniprot_id, affinity_types=None, max_nm=10000):
+    """Query loaded BindingDB for a specific target."""
+    if affinity_types is None:
+        affinity_types = ["Ki (nM)", "IC50 (nM)", "Kd (nM)"]
+
+    # Filter by UniProt ID
+    mask = df["UniProt (SwissProt) Primary ID of Target Chain"] == uniprot_id
+    target_df = df[mask].copy()
+
+    # Filter by affinity cutoff
+    has_affinity = pd.Series(False, index=target_df.index)
+    for col in affinity_types:
+        if col in target_df.columns:
+            has_affinity |= target_df[col] <= max_nm
+
+    result = target_df[has_affinity][["Ligand SMILES"] + affinity_types +
+                                      ["PubChem CID", "ChEMBL ID of Ligand"]].dropna(how='all')
+    return result.sort_values(affinity_types[0])
+```
+
+### 5. SAR Analysis
+
+```python
+import pandas as pd
+
+def sar_analysis(df, target_uniprot, affinity_col="IC50 (nM)"):
+    """
+    Structure-activity relationship analysis for a target.
+    Retrieves all compounds with affinity data and ranks by potency.
+    """
+    target_data = query_target_affinity(df, target_uniprot, [affinity_col])
+
+    if target_data.empty:
+        return target_data
+
+    # Add pIC50 (negative log of IC50 in molar)
+    if affinity_col in target_data.columns:
+        target_data = target_data[target_data[affinity_col].notna()].copy()
+        target_data["pAffinity"] = -((target_data[affinity_col] * 1e-9).apply(
+            lambda x: __import__('math').log10(x)
+        ))
+        target_data = target_data.sort_values("pAffinity", ascending=False)
+
+    return target_data
+
+# Most potent compounds against EGFR (P00533)
+# sar = sar_analysis(df, "P00533", "IC50 (nM)")
+# print(sar.head(20))
+```
+
+### 6. Polypharmacology Profile
+
+```python
+def polypharmacology_profile(df, ligand_smiles_or_name, affinity_cutoff_nM=1000):
+    """
+    Find all targets a compound binds to.
+    Uses PubChem CID or SMILES for matching.
+    """
+    # Search by ligand SMILES (exact match)
+    mask = df["Ligand SMILES"] == ligand_smiles_or_name
+
+    ligand_data = df[mask].copy()
+
+    # Filter by affinity
+    aff_cols = ["Ki (nM)", "IC50 (nM)", "Kd (nM)"]
+    has_aff = pd.Series(False, index=ligand_data.index)
+    for col in aff_cols:
+        if col in ligand_data.columns:
+            has_aff |= ligand_data[col] <= affinity_cutoff_nM
+
+    result = ligand_data[has_aff][
+        ["Target Name", "UniProt (SwissProt) Primary ID of Target Chain"] + aff_cols
+    ].dropna(how='all')
+
+    return result.sort_values("Ki (nM)")
+```
+
+## Query Workflows
+
+### Workflow 1: Find Best Inhibitors for a Target
+
+```python
+import pandas as pd
+
+def find_best_inhibitors(uniprot_id, affinity_type="IC50 (nM)", top_n=20):
+    """Find the most potent inhibitors for a target in BindingDB."""
+    df = load_bindingdb("BindingDB_All.tsv")  # Load once and reuse
+    result = query_target_affinity(df, uniprot_id, [affinity_type])
+
+    if result.empty:
+        print(f"No data found for {uniprot_id}")
+        return result
+
+    result = result.sort_values(affinity_type).head(top_n)
+    print(f"Top {top_n} inhibitors for {uniprot_id} by {affinity_type}:")
+    for _, row in result.iterrows():
+        print(f"  {row['PubChem CID']}: {row[affinity_type]:.1f} nM | SMILES: {row['Ligand SMILES'][:40]}...")
+    return result
+```
+
+### Workflow 2: Selectivity Profiling
+
+1. Get all affinity data for your compound across all targets
+2. Compare affinity ratios between on-target and off-targets
+3. Identify selectivity cliffs (structural changes that improve selectivity)
+4. Cross-reference with ChEMBL for additional selectivity data
+
+### Workflow 3: Machine Learning Dataset Preparation
+
+```python
+def prepare_ml_dataset(df, uniprot_ids, affinity_col="IC50 (nM)",
+                        max_affinity_nM=100000, min_count=50):
+    """Prepare BindingDB data for ML model training."""
+    records = []
+    for uid in uniprot_ids:
+        target_df = query_target_affinity(df, uid, [affinity_col], max_affinity_nM)
+        if len(target_df) >= min_count:
+            target_df = target_df.copy()
+            target_df["target"] = uid
+            records.append(target_df)
+
+    if not records:
+        return pd.DataFrame()
+
+    combined = pd.concat(records)
+    # Add pAffinity (normalized)
+    combined["pAffinity"] = -((combined[affinity_col] * 1e-9).apply(
+        lambda x: __import__('math').log10(max(x, 1e-12))
+    ))
+    return combined[["Ligand SMILES", "target", "pAffinity", affinity_col]].dropna()
+```
+
+## Key Data Fields
+
+| Field | Description |
+|-------|-------------|
+| `Ligand SMILES` | 2D structure of the compound |
+| `Ligand InChI Key` | Unique chemical identifier |
+| `Ki (nM)` | Inhibition constant (equilibrium, functional) |
+| `Kd (nM)` | Dissociation constant (thermodynamic, binding) |
+| `IC50 (nM)` | Half-maximal inhibitory concentration |
+| `EC50 (nM)` | Half-maximal effective concentration |
+| `kon (M-1-s-1)` | Association rate constant |
+| `koff (s-1)` | Dissociation rate constant |
+| `UniProt (SwissProt) Primary ID` | Target UniProt accession |
+| `Target Name` | Protein name |
+| `PDB ID(s) for Ligand-Target Complex` | Crystal structures |
+| `PubChem CID` | PubChem compound ID |
+| `ChEMBL ID of Ligand` | ChEMBL compound ID |
+
+## Affinity Interpretation
+
+| Affinity | Classification | Drug-likeness |
+|----------|---------------|---------------|
+| < 1 nM | Sub-nanomolar | Very potent (picomolar range) |
+| 1–10 nM | Nanomolar | Potent, typical for approved drugs |
+| 10–100 nM | Moderate | Common lead compounds |
+| 100–1000 nM | Weak | Fragment/starting point |
+| > 1000 nM | Very weak | Generally below drug-relevance threshold |
+
+## Best Practices
+
+- **Use Ki for direct binding**: Ki reflects true binding affinity independent of enzymatic mechanism
+- **IC50 context-dependency**: IC50 values depend on substrate concentration (Cheng-Prusoff equation)
+- **Normalize units**: BindingDB reports in nM; verify units when comparing across studies
+- **Filter by target organism**: Use `Target Source Organism` to ensure human protein data
+- **Handle missing values**: Not all compounds have all measurement types
+- **Cross-reference with ChEMBL**: ChEMBL has more curated activity data for medicinal chemistry
+
+## Additional Resources
+
+- **BindingDB website**: https://www.bindingdb.org/
+- **Data downloads**: https://www.bindingdb.org/bind/chemsearch/marvin/Download.jsp
+- **API documentation**: https://www.bindingdb.org/bind/BindingDBRESTfulAPI.jsp
+- **Citation**: Gilson MK et al. (2016) Nucleic Acids Research. PMID: 26481362
+- **Related resources**: ChEMBL (https://www.ebi.ac.uk/chembl/), PubChem BioAssay
--- a/scientific-skills/bindingdb-database/references/affinity_queries.md
+++ b/scientific-skills/bindingdb-database/references/affinity_queries.md
@@ -0,0 +1,178 @@
+# BindingDB Affinity Query Reference
+
+## Affinity Measurement Types
+
+### Ki (Inhibition Constant)
+- **Definition**: Equilibrium constant for inhibitor-enzyme complex dissociation
+- **Equation**: Ki = [E][I]/[EI]
+- **Usage**: Enzyme inhibition; preferred for mechanistic studies
+- **Note**: Independent of substrate concentration (unlike IC50)
+
+### Kd (Dissociation Constant)
+- **Definition**: Thermodynamic binding equilibrium constant
+- **Equation**: Kd = [A][B]/[AB]
+- **Usage**: Direct binding assays (SPR, ITC, fluorescence anisotropy)
+- **Note**: True measure of binding strength; lower = tighter binding
+
+### IC50 (Half-Maximal Inhibitory Concentration)
+- **Definition**: Concentration of inhibitor that reduces target activity by 50%
+- **Usage**: Most common in drug discovery; assay-dependent
+- **Conversion to Ki**: Cheng-Prusoff equation: Ki = IC50 / (1 + [S]/Km)
+- **Note**: Depends on substrate concentration and assay conditions
+
+### EC50 (Half-Maximal Effective Concentration)
+- **Definition**: Concentration that produces 50% of maximal effect
+- **Usage**: Cell-based assays, agonist studies
+
+### Kinetics Parameters
+- **kon**: Association rate constant (M⁻¹s⁻¹); describes how fast complex forms
+- **koff**: Dissociation rate constant (s⁻¹); describes how fast complex dissociates
+- **Residence time**: τ = 1/koff; longer residence = more sustained effect
+- **Kd from kinetics**: Kd = koff/kon
+
+## Common API Query Patterns
+
+### By UniProt ID (REST API)
+
+```python
+import requests
+
+def query_by_uniprot(uniprot_id, affinity_type="Ki"):
+    """
+    REST API query for BindingDB affinities by UniProt target ID.
+    """
+    url = "https://www.bindingdb.org/axis2/services/BDBService/getLigandsByUniprotID"
+    params = {
+        "uniprot_id": uniprot_id,
+        "cutoff": "10000",  # nM threshold
+        "affinity_type": affinity_type,
+        "response": "json"
+    }
+    response = requests.get(url, params=params)
+    return response.json()
+
+# Important targets
+COMMON_TARGETS = {
+    "ABL1": "P00519",    # Imatinib, dasatinib target
+    "EGFR": "P00533",    # Erlotinib, gefitinib target
+    "BRAF": "P15056",    # Vemurafenib, dabrafenib target
+    "CDK2": "P24941",    # Cell cycle kinase
+    "HDAC1": "Q13547",   # Histone deacetylase
+    "BRD4": "O60885",    # BET bromodomain reader
+    "MDM2": "Q00987",    # p53 negative regulator
+    "BCL2": "P10415",    # Antiapoptotic protein
+    "PCSK9": "Q8NBP7",   # Cholesterol regulator
+    "JAK2": "O60674",    # Cytokine signaling kinase
+}
+```
+
+### By PubChem CID (REST API)
+
+```python
+def query_by_pubchem_cid(pubchem_cid):
+    """Get all binding data for a specific compound by PubChem CID."""
+    url = "https://www.bindingdb.org/axis2/services/BDBService/getAffinitiesByCID"
+    params = {"cid": pubchem_cid, "response": "json"}
+    response = requests.get(url, params=params)
+    return response.json()
+
+# Example: Imatinib PubChem CID = 5291
+imatinib_data = query_by_pubchem_cid(5291)
+```
+
+### By Target Name
+
+```python
+def query_by_target_name(target_name, affinity_cutoff=100):
+    """Query BindingDB by target name."""
+    url = "https://www.bindingdb.org/axis2/services/BDBService/getAffinitiesByTarget"
+    params = {
+        "target_name": target_name,
+        "cutoff": affinity_cutoff,
+        "response": "json"
+    }
+    response = requests.get(url, params=params)
+    return response.json()
+```
+
+## Dataset Download Guide
+
+### Available Files
+
+| File | Size | Contents |
+|------|------|---------|
+| `BindingDB_All.tsv.zip` | ~3.5 GB | All data: ~2.9M records |
+| `BindingDB_All.sdf.zip` | ~7 GB | All data with 3D structures |
+| `BindingDB_IC50.tsv` | ~1.5 GB | IC50 data only |
+| `BindingDB_Ki.tsv` | ~0.8 GB | Ki data only |
+| `BindingDB_Kd.tsv` | ~0.2 GB | Kd data only |
+| `BindingDB_EC50.tsv` | ~0.5 GB | EC50 data only |
+| `tdc_bindingdb_*` | Various | TDC-formatted subsets |
+
+### Efficient Loading
+
+```python
+import pandas as pd
+
+# For large files, use chunking
+def load_bindingdb_chunked(filepath, uniprot_ids, affinity_col="Ki (nM)", chunk_size=100000):
+    """Load BindingDB in chunks to filter for specific targets."""
+    results = []
+    for chunk in pd.read_csv(filepath, sep="\t", chunksize=chunk_size,
+                              low_memory=False, on_bad_lines='skip'):
+        # Filter for target
+        mask = chunk["UniProt (SwissProt) Primary ID of Target Chain"].isin(uniprot_ids)
+        if mask.any():
+            results.append(chunk[mask])
+
+    if results:
+        return pd.concat(results)
+    return pd.DataFrame()
+```
+
+## pKi / pIC50 Conversion
+
+Converting raw affinity to logarithmic scale (common in ML):
+
+```python
+import numpy as np
+
+def to_log_affinity(affinity_nM):
+    """Convert nM affinity to pAffinity (negative log molar)."""
+    affinity_M = affinity_nM * 1e-9  # Convert nM to M
+    return -np.log10(affinity_M)
+
+# Examples:
+# 1 nM   → pAffinity = 9.0
+# 10 nM  → pAffinity = 8.0
+# 100 nM → pAffinity = 7.0
+# 1 μM   → pAffinity = 6.0
+# 10 μM  → pAffinity = 5.0
+```
+
+## Quality Filters
+
+When using BindingDB data for ML or SAR:
+
+```python
+def filter_quality(df):
+    """Apply quality filters to BindingDB data."""
+    # 1. Require valid SMILES
+    df = df[df["Ligand SMILES"].notna() & (df["Ligand SMILES"] != "")]
+
+    # 2. Require valid affinity
+    df = df[df["Ki (nM)"].notna() | df["IC50 (nM)"].notna()]
+
+    # 3. Filter extreme values (artifacts)
+    for col in ["Ki (nM)", "IC50 (nM)", "Kd (nM)"]:
+        if col in df.columns:
+            df = df[~(df[col] > 1e6)]  # Remove > 1 mM (non-specific)
+
+    # 4. Use only human targets
+    if "Target Source Organism According to Curator or DataSource" in df.columns:
+        df = df[df["Target Source Organism According to Curator or DataSource"].str.contains(
+            "Homo sapiens", na=False
+        )]
+
+    return df
+```