Add more scientific skills

2026-01-26 16:58:56 +08:00 · 2025-10-19 14:12:02 -07:00
parent 78d5ac2b56
commit 660c8574d0
210 changed files with 88957 additions and 1 deletions
--- a/scientific-packages/biomni/references/api_reference.md
+++ b/scientific-packages/biomni/references/api_reference.md
@@ -0,0 +1,635 @@
+# Biomni API Reference
+
+This document provides comprehensive API documentation for the Biomni biomedical AI agent system.
+
+## Core Classes
+
+### A1 Agent
+
+The primary agent class for executing biomedical research tasks.
+
+#### Initialization
+
+```python
+from biomni.agent import A1
+
+agent = A1(
+    path='./data',              # Path to biomedical knowledge base
+    llm='claude-sonnet-4-20250514',  # LLM model identifier
+    timeout=None,               # Optional timeout in seconds
+    verbose=True               # Enable detailed logging
+)
+```
+
+**Parameters:**
+
+- `path` (str, required): Directory path where the biomedical knowledge base is stored or will be downloaded. First-time initialization will download ~11GB of data.
+- `llm` (str, optional): LLM model identifier. Defaults to the value in `default_config.llm`. Supports multiple providers (see LLM Providers section).
+- `timeout` (int, optional): Maximum execution time in seconds for agent operations. Overrides `default_config.timeout_seconds`.
+- `verbose` (bool, optional): Enable verbose logging for debugging. Default: True.
+
+**Returns:** A1 agent instance ready for task execution.
+
+#### Methods
+
+##### `go(task_description: str) -> None`
+
+Execute a biomedical research task autonomously.
+
+```python
+agent.go("Analyze this scRNA-seq dataset and identify cell types")
+```
+
+**Parameters:**
+- `task_description` (str, required): Natural language description of the biomedical task to execute. Be specific about:
+  - Data location and format
+  - Desired analysis or output
+  - Any specific methods or parameters
+  - Expected results format
+
+**Behavior:**
+1. Decomposes the task into executable steps
+2. Retrieves relevant biomedical knowledge from the data lake
+3. Generates and executes Python/R code
+4. Provides results and visualizations
+5. Handles errors and retries with refinement
+
+**Notes:**
+- Executes code with system privileges - use in sandboxed environments
+- Long-running tasks may require timeout adjustments
+- Intermediate results are displayed during execution
+
+##### `save_conversation_history(output_path: str, format: str = 'pdf') -> None`
+
+Export conversation history and execution trace as a formatted report.
+
+```python
+agent.save_conversation_history(
+    output_path='./reports/analysis_log.pdf',
+    format='pdf'
+)
+```
+
+**Parameters:**
+- `output_path` (str, required): File path for the output report
+- `format` (str, optional): Output format. Options: 'pdf', 'markdown'. Default: 'pdf'
+
+**Requirements:**
+- For PDF: Install one of: WeasyPrint, markdown2pdf, or Pandoc
+  ```bash
+  pip install weasyprint  # Recommended
+  # or
+  pip install markdown2pdf
+  # or install Pandoc system-wide
+  ```
+
+**Report Contents:**
+- Task description and parameters
+- Retrieved biomedical knowledge
+- Generated code with execution traces
+- Results, visualizations, and outputs
+- Timestamps and execution metadata
+
+##### `add_mcp(config_path: str) -> None`
+
+Add Model Context Protocol (MCP) tools to extend agent capabilities.
+
+```python
+agent.add_mcp(config_path='./mcp_tools_config.json')
+```
+
+**Parameters:**
+- `config_path` (str, required): Path to MCP configuration JSON file
+
+**MCP Configuration Format:**
+```json
+{
+  "tools": [
+    {
+      "name": "tool_name",
+      "endpoint": "http://localhost:8000/tool",
+      "description": "Tool description for LLM",
+      "parameters": {
+        "param1": "string",
+        "param2": "integer"
+      }
+    }
+  ]
+}
+```
+
+**Use Cases:**
+- Connect to laboratory information systems
+- Integrate proprietary databases
+- Access specialized computational resources
+- Link to institutional data repositories
+
+## Configuration
+
+### default_config
+
+Global configuration object for Biomni settings.
+
+```python
+from biomni.config import default_config
+```
+
+#### Attributes
+
+##### `llm: str`
+
+Default LLM model identifier for all agent instances.
+
+```python
+default_config.llm = "claude-sonnet-4-20250514"
+```
+
+**Supported Models:**
+
+**Anthropic:**
+- `claude-sonnet-4-20250514` (Recommended)
+- `claude-opus-4-20250514`
+- `claude-3-5-sonnet-20241022`
+- `claude-3-opus-20240229`
+
+**OpenAI:**
+- `gpt-4o`
+- `gpt-4`
+- `gpt-4-turbo`
+- `gpt-3.5-turbo`
+
+**Azure OpenAI:**
+- `azure/gpt-4`
+- `azure/<deployment-name>`
+
+**Google Gemini:**
+- `gemini/gemini-pro`
+- `gemini/gemini-1.5-pro`
+
+**Groq:**
+- `groq/llama-3.1-70b-versatile`
+- `groq/mixtral-8x7b-32768`
+
+**Ollama (Local):**
+- `ollama/llama3`
+- `ollama/mistral`
+- `ollama/<model-name>`
+
+**AWS Bedrock:**
+- `bedrock/anthropic.claude-v2`
+- `bedrock/anthropic.claude-3-sonnet`
+
+**Custom/Biomni-R0:**
+- `openai/biomni-r0` (requires local SGLang deployment)
+
+##### `timeout_seconds: int`
+
+Default timeout for agent operations in seconds.
+
+```python
+default_config.timeout_seconds = 1200  # 20 minutes
+```
+
+**Recommended Values:**
+- Simple tasks (QC, basic analysis): 300-600 seconds
+- Medium tasks (differential expression, clustering): 600-1200 seconds
+- Complex tasks (full pipelines, ML models): 1200-3600 seconds
+- Very complex tasks: 3600+ seconds
+
+##### `data_path: str`
+
+Default path to biomedical knowledge base.
+
+```python
+default_config.data_path = "/path/to/biomni/data"
+```
+
+**Storage Requirements:**
+- Initial download: ~11GB
+- Extracted size: ~15GB
+- Additional working space: ~5-10GB recommended
+
+##### `api_base: str`
+
+Custom API endpoint for LLM providers (advanced usage).
+
+```python
+# For local Biomni-R0 deployment
+default_config.api_base = "http://localhost:30000/v1"
+
+# For custom OpenAI-compatible endpoints
+default_config.api_base = "https://your-endpoint.com/v1"
+```
+
+##### `max_retries: int`
+
+Number of retry attempts for failed operations.
+
+```python
+default_config.max_retries = 3
+```
+
+#### Methods
+
+##### `reset() -> None`
+
+Reset all configuration values to system defaults.
+
+```python
+default_config.reset()
+```
+
+## Database Query System
+
+Biomni includes a retrieval-augmented generation (RAG) system for querying the biomedical knowledge base.
+
+### Query Functions
+
+#### `query_genes(query: str, top_k: int = 10) -> List[Dict]`
+
+Query gene information from integrated databases.
+
+```python
+from biomni.database import query_genes
+
+results = query_genes(
+    query="genes involved in p53 pathway",
+    top_k=20
+)
+```
+
+**Parameters:**
+- `query` (str): Natural language or gene identifier query
+- `top_k` (int): Number of results to return
+
+**Returns:** List of dictionaries containing:
+- `gene_symbol`: Official gene symbol
+- `gene_name`: Full gene name
+- `description`: Functional description
+- `pathways`: Associated biological pathways
+- `go_terms`: Gene Ontology annotations
+- `diseases`: Associated diseases
+- `similarity_score`: Relevance score (0-1)
+
+#### `query_proteins(query: str, top_k: int = 10) -> List[Dict]`
+
+Query protein information from UniProt and other sources.
+
+```python
+from biomni.database import query_proteins
+
+results = query_proteins(
+    query="kinase proteins in cell cycle",
+    top_k=15
+)
+```
+
+**Returns:** List of dictionaries with protein metadata:
+- `uniprot_id`: UniProt accession
+- `protein_name`: Protein name
+- `function`: Functional annotation
+- `domains`: Protein domains
+- `subcellular_location`: Cellular localization
+- `similarity_score`: Relevance score
+
+#### `query_drugs(query: str, top_k: int = 10) -> List[Dict]`
+
+Query drug and compound information.
+
+```python
+from biomni.database import query_drugs
+
+results = query_drugs(
+    query="FDA approved cancer drugs targeting EGFR",
+    top_k=10
+)
+```
+
+**Returns:** Drug information including:
+- `drug_name`: Common name
+- `drugbank_id`: DrugBank identifier
+- `indication`: Therapeutic indication
+- `mechanism`: Mechanism of action
+- `targets`: Molecular targets
+- `approval_status`: Regulatory status
+- `smiles`: Chemical structure (SMILES notation)
+
+#### `query_diseases(query: str, top_k: int = 10) -> List[Dict]`
+
+Query disease information from clinical databases.
+
+```python
+from biomni.database import query_diseases
+
+results = query_diseases(
+    query="autoimmune diseases affecting joints",
+    top_k=10
+)
+```
+
+**Returns:** Disease data:
+- `disease_name`: Standard disease name
+- `disease_id`: Ontology identifier
+- `symptoms`: Clinical manifestations
+- `associated_genes`: Genetic associations
+- `prevalence`: Epidemiological data
+
+#### `query_pathways(query: str, top_k: int = 10) -> List[Dict]`
+
+Query biological pathways from KEGG, Reactome, and other sources.
+
+```python
+from biomni.database import query_pathways
+
+results = query_pathways(
+    query="immune response signaling pathways",
+    top_k=15
+)
+```
+
+**Returns:** Pathway information:
+- `pathway_name`: Pathway name
+- `pathway_id`: Database identifier
+- `genes`: Genes in pathway
+- `description`: Functional description
+- `source`: Database source (KEGG, Reactome, etc.)
+
+## Data Structures
+
+### TaskResult
+
+Result object returned by complex agent operations.
+
+```python
+class TaskResult:
+    success: bool           # Whether task completed successfully
+    output: Any            # Task output (varies by task)
+    code: str             # Generated code
+    execution_time: float # Execution time in seconds
+    error: Optional[str]  # Error message if failed
+    metadata: Dict        # Additional metadata
+```
+
+### BiomedicalEntity
+
+Base class for biomedical entities in the knowledge base.
+
+```python
+class BiomedicalEntity:
+    entity_id: str        # Unique identifier
+    entity_type: str      # Type (gene, protein, drug, etc.)
+    name: str            # Entity name
+    description: str     # Description
+    attributes: Dict     # Additional attributes
+    references: List[str] # Literature references
+```
+
+## Utility Functions
+
+### `download_data(path: str, force: bool = False) -> None`
+
+Manually download or update the biomedical knowledge base.
+
+```python
+from biomni.utils import download_data
+
+download_data(
+    path='./data',
+    force=True  # Force re-download
+)
+```
+
+### `validate_environment() -> Dict[str, bool]`
+
+Check if the environment is properly configured.
+
+```python
+from biomni.utils import validate_environment
+
+status = validate_environment()
+# Returns: {
+#   'conda_env': True,
+#   'api_keys': True,
+#   'data_available': True,
+#   'dependencies': True
+# }
+```
+
+### `list_available_models() -> List[str]`
+
+Get a list of available LLM models based on configured API keys.
+
+```python
+from biomni.utils import list_available_models
+
+models = list_available_models()
+# Returns: ['claude-sonnet-4-20250514', 'gpt-4o', ...]
+```
+
+## Error Handling
+
+### Common Exceptions
+
+#### `BiomniConfigError`
+
+Raised when configuration is invalid or incomplete.
+
+```python
+from biomni.exceptions import BiomniConfigError
+
+try:
+    agent = A1(path='./data')
+except BiomniConfigError as e:
+    print(f"Configuration error: {e}")
+```
+
+#### `BiomniExecutionError`
+
+Raised when code generation or execution fails.
+
+```python
+from biomni.exceptions import BiomniExecutionError
+
+try:
+    agent.go("invalid task")
+except BiomniExecutionError as e:
+    print(f"Execution failed: {e}")
+    # Access failed code: e.code
+    # Access error details: e.details
+```
+
+#### `BiomniDataError`
+
+Raised when knowledge base or data access fails.
+
+```python
+from biomni.exceptions import BiomniDataError
+
+try:
+    results = query_genes("unknown query format")
+except BiomniDataError as e:
+    print(f"Data access error: {e}")
+```
+
+#### `BiomniTimeoutError`
+
+Raised when operations exceed timeout limit.
+
+```python
+from biomni.exceptions import BiomniTimeoutError
+
+try:
+    agent.go("very complex long-running task")
+except BiomniTimeoutError as e:
+    print(f"Task timed out after {e.duration} seconds")
+    # Partial results may be available: e.partial_results
+```
+
+## Best Practices
+
+### Efficient Knowledge Retrieval
+
+Pre-query databases for relevant context before complex tasks:
+
+```python
+from biomni.database import query_genes, query_pathways
+
+# Gather relevant biological context first
+genes = query_genes("cell cycle genes", top_k=50)
+pathways = query_pathways("cell cycle regulation", top_k=20)
+
+# Then execute task with enriched context
+agent.go(f"""
+Analyze the cell cycle progression in this dataset.
+Focus on these genes: {[g['gene_symbol'] for g in genes]}
+Consider these pathways: {[p['pathway_name'] for p in pathways]}
+""")
+```
+
+### Error Recovery
+
+Implement robust error handling for production workflows:
+
+```python
+from biomni.exceptions import BiomniExecutionError, BiomniTimeoutError
+
+max_attempts = 3
+for attempt in range(max_attempts):
+    try:
+        agent.go("complex biomedical task")
+        break
+    except BiomniTimeoutError:
+        # Increase timeout and retry
+        default_config.timeout_seconds *= 2
+        print(f"Timeout, retrying with {default_config.timeout_seconds}s timeout")
+    except BiomniExecutionError as e:
+        # Refine task based on error
+        print(f"Execution failed: {e}, refining task...")
+        # Optionally modify task description
+    else:
+        print("Task failed after max attempts")
+```
+
+### Memory Management
+
+For large-scale analyses, manage memory explicitly:
+
+```python
+import gc
+
+# Process datasets in chunks
+for chunk_id in range(num_chunks):
+    agent.go(f"Process data chunk {chunk_id} located at data/chunk_{chunk_id}.h5ad")
+
+    # Force garbage collection between chunks
+    gc.collect()
+
+    # Save intermediate results
+    agent.save_conversation_history(f"./reports/chunk_{chunk_id}.pdf")
+```
+
+### Reproducibility
+
+Ensure reproducible analyses by:
+
+1. **Fixing random seeds:**
+```python
+agent.go("Set random seed to 42 for all analyses, then perform clustering...")
+```
+
+2. **Logging configuration:**
+```python
+import json
+config_log = {
+    'llm': default_config.llm,
+    'timeout': default_config.timeout_seconds,
+    'data_path': default_config.data_path,
+    'timestamp': datetime.now().isoformat()
+}
+with open('config_log.json', 'w') as f:
+    json.dump(config_log, f, indent=2)
+```
+
+3. **Saving execution traces:**
+```python
+# Always save detailed reports
+agent.save_conversation_history('./reports/full_analysis.pdf')
+```
+
+## Performance Optimization
+
+### Model Selection Strategy
+
+Choose models based on task characteristics:
+
+```python
+# For exploratory, simple tasks
+default_config.llm = "gpt-3.5-turbo"  # Fast, cost-effective
+
+# For standard biomedical analyses
+default_config.llm = "claude-sonnet-4-20250514"  # Recommended
+
+# For complex reasoning and hypothesis generation
+default_config.llm = "claude-opus-4-20250514"  # Highest quality
+
+# For specialized biological reasoning
+default_config.llm = "openai/biomni-r0"  # Requires local deployment
+```
+
+### Timeout Tuning
+
+Set appropriate timeouts based on task complexity:
+
+```python
+# Quick queries and simple analyses
+agent = A1(path='./data', timeout=300)
+
+# Standard workflows
+agent = A1(path='./data', timeout=1200)
+
+# Full pipelines with ML training
+agent = A1(path='./data', timeout=3600)
+```
+
+### Caching and Reuse
+
+Reuse agent instances for multiple related tasks:
+
+```python
+# Create agent once
+agent = A1(path='./data', llm='claude-sonnet-4-20250514')
+
+# Execute multiple related tasks
+tasks = [
+    "Load and QC the scRNA-seq dataset",
+    "Perform clustering with resolution 0.5",
+    "Identify marker genes for each cluster",
+    "Annotate cell types based on markers"
+]
+
+for task in tasks:
+    agent.go(task)
+
+# Save complete workflow
+agent.save_conversation_history('./reports/full_workflow.pdf')
+```