# Biomni API Reference This document provides comprehensive API documentation for the Biomni biomedical AI agent system. ## Core Classes ### A1 Agent The primary agent class for executing biomedical research tasks. #### Initialization ```python from biomni.agent import A1 agent = A1( path='./data', # Path to biomedical knowledge base llm='claude-sonnet-4-20250514', # LLM model identifier timeout=None, # Optional timeout in seconds verbose=True # Enable detailed logging ) ``` **Parameters:** - `path` (str, required): Directory path where the biomedical knowledge base is stored or will be downloaded. First-time initialization will download ~11GB of data. - `llm` (str, optional): LLM model identifier. Defaults to the value in `default_config.llm`. Supports multiple providers (see LLM Providers section). - `timeout` (int, optional): Maximum execution time in seconds for agent operations. Overrides `default_config.timeout_seconds`. - `verbose` (bool, optional): Enable verbose logging for debugging. Default: True. **Returns:** A1 agent instance ready for task execution. #### Methods ##### `go(task_description: str) -> None` Execute a biomedical research task autonomously. ```python agent.go("Analyze this scRNA-seq dataset and identify cell types") ``` **Parameters:** - `task_description` (str, required): Natural language description of the biomedical task to execute. Be specific about: - Data location and format - Desired analysis or output - Any specific methods or parameters - Expected results format **Behavior:** 1. Decomposes the task into executable steps 2. Retrieves relevant biomedical knowledge from the data lake 3. Generates and executes Python/R code 4. Provides results and visualizations 5. Handles errors and retries with refinement **Notes:** - Executes code with system privileges - use in sandboxed environments - Long-running tasks may require timeout adjustments - Intermediate results are displayed during execution ##### `save_conversation_history(output_path: str, format: str = 'pdf') -> None` Export conversation history and execution trace as a formatted report. ```python agent.save_conversation_history( output_path='./reports/analysis_log.pdf', format='pdf' ) ``` **Parameters:** - `output_path` (str, required): File path for the output report - `format` (str, optional): Output format. Options: 'pdf', 'markdown'. Default: 'pdf' **Requirements:** - For PDF: Install one of: WeasyPrint, markdown2pdf, or Pandoc ```bash pip install weasyprint # Recommended # or pip install markdown2pdf # or install Pandoc system-wide ``` **Report Contents:** - Task description and parameters - Retrieved biomedical knowledge - Generated code with execution traces - Results, visualizations, and outputs - Timestamps and execution metadata ##### `add_mcp(config_path: str) -> None` Add Model Context Protocol (MCP) tools to extend agent capabilities. ```python agent.add_mcp(config_path='./mcp_tools_config.json') ``` **Parameters:** - `config_path` (str, required): Path to MCP configuration JSON file **MCP Configuration Format:** ```json { "tools": [ { "name": "tool_name", "endpoint": "http://localhost:8000/tool", "description": "Tool description for LLM", "parameters": { "param1": "string", "param2": "integer" } } ] } ``` **Use Cases:** - Connect to laboratory information systems - Integrate proprietary databases - Access specialized computational resources - Link to institutional data repositories ## Configuration ### default_config Global configuration object for Biomni settings. ```python from biomni.config import default_config ``` #### Attributes ##### `llm: str` Default LLM model identifier for all agent instances. ```python default_config.llm = "claude-sonnet-4-20250514" ``` **Supported Models:** **Anthropic:** - `claude-sonnet-4-20250514` (Recommended) - `claude-opus-4-20250514` - `claude-3-5-sonnet-20241022` - `claude-3-opus-20240229` **OpenAI:** - `gpt-4o` - `gpt-4` - `gpt-4-turbo` - `gpt-3.5-turbo` **Azure OpenAI:** - `azure/gpt-4` - `azure/` **Google Gemini:** - `gemini/gemini-pro` - `gemini/gemini-1.5-pro` **Groq:** - `groq/llama-3.1-70b-versatile` - `groq/mixtral-8x7b-32768` **Ollama (Local):** - `ollama/llama3` - `ollama/mistral` - `ollama/` **AWS Bedrock:** - `bedrock/anthropic.claude-v2` - `bedrock/anthropic.claude-3-sonnet` **Custom/Biomni-R0:** - `openai/biomni-r0` (requires local SGLang deployment) ##### `timeout_seconds: int` Default timeout for agent operations in seconds. ```python default_config.timeout_seconds = 1200 # 20 minutes ``` **Recommended Values:** - Simple tasks (QC, basic analysis): 300-600 seconds - Medium tasks (differential expression, clustering): 600-1200 seconds - Complex tasks (full pipelines, ML models): 1200-3600 seconds - Very complex tasks: 3600+ seconds ##### `data_path: str` Default path to biomedical knowledge base. ```python default_config.data_path = "/path/to/biomni/data" ``` **Storage Requirements:** - Initial download: ~11GB - Extracted size: ~15GB - Additional working space: ~5-10GB recommended ##### `api_base: str` Custom API endpoint for LLM providers (advanced usage). ```python # For local Biomni-R0 deployment default_config.api_base = "http://localhost:30000/v1" # For custom OpenAI-compatible endpoints default_config.api_base = "https://your-endpoint.com/v1" ``` ##### `max_retries: int` Number of retry attempts for failed operations. ```python default_config.max_retries = 3 ``` #### Methods ##### `reset() -> None` Reset all configuration values to system defaults. ```python default_config.reset() ``` ## Database Query System Biomni includes a retrieval-augmented generation (RAG) system for querying the biomedical knowledge base. ### Query Functions #### `query_genes(query: str, top_k: int = 10) -> List[Dict]` Query gene information from integrated databases. ```python from biomni.database import query_genes results = query_genes( query="genes involved in p53 pathway", top_k=20 ) ``` **Parameters:** - `query` (str): Natural language or gene identifier query - `top_k` (int): Number of results to return **Returns:** List of dictionaries containing: - `gene_symbol`: Official gene symbol - `gene_name`: Full gene name - `description`: Functional description - `pathways`: Associated biological pathways - `go_terms`: Gene Ontology annotations - `diseases`: Associated diseases - `similarity_score`: Relevance score (0-1) #### `query_proteins(query: str, top_k: int = 10) -> List[Dict]` Query protein information from UniProt and other sources. ```python from biomni.database import query_proteins results = query_proteins( query="kinase proteins in cell cycle", top_k=15 ) ``` **Returns:** List of dictionaries with protein metadata: - `uniprot_id`: UniProt accession - `protein_name`: Protein name - `function`: Functional annotation - `domains`: Protein domains - `subcellular_location`: Cellular localization - `similarity_score`: Relevance score #### `query_drugs(query: str, top_k: int = 10) -> List[Dict]` Query drug and compound information. ```python from biomni.database import query_drugs results = query_drugs( query="FDA approved cancer drugs targeting EGFR", top_k=10 ) ``` **Returns:** Drug information including: - `drug_name`: Common name - `drugbank_id`: DrugBank identifier - `indication`: Therapeutic indication - `mechanism`: Mechanism of action - `targets`: Molecular targets - `approval_status`: Regulatory status - `smiles`: Chemical structure (SMILES notation) #### `query_diseases(query: str, top_k: int = 10) -> List[Dict]` Query disease information from clinical databases. ```python from biomni.database import query_diseases results = query_diseases( query="autoimmune diseases affecting joints", top_k=10 ) ``` **Returns:** Disease data: - `disease_name`: Standard disease name - `disease_id`: Ontology identifier - `symptoms`: Clinical manifestations - `associated_genes`: Genetic associations - `prevalence`: Epidemiological data #### `query_pathways(query: str, top_k: int = 10) -> List[Dict]` Query biological pathways from KEGG, Reactome, and other sources. ```python from biomni.database import query_pathways results = query_pathways( query="immune response signaling pathways", top_k=15 ) ``` **Returns:** Pathway information: - `pathway_name`: Pathway name - `pathway_id`: Database identifier - `genes`: Genes in pathway - `description`: Functional description - `source`: Database source (KEGG, Reactome, etc.) ## Data Structures ### TaskResult Result object returned by complex agent operations. ```python class TaskResult: success: bool # Whether task completed successfully output: Any # Task output (varies by task) code: str # Generated code execution_time: float # Execution time in seconds error: Optional[str] # Error message if failed metadata: Dict # Additional metadata ``` ### BiomedicalEntity Base class for biomedical entities in the knowledge base. ```python class BiomedicalEntity: entity_id: str # Unique identifier entity_type: str # Type (gene, protein, drug, etc.) name: str # Entity name description: str # Description attributes: Dict # Additional attributes references: List[str] # Literature references ``` ## Utility Functions ### `download_data(path: str, force: bool = False) -> None` Manually download or update the biomedical knowledge base. ```python from biomni.utils import download_data download_data( path='./data', force=True # Force re-download ) ``` ### `validate_environment() -> Dict[str, bool]` Check if the environment is properly configured. ```python from biomni.utils import validate_environment status = validate_environment() # Returns: { # 'conda_env': True, # 'api_keys': True, # 'data_available': True, # 'dependencies': True # } ``` ### `list_available_models() -> List[str]` Get a list of available LLM models based on configured API keys. ```python from biomni.utils import list_available_models models = list_available_models() # Returns: ['claude-sonnet-4-20250514', 'gpt-4o', ...] ``` ## Error Handling ### Common Exceptions #### `BiomniConfigError` Raised when configuration is invalid or incomplete. ```python from biomni.exceptions import BiomniConfigError try: agent = A1(path='./data') except BiomniConfigError as e: print(f"Configuration error: {e}") ``` #### `BiomniExecutionError` Raised when code generation or execution fails. ```python from biomni.exceptions import BiomniExecutionError try: agent.go("invalid task") except BiomniExecutionError as e: print(f"Execution failed: {e}") # Access failed code: e.code # Access error details: e.details ``` #### `BiomniDataError` Raised when knowledge base or data access fails. ```python from biomni.exceptions import BiomniDataError try: results = query_genes("unknown query format") except BiomniDataError as e: print(f"Data access error: {e}") ``` #### `BiomniTimeoutError` Raised when operations exceed timeout limit. ```python from biomni.exceptions import BiomniTimeoutError try: agent.go("very complex long-running task") except BiomniTimeoutError as e: print(f"Task timed out after {e.duration} seconds") # Partial results may be available: e.partial_results ``` ## Best Practices ### Efficient Knowledge Retrieval Pre-query databases for relevant context before complex tasks: ```python from biomni.database import query_genes, query_pathways # Gather relevant biological context first genes = query_genes("cell cycle genes", top_k=50) pathways = query_pathways("cell cycle regulation", top_k=20) # Then execute task with enriched context agent.go(f""" Analyze the cell cycle progression in this dataset. Focus on these genes: {[g['gene_symbol'] for g in genes]} Consider these pathways: {[p['pathway_name'] for p in pathways]} """) ``` ### Error Recovery Implement robust error handling for production workflows: ```python from biomni.exceptions import BiomniExecutionError, BiomniTimeoutError max_attempts = 3 for attempt in range(max_attempts): try: agent.go("complex biomedical task") break except BiomniTimeoutError: # Increase timeout and retry default_config.timeout_seconds *= 2 print(f"Timeout, retrying with {default_config.timeout_seconds}s timeout") except BiomniExecutionError as e: # Refine task based on error print(f"Execution failed: {e}, refining task...") # Optionally modify task description else: print("Task failed after max attempts") ``` ### Memory Management For large-scale analyses, manage memory explicitly: ```python import gc # Process datasets in chunks for chunk_id in range(num_chunks): agent.go(f"Process data chunk {chunk_id} located at data/chunk_{chunk_id}.h5ad") # Force garbage collection between chunks gc.collect() # Save intermediate results agent.save_conversation_history(f"./reports/chunk_{chunk_id}.pdf") ``` ### Reproducibility Ensure reproducible analyses by: 1. **Fixing random seeds:** ```python agent.go("Set random seed to 42 for all analyses, then perform clustering...") ``` 2. **Logging configuration:** ```python import json config_log = { 'llm': default_config.llm, 'timeout': default_config.timeout_seconds, 'data_path': default_config.data_path, 'timestamp': datetime.now().isoformat() } with open('config_log.json', 'w') as f: json.dump(config_log, f, indent=2) ``` 3. **Saving execution traces:** ```python # Always save detailed reports agent.save_conversation_history('./reports/full_analysis.pdf') ``` ## Performance Optimization ### Model Selection Strategy Choose models based on task characteristics: ```python # For exploratory, simple tasks default_config.llm = "gpt-3.5-turbo" # Fast, cost-effective # For standard biomedical analyses default_config.llm = "claude-sonnet-4-20250514" # Recommended # For complex reasoning and hypothesis generation default_config.llm = "claude-opus-4-20250514" # Highest quality # For specialized biological reasoning default_config.llm = "openai/biomni-r0" # Requires local deployment ``` ### Timeout Tuning Set appropriate timeouts based on task complexity: ```python # Quick queries and simple analyses agent = A1(path='./data', timeout=300) # Standard workflows agent = A1(path='./data', timeout=1200) # Full pipelines with ML training agent = A1(path='./data', timeout=3600) ``` ### Caching and Reuse Reuse agent instances for multiple related tasks: ```python # Create agent once agent = A1(path='./data', llm='claude-sonnet-4-20250514') # Execute multiple related tasks tasks = [ "Load and QC the scRNA-seq dataset", "Perform clustering with resolution 0.5", "Identify marker genes for each cluster", "Annotate cell types based on markers" ] for task in tasks: agent.go(task) # Save complete workflow agent.save_conversation_history('./reports/full_workflow.pdf') ```