mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-01-26 16:58:56 +08:00

Files

Timothy Kassis 152d0d54de Initial commit

2025-10-19 14:01:29 -07:00

8.0 KiB

Raw Blame History

Arboreto API Reference

This document provides comprehensive API documentation for the arboreto package, a Python library for gene regulatory network (GRN) inference.

Overview

Arboreto enables inference of gene regulatory networks from expression data using machine learning algorithms. It supports distributed computing via Dask for scalability from single machines to multi-node clusters.

Current Version: 0.1.5 GitHub: https://github.com/tmoerman/arboreto License: BSD 3-Clause

Core Algorithms

GRNBoost2

The flagship algorithm for fast gene regulatory network inference using stochastic gradient boosting.

Function: arboreto.algo.grnboost2()

Parameters:

expression_data (pandas.DataFrame or numpy.ndarray): Expression matrix where rows are observations (cells/samples) and columns are genes. Required.
gene_names (list, optional): List of gene names matching column order. If None, uses DataFrame column names.
tf_names (list, optional): List of transcription factor names to consider as regulators. If None, all genes are considered potential regulators.
seed (int, optional): Random seed for reproducibility. Recommended when consistent results are needed across runs.
client_or_address (dask.distributed.Client or str, optional): Custom Dask client or scheduler address for distributed computing. If None, creates a default local client.
verbose (bool, optional): Enable verbose output for debugging.

Returns:

pandas.DataFrame with columns ['TF', 'target', 'importance'] representing inferred regulatory links. Each row represents a regulatory relationship with an importance score.

Algorithm Details:

Uses stochastic gradient boosting with early-stopping regularization
Much faster than GENIE3, especially for large datasets (tens of thousands of observations)
Extracts important features from trained regression models to identify regulatory relationships
Recommended as the default choice for most use cases

Example:

from arboreto.algo import grnboost2
import pandas as pd

# Load expression data
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
tf_list = ['TF1', 'TF2', 'TF3']  # Optional: specify TFs

# Run inference
network = grnboost2(
    expression_data=expression_matrix,
    tf_names=tf_list,
    seed=42  # For reproducibility
)

# Save results
network.to_csv('output_network.tsv', sep='\t', index=False)

GENIE3

Classical gene regulatory network inference using Random Forest regression.

Function: arboreto.algo.genie3()

Parameters: Same as GRNBoost2 (see above).

Returns: Same format as GRNBoost2 (see above).

Algorithm Details:

Uses Random Forest or ExtraTrees regression models
Blueprint for multiple regression GRN inference strategy
More computationally expensive than GRNBoost2
Better suited for smaller datasets or when maximum accuracy is needed

When to Use GENIE3 vs GRNBoost2:

Use GRNBoost2: For large datasets, faster results, or when computational resources are limited
Use GENIE3: For smaller datasets, when following established protocols, or for comparison with published results

Module Structure

arboreto.algo

Primary module for typical users. Contains high-level inference functions.

Main Functions:

grnboost2() - Fast GRN inference using gradient boosting
genie3() - Classical GRN inference using Random Forest

arboreto.core

Advanced module for power users. Contains low-level framework components for custom implementations.

Use cases:

Custom inference pipelines
Algorithm modifications
Performance tuning

arboreto.utils

Utility functions for common data processing tasks.

Key Functions:

load_tf_names(filename) - Load transcription factor names from file
- Reads a text file with one TF name per line
- Returns a list of TF names
- Example: tf_names = load_tf_names('transcription_factors.txt')

Data Format Requirements

Input Format

Expression Matrix:

Format: pandas DataFrame or numpy ndarray
Orientation: Rows = observations (cells/samples), Columns = genes
Convention: Follows scikit-learn format
Gene Names: Column names (DataFrame) or separate gene_names parameter
Data Type: Numeric (float or int)

Common Mistake: If data is transposed (genes as rows), use pandas to transpose:

expression_df = pd.read_csv('data.tsv', sep='\t', index_col=0).T

Transcription Factor List:

Format: Python list of strings or text file (one TF per line)
Optional: If not provided, all genes are considered potential regulators
Example: ['Sox2', 'Oct4', 'Nanog']

Output Format

Network DataFrame:

Columns:
- TF (str): Transcription factor (regulator) gene name
- target (str): Target gene name
- importance (float): Importance score of the regulatory relationship
Interpretation: Higher importance scores indicate stronger predicted regulatory relationships
Sorting: Typically sorted by importance (descending) for prioritization

Example Output:

TF      target    importance
Sox2    Gene1     15.234
Oct4    Gene1     12.456
Sox2    Gene2     8.901

Distributed Computing with Dask

Local Execution (Default)

Arboreto automatically creates a local Dask client if none is provided:

network = grnboost2(expression_data=expr_matrix, tf_names=tf_list)

Custom Local Cluster

For better control over resources or multiple inferences:

from dask.distributed import Client, LocalCluster

# Configure cluster
cluster = LocalCluster(
    n_workers=4,
    threads_per_worker=2,
    memory_limit='4GB'
)
client = Client(cluster)

# Run inference
network = grnboost2(
    expression_data=expr_matrix,
    tf_names=tf_list,
    client_or_address=client
)

# Clean up
client.close()
cluster.close()

Distributed Cluster

For multi-node computation:

On scheduler node:

dask-scheduler --no-bokeh  # Use --no-bokeh to avoid Bokeh errors

On worker nodes:

dask-worker scheduler-address:8786 --local-dir /tmp

In Python script:

from dask.distributed import Client

client = Client('scheduler-address:8786')
network = grnboost2(
    expression_data=expr_matrix,
    tf_names=tf_list,
    client_or_address=client
)

Dask Dashboard

Monitor computation progress via the Dask dashboard:

from dask.distributed import Client, LocalCluster

cluster = LocalCluster(diagnostics_port=8787)
client = Client(cluster)

# Dashboard available at: http://localhost:8787

Reproducibility

To ensure reproducible results across runs:

network = grnboost2(
    expression_data=expr_matrix,
    tf_names=tf_list,
    seed=42  # Fixed seed ensures identical results
)

Note: Without a seed parameter, results may vary slightly between runs due to randomness in the algorithms.

Performance Considerations

Memory Management

Expression matrices should fit in memory (RAM)
For very large datasets, consider:
- Using a machine with more RAM
- Distributing across multiple nodes
- Preprocessing to reduce dimensionality

Worker Configuration

Local execution: Number of workers = number of CPU cores (default)
Custom cluster: Balance workers and threads based on available resources
Distributed execution: Ensure adequate local_dir space on worker nodes

Algorithm Choice

GRNBoost2: ~10-100x faster than GENIE3 for large datasets
GENIE3: More established but slower, better for small datasets (<10k observations)

Integration with pySCENIC

Arboreto is a core component of the pySCENIC pipeline for single-cell RNA sequencing analysis:

GRN Inference (Arboreto): Infer regulatory networks using GRNBoost2
Regulon Prediction: Prune network and identify regulons
Cell Type Identification: Score regulons across cells

For pySCENIC workflows, arboreto is typically used in the first step to generate the initial regulatory network.

Common Issues and Solutions

See the main SKILL.md for troubleshooting guidance.

8.0 KiB Raw Blame History