Files
claude-scientific-skills/scientific-packages/arboreto/SKILL.md
Timothy Kassis 152d0d54de Initial commit
2025-10-19 14:01:29 -07:00

11 KiB

name, description
name description
arboreto Toolkit for gene regulatory network (GRN) inference from expression data using machine learning. Use this skill when working with gene expression matrices to infer regulatory relationships, performing single-cell RNA-seq analysis, or integrating with pySCENIC workflows. Supports both GRNBoost2 (fast gradient boosting) and GENIE3 (Random Forest) algorithms with distributed computing via Dask.

Arboreto - Gene Regulatory Network Inference

Overview

Arboreto is a Python library for inferring gene regulatory networks (GRNs) from gene expression data using machine learning algorithms. It enables scalable GRN inference from single machines to multi-node clusters using Dask for distributed computing. The skill provides comprehensive support for both GRNBoost2 (fast gradient boosting) and GENIE3 (Random Forest) algorithms.

When to Use This Skill

Apply this skill when:

  • Inferring regulatory relationships between genes from expression data
  • Analyzing single-cell or bulk RNA-seq data to identify transcription factor targets
  • Building the GRN inference component of a pySCENIC pipeline
  • Comparing GRNBoost2 and GENIE3 algorithm performance
  • Setting up distributed computing for large-scale genomic analyses
  • Troubleshooting arboreto installation or runtime issues

Core Capabilities

1. Basic GRN Inference

For standard gene regulatory network inference tasks:

Key considerations:

  • Expression data format: Rows = observations (cells/samples), Columns = genes
  • If data has genes as rows, transpose it first: expression_df.T
  • Always include seed parameter for reproducible results
  • Transcription factor list is optional but recommended for focused analysis

Typical workflow:

import pandas as pd
from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names

# Load expression data (ensure correct orientation)
expression_data = pd.read_csv('expression_data.tsv', sep='\t', index_col=0)

# Optional: Load TF names
tf_names = load_tf_names('transcription_factors.txt')

# Run inference
network = grnboost2(
    expression_data=expression_data,
    tf_names=tf_names,
    seed=42  # For reproducibility
)

# Save results
network.to_csv('network_output.tsv', sep='\t', index=False)

Output format:

  • DataFrame with columns: ['TF', 'target', 'importance']
  • Higher importance scores indicate stronger predicted regulatory relationships
  • Typically sorted by importance (descending)

Multiprocessing requirement: All arboreto code must include if __name__ == '__main__': protection due to Dask's multiprocessing requirements:

if __name__ == '__main__':
    # Arboreto code goes here
    network = grnboost2(expression_data=expr_data, seed=42)

2. Algorithm Selection

GRNBoost2 (Recommended for most cases):

  • ~10-100x faster than GENIE3
  • Uses stochastic gradient boosting with early-stopping
  • Best for: Large datasets (>10k observations), time-sensitive analyses
  • Function: arboreto.algo.grnboost2()

GENIE3:

  • Uses Random Forest regression
  • More established, classical approach
  • Best for: Small datasets, methodological comparisons, reproducing published results
  • Function: arboreto.algo.genie3()

When to compare both algorithms: Use the provided compare_algorithms.py script when:

  • Validating results for critical analyses
  • Benchmarking performance on new datasets
  • Publishing research requiring methodological comparisons

3. Distributed Computing

Local execution (default): Arboreto automatically creates a local Dask client. No configuration needed:

network = grnboost2(expression_data=expr_data)

Custom local cluster (recommended for better control):

from dask.distributed import Client, LocalCluster

# Configure cluster
cluster = LocalCluster(
    n_workers=4,
    threads_per_worker=2,
    memory_limit='4GB',
    diagnostics_port=8787  # Dashboard at http://localhost:8787
)
client = Client(cluster)

# Run inference
network = grnboost2(
    expression_data=expr_data,
    client_or_address=client
)

# Clean up
client.close()
cluster.close()

Distributed cluster (multi-node): On scheduler node:

dask-scheduler --no-bokeh

On worker nodes:

dask-worker scheduler-address:8786 --local-dir /tmp

In Python:

from dask.distributed import Client

client = Client('scheduler-address:8786')
network = grnboost2(expression_data=expr_data, client_or_address=client)

4. Data Preparation

Common data format issues:

  1. Transposed data (genes as rows instead of columns):
# If genes are rows, transpose
expression_data = pd.read_csv('data.tsv', sep='\t', index_col=0).T
  1. Missing gene names:
# Provide gene names if using numpy array
network = grnboost2(
    expression_data=expr_array,
    gene_names=['Gene1', 'Gene2', 'Gene3', ...],
    seed=42
)
  1. Transcription factor specification:
# Option 1: Python list
tf_names = ['Sox2', 'Oct4', 'Nanog', 'Klf4']

# Option 2: Load from file (one TF per line)
from arboreto.utils import load_tf_names
tf_names = load_tf_names('tf_names.txt')

5. Reproducibility

Always specify a seed for consistent results:

network = grnboost2(expression_data=expr_data, seed=42)

Without a seed, results will vary between runs due to algorithm randomness.

6. Result Interpretation

Understanding the output:

  • TF: Transcription factor (regulator) gene
  • target: Target gene being regulated
  • importance: Strength of predicted regulatory relationship

Typical post-processing:

# Filter by importance threshold
high_confidence = network[network['importance'] > 10]

# Get top N predictions
top_predictions = network.head(1000)

# Find all targets of a specific TF
sox2_targets = network[network['TF'] == 'Sox2']

# Count regulations per TF
tf_counts = network['TF'].value_counts()

Installation

Recommended (via conda):

conda install -c bioconda arboreto

Via pip:

pip install arboreto

From source:

git clone https://github.com/tmoerman/arboreto.git
cd arboreto
pip install .

Dependencies:

  • pandas
  • numpy
  • scikit-learn
  • scipy
  • dask
  • distributed

Troubleshooting

Issue: Bokeh error when launching Dask scheduler

Error: TypeError: got an unexpected keyword argument 'host'

Solutions:

  • Use dask-scheduler --no-bokeh to disable Bokeh
  • Upgrade to Dask distributed >= 0.20.0

Issue: Workers not connecting to scheduler

Symptoms: Worker processes start but fail to establish connections

Solutions:

  • Remove dask-worker-space directory before restarting workers
  • Specify adequate local_dir when creating cluster:
cluster = LocalCluster(
    worker_kwargs={'local_dir': '/tmp'}
)

Issue: Memory errors with large datasets

Solutions:

  • Increase worker memory limits: memory_limit='8GB'
  • Distribute across more nodes
  • Reduce dataset size through preprocessing (e.g., feature selection)
  • Ensure expression matrix fits in available RAM

Issue: Inconsistent results across runs

Solution: Always specify a seed parameter:

network = grnboost2(expression_data=expr_data, seed=42)

Issue: Import errors or missing dependencies

Solution: Use conda installation to handle numerical library dependencies:

conda create --name arboreto-env
conda activate arboreto-env
conda install -c bioconda arboreto

Provided Scripts

This skill includes ready-to-use scripts for common workflows:

scripts/basic_grn_inference.py

Command-line tool for standard GRN inference workflow.

Usage:

python scripts/basic_grn_inference.py expression_data.tsv \
    -t tf_names.txt \
    -o network.tsv \
    -s 42 \
    --transpose  # if genes are rows

Features:

  • Automatic data loading and validation
  • Optional TF list specification
  • Configurable output format
  • Data transposition support
  • Summary statistics

scripts/distributed_inference.py

GRN inference with custom Dask cluster configuration.

Usage:

python scripts/distributed_inference.py expression_data.tsv \
    -t tf_names.txt \
    -w 8 \
    -m 4GB \
    --threads 2 \
    --dashboard-port 8787

Features:

  • Configurable worker count and memory limits
  • Dask dashboard integration
  • Thread configuration
  • Resource monitoring

scripts/compare_algorithms.py

Compare GRNBoost2 and GENIE3 side-by-side.

Usage:

python scripts/compare_algorithms.py expression_data.tsv \
    -t tf_names.txt \
    --top-n 100

Features:

  • Runtime comparison
  • Network statistics
  • Prediction overlap analysis
  • Top prediction comparison

Reference Documentation

Detailed API documentation is available in references/api_reference.md, including:

  • Complete parameter descriptions for all functions
  • Data format specifications
  • Distributed computing configuration
  • Performance optimization tips
  • Integration with pySCENIC
  • Comprehensive examples

Load this reference when:

  • Working with advanced Dask configurations
  • Troubleshooting complex deployment scenarios
  • Understanding algorithm internals
  • Optimizing performance for specific use cases

Integration with pySCENIC

Arboreto is the first step in the pySCENIC single-cell analysis pipeline:

  1. GRN Inference (arboreto) ← This skill

    • Input: Expression matrix
    • Output: Regulatory network
  2. Regulon Prediction (pySCENIC)

    • Input: Network from arboreto
    • Output: Refined regulons
  3. Cell Type Identification (pySCENIC)

    • Input: Regulons
    • Output: Cell type scores

When working with pySCENIC, use arboreto to generate the initial network, then pass results to the pySCENIC pipeline.

Best Practices

  1. Always use seed parameter for reproducible research
  2. Validate data orientation (rows = observations, columns = genes)
  3. Specify TF list when known to focus inference and improve speed
  4. Monitor with Dask dashboard for distributed computing
  5. Save intermediate results to avoid re-running long computations
  6. Filter results by importance threshold for downstream analysis
  7. Use GRNBoost2 by default unless specifically requiring GENIE3
  8. Include multiprocessing guard (if __name__ == '__main__':) in all scripts

Quick Reference

Basic inference:

from arboreto.algo import grnboost2
network = grnboost2(expression_data=expr_df, seed=42)

With TF specification:

network = grnboost2(expression_data=expr_df, tf_names=tf_list, seed=42)

With custom Dask client:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=4)
client = Client(cluster)
network = grnboost2(expression_data=expr_df, client_or_address=client, seed=42)
client.close()
cluster.close()

Load TF names:

from arboreto.utils import load_tf_names
tf_names = load_tf_names('transcription_factors.txt')

Transpose data:

expression_df = pd.read_csv('data.tsv', sep='\t', index_col=0).T