mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-01-26 16:58:56 +08:00

Files

Timothy Kassis 152d0d54de Initial commit

2025-10-19 14:01:29 -07:00

11 KiB

Raw Blame History

name, description

name	description
arboreto	Toolkit for gene regulatory network (GRN) inference from expression data using machine learning. Use this skill when working with gene expression matrices to infer regulatory relationships, performing single-cell RNA-seq analysis, or integrating with pySCENIC workflows. Supports both GRNBoost2 (fast gradient boosting) and GENIE3 (Random Forest) algorithms with distributed computing via Dask.

Arboreto - Gene Regulatory Network Inference

Overview

Arboreto is a Python library for inferring gene regulatory networks (GRNs) from gene expression data using machine learning algorithms. It enables scalable GRN inference from single machines to multi-node clusters using Dask for distributed computing. The skill provides comprehensive support for both GRNBoost2 (fast gradient boosting) and GENIE3 (Random Forest) algorithms.

When to Use This Skill

Apply this skill when:

Inferring regulatory relationships between genes from expression data
Analyzing single-cell or bulk RNA-seq data to identify transcription factor targets
Building the GRN inference component of a pySCENIC pipeline
Comparing GRNBoost2 and GENIE3 algorithm performance
Setting up distributed computing for large-scale genomic analyses
Troubleshooting arboreto installation or runtime issues

Core Capabilities

1. Basic GRN Inference

For standard gene regulatory network inference tasks:

Key considerations:

Expression data format: Rows = observations (cells/samples), Columns = genes
If data has genes as rows, transpose it first: expression_df.T
Always include seed parameter for reproducible results
Transcription factor list is optional but recommended for focused analysis

Typical workflow:

import pandas as pd
from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names

# Load expression data (ensure correct orientation)
expression_data = pd.read_csv('expression_data.tsv', sep='\t', index_col=0)

# Optional: Load TF names
tf_names = load_tf_names('transcription_factors.txt')

# Run inference
network = grnboost2(
    expression_data=expression_data,
    tf_names=tf_names,
    seed=42  # For reproducibility
)

# Save results
network.to_csv('network_output.tsv', sep='\t', index=False)

Output format:

DataFrame with columns: ['TF', 'target', 'importance']
Higher importance scores indicate stronger predicted regulatory relationships
Typically sorted by importance (descending)

Multiprocessing requirement: All arboreto code must include if __name__ == '__main__': protection due to Dask's multiprocessing requirements:

if __name__ == '__main__':
    # Arboreto code goes here
    network = grnboost2(expression_data=expr_data, seed=42)

2. Algorithm Selection

GRNBoost2 (Recommended for most cases):

~10-100x faster than GENIE3
Uses stochastic gradient boosting with early-stopping
Best for: Large datasets (>10k observations), time-sensitive analyses
Function: arboreto.algo.grnboost2()

GENIE3:

Uses Random Forest regression
More established, classical approach
Best for: Small datasets, methodological comparisons, reproducing published results
Function: arboreto.algo.genie3()

When to compare both algorithms: Use the provided compare_algorithms.py script when:

Validating results for critical analyses
Benchmarking performance on new datasets
Publishing research requiring methodological comparisons

3. Distributed Computing

Local execution (default): Arboreto automatically creates a local Dask client. No configuration needed:

network = grnboost2(expression_data=expr_data)

Custom local cluster (recommended for better control):

from dask.distributed import Client, LocalCluster

# Configure cluster
cluster = LocalCluster(
    n_workers=4,
    threads_per_worker=2,
    memory_limit='4GB',
    diagnostics_port=8787  # Dashboard at http://localhost:8787
)
client = Client(cluster)

# Run inference
network = grnboost2(
    expression_data=expr_data,
    client_or_address=client
)

# Clean up
client.close()
cluster.close()

Distributed cluster (multi-node): On scheduler node:

dask-scheduler --no-bokeh

On worker nodes:

dask-worker scheduler-address:8786 --local-dir /tmp

In Python:

from dask.distributed import Client

client = Client('scheduler-address:8786')
network = grnboost2(expression_data=expr_data, client_or_address=client)

4. Data Preparation

Common data format issues:

Transposed data (genes as rows instead of columns):

# If genes are rows, transpose
expression_data = pd.read_csv('data.tsv', sep='\t', index_col=0).T

Missing gene names:

# Provide gene names if using numpy array
network = grnboost2(
    expression_data=expr_array,
    gene_names=['Gene1', 'Gene2', 'Gene3', ...],
    seed=42
)

Transcription factor specification:

# Option 1: Python list
tf_names = ['Sox2', 'Oct4', 'Nanog', 'Klf4']

# Option 2: Load from file (one TF per line)
from arboreto.utils import load_tf_names
tf_names = load_tf_names('tf_names.txt')

5. Reproducibility

Always specify a seed for consistent results:

network = grnboost2(expression_data=expr_data, seed=42)

Without a seed, results will vary between runs due to algorithm randomness.

6. Result Interpretation

Understanding the output:

TF: Transcription factor (regulator) gene
target: Target gene being regulated
importance: Strength of predicted regulatory relationship

Typical post-processing:

# Filter by importance threshold
high_confidence = network[network['importance'] > 10]

# Get top N predictions
top_predictions = network.head(1000)

# Find all targets of a specific TF
sox2_targets = network[network['TF'] == 'Sox2']

# Count regulations per TF
tf_counts = network['TF'].value_counts()

Installation

Recommended (via conda):

conda install -c bioconda arboreto

Via pip:

pip install arboreto

From source:

git clone https://github.com/tmoerman/arboreto.git
cd arboreto
pip install .

Dependencies:

pandas
numpy
scikit-learn
scipy
dask
distributed

Troubleshooting

Issue: Bokeh error when launching Dask scheduler

Error: TypeError: got an unexpected keyword argument 'host'

Solutions:

Use dask-scheduler --no-bokeh to disable Bokeh
Upgrade to Dask distributed >= 0.20.0

Issue: Workers not connecting to scheduler

Symptoms: Worker processes start but fail to establish connections

Solutions:

Remove dask-worker-space directory before restarting workers
Specify adequate local_dir when creating cluster:

cluster = LocalCluster(
    worker_kwargs={'local_dir': '/tmp'}
)

Issue: Memory errors with large datasets

Solutions:

Increase worker memory limits: memory_limit='8GB'
Distribute across more nodes
Reduce dataset size through preprocessing (e.g., feature selection)
Ensure expression matrix fits in available RAM

Issue: Inconsistent results across runs

Solution: Always specify a seed parameter:

network = grnboost2(expression_data=expr_data, seed=42)

Issue: Import errors or missing dependencies

Solution: Use conda installation to handle numerical library dependencies:

conda create --name arboreto-env
conda activate arboreto-env
conda install -c bioconda arboreto

Provided Scripts

This skill includes ready-to-use scripts for common workflows:

scripts/basic_grn_inference.py

Command-line tool for standard GRN inference workflow.

Usage:

python scripts/basic_grn_inference.py expression_data.tsv \
    -t tf_names.txt \
    -o network.tsv \
    -s 42 \
    --transpose  # if genes are rows

Features:

Automatic data loading and validation
Optional TF list specification
Configurable output format
Data transposition support
Summary statistics

scripts/distributed_inference.py

GRN inference with custom Dask cluster configuration.

Usage:

python scripts/distributed_inference.py expression_data.tsv \
    -t tf_names.txt \
    -w 8 \
    -m 4GB \
    --threads 2 \
    --dashboard-port 8787

Features:

Configurable worker count and memory limits
Dask dashboard integration
Thread configuration
Resource monitoring

scripts/compare_algorithms.py

Compare GRNBoost2 and GENIE3 side-by-side.

Usage:

python scripts/compare_algorithms.py expression_data.tsv \
    -t tf_names.txt \
    --top-n 100

Features:

Runtime comparison
Network statistics
Prediction overlap analysis
Top prediction comparison

Reference Documentation

Detailed API documentation is available in references/api_reference.md, including:

Complete parameter descriptions for all functions
Data format specifications
Distributed computing configuration
Performance optimization tips
Integration with pySCENIC
Comprehensive examples

Load this reference when:

Working with advanced Dask configurations
Troubleshooting complex deployment scenarios
Understanding algorithm internals
Optimizing performance for specific use cases

Integration with pySCENIC

Arboreto is the first step in the pySCENIC single-cell analysis pipeline:

GRN Inference (arboreto) ← This skill
- Input: Expression matrix
- Output: Regulatory network
Regulon Prediction (pySCENIC)
- Input: Network from arboreto
- Output: Refined regulons
Cell Type Identification (pySCENIC)
- Input: Regulons
- Output: Cell type scores

When working with pySCENIC, use arboreto to generate the initial network, then pass results to the pySCENIC pipeline.

Best Practices

Always use seed parameter for reproducible research
Validate data orientation (rows = observations, columns = genes)
Specify TF list when known to focus inference and improve speed
Monitor with Dask dashboard for distributed computing
Save intermediate results to avoid re-running long computations
Filter results by importance threshold for downstream analysis
Use GRNBoost2 by default unless specifically requiring GENIE3
Include multiprocessing guard (if __name__ == '__main__':) in all scripts

Quick Reference

Basic inference:

from arboreto.algo import grnboost2
network = grnboost2(expression_data=expr_df, seed=42)

With TF specification:

network = grnboost2(expression_data=expr_df, tf_names=tf_list, seed=42)

With custom Dask client:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=4)
client = Client(cluster)
network = grnboost2(expression_data=expr_df, client_or_address=client, seed=42)
client.close()
cluster.close()

Load TF names:

from arboreto.utils import load_tf_names
tf_names = load_tf_names('transcription_factors.txt')

Transpose data:

expression_df = pd.read_csv('data.tsv', sep='\t', index_col=0).T

11 KiB Raw Blame History

Arboreto - Gene Regulatory Network Inference

Overview

When to Use This Skill

Core Capabilities

1. Basic GRN Inference

2. Algorithm Selection

3. Distributed Computing

4. Data Preparation

5. Reproducibility

6. Result Interpretation

Installation

Troubleshooting

Issue: Bokeh error when launching Dask scheduler

Issue: Workers not connecting to scheduler

Issue: Memory errors with large datasets

Issue: Inconsistent results across runs

Issue: Import errors or missing dependencies

Provided Scripts

scripts/basic_grn_inference.py

scripts/distributed_inference.py

scripts/compare_algorithms.py

Reference Documentation

Integration with pySCENIC

Best Practices

Quick Reference

11 KiB

Raw Blame History