Files
claude-scientific-skills/scientific-packages/anndata/references/concatenation_guide.md
Timothy Kassis 152d0d54de Initial commit
2025-10-19 14:01:29 -07:00

11 KiB

AnnData Concatenation Guide

Overview

The concat() function combines multiple AnnData objects through two fundamental operations:

  1. Concatenation: Stacking sub-elements in order
  2. Merging: Combining collections into one result

Basic Concatenation

Syntax

import anndata as ad

combined = ad.concat(
    adatas,              # List of AnnData objects
    axis=0,              # 0=observations, 1=variables
    join='inner',        # 'inner' or 'outer'
    merge=None,          # Merge strategy for non-concat axis
    label=None,          # Column name for source tracking
    keys=None,           # Dataset identifiers
    index_unique=None,   # Separator for unique indices
    fill_value=None,     # Fill value for missing data
    pairwise=False       # Include pairwise matrices
)

Concatenating Observations (Cells)

# Most common: combining multiple samples/batches
adata1 = ad.AnnData(np.random.rand(100, 2000))
adata2 = ad.AnnData(np.random.rand(150, 2000))
adata3 = ad.AnnData(np.random.rand(80, 2000))

combined = ad.concat([adata1, adata2, adata3], axis=0)
# Result: (330 observations, 2000 variables)

Concatenating Variables (Genes)

# Less common: combining different feature sets
adata1 = ad.AnnData(np.random.rand(100, 1000))
adata2 = ad.AnnData(np.random.rand(100, 500))

combined = ad.concat([adata1, adata2], axis=1)
# Result: (100 observations, 1500 variables)

Join Strategies

Inner Join (Intersection)

Keeps only shared features across all objects.

# Datasets with different genes
adata1 = ad.AnnData(
    np.random.rand(100, 2000),
    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(2000)])
)
adata2 = ad.AnnData(
    np.random.rand(150, 1800),
    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(200, 2000)])
)

# Inner join: only genes present in both
combined = ad.concat([adata1, adata2], join='inner')
# Result: (250 observations, 1800 variables)
# Only Gene_200 through Gene_1999

Use when:

  • You want to analyze only features measured in all datasets
  • Missing features would compromise analysis
  • You need a complete case analysis

Trade-offs:

  • May lose many features
  • Ensures no missing data
  • Smaller result size

Outer Join (Union)

Keeps all features from all objects, padding with fill values (default 0).

# Outer join: all genes from both datasets
combined = ad.concat([adata1, adata2], join='outer')
# Result: (250 observations, 2000 variables)
# Missing values filled with 0

# Custom fill value
combined = ad.concat([adata1, adata2], join='outer', fill_value=np.nan)

Use when:

  • You want to preserve all features
  • Sparse data is acceptable
  • Features are independent

Trade-offs:

  • Introduces zeros/missing values
  • Larger result size
  • May need imputation

Merge Strategies

Merge strategies control how elements on the non-concatenation axis are combined.

merge=None (Default)

Excludes all non-concatenation axis elements.

# Both datasets have var annotations
adata1.var['gene_type'] = ['protein_coding'] * 2000
adata2.var['gene_type'] = ['protein_coding'] * 1800

# merge=None: var annotations excluded
combined = ad.concat([adata1, adata2], merge=None)
assert 'gene_type' not in combined.var.columns

Use when:

  • Annotations are dataset-specific
  • You'll add new annotations after merging

merge='same'

Keeps only annotations with identical values across datasets.

# Same annotation values
adata1.var['chromosome'] = ['chr1'] * 1000 + ['chr2'] * 1000
adata2.var['chromosome'] = ['chr1'] * 900 + ['chr2'] * 900

# merge='same': keeps chromosome annotation
combined = ad.concat([adata1, adata2], merge='same')
assert 'chromosome' in combined.var.columns

Use when:

  • Annotations should be consistent
  • You want to validate consistency
  • Shared metadata is important

Note: Comparison occurs after index alignment - only shared indices need to match.

merge='unique'

Includes annotations with a single possible value.

# Unique values per gene
adata1.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
adata2.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]

# merge='unique': keeps ensembl_id
combined = ad.concat([adata1, adata2], merge='unique')

Use when:

  • Each feature has a unique identifier
  • Annotations are feature-specific

merge='first'

Takes the first occurrence of each annotation.

# Different annotation versions
adata1.var['description'] = ['desc1'] * 2000
adata2.var['description'] = ['desc2'] * 2000

# merge='first': uses adata1's descriptions
combined = ad.concat([adata1, adata2], merge='first')
# Uses descriptions from adata1

Use when:

  • One dataset has authoritative annotations
  • Order matters
  • You need a simple resolution strategy

merge='only'

Retains annotations appearing in exactly one object.

# Dataset-specific annotations
adata1.var['dataset1_specific'] = ['value'] * 2000
adata2.var['dataset2_specific'] = ['value'] * 2000

# merge='only': keeps both (no conflicts)
combined = ad.concat([adata1, adata2], merge='only')

Use when:

  • Datasets have non-overlapping annotations
  • You want to preserve all unique metadata

Source Tracking

Using label

Add a categorical column to track data origin.

combined = ad.concat(
    [adata1, adata2, adata3],
    label='batch'
)

# Creates obs['batch'] with values 0, 1, 2
print(combined.obs['batch'].cat.categories)  # ['0', '1', '2']

Using keys

Provide custom names for source tracking.

combined = ad.concat(
    [adata1, adata2, adata3],
    label='study',
    keys=['control', 'treatment_a', 'treatment_b']
)

# Creates obs['study'] with custom names
print(combined.obs['study'].unique())  # ['control', 'treatment_a', 'treatment_b']

Making Indices Unique

Append source identifiers to duplicate observation names.

# Both datasets have cells named "Cell_0", "Cell_1", etc.
adata1.obs_names = [f'Cell_{i}' for i in range(100)]
adata2.obs_names = [f'Cell_{i}' for i in range(150)]

# index_unique adds suffix
combined = ad.concat(
    [adata1, adata2],
    keys=['batch1', 'batch2'],
    index_unique='-'
)

# Results in: Cell_0-batch1, Cell_0-batch2, etc.
print(combined.obs_names[:5])

Handling Different Attributes

X Matrix and Layers

Follows join strategy. Missing values filled according to fill_value.

# Both have layers
adata1.layers['counts'] = adata1.X.copy()
adata2.layers['counts'] = adata2.X.copy()

# Concatenates both X and layers
combined = ad.concat([adata1, adata2])
assert 'counts' in combined.layers

obs and var DataFrames

  • obs: Concatenated along concatenation axis
  • var: Handled by merge strategy
adata1.obs['cell_type'] = ['B cell'] * 100
adata2.obs['cell_type'] = ['T cell'] * 150

combined = ad.concat([adata1, adata2])
# obs['cell_type'] preserved for all cells

obsm and varm

Multi-dimensional annotations follow same rules as layers.

adata1.obsm['X_pca'] = np.random.rand(100, 50)
adata2.obsm['X_pca'] = np.random.rand(150, 50)

combined = ad.concat([adata1, adata2])
# obsm['X_pca'] concatenated: shape (250, 50)

obsp and varp

Pairwise matrices excluded by default. Enable with pairwise=True.

# Distance matrices
adata1.obsp['distances'] = np.random.rand(100, 100)
adata2.obsp['distances'] = np.random.rand(150, 150)

# Excluded by default
combined = ad.concat([adata1, adata2])
assert 'distances' not in combined.obsp

# Include if needed
combined = ad.concat([adata1, adata2], pairwise=True)
# Results in padded block diagonal matrix

uns Dictionary

Merged recursively, applying merge strategy at any nesting depth.

adata1.uns['experiment'] = {'date': '2024-01', 'lab': 'A'}
adata2.uns['experiment'] = {'date': '2024-02', 'lab': 'A'}

# merge='same' keeps 'lab', excludes 'date'
combined = ad.concat([adata1, adata2], merge='same')
# combined.uns['experiment'] = {'lab': 'A'}

Advanced Patterns

Batch Integration Pipeline

import anndata as ad

# Load batches
batches = [
    ad.read_h5ad(f'batch_{i}.h5ad')
    for i in range(5)
]

# Concatenate with tracking
combined = ad.concat(
    batches,
    axis=0,
    join='outer',
    merge='first',
    label='batch_id',
    keys=[f'batch_{i}' for i in range(5)],
    index_unique='-'
)

# Add batch effects
combined.obs['batch_numeric'] = combined.obs['batch_id'].cat.codes

Multi-Study Meta-Analysis

# Different studies with varying gene coverage
studies = {
    'study_a': ad.read_h5ad('study_a.h5ad'),
    'study_b': ad.read_h5ad('study_b.h5ad'),
    'study_c': ad.read_h5ad('study_c.h5ad')
}

# Outer join to keep all genes
combined = ad.concat(
    list(studies.values()),
    axis=0,
    join='outer',
    label='study',
    keys=list(studies.keys()),
    merge='unique',
    fill_value=0
)

# Track coverage
for study in studies:
    n_genes = studies[study].n_vars
    combined.uns[f'{study}_n_genes'] = n_genes

Incremental Concatenation

# For many datasets, concatenate in batches
chunk_size = 10
all_files = [f'dataset_{i}.h5ad' for i in range(100)]

# Process in chunks
result = None
for i in range(0, len(all_files), chunk_size):
    chunk_files = all_files[i:i+chunk_size]
    chunk_adatas = [ad.read_h5ad(f) for f in chunk_files]
    chunk_combined = ad.concat(chunk_adatas)

    if result is None:
        result = chunk_combined
    else:
        result = ad.concat([result, chunk_combined])

Memory-Efficient On-Disk Concatenation

# Experimental feature for large datasets
from anndata.experimental import concat_on_disk

files = ['dataset1.h5ad', 'dataset2.h5ad', 'dataset3.h5ad']
concat_on_disk(
    files,
    'combined.h5ad',
    join='outer'
)

# Read result in backed mode
combined = ad.read_h5ad('combined.h5ad', backed='r')

Troubleshooting

Issue: Dimension Mismatch

# Error: shapes don't match
adata1 = ad.AnnData(np.random.rand(100, 2000))
adata2 = ad.AnnData(np.random.rand(150, 1500))

# Solution: use outer join
combined = ad.concat([adata1, adata2], join='outer')

Issue: Memory Error

# Problem: too many large objects in memory
large_adatas = [ad.read_h5ad(f) for f in many_files]

# Solution: read and concatenate incrementally
result = None
for file in many_files:
    adata = ad.read_h5ad(file)
    if result is None:
        result = adata
    else:
        result = ad.concat([result, adata])
        del adata  # Free memory

Issue: Duplicate Indices

# Problem: same cell names in different batches
# Solution: use index_unique
combined = ad.concat(
    [adata1, adata2],
    keys=['batch1', 'batch2'],
    index_unique='-'
)

Issue: Lost Annotations

# Problem: annotations disappear
adata1.var['important'] = values1
adata2.var['important'] = values2

combined = ad.concat([adata1, adata2])  # merge=None by default
# Solution: use appropriate merge strategy
combined = ad.concat([adata1, adata2], merge='first')

Performance Tips

  1. Pre-align indices: Ensure consistent naming before concatenation
  2. Use sparse matrices: Convert to sparse before concatenating
  3. Batch operations: Concatenate in groups for many datasets
  4. Choose inner join: When possible, to reduce result size
  5. Use categoricals: Convert string annotations before concatenating
  6. Consider on-disk: For very large datasets, use concat_on_disk