Initial commit

2026-03-28 07:33:45 +08:00 · 2025-10-19 14:01:29 -07:00
parent d85386c32b
commit 152d0d54de
15 changed files with 4569 additions and 0 deletions
--- a/scientific-packages/anndata/references/concatenation_guide.md
+++ b/scientific-packages/anndata/references/concatenation_guide.md
@@ -0,0 +1,478 @@
+# AnnData Concatenation Guide
+
+## Overview
+
+The `concat()` function combines multiple AnnData objects through two fundamental operations:
+1. **Concatenation**: Stacking sub-elements in order
+2. **Merging**: Combining collections into one result
+
+## Basic Concatenation
+
+### Syntax
+```python
+import anndata as ad
+
+combined = ad.concat(
+    adatas,              # List of AnnData objects
+    axis=0,              # 0=observations, 1=variables
+    join='inner',        # 'inner' or 'outer'
+    merge=None,          # Merge strategy for non-concat axis
+    label=None,          # Column name for source tracking
+    keys=None,           # Dataset identifiers
+    index_unique=None,   # Separator for unique indices
+    fill_value=None,     # Fill value for missing data
+    pairwise=False       # Include pairwise matrices
+)
+```
+
+### Concatenating Observations (Cells)
+```python
+# Most common: combining multiple samples/batches
+adata1 = ad.AnnData(np.random.rand(100, 2000))
+adata2 = ad.AnnData(np.random.rand(150, 2000))
+adata3 = ad.AnnData(np.random.rand(80, 2000))
+
+combined = ad.concat([adata1, adata2, adata3], axis=0)
+# Result: (330 observations, 2000 variables)
+```
+
+### Concatenating Variables (Genes)
+```python
+# Less common: combining different feature sets
+adata1 = ad.AnnData(np.random.rand(100, 1000))
+adata2 = ad.AnnData(np.random.rand(100, 500))
+
+combined = ad.concat([adata1, adata2], axis=1)
+# Result: (100 observations, 1500 variables)
+```
+
+## Join Strategies
+
+### Inner Join (Intersection)
+
+Keeps only shared features across all objects.
+
+```python
+# Datasets with different genes
+adata1 = ad.AnnData(
+    np.random.rand(100, 2000),
+    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(2000)])
+)
+adata2 = ad.AnnData(
+    np.random.rand(150, 1800),
+    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(200, 2000)])
+)
+
+# Inner join: only genes present in both
+combined = ad.concat([adata1, adata2], join='inner')
+# Result: (250 observations, 1800 variables)
+# Only Gene_200 through Gene_1999
+```
+
+**Use when:**
+- You want to analyze only features measured in all datasets
+- Missing features would compromise analysis
+- You need a complete case analysis
+
+**Trade-offs:**
+- May lose many features
+- Ensures no missing data
+- Smaller result size
+
+### Outer Join (Union)
+
+Keeps all features from all objects, padding with fill values (default 0).
+
+```python
+# Outer join: all genes from both datasets
+combined = ad.concat([adata1, adata2], join='outer')
+# Result: (250 observations, 2000 variables)
+# Missing values filled with 0
+
+# Custom fill value
+combined = ad.concat([adata1, adata2], join='outer', fill_value=np.nan)
+```
+
+**Use when:**
+- You want to preserve all features
+- Sparse data is acceptable
+- Features are independent
+
+**Trade-offs:**
+- Introduces zeros/missing values
+- Larger result size
+- May need imputation
+
+## Merge Strategies
+
+Merge strategies control how elements on the non-concatenation axis are combined.
+
+### merge=None (Default)
+
+Excludes all non-concatenation axis elements.
+
+```python
+# Both datasets have var annotations
+adata1.var['gene_type'] = ['protein_coding'] * 2000
+adata2.var['gene_type'] = ['protein_coding'] * 1800
+
+# merge=None: var annotations excluded
+combined = ad.concat([adata1, adata2], merge=None)
+assert 'gene_type' not in combined.var.columns
+```
+
+**Use when:**
+- Annotations are dataset-specific
+- You'll add new annotations after merging
+
+### merge='same'
+
+Keeps only annotations with identical values across datasets.
+
+```python
+# Same annotation values
+adata1.var['chromosome'] = ['chr1'] * 1000 + ['chr2'] * 1000
+adata2.var['chromosome'] = ['chr1'] * 900 + ['chr2'] * 900
+
+# merge='same': keeps chromosome annotation
+combined = ad.concat([adata1, adata2], merge='same')
+assert 'chromosome' in combined.var.columns
+```
+
+**Use when:**
+- Annotations should be consistent
+- You want to validate consistency
+- Shared metadata is important
+
+**Note:** Comparison occurs after index alignment - only shared indices need to match.
+
+### merge='unique'
+
+Includes annotations with a single possible value.
+
+```python
+# Unique values per gene
+adata1.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
+adata2.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
+
+# merge='unique': keeps ensembl_id
+combined = ad.concat([adata1, adata2], merge='unique')
+```
+
+**Use when:**
+- Each feature has a unique identifier
+- Annotations are feature-specific
+
+### merge='first'
+
+Takes the first occurrence of each annotation.
+
+```python
+# Different annotation versions
+adata1.var['description'] = ['desc1'] * 2000
+adata2.var['description'] = ['desc2'] * 2000
+
+# merge='first': uses adata1's descriptions
+combined = ad.concat([adata1, adata2], merge='first')
+# Uses descriptions from adata1
+```
+
+**Use when:**
+- One dataset has authoritative annotations
+- Order matters
+- You need a simple resolution strategy
+
+### merge='only'
+
+Retains annotations appearing in exactly one object.
+
+```python
+# Dataset-specific annotations
+adata1.var['dataset1_specific'] = ['value'] * 2000
+adata2.var['dataset2_specific'] = ['value'] * 2000
+
+# merge='only': keeps both (no conflicts)
+combined = ad.concat([adata1, adata2], merge='only')
+```
+
+**Use when:**
+- Datasets have non-overlapping annotations
+- You want to preserve all unique metadata
+
+## Source Tracking
+
+### Using label
+
+Add a categorical column to track data origin.
+
+```python
+combined = ad.concat(
+    [adata1, adata2, adata3],
+    label='batch'
+)
+
+# Creates obs['batch'] with values 0, 1, 2
+print(combined.obs['batch'].cat.categories)  # ['0', '1', '2']
+```
+
+### Using keys
+
+Provide custom names for source tracking.
+
+```python
+combined = ad.concat(
+    [adata1, adata2, adata3],
+    label='study',
+    keys=['control', 'treatment_a', 'treatment_b']
+)
+
+# Creates obs['study'] with custom names
+print(combined.obs['study'].unique())  # ['control', 'treatment_a', 'treatment_b']
+```
+
+### Making Indices Unique
+
+Append source identifiers to duplicate observation names.
+
+```python
+# Both datasets have cells named "Cell_0", "Cell_1", etc.
+adata1.obs_names = [f'Cell_{i}' for i in range(100)]
+adata2.obs_names = [f'Cell_{i}' for i in range(150)]
+
+# index_unique adds suffix
+combined = ad.concat(
+    [adata1, adata2],
+    keys=['batch1', 'batch2'],
+    index_unique='-'
+)
+
+# Results in: Cell_0-batch1, Cell_0-batch2, etc.
+print(combined.obs_names[:5])
+```
+
+## Handling Different Attributes
+
+### X Matrix and Layers
+
+Follows join strategy. Missing values filled according to `fill_value`.
+
+```python
+# Both have layers
+adata1.layers['counts'] = adata1.X.copy()
+adata2.layers['counts'] = adata2.X.copy()
+
+# Concatenates both X and layers
+combined = ad.concat([adata1, adata2])
+assert 'counts' in combined.layers
+```
+
+### obs and var DataFrames
+
+- **obs**: Concatenated along concatenation axis
+- **var**: Handled by merge strategy
+
+```python
+adata1.obs['cell_type'] = ['B cell'] * 100
+adata2.obs['cell_type'] = ['T cell'] * 150
+
+combined = ad.concat([adata1, adata2])
+# obs['cell_type'] preserved for all cells
+```
+
+### obsm and varm
+
+Multi-dimensional annotations follow same rules as layers.
+
+```python
+adata1.obsm['X_pca'] = np.random.rand(100, 50)
+adata2.obsm['X_pca'] = np.random.rand(150, 50)
+
+combined = ad.concat([adata1, adata2])
+# obsm['X_pca'] concatenated: shape (250, 50)
+```
+
+### obsp and varp
+
+Pairwise matrices excluded by default. Enable with `pairwise=True`.
+
+```python
+# Distance matrices
+adata1.obsp['distances'] = np.random.rand(100, 100)
+adata2.obsp['distances'] = np.random.rand(150, 150)
+
+# Excluded by default
+combined = ad.concat([adata1, adata2])
+assert 'distances' not in combined.obsp
+
+# Include if needed
+combined = ad.concat([adata1, adata2], pairwise=True)
+# Results in padded block diagonal matrix
+```
+
+### uns Dictionary
+
+Merged recursively, applying merge strategy at any nesting depth.
+
+```python
+adata1.uns['experiment'] = {'date': '2024-01', 'lab': 'A'}
+adata2.uns['experiment'] = {'date': '2024-02', 'lab': 'A'}
+
+# merge='same' keeps 'lab', excludes 'date'
+combined = ad.concat([adata1, adata2], merge='same')
+# combined.uns['experiment'] = {'lab': 'A'}
+```
+
+## Advanced Patterns
+
+### Batch Integration Pipeline
+
+```python
+import anndata as ad
+
+# Load batches
+batches = [
+    ad.read_h5ad(f'batch_{i}.h5ad')
+    for i in range(5)
+]
+
+# Concatenate with tracking
+combined = ad.concat(
+    batches,
+    axis=0,
+    join='outer',
+    merge='first',
+    label='batch_id',
+    keys=[f'batch_{i}' for i in range(5)],
+    index_unique='-'
+)
+
+# Add batch effects
+combined.obs['batch_numeric'] = combined.obs['batch_id'].cat.codes
+```
+
+### Multi-Study Meta-Analysis
+
+```python
+# Different studies with varying gene coverage
+studies = {
+    'study_a': ad.read_h5ad('study_a.h5ad'),
+    'study_b': ad.read_h5ad('study_b.h5ad'),
+    'study_c': ad.read_h5ad('study_c.h5ad')
+}
+
+# Outer join to keep all genes
+combined = ad.concat(
+    list(studies.values()),
+    axis=0,
+    join='outer',
+    label='study',
+    keys=list(studies.keys()),
+    merge='unique',
+    fill_value=0
+)
+
+# Track coverage
+for study in studies:
+    n_genes = studies[study].n_vars
+    combined.uns[f'{study}_n_genes'] = n_genes
+```
+
+### Incremental Concatenation
+
+```python
+# For many datasets, concatenate in batches
+chunk_size = 10
+all_files = [f'dataset_{i}.h5ad' for i in range(100)]
+
+# Process in chunks
+result = None
+for i in range(0, len(all_files), chunk_size):
+    chunk_files = all_files[i:i+chunk_size]
+    chunk_adatas = [ad.read_h5ad(f) for f in chunk_files]
+    chunk_combined = ad.concat(chunk_adatas)
+
+    if result is None:
+        result = chunk_combined
+    else:
+        result = ad.concat([result, chunk_combined])
+```
+
+### Memory-Efficient On-Disk Concatenation
+
+```python
+# Experimental feature for large datasets
+from anndata.experimental import concat_on_disk
+
+files = ['dataset1.h5ad', 'dataset2.h5ad', 'dataset3.h5ad']
+concat_on_disk(
+    files,
+    'combined.h5ad',
+    join='outer'
+)
+
+# Read result in backed mode
+combined = ad.read_h5ad('combined.h5ad', backed='r')
+```
+
+## Troubleshooting
+
+### Issue: Dimension Mismatch
+
+```python
+# Error: shapes don't match
+adata1 = ad.AnnData(np.random.rand(100, 2000))
+adata2 = ad.AnnData(np.random.rand(150, 1500))
+
+# Solution: use outer join
+combined = ad.concat([adata1, adata2], join='outer')
+```
+
+### Issue: Memory Error
+
+```python
+# Problem: too many large objects in memory
+large_adatas = [ad.read_h5ad(f) for f in many_files]
+
+# Solution: read and concatenate incrementally
+result = None
+for file in many_files:
+    adata = ad.read_h5ad(file)
+    if result is None:
+        result = adata
+    else:
+        result = ad.concat([result, adata])
+        del adata  # Free memory
+```
+
+### Issue: Duplicate Indices
+
+```python
+# Problem: same cell names in different batches
+# Solution: use index_unique
+combined = ad.concat(
+    [adata1, adata2],
+    keys=['batch1', 'batch2'],
+    index_unique='-'
+)
+```
+
+### Issue: Lost Annotations
+
+```python
+# Problem: annotations disappear
+adata1.var['important'] = values1
+adata2.var['important'] = values2
+
+combined = ad.concat([adata1, adata2])  # merge=None by default
+# Solution: use appropriate merge strategy
+combined = ad.concat([adata1, adata2], merge='first')
+```
+
+## Performance Tips
+
+1. **Pre-align indices**: Ensure consistent naming before concatenation
+2. **Use sparse matrices**: Convert to sparse before concatenating
+3. **Batch operations**: Concatenate in groups for many datasets
+4. **Choose inner join**: When possible, to reduce result size
+5. **Use categoricals**: Convert string annotations before concatenating
+6. **Consider on-disk**: For very large datasets, use `concat_on_disk`