mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
479 lines
11 KiB
Markdown
479 lines
11 KiB
Markdown
# AnnData Concatenation Guide
|
|
|
|
## Overview
|
|
|
|
The `concat()` function combines multiple AnnData objects through two fundamental operations:
|
|
1. **Concatenation**: Stacking sub-elements in order
|
|
2. **Merging**: Combining collections into one result
|
|
|
|
## Basic Concatenation
|
|
|
|
### Syntax
|
|
```python
|
|
import anndata as ad
|
|
|
|
combined = ad.concat(
|
|
adatas, # List of AnnData objects
|
|
axis=0, # 0=observations, 1=variables
|
|
join='inner', # 'inner' or 'outer'
|
|
merge=None, # Merge strategy for non-concat axis
|
|
label=None, # Column name for source tracking
|
|
keys=None, # Dataset identifiers
|
|
index_unique=None, # Separator for unique indices
|
|
fill_value=None, # Fill value for missing data
|
|
pairwise=False # Include pairwise matrices
|
|
)
|
|
```
|
|
|
|
### Concatenating Observations (Cells)
|
|
```python
|
|
# Most common: combining multiple samples/batches
|
|
adata1 = ad.AnnData(np.random.rand(100, 2000))
|
|
adata2 = ad.AnnData(np.random.rand(150, 2000))
|
|
adata3 = ad.AnnData(np.random.rand(80, 2000))
|
|
|
|
combined = ad.concat([adata1, adata2, adata3], axis=0)
|
|
# Result: (330 observations, 2000 variables)
|
|
```
|
|
|
|
### Concatenating Variables (Genes)
|
|
```python
|
|
# Less common: combining different feature sets
|
|
adata1 = ad.AnnData(np.random.rand(100, 1000))
|
|
adata2 = ad.AnnData(np.random.rand(100, 500))
|
|
|
|
combined = ad.concat([adata1, adata2], axis=1)
|
|
# Result: (100 observations, 1500 variables)
|
|
```
|
|
|
|
## Join Strategies
|
|
|
|
### Inner Join (Intersection)
|
|
|
|
Keeps only shared features across all objects.
|
|
|
|
```python
|
|
# Datasets with different genes
|
|
adata1 = ad.AnnData(
|
|
np.random.rand(100, 2000),
|
|
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(2000)])
|
|
)
|
|
adata2 = ad.AnnData(
|
|
np.random.rand(150, 1800),
|
|
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(200, 2000)])
|
|
)
|
|
|
|
# Inner join: only genes present in both
|
|
combined = ad.concat([adata1, adata2], join='inner')
|
|
# Result: (250 observations, 1800 variables)
|
|
# Only Gene_200 through Gene_1999
|
|
```
|
|
|
|
**Use when:**
|
|
- You want to analyze only features measured in all datasets
|
|
- Missing features would compromise analysis
|
|
- You need a complete case analysis
|
|
|
|
**Trade-offs:**
|
|
- May lose many features
|
|
- Ensures no missing data
|
|
- Smaller result size
|
|
|
|
### Outer Join (Union)
|
|
|
|
Keeps all features from all objects, padding with fill values (default 0).
|
|
|
|
```python
|
|
# Outer join: all genes from both datasets
|
|
combined = ad.concat([adata1, adata2], join='outer')
|
|
# Result: (250 observations, 2000 variables)
|
|
# Missing values filled with 0
|
|
|
|
# Custom fill value
|
|
combined = ad.concat([adata1, adata2], join='outer', fill_value=np.nan)
|
|
```
|
|
|
|
**Use when:**
|
|
- You want to preserve all features
|
|
- Sparse data is acceptable
|
|
- Features are independent
|
|
|
|
**Trade-offs:**
|
|
- Introduces zeros/missing values
|
|
- Larger result size
|
|
- May need imputation
|
|
|
|
## Merge Strategies
|
|
|
|
Merge strategies control how elements on the non-concatenation axis are combined.
|
|
|
|
### merge=None (Default)
|
|
|
|
Excludes all non-concatenation axis elements.
|
|
|
|
```python
|
|
# Both datasets have var annotations
|
|
adata1.var['gene_type'] = ['protein_coding'] * 2000
|
|
adata2.var['gene_type'] = ['protein_coding'] * 1800
|
|
|
|
# merge=None: var annotations excluded
|
|
combined = ad.concat([adata1, adata2], merge=None)
|
|
assert 'gene_type' not in combined.var.columns
|
|
```
|
|
|
|
**Use when:**
|
|
- Annotations are dataset-specific
|
|
- You'll add new annotations after merging
|
|
|
|
### merge='same'
|
|
|
|
Keeps only annotations with identical values across datasets.
|
|
|
|
```python
|
|
# Same annotation values
|
|
adata1.var['chromosome'] = ['chr1'] * 1000 + ['chr2'] * 1000
|
|
adata2.var['chromosome'] = ['chr1'] * 900 + ['chr2'] * 900
|
|
|
|
# merge='same': keeps chromosome annotation
|
|
combined = ad.concat([adata1, adata2], merge='same')
|
|
assert 'chromosome' in combined.var.columns
|
|
```
|
|
|
|
**Use when:**
|
|
- Annotations should be consistent
|
|
- You want to validate consistency
|
|
- Shared metadata is important
|
|
|
|
**Note:** Comparison occurs after index alignment - only shared indices need to match.
|
|
|
|
### merge='unique'
|
|
|
|
Includes annotations with a single possible value.
|
|
|
|
```python
|
|
# Unique values per gene
|
|
adata1.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
|
|
adata2.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
|
|
|
|
# merge='unique': keeps ensembl_id
|
|
combined = ad.concat([adata1, adata2], merge='unique')
|
|
```
|
|
|
|
**Use when:**
|
|
- Each feature has a unique identifier
|
|
- Annotations are feature-specific
|
|
|
|
### merge='first'
|
|
|
|
Takes the first occurrence of each annotation.
|
|
|
|
```python
|
|
# Different annotation versions
|
|
adata1.var['description'] = ['desc1'] * 2000
|
|
adata2.var['description'] = ['desc2'] * 2000
|
|
|
|
# merge='first': uses adata1's descriptions
|
|
combined = ad.concat([adata1, adata2], merge='first')
|
|
# Uses descriptions from adata1
|
|
```
|
|
|
|
**Use when:**
|
|
- One dataset has authoritative annotations
|
|
- Order matters
|
|
- You need a simple resolution strategy
|
|
|
|
### merge='only'
|
|
|
|
Retains annotations appearing in exactly one object.
|
|
|
|
```python
|
|
# Dataset-specific annotations
|
|
adata1.var['dataset1_specific'] = ['value'] * 2000
|
|
adata2.var['dataset2_specific'] = ['value'] * 2000
|
|
|
|
# merge='only': keeps both (no conflicts)
|
|
combined = ad.concat([adata1, adata2], merge='only')
|
|
```
|
|
|
|
**Use when:**
|
|
- Datasets have non-overlapping annotations
|
|
- You want to preserve all unique metadata
|
|
|
|
## Source Tracking
|
|
|
|
### Using label
|
|
|
|
Add a categorical column to track data origin.
|
|
|
|
```python
|
|
combined = ad.concat(
|
|
[adata1, adata2, adata3],
|
|
label='batch'
|
|
)
|
|
|
|
# Creates obs['batch'] with values 0, 1, 2
|
|
print(combined.obs['batch'].cat.categories) # ['0', '1', '2']
|
|
```
|
|
|
|
### Using keys
|
|
|
|
Provide custom names for source tracking.
|
|
|
|
```python
|
|
combined = ad.concat(
|
|
[adata1, adata2, adata3],
|
|
label='study',
|
|
keys=['control', 'treatment_a', 'treatment_b']
|
|
)
|
|
|
|
# Creates obs['study'] with custom names
|
|
print(combined.obs['study'].unique()) # ['control', 'treatment_a', 'treatment_b']
|
|
```
|
|
|
|
### Making Indices Unique
|
|
|
|
Append source identifiers to duplicate observation names.
|
|
|
|
```python
|
|
# Both datasets have cells named "Cell_0", "Cell_1", etc.
|
|
adata1.obs_names = [f'Cell_{i}' for i in range(100)]
|
|
adata2.obs_names = [f'Cell_{i}' for i in range(150)]
|
|
|
|
# index_unique adds suffix
|
|
combined = ad.concat(
|
|
[adata1, adata2],
|
|
keys=['batch1', 'batch2'],
|
|
index_unique='-'
|
|
)
|
|
|
|
# Results in: Cell_0-batch1, Cell_0-batch2, etc.
|
|
print(combined.obs_names[:5])
|
|
```
|
|
|
|
## Handling Different Attributes
|
|
|
|
### X Matrix and Layers
|
|
|
|
Follows join strategy. Missing values filled according to `fill_value`.
|
|
|
|
```python
|
|
# Both have layers
|
|
adata1.layers['counts'] = adata1.X.copy()
|
|
adata2.layers['counts'] = adata2.X.copy()
|
|
|
|
# Concatenates both X and layers
|
|
combined = ad.concat([adata1, adata2])
|
|
assert 'counts' in combined.layers
|
|
```
|
|
|
|
### obs and var DataFrames
|
|
|
|
- **obs**: Concatenated along concatenation axis
|
|
- **var**: Handled by merge strategy
|
|
|
|
```python
|
|
adata1.obs['cell_type'] = ['B cell'] * 100
|
|
adata2.obs['cell_type'] = ['T cell'] * 150
|
|
|
|
combined = ad.concat([adata1, adata2])
|
|
# obs['cell_type'] preserved for all cells
|
|
```
|
|
|
|
### obsm and varm
|
|
|
|
Multi-dimensional annotations follow same rules as layers.
|
|
|
|
```python
|
|
adata1.obsm['X_pca'] = np.random.rand(100, 50)
|
|
adata2.obsm['X_pca'] = np.random.rand(150, 50)
|
|
|
|
combined = ad.concat([adata1, adata2])
|
|
# obsm['X_pca'] concatenated: shape (250, 50)
|
|
```
|
|
|
|
### obsp and varp
|
|
|
|
Pairwise matrices excluded by default. Enable with `pairwise=True`.
|
|
|
|
```python
|
|
# Distance matrices
|
|
adata1.obsp['distances'] = np.random.rand(100, 100)
|
|
adata2.obsp['distances'] = np.random.rand(150, 150)
|
|
|
|
# Excluded by default
|
|
combined = ad.concat([adata1, adata2])
|
|
assert 'distances' not in combined.obsp
|
|
|
|
# Include if needed
|
|
combined = ad.concat([adata1, adata2], pairwise=True)
|
|
# Results in padded block diagonal matrix
|
|
```
|
|
|
|
### uns Dictionary
|
|
|
|
Merged recursively, applying merge strategy at any nesting depth.
|
|
|
|
```python
|
|
adata1.uns['experiment'] = {'date': '2024-01', 'lab': 'A'}
|
|
adata2.uns['experiment'] = {'date': '2024-02', 'lab': 'A'}
|
|
|
|
# merge='same' keeps 'lab', excludes 'date'
|
|
combined = ad.concat([adata1, adata2], merge='same')
|
|
# combined.uns['experiment'] = {'lab': 'A'}
|
|
```
|
|
|
|
## Advanced Patterns
|
|
|
|
### Batch Integration Pipeline
|
|
|
|
```python
|
|
import anndata as ad
|
|
|
|
# Load batches
|
|
batches = [
|
|
ad.read_h5ad(f'batch_{i}.h5ad')
|
|
for i in range(5)
|
|
]
|
|
|
|
# Concatenate with tracking
|
|
combined = ad.concat(
|
|
batches,
|
|
axis=0,
|
|
join='outer',
|
|
merge='first',
|
|
label='batch_id',
|
|
keys=[f'batch_{i}' for i in range(5)],
|
|
index_unique='-'
|
|
)
|
|
|
|
# Add batch effects
|
|
combined.obs['batch_numeric'] = combined.obs['batch_id'].cat.codes
|
|
```
|
|
|
|
### Multi-Study Meta-Analysis
|
|
|
|
```python
|
|
# Different studies with varying gene coverage
|
|
studies = {
|
|
'study_a': ad.read_h5ad('study_a.h5ad'),
|
|
'study_b': ad.read_h5ad('study_b.h5ad'),
|
|
'study_c': ad.read_h5ad('study_c.h5ad')
|
|
}
|
|
|
|
# Outer join to keep all genes
|
|
combined = ad.concat(
|
|
list(studies.values()),
|
|
axis=0,
|
|
join='outer',
|
|
label='study',
|
|
keys=list(studies.keys()),
|
|
merge='unique',
|
|
fill_value=0
|
|
)
|
|
|
|
# Track coverage
|
|
for study in studies:
|
|
n_genes = studies[study].n_vars
|
|
combined.uns[f'{study}_n_genes'] = n_genes
|
|
```
|
|
|
|
### Incremental Concatenation
|
|
|
|
```python
|
|
# For many datasets, concatenate in batches
|
|
chunk_size = 10
|
|
all_files = [f'dataset_{i}.h5ad' for i in range(100)]
|
|
|
|
# Process in chunks
|
|
result = None
|
|
for i in range(0, len(all_files), chunk_size):
|
|
chunk_files = all_files[i:i+chunk_size]
|
|
chunk_adatas = [ad.read_h5ad(f) for f in chunk_files]
|
|
chunk_combined = ad.concat(chunk_adatas)
|
|
|
|
if result is None:
|
|
result = chunk_combined
|
|
else:
|
|
result = ad.concat([result, chunk_combined])
|
|
```
|
|
|
|
### Memory-Efficient On-Disk Concatenation
|
|
|
|
```python
|
|
# Experimental feature for large datasets
|
|
from anndata.experimental import concat_on_disk
|
|
|
|
files = ['dataset1.h5ad', 'dataset2.h5ad', 'dataset3.h5ad']
|
|
concat_on_disk(
|
|
files,
|
|
'combined.h5ad',
|
|
join='outer'
|
|
)
|
|
|
|
# Read result in backed mode
|
|
combined = ad.read_h5ad('combined.h5ad', backed='r')
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: Dimension Mismatch
|
|
|
|
```python
|
|
# Error: shapes don't match
|
|
adata1 = ad.AnnData(np.random.rand(100, 2000))
|
|
adata2 = ad.AnnData(np.random.rand(150, 1500))
|
|
|
|
# Solution: use outer join
|
|
combined = ad.concat([adata1, adata2], join='outer')
|
|
```
|
|
|
|
### Issue: Memory Error
|
|
|
|
```python
|
|
# Problem: too many large objects in memory
|
|
large_adatas = [ad.read_h5ad(f) for f in many_files]
|
|
|
|
# Solution: read and concatenate incrementally
|
|
result = None
|
|
for file in many_files:
|
|
adata = ad.read_h5ad(file)
|
|
if result is None:
|
|
result = adata
|
|
else:
|
|
result = ad.concat([result, adata])
|
|
del adata # Free memory
|
|
```
|
|
|
|
### Issue: Duplicate Indices
|
|
|
|
```python
|
|
# Problem: same cell names in different batches
|
|
# Solution: use index_unique
|
|
combined = ad.concat(
|
|
[adata1, adata2],
|
|
keys=['batch1', 'batch2'],
|
|
index_unique='-'
|
|
)
|
|
```
|
|
|
|
### Issue: Lost Annotations
|
|
|
|
```python
|
|
# Problem: annotations disappear
|
|
adata1.var['important'] = values1
|
|
adata2.var['important'] = values2
|
|
|
|
combined = ad.concat([adata1, adata2]) # merge=None by default
|
|
# Solution: use appropriate merge strategy
|
|
combined = ad.concat([adata1, adata2], merge='first')
|
|
```
|
|
|
|
## Performance Tips
|
|
|
|
1. **Pre-align indices**: Ensure consistent naming before concatenation
|
|
2. **Use sparse matrices**: Convert to sparse before concatenating
|
|
3. **Batch operations**: Concatenate in groups for many datasets
|
|
4. **Choose inner join**: When possible, to reduce result size
|
|
5. **Use categoricals**: Convert string annotations before concatenating
|
|
6. **Consider on-disk**: For very large datasets, use `concat_on_disk`
|