Initial commit

This commit is contained in:
Timothy Kassis
2025-10-19 14:01:29 -07:00
parent d85386c32b
commit 152d0d54de
15 changed files with 4569 additions and 0 deletions

View File

@@ -0,0 +1,527 @@
---
name: anndata
description: Work with AnnData objects for annotated data matrices, commonly used in single-cell genomics and other scientific domains. This skill should be used when working with .h5ad files, performing single-cell RNA-seq analysis, managing annotated datasets, concatenating multiple datasets, or working with sparse matrices and embeddings in a structured format.
---
# AnnData
## Overview
AnnData (Annotated Data) is Python's standard for storing and manipulating annotated data matrices, particularly in single-cell genomics. This skill provides comprehensive guidance for working with AnnData objects, including data creation, manipulation, file I/O, concatenation, and best practices for memory-efficient workflows.
## Core Capabilities
### 1. Creating and Structuring AnnData Objects
Create AnnData objects from various data sources and organize multi-dimensional annotations.
**Basic creation:**
```python
import anndata as ad
import numpy as np
from scipy.sparse import csr_matrix
# From dense or sparse arrays
counts = np.random.poisson(1, size=(100, 2000))
adata = ad.AnnData(counts)
# With sparse matrix (memory-efficient)
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
adata = ad.AnnData(counts)
```
**With metadata:**
```python
import pandas as pd
obs_meta = pd.DataFrame({
'cell_type': pd.Categorical(['B', 'T', 'Monocyte'] * 33 + ['B']),
'batch': ['batch1'] * 50 + ['batch2'] * 50
})
var_meta = pd.DataFrame({
'gene_name': [f'Gene_{i}' for i in range(2000)],
'highly_variable': np.random.choice([True, False], 2000)
})
adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
```
**Understanding the structure:**
- **X**: Primary data matrix (observations × variables)
- **obs**: Row (observation) annotations as DataFrame
- **var**: Column (variable) annotations as DataFrame
- **obsm**: Multi-dimensional observation annotations (e.g., PCA, UMAP coordinates)
- **varm**: Multi-dimensional variable annotations (e.g., gene loadings)
- **layers**: Alternative data matrices with same dimensions as X
- **uns**: Unstructured metadata dictionary
- **obsp/varp**: Pairwise relationship matrices (graphs)
### 2. Adding Annotations and Layers
Organize different data representations and metadata within a single object.
**Cell-level metadata (obs):**
```python
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['total_counts'] = adata.X.sum(axis=1)
adata.obs['condition'] = pd.Categorical(['control', 'treated'] * 50)
```
**Gene-level metadata (var):**
```python
adata.var['highly_variable'] = gene_variance > threshold
adata.var['chromosome'] = pd.Categorical(['chr1', 'chr2', ...])
```
**Embeddings (obsm/varm):**
```python
# Dimensionality reduction results
adata.obsm['X_pca'] = pca_coordinates # Shape: (n_obs, n_components)
adata.obsm['X_umap'] = umap_coordinates # Shape: (n_obs, 2)
adata.obsm['X_tsne'] = tsne_coordinates
# Gene loadings
adata.varm['PCs'] = principal_components # Shape: (n_vars, n_components)
```
**Alternative data representations (layers):**
```python
# Store multiple versions
adata.layers['counts'] = raw_counts
adata.layers['log1p'] = np.log1p(adata.X)
adata.layers['scaled'] = (adata.X - mean) / std
```
**Unstructured metadata (uns):**
```python
# Analysis parameters
adata.uns['preprocessing'] = {
'normalization': 'TPM',
'min_genes': 200,
'date': '2024-01-15'
}
# Results
adata.uns['pca'] = {'variance_ratio': variance_explained}
```
### 3. Subsetting and Views
Efficiently subset data while managing memory through views and copies.
**Subsetting operations:**
```python
# By observation/variable names
subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
# By boolean masks
b_cells = adata[adata.obs.cell_type == 'B']
high_quality = adata[adata.obs.n_genes > 200]
# By position
first_cells = adata[:100, :]
top_genes = adata[:, :500]
# Combined conditions
filtered = adata[
(adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
adata.var.highly_variable
]
```
**Understanding views:**
- Subsetting returns **views** by default (memory-efficient, shares data with original)
- Modifying a view affects the original object
- Check with `adata.is_view`
- Convert to independent copy with `.copy()`
```python
# View (memory-efficient)
subset = adata[adata.obs.condition == 'treated']
print(subset.is_view) # True
# Independent copy
subset_copy = adata[adata.obs.condition == 'treated'].copy()
print(subset_copy.is_view) # False
```
### 4. File I/O and Backed Mode
Read and write data efficiently, with options for memory-limited environments.
**Writing data:**
```python
# Standard format with compression
adata.write('results.h5ad', compression='gzip')
# Alternative formats
adata.write_zarr('results.zarr') # For cloud storage
adata.write_loom('results.loom') # For compatibility
adata.write_csvs('results/') # As CSV files
```
**Reading data:**
```python
# Load into memory
adata = ad.read_h5ad('results.h5ad')
# Backed mode (disk-backed, memory-efficient)
adata = ad.read_h5ad('large_file.h5ad', backed='r')
print(adata.isbacked) # True
print(adata.filename) # Path to file
# Close file connection when done
adata.file.close()
```
**Reading from other formats:**
```python
# 10X format
adata = ad.read_mtx('matrix.mtx')
# CSV
adata = ad.read_csv('data.csv')
# Loom
adata = ad.read_loom('data.loom')
```
**Working with backed mode:**
```python
# Read in backed mode for large files
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
# Process in chunks
for chunk in adata.chunk_X(chunk_size=1000):
result = process_chunk(chunk)
# Load to memory if needed
adata_memory = adata.to_memory()
```
### 5. Concatenating Multiple Datasets
Combine multiple AnnData objects with control over how data is merged.
**Basic concatenation:**
```python
# Concatenate observations (most common)
combined = ad.concat([adata1, adata2, adata3], axis=0)
# Concatenate variables (rare)
combined = ad.concat([adata1, adata2], axis=1)
```
**Join strategies:**
```python
# Inner join: only shared variables (no missing data)
combined = ad.concat([adata1, adata2], join='inner')
# Outer join: all variables (fills missing with 0)
combined = ad.concat([adata1, adata2], join='outer')
```
**Tracking data sources:**
```python
# Add source labels
combined = ad.concat(
[adata1, adata2, adata3],
label='dataset',
keys=['exp1', 'exp2', 'exp3']
)
# Creates combined.obs['dataset'] with values 'exp1', 'exp2', 'exp3'
# Make duplicate indices unique
combined = ad.concat(
[adata1, adata2],
keys=['batch1', 'batch2'],
index_unique='-'
)
# Cell names become: Cell_0-batch1, Cell_0-batch2, etc.
```
**Merge strategies for metadata:**
```python
# merge=None: exclude variable annotations (default)
combined = ad.concat([adata1, adata2], merge=None)
# merge='same': keep only identical annotations
combined = ad.concat([adata1, adata2], merge='same')
# merge='first': use first occurrence
combined = ad.concat([adata1, adata2], merge='first')
# merge='unique': keep annotations with single value
combined = ad.concat([adata1, adata2], merge='unique')
```
**Complete example:**
```python
# Load batches
batch1 = ad.read_h5ad('batch1.h5ad')
batch2 = ad.read_h5ad('batch2.h5ad')
batch3 = ad.read_h5ad('batch3.h5ad')
# Concatenate with full tracking
combined = ad.concat(
[batch1, batch2, batch3],
axis=0,
join='outer', # Keep all genes
merge='first', # Use first batch's annotations
label='batch_id', # Track source
keys=['b1', 'b2', 'b3'], # Custom labels
index_unique='-' # Make cell names unique
)
```
### 6. Data Conversion and Extraction
Convert between AnnData and other formats for interoperability.
**To DataFrame:**
```python
# Convert X to DataFrame
df = adata.to_df()
# Convert specific layer
df = adata.to_df(layer='log1p')
```
**Extract vectors:**
```python
# Get 1D arrays from data or annotations
gene_expression = adata.obs_vector('Gene_100')
cell_metadata = adata.obs_vector('n_genes')
```
**Transpose:**
```python
# Swap observations and variables
transposed = adata.T
```
### 7. Memory Optimization
Strategies for working with large datasets efficiently.
**Use sparse matrices:**
```python
from scipy.sparse import csr_matrix
# Check sparsity
density = (adata.X != 0).sum() / adata.X.size
if density < 0.3: # Less than 30% non-zero
adata.X = csr_matrix(adata.X)
```
**Convert strings to categoricals:**
```python
# Automatic conversion
adata.strings_to_categoricals()
# Manual conversion (more control)
adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
```
**Use backed mode:**
```python
# Read without loading into memory
adata = ad.read_h5ad('large_file.h5ad', backed='r')
# Work with subsets
subset = adata[:1000, :500].copy() # Only this subset in memory
```
**Chunked processing:**
```python
# Process data in chunks
results = []
for chunk in adata.chunk_X(chunk_size=1000):
result = expensive_computation(chunk)
results.append(result)
```
## Common Workflows
### Single-Cell RNA-seq Analysis
Complete workflow from loading to analysis:
```python
import anndata as ad
import numpy as np
import pandas as pd
# 1. Load data
adata = ad.read_mtx('matrix.mtx')
adata.obs_names = pd.read_csv('barcodes.tsv', header=None)[0]
adata.var_names = pd.read_csv('genes.tsv', header=None)[0]
# 2. Quality control
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['total_counts'] = adata.X.sum(axis=1)
adata = adata[adata.obs.n_genes > 200]
adata = adata[adata.obs.total_counts < 10000]
# 3. Filter genes
min_cells = 3
adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
# 4. Store raw counts
adata.layers['counts'] = adata.X.copy()
# 5. Normalize
adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
adata.X = np.log1p(adata.X)
# 6. Feature selection
gene_var = adata.X.var(axis=0)
adata.var['highly_variable'] = gene_var > np.percentile(gene_var, 90)
# 7. Dimensionality reduction (example with external tools)
# adata.obsm['X_pca'] = compute_pca(adata.X)
# adata.obsm['X_umap'] = compute_umap(adata.obsm['X_pca'])
# 8. Save results
adata.write('analyzed.h5ad', compression='gzip')
```
### Batch Integration
Combining multiple experimental batches:
```python
# Load batches
batches = [ad.read_h5ad(f'batch_{i}.h5ad') for i in range(3)]
# Concatenate with tracking
combined = ad.concat(
batches,
axis=0,
join='outer',
label='batch',
keys=['batch_0', 'batch_1', 'batch_2'],
index_unique='-'
)
# Add batch as numeric for correction algorithms
combined.obs['batch_numeric'] = combined.obs['batch'].cat.codes
# Perform batch correction (with external tools)
# corrected_pca = run_harmony(combined.obsm['X_pca'], combined.obs['batch'])
# combined.obsm['X_pca_corrected'] = corrected_pca
# Save integrated data
combined.write('integrated.h5ad', compression='gzip')
```
### Memory-Efficient Large Dataset Processing
Working with datasets too large for memory:
```python
# Read in backed mode
adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')
# Compute statistics in chunks
total = 0
for chunk in adata.chunk_X(chunk_size=1000):
total += chunk.sum()
mean_expression = total / (adata.n_obs * adata.n_vars)
# Work with subset
high_quality_cells = adata.obs.n_genes > 1000
subset = adata[high_quality_cells, :].copy()
# Close file
adata.file.close()
```
## Best Practices
### Data Organization
1. **Use layers for different representations**: Store raw counts, normalized, log-transformed, and scaled data in separate layers
2. **Use obsm/varm for multi-dimensional data**: Embeddings, loadings, and other matrix-like annotations
3. **Use uns for metadata**: Analysis parameters, dates, version information
4. **Use categoricals for efficiency**: Convert repeated strings to categorical types
### Subsetting
1. **Understand views vs copies**: Subsetting returns views by default; use `.copy()` when you need independence
2. **Chain conditions efficiently**: Combine boolean masks in a single subsetting operation
3. **Validate after subsetting**: Check dimensions and data integrity
### File I/O
1. **Use compression**: Always use `compression='gzip'` when writing h5ad files
2. **Choose the right format**: H5AD for general use, Zarr for cloud storage, Loom for compatibility
3. **Close backed files**: Always close file connections when done
4. **Use backed mode for large files**: Don't load everything into memory if not needed
### Concatenation
1. **Choose appropriate join**: Inner join for complete cases, outer join to preserve all features
2. **Track sources**: Use `label` and `keys` to track data origin
3. **Handle duplicates**: Use `index_unique` to make observation names unique
4. **Select merge strategy**: Choose appropriate merge strategy for variable annotations
### Memory Management
1. **Use sparse matrices**: For data with <30% non-zero values
2. **Convert to categoricals**: For repeated string values
3. **Process in chunks**: For operations on very large matrices
4. **Use backed mode**: Read large files with `backed='r'`
### Naming Conventions
Follow these conventions for consistency:
- **Embeddings**: `X_pca`, `X_umap`, `X_tsne`
- **Layers**: Descriptive names like `counts`, `log1p`, `scaled`
- **Observations**: Use snake_case like `cell_type`, `n_genes`, `total_counts`
- **Variables**: Use snake_case like `highly_variable`, `gene_name`
## Reference Documentation
For detailed API information, usage patterns, and troubleshooting, refer to the comprehensive reference files in the `references/` directory:
1. **api_reference.md**: Complete API documentation including all classes, methods, and functions with usage examples. Use `grep -r "pattern" references/api_reference.md` to search for specific functions or parameters.
2. **workflows_best_practices.md**: Detailed workflows for common tasks (single-cell analysis, batch integration, large datasets), best practices for memory management, data organization, and common pitfalls to avoid. Use `grep -r "pattern" references/workflows_best_practices.md` to search for specific workflows.
3. **concatenation_guide.md**: Comprehensive guide to concatenation strategies, join types, merge strategies, source tracking, and troubleshooting concatenation issues. Use `grep -r "pattern" references/concatenation_guide.md` to search for concatenation patterns.
## When to Load References
Load reference files into context when:
- Implementing complex concatenation with specific merge strategies
- Troubleshooting errors or unexpected behavior
- Optimizing memory usage for large datasets
- Implementing complete analysis workflows
- Understanding nuances of specific API methods
To search within references without loading them:
```python
# Example: Search for information about backed mode
grep -r "backed mode" references/
```
## Common Error Patterns
### Memory Errors
**Problem**: "MemoryError: Unable to allocate array"
**Solution**: Use backed mode, sparse matrices, or process in chunks
### Dimension Mismatch
**Problem**: "ValueError: operands could not be broadcast together"
**Solution**: Use outer join in concatenation or align indices before operations
### View Modification
**Problem**: "ValueError: assignment destination is read-only"
**Solution**: Convert view to copy with `.copy()` before modification
### File Already Open
**Problem**: "OSError: Unable to open file (file is already open)"
**Solution**: Close previous file connection with `adata.file.close()`

View File

@@ -0,0 +1,218 @@
# AnnData API Reference
## Core AnnData Class
The `AnnData` class is the central data structure for storing and manipulating annotated datasets in single-cell genomics and other domains.
### Core Attributes
| Attribute | Type | Description |
|-----------|------|-------------|
| **X** | array-like | Primary data matrix (#observations × #variables). Supports NumPy arrays, sparse matrices (CSR/CSC), HDF5 datasets, Zarr arrays, and Dask arrays |
| **obs** | DataFrame | One-dimensional annotation of observations (rows). Length equals observation count |
| **var** | DataFrame | One-dimensional annotation of variables/features (columns). Length equals variable count |
| **uns** | OrderedDict | Unstructured annotation for miscellaneous metadata |
| **obsm** | dict-like | Multi-dimensional observation annotations (structured arrays aligned to observation axis) |
| **varm** | dict-like | Multi-dimensional variable annotations (structured arrays aligned to variable axis) |
| **obsp** | dict-like | Pairwise observation annotations (square matrices representing graphs) |
| **varp** | dict-like | Pairwise variable annotations (graphs between features) |
| **layers** | dict-like | Additional data matrices matching X's dimensions |
| **raw** | AnnData | Stores original versions of X and var before transformations |
### Dimensional Properties
- **n_obs**: Number of observations (sample count)
- **n_vars**: Number of variables/features
- **shape**: Tuple returning (n_obs, n_vars)
- **T**: Transposed view of the entire object
### State Properties
- **isbacked**: Boolean indicating disk-backed storage status
- **is_view**: Boolean identifying whether object is a view of another AnnData
- **filename**: Path to backing .h5ad file; setting this enables disk-backed mode
### Key Methods
#### Construction and Copying
- **`AnnData(X=None, obs=None, var=None, ...)`**: Create new AnnData object
- **`copy(filename=None)`**: Create full copy, optionally stored on disk
#### Subsetting and Views
- **`adata[obs_subset, var_subset]`**: Subset observations and variables (returns view by default)
- **`.copy()`**: Convert view to independent object
#### Data Access
- **`to_df(layer=None)`**: Generate pandas DataFrame representation
- **`obs_vector(k, layer=None)`**: Extract 1D array from X, layers, or annotations
- **`var_vector(k, layer=None)`**: Extract 1D array for a variable
- **`chunk_X(chunk_size)`**: Iterate over data matrix in chunks
- **`chunked_X(chunk_size)`**: Context manager for chunked iteration
#### Transformation
- **`transpose()`**: Return transposed object
- **`concatenate(*adatas, ...)`**: Combine multiple AnnData objects along observation axis
- **`to_memory(copy=False)`**: Load all backed arrays into RAM
#### File I/O
- **`write_h5ad(filename, compression='gzip')`**: Save as .h5ad HDF5 format
- **`write_zarr(store, ...)`**: Export hierarchical Zarr store
- **`write_loom(filename, ...)`**: Output .loom format file
- **`write_csvs(dirname, ...)`**: Write annotations as separate CSV files
#### Data Management
- **`strings_to_categoricals()`**: Convert string annotations to categorical types
- **`rename_categories(key, categories)`**: Update category labels in annotations
- **`obs_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate observation names
- **`var_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate variable names
## Module-Level Functions
### Reading Functions
#### Native Formats
- **`read_h5ad(filename, backed=None, as_sparse=None)`**: Load HDF5-based .h5ad files
- **`read_zarr(store)`**: Access hierarchical Zarr array stores
#### Alternative Formats
- **`read_csv(filename, ...)`**: Import from CSV files
- **`read_excel(filename, ...)`**: Import from Excel files
- **`read_hdf(filename, key)`**: Read from HDF5 files
- **`read_loom(filename, ...)`**: Import from .loom files
- **`read_mtx(filename, ...)`**: Import from Matrix Market format
- **`read_text(filename, ...)`**: Import from text files
- **`read_umi_tools(filename, ...)`**: Import from UMI-tools format
#### Element-Level Access
- **`read_elem(elem)`**: Retrieve specific components from storage
- **`sparse_dataset(group)`**: Generate backed sparse matrix classes
### Combining Operations
- **`concat(adatas, axis=0, join='inner', merge=None, ...)`**: Merge multiple AnnData objects
- **axis**: 0 (observations) or 1 (variables)
- **join**: 'inner' (intersection) or 'outer' (union)
- **merge**: Strategy for non-concatenation axis ('same', 'unique', 'first', 'only', or None)
- **label**: Column name for source tracking
- **keys**: Dataset identifiers for source annotation
- **index_unique**: Separator for making duplicate indices unique
### Writing Functions
- **`write_h5ad(filename, adata, compression='gzip')`**: Export to HDF5 format
- **`write_zarr(store, adata, ...)`**: Save as Zarr hierarchical arrays
- **`write_elem(elem, ...)`**: Write individual components
### Experimental Features
- **`AnnCollection`**: Batch processing for large collections
- **`AnnLoader`**: PyTorch DataLoader integration
- **`concat_on_disk(*adatas, filename, ...)`**: Memory-efficient out-of-core concatenation
- **`read_lazy(filename)`**: Lazy loading with deferred computation
- **`read_dispatched(filename, ...)`**: Custom I/O with callbacks
- **`write_dispatched(filename, ...)`**: Custom writing with callbacks
### Configuration
- **`settings`**: Package-wide configuration object
- **`settings.override(**kwargs)`**: Context manager for temporary settings changes
## Common Usage Patterns
### Creating AnnData Objects
```python
import anndata as ad
import numpy as np
from scipy.sparse import csr_matrix
# From dense array
counts = np.random.poisson(1, size=(100, 2000))
adata = ad.AnnData(counts)
# From sparse matrix
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
adata = ad.AnnData(counts)
# With metadata
import pandas as pd
obs_meta = pd.DataFrame({'cell_type': ['B', 'T', 'Monocyte'] * 33 + ['B']})
var_meta = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]})
adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
```
### Subsetting
```python
# By names
subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
# By boolean mask
b_cells = adata[adata.obs.cell_type == 'B']
# By position
first_five = adata[:5, :100]
# Convert view to copy
adata_copy = adata[:5].copy()
```
### Adding Annotations
```python
# Cell-level metadata
adata.obs['batch'] = pd.Categorical(['batch1', 'batch2'] * 50)
# Gene-level metadata
adata.var['highly_variable'] = np.random.choice([True, False], size=adata.n_vars)
# Embeddings
adata.obsm['X_pca'] = np.random.normal(size=(adata.n_obs, 50))
adata.obsm['X_umap'] = np.random.normal(size=(adata.n_obs, 2))
# Alternative data representations
adata.layers['log_transformed'] = np.log1p(adata.X)
adata.layers['scaled'] = (adata.X - adata.X.mean(axis=0)) / adata.X.std(axis=0)
# Unstructured metadata
adata.uns['experiment_date'] = '2024-01-15'
adata.uns['parameters'] = {'min_genes': 200, 'min_cells': 3}
```
### File I/O
```python
# Write to disk
adata.write('my_results.h5ad', compression='gzip')
# Read into memory
adata = ad.read_h5ad('my_results.h5ad')
# Read in backed mode (memory-efficient)
adata = ad.read_h5ad('my_results.h5ad', backed='r')
# Close file connection
adata.file.close()
```
### Concatenation
```python
# Combine multiple datasets
adata1 = ad.AnnData(np.random.poisson(1, size=(100, 2000)))
adata2 = ad.AnnData(np.random.poisson(1, size=(150, 2000)))
adata3 = ad.AnnData(np.random.poisson(1, size=(80, 2000)))
# Simple concatenation
combined = ad.concat([adata1, adata2, adata3], axis=0)
# With source labels
combined = ad.concat(
[adata1, adata2, adata3],
axis=0,
label='dataset',
keys=['exp1', 'exp2', 'exp3']
)
# Inner join (only shared variables)
combined = ad.concat([adata1, adata2, adata3], axis=0, join='inner')
# Outer join (all variables, pad with zeros)
combined = ad.concat([adata1, adata2, adata3], axis=0, join='outer')
```

View File

@@ -0,0 +1,478 @@
# AnnData Concatenation Guide
## Overview
The `concat()` function combines multiple AnnData objects through two fundamental operations:
1. **Concatenation**: Stacking sub-elements in order
2. **Merging**: Combining collections into one result
## Basic Concatenation
### Syntax
```python
import anndata as ad
combined = ad.concat(
adatas, # List of AnnData objects
axis=0, # 0=observations, 1=variables
join='inner', # 'inner' or 'outer'
merge=None, # Merge strategy for non-concat axis
label=None, # Column name for source tracking
keys=None, # Dataset identifiers
index_unique=None, # Separator for unique indices
fill_value=None, # Fill value for missing data
pairwise=False # Include pairwise matrices
)
```
### Concatenating Observations (Cells)
```python
# Most common: combining multiple samples/batches
adata1 = ad.AnnData(np.random.rand(100, 2000))
adata2 = ad.AnnData(np.random.rand(150, 2000))
adata3 = ad.AnnData(np.random.rand(80, 2000))
combined = ad.concat([adata1, adata2, adata3], axis=0)
# Result: (330 observations, 2000 variables)
```
### Concatenating Variables (Genes)
```python
# Less common: combining different feature sets
adata1 = ad.AnnData(np.random.rand(100, 1000))
adata2 = ad.AnnData(np.random.rand(100, 500))
combined = ad.concat([adata1, adata2], axis=1)
# Result: (100 observations, 1500 variables)
```
## Join Strategies
### Inner Join (Intersection)
Keeps only shared features across all objects.
```python
# Datasets with different genes
adata1 = ad.AnnData(
np.random.rand(100, 2000),
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(2000)])
)
adata2 = ad.AnnData(
np.random.rand(150, 1800),
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(200, 2000)])
)
# Inner join: only genes present in both
combined = ad.concat([adata1, adata2], join='inner')
# Result: (250 observations, 1800 variables)
# Only Gene_200 through Gene_1999
```
**Use when:**
- You want to analyze only features measured in all datasets
- Missing features would compromise analysis
- You need a complete case analysis
**Trade-offs:**
- May lose many features
- Ensures no missing data
- Smaller result size
### Outer Join (Union)
Keeps all features from all objects, padding with fill values (default 0).
```python
# Outer join: all genes from both datasets
combined = ad.concat([adata1, adata2], join='outer')
# Result: (250 observations, 2000 variables)
# Missing values filled with 0
# Custom fill value
combined = ad.concat([adata1, adata2], join='outer', fill_value=np.nan)
```
**Use when:**
- You want to preserve all features
- Sparse data is acceptable
- Features are independent
**Trade-offs:**
- Introduces zeros/missing values
- Larger result size
- May need imputation
## Merge Strategies
Merge strategies control how elements on the non-concatenation axis are combined.
### merge=None (Default)
Excludes all non-concatenation axis elements.
```python
# Both datasets have var annotations
adata1.var['gene_type'] = ['protein_coding'] * 2000
adata2.var['gene_type'] = ['protein_coding'] * 1800
# merge=None: var annotations excluded
combined = ad.concat([adata1, adata2], merge=None)
assert 'gene_type' not in combined.var.columns
```
**Use when:**
- Annotations are dataset-specific
- You'll add new annotations after merging
### merge='same'
Keeps only annotations with identical values across datasets.
```python
# Same annotation values
adata1.var['chromosome'] = ['chr1'] * 1000 + ['chr2'] * 1000
adata2.var['chromosome'] = ['chr1'] * 900 + ['chr2'] * 900
# merge='same': keeps chromosome annotation
combined = ad.concat([adata1, adata2], merge='same')
assert 'chromosome' in combined.var.columns
```
**Use when:**
- Annotations should be consistent
- You want to validate consistency
- Shared metadata is important
**Note:** Comparison occurs after index alignment - only shared indices need to match.
### merge='unique'
Includes annotations with a single possible value.
```python
# Unique values per gene
adata1.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
adata2.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
# merge='unique': keeps ensembl_id
combined = ad.concat([adata1, adata2], merge='unique')
```
**Use when:**
- Each feature has a unique identifier
- Annotations are feature-specific
### merge='first'
Takes the first occurrence of each annotation.
```python
# Different annotation versions
adata1.var['description'] = ['desc1'] * 2000
adata2.var['description'] = ['desc2'] * 2000
# merge='first': uses adata1's descriptions
combined = ad.concat([adata1, adata2], merge='first')
# Uses descriptions from adata1
```
**Use when:**
- One dataset has authoritative annotations
- Order matters
- You need a simple resolution strategy
### merge='only'
Retains annotations appearing in exactly one object.
```python
# Dataset-specific annotations
adata1.var['dataset1_specific'] = ['value'] * 2000
adata2.var['dataset2_specific'] = ['value'] * 2000
# merge='only': keeps both (no conflicts)
combined = ad.concat([adata1, adata2], merge='only')
```
**Use when:**
- Datasets have non-overlapping annotations
- You want to preserve all unique metadata
## Source Tracking
### Using label
Add a categorical column to track data origin.
```python
combined = ad.concat(
[adata1, adata2, adata3],
label='batch'
)
# Creates obs['batch'] with values 0, 1, 2
print(combined.obs['batch'].cat.categories) # ['0', '1', '2']
```
### Using keys
Provide custom names for source tracking.
```python
combined = ad.concat(
[adata1, adata2, adata3],
label='study',
keys=['control', 'treatment_a', 'treatment_b']
)
# Creates obs['study'] with custom names
print(combined.obs['study'].unique()) # ['control', 'treatment_a', 'treatment_b']
```
### Making Indices Unique
Append source identifiers to duplicate observation names.
```python
# Both datasets have cells named "Cell_0", "Cell_1", etc.
adata1.obs_names = [f'Cell_{i}' for i in range(100)]
adata2.obs_names = [f'Cell_{i}' for i in range(150)]
# index_unique adds suffix
combined = ad.concat(
[adata1, adata2],
keys=['batch1', 'batch2'],
index_unique='-'
)
# Results in: Cell_0-batch1, Cell_0-batch2, etc.
print(combined.obs_names[:5])
```
## Handling Different Attributes
### X Matrix and Layers
Follows join strategy. Missing values filled according to `fill_value`.
```python
# Both have layers
adata1.layers['counts'] = adata1.X.copy()
adata2.layers['counts'] = adata2.X.copy()
# Concatenates both X and layers
combined = ad.concat([adata1, adata2])
assert 'counts' in combined.layers
```
### obs and var DataFrames
- **obs**: Concatenated along concatenation axis
- **var**: Handled by merge strategy
```python
adata1.obs['cell_type'] = ['B cell'] * 100
adata2.obs['cell_type'] = ['T cell'] * 150
combined = ad.concat([adata1, adata2])
# obs['cell_type'] preserved for all cells
```
### obsm and varm
Multi-dimensional annotations follow same rules as layers.
```python
adata1.obsm['X_pca'] = np.random.rand(100, 50)
adata2.obsm['X_pca'] = np.random.rand(150, 50)
combined = ad.concat([adata1, adata2])
# obsm['X_pca'] concatenated: shape (250, 50)
```
### obsp and varp
Pairwise matrices excluded by default. Enable with `pairwise=True`.
```python
# Distance matrices
adata1.obsp['distances'] = np.random.rand(100, 100)
adata2.obsp['distances'] = np.random.rand(150, 150)
# Excluded by default
combined = ad.concat([adata1, adata2])
assert 'distances' not in combined.obsp
# Include if needed
combined = ad.concat([adata1, adata2], pairwise=True)
# Results in padded block diagonal matrix
```
### uns Dictionary
Merged recursively, applying merge strategy at any nesting depth.
```python
adata1.uns['experiment'] = {'date': '2024-01', 'lab': 'A'}
adata2.uns['experiment'] = {'date': '2024-02', 'lab': 'A'}
# merge='same' keeps 'lab', excludes 'date'
combined = ad.concat([adata1, adata2], merge='same')
# combined.uns['experiment'] = {'lab': 'A'}
```
## Advanced Patterns
### Batch Integration Pipeline
```python
import anndata as ad
# Load batches
batches = [
ad.read_h5ad(f'batch_{i}.h5ad')
for i in range(5)
]
# Concatenate with tracking
combined = ad.concat(
batches,
axis=0,
join='outer',
merge='first',
label='batch_id',
keys=[f'batch_{i}' for i in range(5)],
index_unique='-'
)
# Add batch effects
combined.obs['batch_numeric'] = combined.obs['batch_id'].cat.codes
```
### Multi-Study Meta-Analysis
```python
# Different studies with varying gene coverage
studies = {
'study_a': ad.read_h5ad('study_a.h5ad'),
'study_b': ad.read_h5ad('study_b.h5ad'),
'study_c': ad.read_h5ad('study_c.h5ad')
}
# Outer join to keep all genes
combined = ad.concat(
list(studies.values()),
axis=0,
join='outer',
label='study',
keys=list(studies.keys()),
merge='unique',
fill_value=0
)
# Track coverage
for study in studies:
n_genes = studies[study].n_vars
combined.uns[f'{study}_n_genes'] = n_genes
```
### Incremental Concatenation
```python
# For many datasets, concatenate in batches
chunk_size = 10
all_files = [f'dataset_{i}.h5ad' for i in range(100)]
# Process in chunks
result = None
for i in range(0, len(all_files), chunk_size):
chunk_files = all_files[i:i+chunk_size]
chunk_adatas = [ad.read_h5ad(f) for f in chunk_files]
chunk_combined = ad.concat(chunk_adatas)
if result is None:
result = chunk_combined
else:
result = ad.concat([result, chunk_combined])
```
### Memory-Efficient On-Disk Concatenation
```python
# Experimental feature for large datasets
from anndata.experimental import concat_on_disk
files = ['dataset1.h5ad', 'dataset2.h5ad', 'dataset3.h5ad']
concat_on_disk(
files,
'combined.h5ad',
join='outer'
)
# Read result in backed mode
combined = ad.read_h5ad('combined.h5ad', backed='r')
```
## Troubleshooting
### Issue: Dimension Mismatch
```python
# Error: shapes don't match
adata1 = ad.AnnData(np.random.rand(100, 2000))
adata2 = ad.AnnData(np.random.rand(150, 1500))
# Solution: use outer join
combined = ad.concat([adata1, adata2], join='outer')
```
### Issue: Memory Error
```python
# Problem: too many large objects in memory
large_adatas = [ad.read_h5ad(f) for f in many_files]
# Solution: read and concatenate incrementally
result = None
for file in many_files:
adata = ad.read_h5ad(file)
if result is None:
result = adata
else:
result = ad.concat([result, adata])
del adata # Free memory
```
### Issue: Duplicate Indices
```python
# Problem: same cell names in different batches
# Solution: use index_unique
combined = ad.concat(
[adata1, adata2],
keys=['batch1', 'batch2'],
index_unique='-'
)
```
### Issue: Lost Annotations
```python
# Problem: annotations disappear
adata1.var['important'] = values1
adata2.var['important'] = values2
combined = ad.concat([adata1, adata2]) # merge=None by default
# Solution: use appropriate merge strategy
combined = ad.concat([adata1, adata2], merge='first')
```
## Performance Tips
1. **Pre-align indices**: Ensure consistent naming before concatenation
2. **Use sparse matrices**: Convert to sparse before concatenating
3. **Batch operations**: Concatenate in groups for many datasets
4. **Choose inner join**: When possible, to reduce result size
5. **Use categoricals**: Convert string annotations before concatenating
6. **Consider on-disk**: For very large datasets, use `concat_on_disk`

View File

@@ -0,0 +1,438 @@
# AnnData Workflows and Best Practices
## Common Workflows
### 1. Single-Cell RNA-seq Analysis Workflow
#### Loading Data
```python
import anndata as ad
import numpy as np
import pandas as pd
# Load from 10X format
adata = ad.read_mtx('matrix.mtx')
adata.var_names = pd.read_csv('genes.tsv', sep='\t', header=None)[0]
adata.obs_names = pd.read_csv('barcodes.tsv', sep='\t', header=None)[0]
# Or load from pre-processed h5ad
adata = ad.read_h5ad('preprocessed_data.h5ad')
```
#### Quality Control
```python
# Calculate QC metrics
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['total_counts'] = adata.X.sum(axis=1)
# Filter cells
adata = adata[adata.obs.n_genes > 200]
adata = adata[adata.obs.total_counts < 10000]
# Filter genes
min_cells = 3
adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
```
#### Normalization and Preprocessing
```python
# Store raw counts
adata.layers['counts'] = adata.X.copy()
# Normalize
adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
# Log transform
adata.layers['log1p'] = np.log1p(adata.X)
adata.X = adata.layers['log1p']
# Identify highly variable genes
gene_variance = adata.X.var(axis=0)
adata.var['highly_variable'] = gene_variance > np.percentile(gene_variance, 90)
```
#### Dimensionality Reduction
```python
# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=50)
adata.obsm['X_pca'] = pca.fit_transform(adata.X)
# Store PCA variance
adata.uns['pca'] = {'variance_ratio': pca.explained_variance_ratio_}
# UMAP
from umap import UMAP
umap = UMAP(n_components=2)
adata.obsm['X_umap'] = umap.fit_transform(adata.obsm['X_pca'])
```
#### Clustering
```python
# Store cluster assignments
adata.obs['clusters'] = pd.Categorical(['cluster_0', 'cluster_1', ...])
# Store cluster centroids
centroids = np.array([...])
adata.varm['cluster_centroids'] = centroids
```
#### Save Results
```python
# Save complete analysis
adata.write('analyzed_data.h5ad', compression='gzip')
```
### 2. Batch Integration Workflow
```python
import anndata as ad
# Load multiple batches
batch1 = ad.read_h5ad('batch1.h5ad')
batch2 = ad.read_h5ad('batch2.h5ad')
batch3 = ad.read_h5ad('batch3.h5ad')
# Concatenate with batch labels
adata = ad.concat(
[batch1, batch2, batch3],
axis=0,
label='batch',
keys=['batch1', 'batch2', 'batch3'],
index_unique='-'
)
# Batch effect correction would go here
# (using external tools like Harmony, Scanorama, etc.)
# Store corrected embeddings
adata.obsm['X_pca_corrected'] = corrected_pca
adata.obsm['X_umap_corrected'] = corrected_umap
```
### 3. Memory-Efficient Large Dataset Workflow
```python
import anndata as ad
# Read in backed mode
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
# Check backing status
print(f"Is backed: {adata.isbacked}")
print(f"File: {adata.filename}")
# Work with chunks
for chunk in adata.chunk_X(chunk_size=1000):
# Process chunk
result = process_chunk(chunk)
# Close file when done
adata.file.close()
```
### 4. Multi-Dataset Comparison Workflow
```python
import anndata as ad
# Load datasets
datasets = {
'study1': ad.read_h5ad('study1.h5ad'),
'study2': ad.read_h5ad('study2.h5ad'),
'study3': ad.read_h5ad('study3.h5ad')
}
# Outer join to keep all genes
combined = ad.concat(
list(datasets.values()),
axis=0,
join='outer',
label='study',
keys=list(datasets.keys()),
merge='first'
)
# Handle missing data
combined.X[np.isnan(combined.X)] = 0
# Add dataset-specific metadata
combined.uns['datasets'] = {
'study1': {'date': '2023-01', 'n_samples': datasets['study1'].n_obs},
'study2': {'date': '2023-06', 'n_samples': datasets['study2'].n_obs},
'study3': {'date': '2024-01', 'n_samples': datasets['study3'].n_obs}
}
```
## Best Practices
### Memory Management
#### Use Sparse Matrices
```python
from scipy.sparse import csr_matrix
# Convert to sparse if data is sparse
if density < 0.3: # Less than 30% non-zero
adata.X = csr_matrix(adata.X)
```
#### Use Backed Mode for Large Files
```python
# Read with backing
adata = ad.read_h5ad('large_file.h5ad', backed='r')
# Only load what you need
subset = adata[:1000, :500].copy() # Now in memory
```
#### Convert Strings to Categoricals
```python
# Efficient storage for repeated strings
adata.strings_to_categoricals()
# Or manually
adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
```
### Data Organization
#### Use Layers for Different Representations
```python
# Store multiple versions of the data
adata.layers['counts'] = raw_counts
adata.layers['normalized'] = normalized_data
adata.layers['log1p'] = log_transformed_data
adata.layers['scaled'] = scaled_data
```
#### Use obsm/varm for Multi-Dimensional Annotations
```python
# Embeddings
adata.obsm['X_pca'] = pca_coordinates
adata.obsm['X_umap'] = umap_coordinates
adata.obsm['X_tsne'] = tsne_coordinates
# Gene loadings
adata.varm['PCs'] = principal_components
```
#### Use uns for Analysis Metadata
```python
# Store parameters
adata.uns['preprocessing'] = {
'normalization': 'TPM',
'min_genes': 200,
'min_cells': 3,
'date': '2024-01-15'
}
# Store analysis results
adata.uns['differential_expression'] = {
'method': 't-test',
'p_value_threshold': 0.05
}
```
### Subsetting and Views
#### Understand View vs Copy
```python
# Subsetting returns a view
subset = adata[adata.obs.cell_type == 'B cell'] # View
print(subset.is_view) # True
# Views are memory efficient but modifications affect original
subset.obs['new_column'] = value # Modifies original adata
# Create independent copy when needed
subset_copy = adata[adata.obs.cell_type == 'B cell'].copy()
```
#### Chain Operations Efficiently
```python
# Bad - creates multiple intermediate views
temp1 = adata[adata.obs.batch == 'batch1']
temp2 = temp1[temp1.obs.n_genes > 200]
result = temp2[:, temp2.var.highly_variable].copy()
# Good - chain operations
result = adata[
(adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
adata.var.highly_variable
].copy()
```
### File I/O
#### Use Compression
```python
# Save with compression
adata.write('data.h5ad', compression='gzip')
```
#### Choose the Right Format
```python
# H5AD for general use (good compression, fast)
adata.write_h5ad('data.h5ad')
# Zarr for cloud storage and parallel access
adata.write_zarr('data.zarr')
# Loom for compatibility with other tools
adata.write_loom('data.loom')
```
#### Close File Connections
```python
# Use context manager pattern
adata = ad.read_h5ad('file.h5ad', backed='r')
try:
# Work with data
process(adata)
finally:
adata.file.close()
```
### Concatenation
#### Choose Appropriate Join Strategy
```python
# Inner join - only common features (safe, may lose data)
combined = ad.concat([adata1, adata2], join='inner')
# Outer join - all features (keeps all data, may introduce zeros)
combined = ad.concat([adata1, adata2], join='outer')
```
#### Track Data Sources
```python
# Add source labels
combined = ad.concat(
[adata1, adata2, adata3],
label='dataset',
keys=['exp1', 'exp2', 'exp3']
)
# Make indices unique
combined = ad.concat(
[adata1, adata2, adata3],
index_unique='-'
)
```
#### Handle Variable-Specific Metadata
```python
# Use merge strategy for var annotations
combined = ad.concat(
[adata1, adata2],
merge='same', # Keep only identical annotations
join='outer'
)
```
### Naming Conventions
#### Use Consistent Naming
```python
# Embeddings: X_<method>
adata.obsm['X_pca']
adata.obsm['X_umap']
adata.obsm['X_tsne']
# Layers: descriptive names
adata.layers['counts']
adata.layers['log1p']
adata.layers['scaled']
# Observations: snake_case
adata.obs['cell_type']
adata.obs['n_genes']
adata.obs['total_counts']
```
#### Make Indices Unique
```python
# Ensure unique names
adata.obs_names_make_unique()
adata.var_names_make_unique()
```
### Error Handling
#### Validate Data Structure
```python
# Check dimensions
assert adata.n_obs > 0, "No observations in data"
assert adata.n_vars > 0, "No variables in data"
# Check for NaN values
if np.isnan(adata.X).any():
print("Warning: NaN values detected")
# Check for negative values in count data
if (adata.X < 0).any():
print("Warning: Negative values in count data")
```
#### Handle Missing Data
```python
# Check for missing annotations
if adata.obs['cell_type'].isna().any():
print("Warning: Missing cell type annotations")
# Fill or remove
adata = adata[~adata.obs['cell_type'].isna()]
```
## Common Pitfalls
### 1. Forgetting to Copy Views
```python
# BAD - modifies original
subset = adata[adata.obs.condition == 'treated']
subset.X = transformed_data # Changes original adata!
# GOOD
subset = adata[adata.obs.condition == 'treated'].copy()
subset.X = transformed_data # Only changes subset
```
### 2. Mixing Backed and In-Memory Operations
```python
# BAD - trying to modify backed data
adata = ad.read_h5ad('file.h5ad', backed='r')
adata.X[0, 0] = 100 # Error: can't modify backed data
# GOOD - load to memory first
adata = ad.read_h5ad('file.h5ad', backed='r')
adata = adata.to_memory()
adata.X[0, 0] = 100 # Works
```
### 3. Not Using Categoricals for Metadata
```python
# BAD - stores as strings (memory inefficient)
adata.obs['cell_type'] = ['B cell', 'T cell', ...] * 1000
# GOOD - use categorical
adata.obs['cell_type'] = pd.Categorical(['B cell', 'T cell', ...] * 1000)
```
### 4. Incorrect Concatenation Axis
```python
# Concatenating observations (cells)
combined = ad.concat([adata1, adata2], axis=0) # Correct
# Concatenating variables (genes) - rare
combined = ad.concat([adata1, adata2], axis=1) # Less common
```
### 5. Not Preserving Raw Data
```python
# BAD - loses original data
adata.X = normalized_data
# GOOD - preserve original
adata.layers['counts'] = adata.X.copy()
adata.X = normalized_data
```