mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Initial commit
This commit is contained in:
218
scientific-packages/anndata/references/api_reference.md
Normal file
218
scientific-packages/anndata/references/api_reference.md
Normal file
@@ -0,0 +1,218 @@
|
||||
# AnnData API Reference
|
||||
|
||||
## Core AnnData Class
|
||||
|
||||
The `AnnData` class is the central data structure for storing and manipulating annotated datasets in single-cell genomics and other domains.
|
||||
|
||||
### Core Attributes
|
||||
|
||||
| Attribute | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| **X** | array-like | Primary data matrix (#observations × #variables). Supports NumPy arrays, sparse matrices (CSR/CSC), HDF5 datasets, Zarr arrays, and Dask arrays |
|
||||
| **obs** | DataFrame | One-dimensional annotation of observations (rows). Length equals observation count |
|
||||
| **var** | DataFrame | One-dimensional annotation of variables/features (columns). Length equals variable count |
|
||||
| **uns** | OrderedDict | Unstructured annotation for miscellaneous metadata |
|
||||
| **obsm** | dict-like | Multi-dimensional observation annotations (structured arrays aligned to observation axis) |
|
||||
| **varm** | dict-like | Multi-dimensional variable annotations (structured arrays aligned to variable axis) |
|
||||
| **obsp** | dict-like | Pairwise observation annotations (square matrices representing graphs) |
|
||||
| **varp** | dict-like | Pairwise variable annotations (graphs between features) |
|
||||
| **layers** | dict-like | Additional data matrices matching X's dimensions |
|
||||
| **raw** | AnnData | Stores original versions of X and var before transformations |
|
||||
|
||||
### Dimensional Properties
|
||||
|
||||
- **n_obs**: Number of observations (sample count)
|
||||
- **n_vars**: Number of variables/features
|
||||
- **shape**: Tuple returning (n_obs, n_vars)
|
||||
- **T**: Transposed view of the entire object
|
||||
|
||||
### State Properties
|
||||
|
||||
- **isbacked**: Boolean indicating disk-backed storage status
|
||||
- **is_view**: Boolean identifying whether object is a view of another AnnData
|
||||
- **filename**: Path to backing .h5ad file; setting this enables disk-backed mode
|
||||
|
||||
### Key Methods
|
||||
|
||||
#### Construction and Copying
|
||||
- **`AnnData(X=None, obs=None, var=None, ...)`**: Create new AnnData object
|
||||
- **`copy(filename=None)`**: Create full copy, optionally stored on disk
|
||||
|
||||
#### Subsetting and Views
|
||||
- **`adata[obs_subset, var_subset]`**: Subset observations and variables (returns view by default)
|
||||
- **`.copy()`**: Convert view to independent object
|
||||
|
||||
#### Data Access
|
||||
- **`to_df(layer=None)`**: Generate pandas DataFrame representation
|
||||
- **`obs_vector(k, layer=None)`**: Extract 1D array from X, layers, or annotations
|
||||
- **`var_vector(k, layer=None)`**: Extract 1D array for a variable
|
||||
- **`chunk_X(chunk_size)`**: Iterate over data matrix in chunks
|
||||
- **`chunked_X(chunk_size)`**: Context manager for chunked iteration
|
||||
|
||||
#### Transformation
|
||||
- **`transpose()`**: Return transposed object
|
||||
- **`concatenate(*adatas, ...)`**: Combine multiple AnnData objects along observation axis
|
||||
- **`to_memory(copy=False)`**: Load all backed arrays into RAM
|
||||
|
||||
#### File I/O
|
||||
- **`write_h5ad(filename, compression='gzip')`**: Save as .h5ad HDF5 format
|
||||
- **`write_zarr(store, ...)`**: Export hierarchical Zarr store
|
||||
- **`write_loom(filename, ...)`**: Output .loom format file
|
||||
- **`write_csvs(dirname, ...)`**: Write annotations as separate CSV files
|
||||
|
||||
#### Data Management
|
||||
- **`strings_to_categoricals()`**: Convert string annotations to categorical types
|
||||
- **`rename_categories(key, categories)`**: Update category labels in annotations
|
||||
- **`obs_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate observation names
|
||||
- **`var_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate variable names
|
||||
|
||||
## Module-Level Functions
|
||||
|
||||
### Reading Functions
|
||||
|
||||
#### Native Formats
|
||||
- **`read_h5ad(filename, backed=None, as_sparse=None)`**: Load HDF5-based .h5ad files
|
||||
- **`read_zarr(store)`**: Access hierarchical Zarr array stores
|
||||
|
||||
#### Alternative Formats
|
||||
- **`read_csv(filename, ...)`**: Import from CSV files
|
||||
- **`read_excel(filename, ...)`**: Import from Excel files
|
||||
- **`read_hdf(filename, key)`**: Read from HDF5 files
|
||||
- **`read_loom(filename, ...)`**: Import from .loom files
|
||||
- **`read_mtx(filename, ...)`**: Import from Matrix Market format
|
||||
- **`read_text(filename, ...)`**: Import from text files
|
||||
- **`read_umi_tools(filename, ...)`**: Import from UMI-tools format
|
||||
|
||||
#### Element-Level Access
|
||||
- **`read_elem(elem)`**: Retrieve specific components from storage
|
||||
- **`sparse_dataset(group)`**: Generate backed sparse matrix classes
|
||||
|
||||
### Combining Operations
|
||||
- **`concat(adatas, axis=0, join='inner', merge=None, ...)`**: Merge multiple AnnData objects
|
||||
- **axis**: 0 (observations) or 1 (variables)
|
||||
- **join**: 'inner' (intersection) or 'outer' (union)
|
||||
- **merge**: Strategy for non-concatenation axis ('same', 'unique', 'first', 'only', or None)
|
||||
- **label**: Column name for source tracking
|
||||
- **keys**: Dataset identifiers for source annotation
|
||||
- **index_unique**: Separator for making duplicate indices unique
|
||||
|
||||
### Writing Functions
|
||||
- **`write_h5ad(filename, adata, compression='gzip')`**: Export to HDF5 format
|
||||
- **`write_zarr(store, adata, ...)`**: Save as Zarr hierarchical arrays
|
||||
- **`write_elem(elem, ...)`**: Write individual components
|
||||
|
||||
### Experimental Features
|
||||
- **`AnnCollection`**: Batch processing for large collections
|
||||
- **`AnnLoader`**: PyTorch DataLoader integration
|
||||
- **`concat_on_disk(*adatas, filename, ...)`**: Memory-efficient out-of-core concatenation
|
||||
- **`read_lazy(filename)`**: Lazy loading with deferred computation
|
||||
- **`read_dispatched(filename, ...)`**: Custom I/O with callbacks
|
||||
- **`write_dispatched(filename, ...)`**: Custom writing with callbacks
|
||||
|
||||
### Configuration
|
||||
- **`settings`**: Package-wide configuration object
|
||||
- **`settings.override(**kwargs)`**: Context manager for temporary settings changes
|
||||
|
||||
## Common Usage Patterns
|
||||
|
||||
### Creating AnnData Objects
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
import numpy as np
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
# From dense array
|
||||
counts = np.random.poisson(1, size=(100, 2000))
|
||||
adata = ad.AnnData(counts)
|
||||
|
||||
# From sparse matrix
|
||||
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
|
||||
adata = ad.AnnData(counts)
|
||||
|
||||
# With metadata
|
||||
import pandas as pd
|
||||
obs_meta = pd.DataFrame({'cell_type': ['B', 'T', 'Monocyte'] * 33 + ['B']})
|
||||
var_meta = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]})
|
||||
adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
|
||||
```
|
||||
|
||||
### Subsetting
|
||||
|
||||
```python
|
||||
# By names
|
||||
subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
|
||||
|
||||
# By boolean mask
|
||||
b_cells = adata[adata.obs.cell_type == 'B']
|
||||
|
||||
# By position
|
||||
first_five = adata[:5, :100]
|
||||
|
||||
# Convert view to copy
|
||||
adata_copy = adata[:5].copy()
|
||||
```
|
||||
|
||||
### Adding Annotations
|
||||
|
||||
```python
|
||||
# Cell-level metadata
|
||||
adata.obs['batch'] = pd.Categorical(['batch1', 'batch2'] * 50)
|
||||
|
||||
# Gene-level metadata
|
||||
adata.var['highly_variable'] = np.random.choice([True, False], size=adata.n_vars)
|
||||
|
||||
# Embeddings
|
||||
adata.obsm['X_pca'] = np.random.normal(size=(adata.n_obs, 50))
|
||||
adata.obsm['X_umap'] = np.random.normal(size=(adata.n_obs, 2))
|
||||
|
||||
# Alternative data representations
|
||||
adata.layers['log_transformed'] = np.log1p(adata.X)
|
||||
adata.layers['scaled'] = (adata.X - adata.X.mean(axis=0)) / adata.X.std(axis=0)
|
||||
|
||||
# Unstructured metadata
|
||||
adata.uns['experiment_date'] = '2024-01-15'
|
||||
adata.uns['parameters'] = {'min_genes': 200, 'min_cells': 3}
|
||||
```
|
||||
|
||||
### File I/O
|
||||
|
||||
```python
|
||||
# Write to disk
|
||||
adata.write('my_results.h5ad', compression='gzip')
|
||||
|
||||
# Read into memory
|
||||
adata = ad.read_h5ad('my_results.h5ad')
|
||||
|
||||
# Read in backed mode (memory-efficient)
|
||||
adata = ad.read_h5ad('my_results.h5ad', backed='r')
|
||||
|
||||
# Close file connection
|
||||
adata.file.close()
|
||||
```
|
||||
|
||||
### Concatenation
|
||||
|
||||
```python
|
||||
# Combine multiple datasets
|
||||
adata1 = ad.AnnData(np.random.poisson(1, size=(100, 2000)))
|
||||
adata2 = ad.AnnData(np.random.poisson(1, size=(150, 2000)))
|
||||
adata3 = ad.AnnData(np.random.poisson(1, size=(80, 2000)))
|
||||
|
||||
# Simple concatenation
|
||||
combined = ad.concat([adata1, adata2, adata3], axis=0)
|
||||
|
||||
# With source labels
|
||||
combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
axis=0,
|
||||
label='dataset',
|
||||
keys=['exp1', 'exp2', 'exp3']
|
||||
)
|
||||
|
||||
# Inner join (only shared variables)
|
||||
combined = ad.concat([adata1, adata2, adata3], axis=0, join='inner')
|
||||
|
||||
# Outer join (all variables, pad with zeros)
|
||||
combined = ad.concat([adata1, adata2, adata3], axis=0, join='outer')
|
||||
```
|
||||
478
scientific-packages/anndata/references/concatenation_guide.md
Normal file
478
scientific-packages/anndata/references/concatenation_guide.md
Normal file
@@ -0,0 +1,478 @@
|
||||
# AnnData Concatenation Guide
|
||||
|
||||
## Overview
|
||||
|
||||
The `concat()` function combines multiple AnnData objects through two fundamental operations:
|
||||
1. **Concatenation**: Stacking sub-elements in order
|
||||
2. **Merging**: Combining collections into one result
|
||||
|
||||
## Basic Concatenation
|
||||
|
||||
### Syntax
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
combined = ad.concat(
|
||||
adatas, # List of AnnData objects
|
||||
axis=0, # 0=observations, 1=variables
|
||||
join='inner', # 'inner' or 'outer'
|
||||
merge=None, # Merge strategy for non-concat axis
|
||||
label=None, # Column name for source tracking
|
||||
keys=None, # Dataset identifiers
|
||||
index_unique=None, # Separator for unique indices
|
||||
fill_value=None, # Fill value for missing data
|
||||
pairwise=False # Include pairwise matrices
|
||||
)
|
||||
```
|
||||
|
||||
### Concatenating Observations (Cells)
|
||||
```python
|
||||
# Most common: combining multiple samples/batches
|
||||
adata1 = ad.AnnData(np.random.rand(100, 2000))
|
||||
adata2 = ad.AnnData(np.random.rand(150, 2000))
|
||||
adata3 = ad.AnnData(np.random.rand(80, 2000))
|
||||
|
||||
combined = ad.concat([adata1, adata2, adata3], axis=0)
|
||||
# Result: (330 observations, 2000 variables)
|
||||
```
|
||||
|
||||
### Concatenating Variables (Genes)
|
||||
```python
|
||||
# Less common: combining different feature sets
|
||||
adata1 = ad.AnnData(np.random.rand(100, 1000))
|
||||
adata2 = ad.AnnData(np.random.rand(100, 500))
|
||||
|
||||
combined = ad.concat([adata1, adata2], axis=1)
|
||||
# Result: (100 observations, 1500 variables)
|
||||
```
|
||||
|
||||
## Join Strategies
|
||||
|
||||
### Inner Join (Intersection)
|
||||
|
||||
Keeps only shared features across all objects.
|
||||
|
||||
```python
|
||||
# Datasets with different genes
|
||||
adata1 = ad.AnnData(
|
||||
np.random.rand(100, 2000),
|
||||
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(2000)])
|
||||
)
|
||||
adata2 = ad.AnnData(
|
||||
np.random.rand(150, 1800),
|
||||
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(200, 2000)])
|
||||
)
|
||||
|
||||
# Inner join: only genes present in both
|
||||
combined = ad.concat([adata1, adata2], join='inner')
|
||||
# Result: (250 observations, 1800 variables)
|
||||
# Only Gene_200 through Gene_1999
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- You want to analyze only features measured in all datasets
|
||||
- Missing features would compromise analysis
|
||||
- You need a complete case analysis
|
||||
|
||||
**Trade-offs:**
|
||||
- May lose many features
|
||||
- Ensures no missing data
|
||||
- Smaller result size
|
||||
|
||||
### Outer Join (Union)
|
||||
|
||||
Keeps all features from all objects, padding with fill values (default 0).
|
||||
|
||||
```python
|
||||
# Outer join: all genes from both datasets
|
||||
combined = ad.concat([adata1, adata2], join='outer')
|
||||
# Result: (250 observations, 2000 variables)
|
||||
# Missing values filled with 0
|
||||
|
||||
# Custom fill value
|
||||
combined = ad.concat([adata1, adata2], join='outer', fill_value=np.nan)
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- You want to preserve all features
|
||||
- Sparse data is acceptable
|
||||
- Features are independent
|
||||
|
||||
**Trade-offs:**
|
||||
- Introduces zeros/missing values
|
||||
- Larger result size
|
||||
- May need imputation
|
||||
|
||||
## Merge Strategies
|
||||
|
||||
Merge strategies control how elements on the non-concatenation axis are combined.
|
||||
|
||||
### merge=None (Default)
|
||||
|
||||
Excludes all non-concatenation axis elements.
|
||||
|
||||
```python
|
||||
# Both datasets have var annotations
|
||||
adata1.var['gene_type'] = ['protein_coding'] * 2000
|
||||
adata2.var['gene_type'] = ['protein_coding'] * 1800
|
||||
|
||||
# merge=None: var annotations excluded
|
||||
combined = ad.concat([adata1, adata2], merge=None)
|
||||
assert 'gene_type' not in combined.var.columns
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- Annotations are dataset-specific
|
||||
- You'll add new annotations after merging
|
||||
|
||||
### merge='same'
|
||||
|
||||
Keeps only annotations with identical values across datasets.
|
||||
|
||||
```python
|
||||
# Same annotation values
|
||||
adata1.var['chromosome'] = ['chr1'] * 1000 + ['chr2'] * 1000
|
||||
adata2.var['chromosome'] = ['chr1'] * 900 + ['chr2'] * 900
|
||||
|
||||
# merge='same': keeps chromosome annotation
|
||||
combined = ad.concat([adata1, adata2], merge='same')
|
||||
assert 'chromosome' in combined.var.columns
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- Annotations should be consistent
|
||||
- You want to validate consistency
|
||||
- Shared metadata is important
|
||||
|
||||
**Note:** Comparison occurs after index alignment - only shared indices need to match.
|
||||
|
||||
### merge='unique'
|
||||
|
||||
Includes annotations with a single possible value.
|
||||
|
||||
```python
|
||||
# Unique values per gene
|
||||
adata1.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
|
||||
adata2.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
|
||||
|
||||
# merge='unique': keeps ensembl_id
|
||||
combined = ad.concat([adata1, adata2], merge='unique')
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- Each feature has a unique identifier
|
||||
- Annotations are feature-specific
|
||||
|
||||
### merge='first'
|
||||
|
||||
Takes the first occurrence of each annotation.
|
||||
|
||||
```python
|
||||
# Different annotation versions
|
||||
adata1.var['description'] = ['desc1'] * 2000
|
||||
adata2.var['description'] = ['desc2'] * 2000
|
||||
|
||||
# merge='first': uses adata1's descriptions
|
||||
combined = ad.concat([adata1, adata2], merge='first')
|
||||
# Uses descriptions from adata1
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- One dataset has authoritative annotations
|
||||
- Order matters
|
||||
- You need a simple resolution strategy
|
||||
|
||||
### merge='only'
|
||||
|
||||
Retains annotations appearing in exactly one object.
|
||||
|
||||
```python
|
||||
# Dataset-specific annotations
|
||||
adata1.var['dataset1_specific'] = ['value'] * 2000
|
||||
adata2.var['dataset2_specific'] = ['value'] * 2000
|
||||
|
||||
# merge='only': keeps both (no conflicts)
|
||||
combined = ad.concat([adata1, adata2], merge='only')
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- Datasets have non-overlapping annotations
|
||||
- You want to preserve all unique metadata
|
||||
|
||||
## Source Tracking
|
||||
|
||||
### Using label
|
||||
|
||||
Add a categorical column to track data origin.
|
||||
|
||||
```python
|
||||
combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
label='batch'
|
||||
)
|
||||
|
||||
# Creates obs['batch'] with values 0, 1, 2
|
||||
print(combined.obs['batch'].cat.categories) # ['0', '1', '2']
|
||||
```
|
||||
|
||||
### Using keys
|
||||
|
||||
Provide custom names for source tracking.
|
||||
|
||||
```python
|
||||
combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
label='study',
|
||||
keys=['control', 'treatment_a', 'treatment_b']
|
||||
)
|
||||
|
||||
# Creates obs['study'] with custom names
|
||||
print(combined.obs['study'].unique()) # ['control', 'treatment_a', 'treatment_b']
|
||||
```
|
||||
|
||||
### Making Indices Unique
|
||||
|
||||
Append source identifiers to duplicate observation names.
|
||||
|
||||
```python
|
||||
# Both datasets have cells named "Cell_0", "Cell_1", etc.
|
||||
adata1.obs_names = [f'Cell_{i}' for i in range(100)]
|
||||
adata2.obs_names = [f'Cell_{i}' for i in range(150)]
|
||||
|
||||
# index_unique adds suffix
|
||||
combined = ad.concat(
|
||||
[adata1, adata2],
|
||||
keys=['batch1', 'batch2'],
|
||||
index_unique='-'
|
||||
)
|
||||
|
||||
# Results in: Cell_0-batch1, Cell_0-batch2, etc.
|
||||
print(combined.obs_names[:5])
|
||||
```
|
||||
|
||||
## Handling Different Attributes
|
||||
|
||||
### X Matrix and Layers
|
||||
|
||||
Follows join strategy. Missing values filled according to `fill_value`.
|
||||
|
||||
```python
|
||||
# Both have layers
|
||||
adata1.layers['counts'] = adata1.X.copy()
|
||||
adata2.layers['counts'] = adata2.X.copy()
|
||||
|
||||
# Concatenates both X and layers
|
||||
combined = ad.concat([adata1, adata2])
|
||||
assert 'counts' in combined.layers
|
||||
```
|
||||
|
||||
### obs and var DataFrames
|
||||
|
||||
- **obs**: Concatenated along concatenation axis
|
||||
- **var**: Handled by merge strategy
|
||||
|
||||
```python
|
||||
adata1.obs['cell_type'] = ['B cell'] * 100
|
||||
adata2.obs['cell_type'] = ['T cell'] * 150
|
||||
|
||||
combined = ad.concat([adata1, adata2])
|
||||
# obs['cell_type'] preserved for all cells
|
||||
```
|
||||
|
||||
### obsm and varm
|
||||
|
||||
Multi-dimensional annotations follow same rules as layers.
|
||||
|
||||
```python
|
||||
adata1.obsm['X_pca'] = np.random.rand(100, 50)
|
||||
adata2.obsm['X_pca'] = np.random.rand(150, 50)
|
||||
|
||||
combined = ad.concat([adata1, adata2])
|
||||
# obsm['X_pca'] concatenated: shape (250, 50)
|
||||
```
|
||||
|
||||
### obsp and varp
|
||||
|
||||
Pairwise matrices excluded by default. Enable with `pairwise=True`.
|
||||
|
||||
```python
|
||||
# Distance matrices
|
||||
adata1.obsp['distances'] = np.random.rand(100, 100)
|
||||
adata2.obsp['distances'] = np.random.rand(150, 150)
|
||||
|
||||
# Excluded by default
|
||||
combined = ad.concat([adata1, adata2])
|
||||
assert 'distances' not in combined.obsp
|
||||
|
||||
# Include if needed
|
||||
combined = ad.concat([adata1, adata2], pairwise=True)
|
||||
# Results in padded block diagonal matrix
|
||||
```
|
||||
|
||||
### uns Dictionary
|
||||
|
||||
Merged recursively, applying merge strategy at any nesting depth.
|
||||
|
||||
```python
|
||||
adata1.uns['experiment'] = {'date': '2024-01', 'lab': 'A'}
|
||||
adata2.uns['experiment'] = {'date': '2024-02', 'lab': 'A'}
|
||||
|
||||
# merge='same' keeps 'lab', excludes 'date'
|
||||
combined = ad.concat([adata1, adata2], merge='same')
|
||||
# combined.uns['experiment'] = {'lab': 'A'}
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Batch Integration Pipeline
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Load batches
|
||||
batches = [
|
||||
ad.read_h5ad(f'batch_{i}.h5ad')
|
||||
for i in range(5)
|
||||
]
|
||||
|
||||
# Concatenate with tracking
|
||||
combined = ad.concat(
|
||||
batches,
|
||||
axis=0,
|
||||
join='outer',
|
||||
merge='first',
|
||||
label='batch_id',
|
||||
keys=[f'batch_{i}' for i in range(5)],
|
||||
index_unique='-'
|
||||
)
|
||||
|
||||
# Add batch effects
|
||||
combined.obs['batch_numeric'] = combined.obs['batch_id'].cat.codes
|
||||
```
|
||||
|
||||
### Multi-Study Meta-Analysis
|
||||
|
||||
```python
|
||||
# Different studies with varying gene coverage
|
||||
studies = {
|
||||
'study_a': ad.read_h5ad('study_a.h5ad'),
|
||||
'study_b': ad.read_h5ad('study_b.h5ad'),
|
||||
'study_c': ad.read_h5ad('study_c.h5ad')
|
||||
}
|
||||
|
||||
# Outer join to keep all genes
|
||||
combined = ad.concat(
|
||||
list(studies.values()),
|
||||
axis=0,
|
||||
join='outer',
|
||||
label='study',
|
||||
keys=list(studies.keys()),
|
||||
merge='unique',
|
||||
fill_value=0
|
||||
)
|
||||
|
||||
# Track coverage
|
||||
for study in studies:
|
||||
n_genes = studies[study].n_vars
|
||||
combined.uns[f'{study}_n_genes'] = n_genes
|
||||
```
|
||||
|
||||
### Incremental Concatenation
|
||||
|
||||
```python
|
||||
# For many datasets, concatenate in batches
|
||||
chunk_size = 10
|
||||
all_files = [f'dataset_{i}.h5ad' for i in range(100)]
|
||||
|
||||
# Process in chunks
|
||||
result = None
|
||||
for i in range(0, len(all_files), chunk_size):
|
||||
chunk_files = all_files[i:i+chunk_size]
|
||||
chunk_adatas = [ad.read_h5ad(f) for f in chunk_files]
|
||||
chunk_combined = ad.concat(chunk_adatas)
|
||||
|
||||
if result is None:
|
||||
result = chunk_combined
|
||||
else:
|
||||
result = ad.concat([result, chunk_combined])
|
||||
```
|
||||
|
||||
### Memory-Efficient On-Disk Concatenation
|
||||
|
||||
```python
|
||||
# Experimental feature for large datasets
|
||||
from anndata.experimental import concat_on_disk
|
||||
|
||||
files = ['dataset1.h5ad', 'dataset2.h5ad', 'dataset3.h5ad']
|
||||
concat_on_disk(
|
||||
files,
|
||||
'combined.h5ad',
|
||||
join='outer'
|
||||
)
|
||||
|
||||
# Read result in backed mode
|
||||
combined = ad.read_h5ad('combined.h5ad', backed='r')
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Dimension Mismatch
|
||||
|
||||
```python
|
||||
# Error: shapes don't match
|
||||
adata1 = ad.AnnData(np.random.rand(100, 2000))
|
||||
adata2 = ad.AnnData(np.random.rand(150, 1500))
|
||||
|
||||
# Solution: use outer join
|
||||
combined = ad.concat([adata1, adata2], join='outer')
|
||||
```
|
||||
|
||||
### Issue: Memory Error
|
||||
|
||||
```python
|
||||
# Problem: too many large objects in memory
|
||||
large_adatas = [ad.read_h5ad(f) for f in many_files]
|
||||
|
||||
# Solution: read and concatenate incrementally
|
||||
result = None
|
||||
for file in many_files:
|
||||
adata = ad.read_h5ad(file)
|
||||
if result is None:
|
||||
result = adata
|
||||
else:
|
||||
result = ad.concat([result, adata])
|
||||
del adata # Free memory
|
||||
```
|
||||
|
||||
### Issue: Duplicate Indices
|
||||
|
||||
```python
|
||||
# Problem: same cell names in different batches
|
||||
# Solution: use index_unique
|
||||
combined = ad.concat(
|
||||
[adata1, adata2],
|
||||
keys=['batch1', 'batch2'],
|
||||
index_unique='-'
|
||||
)
|
||||
```
|
||||
|
||||
### Issue: Lost Annotations
|
||||
|
||||
```python
|
||||
# Problem: annotations disappear
|
||||
adata1.var['important'] = values1
|
||||
adata2.var['important'] = values2
|
||||
|
||||
combined = ad.concat([adata1, adata2]) # merge=None by default
|
||||
# Solution: use appropriate merge strategy
|
||||
combined = ad.concat([adata1, adata2], merge='first')
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Pre-align indices**: Ensure consistent naming before concatenation
|
||||
2. **Use sparse matrices**: Convert to sparse before concatenating
|
||||
3. **Batch operations**: Concatenate in groups for many datasets
|
||||
4. **Choose inner join**: When possible, to reduce result size
|
||||
5. **Use categoricals**: Convert string annotations before concatenating
|
||||
6. **Consider on-disk**: For very large datasets, use `concat_on_disk`
|
||||
@@ -0,0 +1,438 @@
|
||||
# AnnData Workflows and Best Practices
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### 1. Single-Cell RNA-seq Analysis Workflow
|
||||
|
||||
#### Loading Data
|
||||
```python
|
||||
import anndata as ad
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
# Load from 10X format
|
||||
adata = ad.read_mtx('matrix.mtx')
|
||||
adata.var_names = pd.read_csv('genes.tsv', sep='\t', header=None)[0]
|
||||
adata.obs_names = pd.read_csv('barcodes.tsv', sep='\t', header=None)[0]
|
||||
|
||||
# Or load from pre-processed h5ad
|
||||
adata = ad.read_h5ad('preprocessed_data.h5ad')
|
||||
```
|
||||
|
||||
#### Quality Control
|
||||
```python
|
||||
# Calculate QC metrics
|
||||
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
|
||||
adata.obs['total_counts'] = adata.X.sum(axis=1)
|
||||
|
||||
# Filter cells
|
||||
adata = adata[adata.obs.n_genes > 200]
|
||||
adata = adata[adata.obs.total_counts < 10000]
|
||||
|
||||
# Filter genes
|
||||
min_cells = 3
|
||||
adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
|
||||
```
|
||||
|
||||
#### Normalization and Preprocessing
|
||||
```python
|
||||
# Store raw counts
|
||||
adata.layers['counts'] = adata.X.copy()
|
||||
|
||||
# Normalize
|
||||
adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
|
||||
|
||||
# Log transform
|
||||
adata.layers['log1p'] = np.log1p(adata.X)
|
||||
adata.X = adata.layers['log1p']
|
||||
|
||||
# Identify highly variable genes
|
||||
gene_variance = adata.X.var(axis=0)
|
||||
adata.var['highly_variable'] = gene_variance > np.percentile(gene_variance, 90)
|
||||
```
|
||||
|
||||
#### Dimensionality Reduction
|
||||
```python
|
||||
# PCA
|
||||
from sklearn.decomposition import PCA
|
||||
pca = PCA(n_components=50)
|
||||
adata.obsm['X_pca'] = pca.fit_transform(adata.X)
|
||||
|
||||
# Store PCA variance
|
||||
adata.uns['pca'] = {'variance_ratio': pca.explained_variance_ratio_}
|
||||
|
||||
# UMAP
|
||||
from umap import UMAP
|
||||
umap = UMAP(n_components=2)
|
||||
adata.obsm['X_umap'] = umap.fit_transform(adata.obsm['X_pca'])
|
||||
```
|
||||
|
||||
#### Clustering
|
||||
```python
|
||||
# Store cluster assignments
|
||||
adata.obs['clusters'] = pd.Categorical(['cluster_0', 'cluster_1', ...])
|
||||
|
||||
# Store cluster centroids
|
||||
centroids = np.array([...])
|
||||
adata.varm['cluster_centroids'] = centroids
|
||||
```
|
||||
|
||||
#### Save Results
|
||||
```python
|
||||
# Save complete analysis
|
||||
adata.write('analyzed_data.h5ad', compression='gzip')
|
||||
```
|
||||
|
||||
### 2. Batch Integration Workflow
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Load multiple batches
|
||||
batch1 = ad.read_h5ad('batch1.h5ad')
|
||||
batch2 = ad.read_h5ad('batch2.h5ad')
|
||||
batch3 = ad.read_h5ad('batch3.h5ad')
|
||||
|
||||
# Concatenate with batch labels
|
||||
adata = ad.concat(
|
||||
[batch1, batch2, batch3],
|
||||
axis=0,
|
||||
label='batch',
|
||||
keys=['batch1', 'batch2', 'batch3'],
|
||||
index_unique='-'
|
||||
)
|
||||
|
||||
# Batch effect correction would go here
|
||||
# (using external tools like Harmony, Scanorama, etc.)
|
||||
|
||||
# Store corrected embeddings
|
||||
adata.obsm['X_pca_corrected'] = corrected_pca
|
||||
adata.obsm['X_umap_corrected'] = corrected_umap
|
||||
```
|
||||
|
||||
### 3. Memory-Efficient Large Dataset Workflow
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Read in backed mode
|
||||
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
|
||||
|
||||
# Check backing status
|
||||
print(f"Is backed: {adata.isbacked}")
|
||||
print(f"File: {adata.filename}")
|
||||
|
||||
# Work with chunks
|
||||
for chunk in adata.chunk_X(chunk_size=1000):
|
||||
# Process chunk
|
||||
result = process_chunk(chunk)
|
||||
|
||||
# Close file when done
|
||||
adata.file.close()
|
||||
```
|
||||
|
||||
### 4. Multi-Dataset Comparison Workflow
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Load datasets
|
||||
datasets = {
|
||||
'study1': ad.read_h5ad('study1.h5ad'),
|
||||
'study2': ad.read_h5ad('study2.h5ad'),
|
||||
'study3': ad.read_h5ad('study3.h5ad')
|
||||
}
|
||||
|
||||
# Outer join to keep all genes
|
||||
combined = ad.concat(
|
||||
list(datasets.values()),
|
||||
axis=0,
|
||||
join='outer',
|
||||
label='study',
|
||||
keys=list(datasets.keys()),
|
||||
merge='first'
|
||||
)
|
||||
|
||||
# Handle missing data
|
||||
combined.X[np.isnan(combined.X)] = 0
|
||||
|
||||
# Add dataset-specific metadata
|
||||
combined.uns['datasets'] = {
|
||||
'study1': {'date': '2023-01', 'n_samples': datasets['study1'].n_obs},
|
||||
'study2': {'date': '2023-06', 'n_samples': datasets['study2'].n_obs},
|
||||
'study3': {'date': '2024-01', 'n_samples': datasets['study3'].n_obs}
|
||||
}
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Memory Management
|
||||
|
||||
#### Use Sparse Matrices
|
||||
```python
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
# Convert to sparse if data is sparse
|
||||
if density < 0.3: # Less than 30% non-zero
|
||||
adata.X = csr_matrix(adata.X)
|
||||
```
|
||||
|
||||
#### Use Backed Mode for Large Files
|
||||
```python
|
||||
# Read with backing
|
||||
adata = ad.read_h5ad('large_file.h5ad', backed='r')
|
||||
|
||||
# Only load what you need
|
||||
subset = adata[:1000, :500].copy() # Now in memory
|
||||
```
|
||||
|
||||
#### Convert Strings to Categoricals
|
||||
```python
|
||||
# Efficient storage for repeated strings
|
||||
adata.strings_to_categoricals()
|
||||
|
||||
# Or manually
|
||||
adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
|
||||
```
|
||||
|
||||
### Data Organization
|
||||
|
||||
#### Use Layers for Different Representations
|
||||
```python
|
||||
# Store multiple versions of the data
|
||||
adata.layers['counts'] = raw_counts
|
||||
adata.layers['normalized'] = normalized_data
|
||||
adata.layers['log1p'] = log_transformed_data
|
||||
adata.layers['scaled'] = scaled_data
|
||||
```
|
||||
|
||||
#### Use obsm/varm for Multi-Dimensional Annotations
|
||||
```python
|
||||
# Embeddings
|
||||
adata.obsm['X_pca'] = pca_coordinates
|
||||
adata.obsm['X_umap'] = umap_coordinates
|
||||
adata.obsm['X_tsne'] = tsne_coordinates
|
||||
|
||||
# Gene loadings
|
||||
adata.varm['PCs'] = principal_components
|
||||
```
|
||||
|
||||
#### Use uns for Analysis Metadata
|
||||
```python
|
||||
# Store parameters
|
||||
adata.uns['preprocessing'] = {
|
||||
'normalization': 'TPM',
|
||||
'min_genes': 200,
|
||||
'min_cells': 3,
|
||||
'date': '2024-01-15'
|
||||
}
|
||||
|
||||
# Store analysis results
|
||||
adata.uns['differential_expression'] = {
|
||||
'method': 't-test',
|
||||
'p_value_threshold': 0.05
|
||||
}
|
||||
```
|
||||
|
||||
### Subsetting and Views
|
||||
|
||||
#### Understand View vs Copy
|
||||
```python
|
||||
# Subsetting returns a view
|
||||
subset = adata[adata.obs.cell_type == 'B cell'] # View
|
||||
print(subset.is_view) # True
|
||||
|
||||
# Views are memory efficient but modifications affect original
|
||||
subset.obs['new_column'] = value # Modifies original adata
|
||||
|
||||
# Create independent copy when needed
|
||||
subset_copy = adata[adata.obs.cell_type == 'B cell'].copy()
|
||||
```
|
||||
|
||||
#### Chain Operations Efficiently
|
||||
```python
|
||||
# Bad - creates multiple intermediate views
|
||||
temp1 = adata[adata.obs.batch == 'batch1']
|
||||
temp2 = temp1[temp1.obs.n_genes > 200]
|
||||
result = temp2[:, temp2.var.highly_variable].copy()
|
||||
|
||||
# Good - chain operations
|
||||
result = adata[
|
||||
(adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
|
||||
adata.var.highly_variable
|
||||
].copy()
|
||||
```
|
||||
|
||||
### File I/O
|
||||
|
||||
#### Use Compression
|
||||
```python
|
||||
# Save with compression
|
||||
adata.write('data.h5ad', compression='gzip')
|
||||
```
|
||||
|
||||
#### Choose the Right Format
|
||||
```python
|
||||
# H5AD for general use (good compression, fast)
|
||||
adata.write_h5ad('data.h5ad')
|
||||
|
||||
# Zarr for cloud storage and parallel access
|
||||
adata.write_zarr('data.zarr')
|
||||
|
||||
# Loom for compatibility with other tools
|
||||
adata.write_loom('data.loom')
|
||||
```
|
||||
|
||||
#### Close File Connections
|
||||
```python
|
||||
# Use context manager pattern
|
||||
adata = ad.read_h5ad('file.h5ad', backed='r')
|
||||
try:
|
||||
# Work with data
|
||||
process(adata)
|
||||
finally:
|
||||
adata.file.close()
|
||||
```
|
||||
|
||||
### Concatenation
|
||||
|
||||
#### Choose Appropriate Join Strategy
|
||||
```python
|
||||
# Inner join - only common features (safe, may lose data)
|
||||
combined = ad.concat([adata1, adata2], join='inner')
|
||||
|
||||
# Outer join - all features (keeps all data, may introduce zeros)
|
||||
combined = ad.concat([adata1, adata2], join='outer')
|
||||
```
|
||||
|
||||
#### Track Data Sources
|
||||
```python
|
||||
# Add source labels
|
||||
combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
label='dataset',
|
||||
keys=['exp1', 'exp2', 'exp3']
|
||||
)
|
||||
|
||||
# Make indices unique
|
||||
combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
index_unique='-'
|
||||
)
|
||||
```
|
||||
|
||||
#### Handle Variable-Specific Metadata
|
||||
```python
|
||||
# Use merge strategy for var annotations
|
||||
combined = ad.concat(
|
||||
[adata1, adata2],
|
||||
merge='same', # Keep only identical annotations
|
||||
join='outer'
|
||||
)
|
||||
```
|
||||
|
||||
### Naming Conventions
|
||||
|
||||
#### Use Consistent Naming
|
||||
```python
|
||||
# Embeddings: X_<method>
|
||||
adata.obsm['X_pca']
|
||||
adata.obsm['X_umap']
|
||||
adata.obsm['X_tsne']
|
||||
|
||||
# Layers: descriptive names
|
||||
adata.layers['counts']
|
||||
adata.layers['log1p']
|
||||
adata.layers['scaled']
|
||||
|
||||
# Observations: snake_case
|
||||
adata.obs['cell_type']
|
||||
adata.obs['n_genes']
|
||||
adata.obs['total_counts']
|
||||
```
|
||||
|
||||
#### Make Indices Unique
|
||||
```python
|
||||
# Ensure unique names
|
||||
adata.obs_names_make_unique()
|
||||
adata.var_names_make_unique()
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
#### Validate Data Structure
|
||||
```python
|
||||
# Check dimensions
|
||||
assert adata.n_obs > 0, "No observations in data"
|
||||
assert adata.n_vars > 0, "No variables in data"
|
||||
|
||||
# Check for NaN values
|
||||
if np.isnan(adata.X).any():
|
||||
print("Warning: NaN values detected")
|
||||
|
||||
# Check for negative values in count data
|
||||
if (adata.X < 0).any():
|
||||
print("Warning: Negative values in count data")
|
||||
```
|
||||
|
||||
#### Handle Missing Data
|
||||
```python
|
||||
# Check for missing annotations
|
||||
if adata.obs['cell_type'].isna().any():
|
||||
print("Warning: Missing cell type annotations")
|
||||
# Fill or remove
|
||||
adata = adata[~adata.obs['cell_type'].isna()]
|
||||
```
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Forgetting to Copy Views
|
||||
```python
|
||||
# BAD - modifies original
|
||||
subset = adata[adata.obs.condition == 'treated']
|
||||
subset.X = transformed_data # Changes original adata!
|
||||
|
||||
# GOOD
|
||||
subset = adata[adata.obs.condition == 'treated'].copy()
|
||||
subset.X = transformed_data # Only changes subset
|
||||
```
|
||||
|
||||
### 2. Mixing Backed and In-Memory Operations
|
||||
```python
|
||||
# BAD - trying to modify backed data
|
||||
adata = ad.read_h5ad('file.h5ad', backed='r')
|
||||
adata.X[0, 0] = 100 # Error: can't modify backed data
|
||||
|
||||
# GOOD - load to memory first
|
||||
adata = ad.read_h5ad('file.h5ad', backed='r')
|
||||
adata = adata.to_memory()
|
||||
adata.X[0, 0] = 100 # Works
|
||||
```
|
||||
|
||||
### 3. Not Using Categoricals for Metadata
|
||||
```python
|
||||
# BAD - stores as strings (memory inefficient)
|
||||
adata.obs['cell_type'] = ['B cell', 'T cell', ...] * 1000
|
||||
|
||||
# GOOD - use categorical
|
||||
adata.obs['cell_type'] = pd.Categorical(['B cell', 'T cell', ...] * 1000)
|
||||
```
|
||||
|
||||
### 4. Incorrect Concatenation Axis
|
||||
```python
|
||||
# Concatenating observations (cells)
|
||||
combined = ad.concat([adata1, adata2], axis=0) # Correct
|
||||
|
||||
# Concatenating variables (genes) - rare
|
||||
combined = ad.concat([adata1, adata2], axis=1) # Less common
|
||||
```
|
||||
|
||||
### 5. Not Preserving Raw Data
|
||||
```python
|
||||
# BAD - loses original data
|
||||
adata.X = normalized_data
|
||||
|
||||
# GOOD - preserve original
|
||||
adata.layers['counts'] = adata.X.copy()
|
||||
adata.X = normalized_data
|
||||
```
|
||||
Reference in New Issue
Block a user