mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
Improve the anndata skill
This commit is contained in:
@@ -1,527 +1,394 @@
|
||||
---
|
||||
name: anndata
|
||||
description: "Manipulate AnnData objects for single-cell genomics. Load/save .h5ad files, manage obs/var metadata, layers, embeddings (PCA/UMAP), concatenate datasets, for scRNA-seq workflows."
|
||||
description: This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
|
||||
---
|
||||
|
||||
# AnnData
|
||||
|
||||
## Overview
|
||||
|
||||
AnnData (Annotated Data) is Python's standard for storing and manipulating annotated data matrices, particularly in single-cell genomics. Work with AnnData objects for data creation, manipulation, file I/O, concatenation, and memory-efficient workflows.
|
||||
AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.
|
||||
|
||||
## Core Capabilities
|
||||
## When to Use This Skill
|
||||
|
||||
### 1. Creating and Structuring AnnData Objects
|
||||
Use this skill when:
|
||||
- Creating, reading, or writing AnnData objects
|
||||
- Working with h5ad, zarr, or other genomics data formats
|
||||
- Performing single-cell RNA-seq analysis
|
||||
- Managing large datasets with sparse matrices or backed mode
|
||||
- Concatenating multiple datasets or experimental batches
|
||||
- Subsetting, filtering, or transforming annotated data
|
||||
- Integrating with scanpy, scvi-tools, or other scverse ecosystem tools
|
||||
|
||||
Create AnnData objects from various data sources and organize multi-dimensional annotations.
|
||||
## Installation
|
||||
|
||||
**Basic creation:**
|
||||
```bash
|
||||
pip install anndata
|
||||
|
||||
# With optional dependencies
|
||||
pip install anndata[dev,test,doc]
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Creating an AnnData object
|
||||
```python
|
||||
import anndata as ad
|
||||
import numpy as np
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
# From dense or sparse arrays
|
||||
counts = np.random.poisson(1, size=(100, 2000))
|
||||
adata = ad.AnnData(counts)
|
||||
|
||||
# With sparse matrix (memory-efficient)
|
||||
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
|
||||
adata = ad.AnnData(counts)
|
||||
```
|
||||
|
||||
**With metadata:**
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
obs_meta = pd.DataFrame({
|
||||
'cell_type': pd.Categorical(['B', 'T', 'Monocyte'] * 33 + ['B']),
|
||||
'batch': ['batch1'] * 50 + ['batch2'] * 50
|
||||
})
|
||||
var_meta = pd.DataFrame({
|
||||
'gene_name': [f'Gene_{i}' for i in range(2000)],
|
||||
'highly_variable': np.random.choice([True, False], 2000)
|
||||
})
|
||||
# Minimal creation
|
||||
X = np.random.rand(100, 2000) # 100 cells × 2000 genes
|
||||
adata = ad.AnnData(X)
|
||||
|
||||
adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
|
||||
# With metadata
|
||||
obs = pd.DataFrame({
|
||||
'cell_type': ['T cell', 'B cell'] * 50,
|
||||
'sample': ['A', 'B'] * 50
|
||||
}, index=[f'cell_{i}' for i in range(100)])
|
||||
|
||||
var = pd.DataFrame({
|
||||
'gene_name': [f'Gene_{i}' for i in range(2000)]
|
||||
}, index=[f'ENSG{i:05d}' for i in range(2000)])
|
||||
|
||||
adata = ad.AnnData(X=X, obs=obs, var=var)
|
||||
```
|
||||
|
||||
**Understanding the structure:**
|
||||
- **X**: Primary data matrix (observations × variables)
|
||||
- **obs**: Row (observation) annotations as DataFrame
|
||||
- **var**: Column (variable) annotations as DataFrame
|
||||
- **obsm**: Multi-dimensional observation annotations (e.g., PCA, UMAP coordinates)
|
||||
- **varm**: Multi-dimensional variable annotations (e.g., gene loadings)
|
||||
- **layers**: Alternative data matrices with same dimensions as X
|
||||
- **uns**: Unstructured metadata dictionary
|
||||
- **obsp/varp**: Pairwise relationship matrices (graphs)
|
||||
|
||||
### 2. Adding Annotations and Layers
|
||||
|
||||
Organize different data representations and metadata within a single object.
|
||||
|
||||
**Cell-level metadata (obs):**
|
||||
### Reading data
|
||||
```python
|
||||
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
|
||||
adata.obs['total_counts'] = adata.X.sum(axis=1)
|
||||
adata.obs['condition'] = pd.Categorical(['control', 'treated'] * 50)
|
||||
```
|
||||
# Read h5ad file
|
||||
adata = ad.read_h5ad('data.h5ad')
|
||||
|
||||
**Gene-level metadata (var):**
|
||||
```python
|
||||
adata.var['highly_variable'] = gene_variance > threshold
|
||||
adata.var['chromosome'] = pd.Categorical(['chr1', 'chr2', ...])
|
||||
```
|
||||
# Read with backed mode (for large files)
|
||||
adata = ad.read_h5ad('large_data.h5ad', backed='r')
|
||||
|
||||
**Embeddings (obsm/varm):**
|
||||
```python
|
||||
# Dimensionality reduction results
|
||||
adata.obsm['X_pca'] = pca_coordinates # Shape: (n_obs, n_components)
|
||||
adata.obsm['X_umap'] = umap_coordinates # Shape: (n_obs, 2)
|
||||
adata.obsm['X_tsne'] = tsne_coordinates
|
||||
|
||||
# Gene loadings
|
||||
adata.varm['PCs'] = principal_components # Shape: (n_vars, n_components)
|
||||
```
|
||||
|
||||
**Alternative data representations (layers):**
|
||||
```python
|
||||
# Store multiple versions
|
||||
adata.layers['counts'] = raw_counts
|
||||
adata.layers['log1p'] = np.log1p(adata.X)
|
||||
adata.layers['scaled'] = (adata.X - mean) / std
|
||||
```
|
||||
|
||||
**Unstructured metadata (uns):**
|
||||
```python
|
||||
# Analysis parameters
|
||||
adata.uns['preprocessing'] = {
|
||||
'normalization': 'TPM',
|
||||
'min_genes': 200,
|
||||
'date': '2024-01-15'
|
||||
}
|
||||
|
||||
# Results
|
||||
adata.uns['pca'] = {'variance_ratio': variance_explained}
|
||||
```
|
||||
|
||||
### 3. Subsetting and Views
|
||||
|
||||
Efficiently subset data while managing memory through views and copies.
|
||||
|
||||
**Subsetting operations:**
|
||||
```python
|
||||
# By observation/variable names
|
||||
subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
|
||||
|
||||
# By boolean masks
|
||||
b_cells = adata[adata.obs.cell_type == 'B']
|
||||
high_quality = adata[adata.obs.n_genes > 200]
|
||||
|
||||
# By position
|
||||
first_cells = adata[:100, :]
|
||||
top_genes = adata[:, :500]
|
||||
|
||||
# Combined conditions
|
||||
filtered = adata[
|
||||
(adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
|
||||
adata.var.highly_variable
|
||||
]
|
||||
```
|
||||
|
||||
**Understanding views:**
|
||||
- Subsetting returns **views** by default (memory-efficient, shares data with original)
|
||||
- Modifying a view affects the original object
|
||||
- Check with `adata.is_view`
|
||||
- Convert to independent copy with `.copy()`
|
||||
|
||||
```python
|
||||
# View (memory-efficient)
|
||||
subset = adata[adata.obs.condition == 'treated']
|
||||
print(subset.is_view) # True
|
||||
|
||||
# Independent copy
|
||||
subset_copy = adata[adata.obs.condition == 'treated'].copy()
|
||||
print(subset_copy.is_view) # False
|
||||
```
|
||||
|
||||
### 4. File I/O and Backed Mode
|
||||
|
||||
Read and write data efficiently, with options for memory-limited environments.
|
||||
|
||||
**Writing data:**
|
||||
```python
|
||||
# Standard format with compression
|
||||
adata.write('results.h5ad', compression='gzip')
|
||||
|
||||
# Alternative formats
|
||||
adata.write_zarr('results.zarr') # For cloud storage
|
||||
adata.write_loom('results.loom') # For compatibility
|
||||
adata.write_csvs('results/') # As CSV files
|
||||
```
|
||||
|
||||
**Reading data:**
|
||||
```python
|
||||
# Load into memory
|
||||
adata = ad.read_h5ad('results.h5ad')
|
||||
|
||||
# Backed mode (disk-backed, memory-efficient)
|
||||
adata = ad.read_h5ad('large_file.h5ad', backed='r')
|
||||
print(adata.isbacked) # True
|
||||
print(adata.filename) # Path to file
|
||||
|
||||
# Close file connection when done
|
||||
adata.file.close()
|
||||
```
|
||||
|
||||
**Reading from other formats:**
|
||||
```python
|
||||
# 10X format
|
||||
adata = ad.read_mtx('matrix.mtx')
|
||||
|
||||
# CSV
|
||||
# Read other formats
|
||||
adata = ad.read_csv('data.csv')
|
||||
|
||||
# Loom
|
||||
adata = ad.read_loom('data.loom')
|
||||
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
|
||||
```
|
||||
|
||||
**Working with backed mode:**
|
||||
### Writing data
|
||||
```python
|
||||
# Read in backed mode for large files
|
||||
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
|
||||
# Write h5ad file
|
||||
adata.write_h5ad('output.h5ad')
|
||||
|
||||
# Process in chunks
|
||||
for chunk in adata.chunk_X(chunk_size=1000):
|
||||
result = process_chunk(chunk)
|
||||
# Write with compression
|
||||
adata.write_h5ad('output.h5ad', compression='gzip')
|
||||
|
||||
# Load to memory if needed
|
||||
adata_memory = adata.to_memory()
|
||||
# Write other formats
|
||||
adata.write_zarr('output.zarr')
|
||||
adata.write_csvs('output_dir/')
|
||||
```
|
||||
|
||||
### 5. Concatenating Multiple Datasets
|
||||
|
||||
Combine multiple AnnData objects with control over how data is merged.
|
||||
|
||||
**Basic concatenation:**
|
||||
### Basic operations
|
||||
```python
|
||||
# Concatenate observations (most common)
|
||||
combined = ad.concat([adata1, adata2, adata3], axis=0)
|
||||
# Subset by conditions
|
||||
t_cells = adata[adata.obs['cell_type'] == 'T cell']
|
||||
|
||||
# Concatenate variables (rare)
|
||||
combined = ad.concat([adata1, adata2], axis=1)
|
||||
# Subset by indices
|
||||
subset = adata[0:50, 0:100]
|
||||
|
||||
# Add metadata
|
||||
adata.obs['quality_score'] = np.random.rand(adata.n_obs)
|
||||
adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8
|
||||
|
||||
# Access dimensions
|
||||
print(f"{adata.n_obs} observations × {adata.n_vars} variables")
|
||||
```
|
||||
|
||||
**Join strategies:**
|
||||
```python
|
||||
# Inner join: only shared variables (no missing data)
|
||||
combined = ad.concat([adata1, adata2], join='inner')
|
||||
## Core Capabilities
|
||||
|
||||
# Outer join: all variables (fills missing with 0)
|
||||
combined = ad.concat([adata1, adata2], join='outer')
|
||||
### 1. Data Structure
|
||||
|
||||
Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.
|
||||
|
||||
**See**: `references/data_structure.md` for comprehensive information on:
|
||||
- Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
|
||||
- Creating AnnData objects from various sources
|
||||
- Accessing and manipulating data components
|
||||
- Memory-efficient practices
|
||||
|
||||
### 2. Input/Output Operations
|
||||
|
||||
Read and write data in various formats with support for compression, backed mode, and cloud storage.
|
||||
|
||||
**See**: `references/io_operations.md` for details on:
|
||||
- Native formats (h5ad, zarr)
|
||||
- Alternative formats (CSV, MTX, Loom, 10X, Excel)
|
||||
- Backed mode for large datasets
|
||||
- Remote data access
|
||||
- Format conversion
|
||||
- Performance optimization
|
||||
|
||||
Common commands:
|
||||
```python
|
||||
# Read/write h5ad
|
||||
adata = ad.read_h5ad('data.h5ad', backed='r')
|
||||
adata.write_h5ad('output.h5ad', compression='gzip')
|
||||
|
||||
# Read 10X data
|
||||
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
|
||||
|
||||
# Read MTX format
|
||||
adata = ad.read_mtx('matrix.mtx').T
|
||||
```
|
||||
|
||||
**Tracking data sources:**
|
||||
### 3. Concatenation
|
||||
|
||||
Combine multiple AnnData objects along observations or variables with flexible join strategies.
|
||||
|
||||
**See**: `references/concatenation.md` for comprehensive coverage of:
|
||||
- Basic concatenation (axis=0 for observations, axis=1 for variables)
|
||||
- Join types (inner, outer)
|
||||
- Merge strategies (same, unique, first, only)
|
||||
- Tracking data sources with labels
|
||||
- Lazy concatenation (AnnCollection)
|
||||
- On-disk concatenation for large datasets
|
||||
|
||||
Common commands:
|
||||
```python
|
||||
# Add source labels
|
||||
combined = ad.concat(
|
||||
# Concatenate observations (combine samples)
|
||||
adata = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
label='dataset',
|
||||
keys=['exp1', 'exp2', 'exp3']
|
||||
)
|
||||
# Creates combined.obs['dataset'] with values 'exp1', 'exp2', 'exp3'
|
||||
|
||||
# Make duplicate indices unique
|
||||
combined = ad.concat(
|
||||
[adata1, adata2],
|
||||
keys=['batch1', 'batch2'],
|
||||
index_unique='-'
|
||||
)
|
||||
# Cell names become: Cell_0-batch1, Cell_0-batch2, etc.
|
||||
```
|
||||
|
||||
**Merge strategies for metadata:**
|
||||
```python
|
||||
# merge=None: exclude variable annotations (default)
|
||||
combined = ad.concat([adata1, adata2], merge=None)
|
||||
|
||||
# merge='same': keep only identical annotations
|
||||
combined = ad.concat([adata1, adata2], merge='same')
|
||||
|
||||
# merge='first': use first occurrence
|
||||
combined = ad.concat([adata1, adata2], merge='first')
|
||||
|
||||
# merge='unique': keep annotations with single value
|
||||
combined = ad.concat([adata1, adata2], merge='unique')
|
||||
```
|
||||
|
||||
**Complete example:**
|
||||
```python
|
||||
# Load batches
|
||||
batch1 = ad.read_h5ad('batch1.h5ad')
|
||||
batch2 = ad.read_h5ad('batch2.h5ad')
|
||||
batch3 = ad.read_h5ad('batch3.h5ad')
|
||||
|
||||
# Concatenate with full tracking
|
||||
combined = ad.concat(
|
||||
[batch1, batch2, batch3],
|
||||
axis=0,
|
||||
join='outer', # Keep all genes
|
||||
merge='first', # Use first batch's annotations
|
||||
label='batch_id', # Track source
|
||||
keys=['b1', 'b2', 'b3'], # Custom labels
|
||||
index_unique='-' # Make cell names unique
|
||||
join='inner',
|
||||
label='batch',
|
||||
keys=['batch1', 'batch2', 'batch3']
|
||||
)
|
||||
|
||||
# Concatenate variables (combine modalities)
|
||||
adata = ad.concat([adata_rna, adata_protein], axis=1)
|
||||
|
||||
# Lazy concatenation
|
||||
from anndata.experimental import AnnCollection
|
||||
collection = AnnCollection(
|
||||
['data1.h5ad', 'data2.h5ad'],
|
||||
join_obs='outer',
|
||||
label='dataset'
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Data Conversion and Extraction
|
||||
### 4. Data Manipulation
|
||||
|
||||
Convert between AnnData and other formats for interoperability.
|
||||
Transform, subset, filter, and reorganize data efficiently.
|
||||
|
||||
**To DataFrame:**
|
||||
**See**: `references/manipulation.md` for detailed guidance on:
|
||||
- Subsetting (by indices, names, boolean masks, metadata conditions)
|
||||
- Transposition
|
||||
- Copying (full copies vs views)
|
||||
- Renaming (observations, variables, categories)
|
||||
- Type conversions (strings to categoricals, sparse/dense)
|
||||
- Adding/removing data components
|
||||
- Reordering
|
||||
- Quality control filtering
|
||||
|
||||
Common commands:
|
||||
```python
|
||||
# Convert X to DataFrame
|
||||
df = adata.to_df()
|
||||
# Subset by metadata
|
||||
filtered = adata[adata.obs['quality_score'] > 0.8]
|
||||
hv_genes = adata[:, adata.var['highly_variable']]
|
||||
|
||||
# Convert specific layer
|
||||
df = adata.to_df(layer='log1p')
|
||||
# Transpose
|
||||
adata_T = adata.T
|
||||
|
||||
# Copy vs view
|
||||
view = adata[0:100, :] # View (lightweight reference)
|
||||
copy = adata[0:100, :].copy() # Independent copy
|
||||
|
||||
# Convert strings to categoricals
|
||||
adata.strings_to_categoricals()
|
||||
```
|
||||
|
||||
**Extract vectors:**
|
||||
```python
|
||||
# Get 1D arrays from data or annotations
|
||||
gene_expression = adata.obs_vector('Gene_100')
|
||||
cell_metadata = adata.obs_vector('n_genes')
|
||||
```
|
||||
|
||||
**Transpose:**
|
||||
```python
|
||||
# Swap observations and variables
|
||||
transposed = adata.T
|
||||
```
|
||||
|
||||
### 7. Memory Optimization
|
||||
|
||||
Strategies for working with large datasets efficiently.
|
||||
|
||||
**Use sparse matrices:**
|
||||
### 5. Best Practices
|
||||
|
||||
Follow recommended patterns for memory efficiency, performance, and reproducibility.
|
||||
|
||||
**See**: `references/best_practices.md` for guidelines on:
|
||||
- Memory management (sparse matrices, categoricals, backed mode)
|
||||
- Views vs copies
|
||||
- Data storage optimization
|
||||
- Performance optimization
|
||||
- Working with raw data
|
||||
- Metadata management
|
||||
- Reproducibility
|
||||
- Error handling
|
||||
- Integration with other tools
|
||||
- Common pitfalls and solutions
|
||||
|
||||
Key recommendations:
|
||||
```python
|
||||
# Use sparse matrices for sparse data
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
# Check sparsity
|
||||
density = (adata.X != 0).sum() / adata.X.size
|
||||
if density < 0.3: # Less than 30% non-zero
|
||||
adata.X = csr_matrix(adata.X)
|
||||
```
|
||||
|
||||
**Convert strings to categoricals:**
|
||||
```python
|
||||
# Automatic conversion
|
||||
# Convert strings to categoricals
|
||||
adata.strings_to_categoricals()
|
||||
|
||||
# Manual conversion (more control)
|
||||
adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
|
||||
# Use backed mode for large files
|
||||
adata = ad.read_h5ad('large.h5ad', backed='r')
|
||||
|
||||
# Store raw before filtering
|
||||
adata.raw = adata.copy()
|
||||
adata = adata[:, adata.var['highly_variable']]
|
||||
```
|
||||
|
||||
**Use backed mode:**
|
||||
```python
|
||||
# Read without loading into memory
|
||||
adata = ad.read_h5ad('large_file.h5ad', backed='r')
|
||||
## Integration with Scverse Ecosystem
|
||||
|
||||
# Work with subsets
|
||||
subset = adata[:1000, :500].copy() # Only this subset in memory
|
||||
AnnData serves as the foundational data structure for the scverse ecosystem:
|
||||
|
||||
### Scanpy (Single-cell analysis)
|
||||
```python
|
||||
import scanpy as sc
|
||||
|
||||
# Preprocessing
|
||||
sc.pp.filter_cells(adata, min_genes=200)
|
||||
sc.pp.normalize_total(adata, target_sum=1e4)
|
||||
sc.pp.log1p(adata)
|
||||
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
|
||||
|
||||
# Dimensionality reduction
|
||||
sc.pp.pca(adata, n_comps=50)
|
||||
sc.pp.neighbors(adata, n_neighbors=15)
|
||||
sc.tl.umap(adata)
|
||||
sc.tl.leiden(adata)
|
||||
|
||||
# Visualization
|
||||
sc.pl.umap(adata, color=['cell_type', 'leiden'])
|
||||
```
|
||||
|
||||
**Chunked processing:**
|
||||
### Muon (Multimodal data)
|
||||
```python
|
||||
# Process data in chunks
|
||||
results = []
|
||||
for chunk in adata.chunk_X(chunk_size=1000):
|
||||
result = expensive_computation(chunk)
|
||||
results.append(result)
|
||||
import muon as mu
|
||||
|
||||
# Combine RNA and protein data
|
||||
mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})
|
||||
```
|
||||
|
||||
### PyTorch integration
|
||||
```python
|
||||
from anndata.experimental import AnnLoader
|
||||
|
||||
# Create DataLoader for deep learning
|
||||
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
|
||||
|
||||
for batch in dataloader:
|
||||
X = batch.X
|
||||
# Train model
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Single-Cell RNA-seq Analysis
|
||||
|
||||
Complete workflow from loading to analysis:
|
||||
|
||||
### Single-cell RNA-seq analysis
|
||||
```python
|
||||
import anndata as ad
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import scanpy as sc
|
||||
|
||||
# 1. Load data
|
||||
adata = ad.read_mtx('matrix.mtx')
|
||||
adata.obs_names = pd.read_csv('barcodes.tsv', header=None)[0]
|
||||
adata.var_names = pd.read_csv('genes.tsv', header=None)[0]
|
||||
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
|
||||
|
||||
# 2. Quality control
|
||||
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
|
||||
adata.obs['total_counts'] = adata.X.sum(axis=1)
|
||||
adata = adata[adata.obs.n_genes > 200]
|
||||
adata = adata[adata.obs.total_counts < 10000]
|
||||
adata.obs['n_counts'] = adata.X.sum(axis=1)
|
||||
adata = adata[adata.obs['n_genes'] > 200]
|
||||
adata = adata[adata.obs['n_counts'] < 50000]
|
||||
|
||||
# 3. Filter genes
|
||||
min_cells = 3
|
||||
adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
|
||||
# 3. Store raw
|
||||
adata.raw = adata.copy()
|
||||
|
||||
# 4. Store raw counts
|
||||
adata.layers['counts'] = adata.X.copy()
|
||||
# 4. Normalize and filter
|
||||
sc.pp.normalize_total(adata, target_sum=1e4)
|
||||
sc.pp.log1p(adata)
|
||||
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
|
||||
adata = adata[:, adata.var['highly_variable']]
|
||||
|
||||
# 5. Normalize
|
||||
adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
|
||||
adata.X = np.log1p(adata.X)
|
||||
|
||||
# 6. Feature selection
|
||||
gene_var = adata.X.var(axis=0)
|
||||
adata.var['highly_variable'] = gene_var > np.percentile(gene_var, 90)
|
||||
|
||||
# 7. Dimensionality reduction (example with external tools)
|
||||
# adata.obsm['X_pca'] = compute_pca(adata.X)
|
||||
# adata.obsm['X_umap'] = compute_umap(adata.obsm['X_pca'])
|
||||
|
||||
# 8. Save results
|
||||
adata.write('analyzed.h5ad', compression='gzip')
|
||||
# 5. Save processed data
|
||||
adata.write_h5ad('processed.h5ad')
|
||||
```
|
||||
|
||||
### Batch Integration
|
||||
|
||||
Combining multiple experimental batches:
|
||||
|
||||
### Batch integration
|
||||
```python
|
||||
# Load batches
|
||||
batches = [ad.read_h5ad(f'batch_{i}.h5ad') for i in range(3)]
|
||||
# Load multiple batches
|
||||
adata1 = ad.read_h5ad('batch1.h5ad')
|
||||
adata2 = ad.read_h5ad('batch2.h5ad')
|
||||
adata3 = ad.read_h5ad('batch3.h5ad')
|
||||
|
||||
# Concatenate with tracking
|
||||
combined = ad.concat(
|
||||
batches,
|
||||
axis=0,
|
||||
join='outer',
|
||||
# Concatenate with batch labels
|
||||
adata = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
label='batch',
|
||||
keys=['batch_0', 'batch_1', 'batch_2'],
|
||||
index_unique='-'
|
||||
keys=['batch1', 'batch2', 'batch3'],
|
||||
join='inner'
|
||||
)
|
||||
|
||||
# Add batch as numeric for correction algorithms
|
||||
combined.obs['batch_numeric'] = combined.obs['batch'].cat.codes
|
||||
# Apply batch correction
|
||||
import scanpy as sc
|
||||
sc.pp.combat(adata, key='batch')
|
||||
|
||||
# Perform batch correction (with external tools)
|
||||
# corrected_pca = run_harmony(combined.obsm['X_pca'], combined.obs['batch'])
|
||||
# combined.obsm['X_pca_corrected'] = corrected_pca
|
||||
|
||||
# Save integrated data
|
||||
combined.write('integrated.h5ad', compression='gzip')
|
||||
# Continue analysis
|
||||
sc.pp.pca(adata)
|
||||
sc.pp.neighbors(adata)
|
||||
sc.tl.umap(adata)
|
||||
```
|
||||
|
||||
### Memory-Efficient Large Dataset Processing
|
||||
|
||||
Working with datasets too large for memory:
|
||||
|
||||
### Working with large datasets
|
||||
```python
|
||||
# Read in backed mode
|
||||
adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')
|
||||
# Open in backed mode
|
||||
adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')
|
||||
|
||||
# Compute statistics in chunks
|
||||
total = 0
|
||||
for chunk in adata.chunk_X(chunk_size=1000):
|
||||
total += chunk.sum()
|
||||
# Filter based on metadata (no data loading)
|
||||
high_quality = adata[adata.obs['quality_score'] > 0.8]
|
||||
|
||||
mean_expression = total / (adata.n_obs * adata.n_vars)
|
||||
# Load filtered subset
|
||||
adata_subset = high_quality.to_memory()
|
||||
|
||||
# Work with subset
|
||||
high_quality_cells = adata.obs.n_genes > 1000
|
||||
subset = adata[high_quality_cells, :].copy()
|
||||
# Process subset
|
||||
process(adata_subset)
|
||||
|
||||
# Close file
|
||||
adata.file.close()
|
||||
# Or process in chunks
|
||||
chunk_size = 1000
|
||||
for i in range(0, adata.n_obs, chunk_size):
|
||||
chunk = adata[i:i+chunk_size, :].to_memory()
|
||||
process(chunk)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
## Troubleshooting
|
||||
|
||||
### Data Organization
|
||||
|
||||
1. **Use layers for different representations**: Store raw counts, normalized, log-transformed, and scaled data in separate layers
|
||||
2. **Use obsm/varm for multi-dimensional data**: Embeddings, loadings, and other matrix-like annotations
|
||||
3. **Use uns for metadata**: Analysis parameters, dates, version information
|
||||
4. **Use categoricals for efficiency**: Convert repeated strings to categorical types
|
||||
|
||||
### Subsetting
|
||||
|
||||
1. **Understand views vs copies**: Subsetting returns views by default; use `.copy()` when you need independence
|
||||
2. **Chain conditions efficiently**: Combine boolean masks in a single subsetting operation
|
||||
3. **Validate after subsetting**: Check dimensions and data integrity
|
||||
|
||||
### File I/O
|
||||
|
||||
1. **Use compression**: Always use `compression='gzip'` when writing h5ad files
|
||||
2. **Choose the right format**: H5AD for general use, Zarr for cloud storage, Loom for compatibility
|
||||
3. **Close backed files**: Always close file connections when done
|
||||
4. **Use backed mode for large files**: Don't load everything into memory if not needed
|
||||
|
||||
### Concatenation
|
||||
|
||||
1. **Choose appropriate join**: Inner join for complete cases, outer join to preserve all features
|
||||
2. **Track sources**: Use `label` and `keys` to track data origin
|
||||
3. **Handle duplicates**: Use `index_unique` to make observation names unique
|
||||
4. **Select merge strategy**: Choose appropriate merge strategy for variable annotations
|
||||
|
||||
### Memory Management
|
||||
|
||||
1. **Use sparse matrices**: For data with <30% non-zero values
|
||||
2. **Convert to categoricals**: For repeated string values
|
||||
3. **Process in chunks**: For operations on very large matrices
|
||||
4. **Use backed mode**: Read large files with `backed='r'`
|
||||
|
||||
### Naming Conventions
|
||||
|
||||
Follow these conventions for consistency:
|
||||
|
||||
- **Embeddings**: `X_pca`, `X_umap`, `X_tsne`
|
||||
- **Layers**: Descriptive names like `counts`, `log1p`, `scaled`
|
||||
- **Observations**: Use snake_case like `cell_type`, `n_genes`, `total_counts`
|
||||
- **Variables**: Use snake_case like `highly_variable`, `gene_name`
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
For detailed API information, usage patterns, and troubleshooting, refer to the comprehensive reference files in the `references/` directory:
|
||||
|
||||
1. **api_reference.md**: Complete API documentation including all classes, methods, and functions with usage examples. Use `grep -r "pattern" references/api_reference.md` to search for specific functions or parameters.
|
||||
|
||||
2. **workflows_best_practices.md**: Detailed workflows for common tasks (single-cell analysis, batch integration, large datasets), best practices for memory management, data organization, and common pitfalls to avoid. Use `grep -r "pattern" references/workflows_best_practices.md` to search for specific workflows.
|
||||
|
||||
3. **concatenation_guide.md**: Comprehensive guide to concatenation strategies, join types, merge strategies, source tracking, and troubleshooting concatenation issues. Use `grep -r "pattern" references/concatenation_guide.md` to search for concatenation patterns.
|
||||
|
||||
## When to Load References
|
||||
|
||||
Load reference files into context when:
|
||||
- Implementing complex concatenation with specific merge strategies
|
||||
- Troubleshooting errors or unexpected behavior
|
||||
- Optimizing memory usage for large datasets
|
||||
- Implementing complete analysis workflows
|
||||
- Understanding nuances of specific API methods
|
||||
|
||||
To search within references without loading them:
|
||||
### Out of memory errors
|
||||
Use backed mode or convert to sparse matrices:
|
||||
```python
|
||||
# Example: Search for information about backed mode
|
||||
grep -r "backed mode" references/
|
||||
# Backed mode
|
||||
adata = ad.read_h5ad('file.h5ad', backed='r')
|
||||
|
||||
# Sparse matrices
|
||||
from scipy.sparse import csr_matrix
|
||||
adata.X = csr_matrix(adata.X)
|
||||
```
|
||||
|
||||
## Common Error Patterns
|
||||
### Slow file reading
|
||||
Use compression and appropriate formats:
|
||||
```python
|
||||
# Optimize for storage
|
||||
adata.strings_to_categoricals()
|
||||
adata.write_h5ad('file.h5ad', compression='gzip')
|
||||
|
||||
### Memory Errors
|
||||
**Problem**: "MemoryError: Unable to allocate array"
|
||||
**Solution**: Use backed mode, sparse matrices, or process in chunks
|
||||
# Use Zarr for cloud storage
|
||||
adata.write_zarr('file.zarr', chunks=(1000, 1000))
|
||||
```
|
||||
|
||||
### Dimension Mismatch
|
||||
**Problem**: "ValueError: operands could not be broadcast together"
|
||||
**Solution**: Use outer join in concatenation or align indices before operations
|
||||
### Index alignment issues
|
||||
Always align external data on index:
|
||||
```python
|
||||
# Wrong
|
||||
adata.obs['new_col'] = external_data['values']
|
||||
|
||||
### View Modification
|
||||
**Problem**: "ValueError: assignment destination is read-only"
|
||||
**Solution**: Convert view to copy with `.copy()` before modification
|
||||
# Correct
|
||||
adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
|
||||
```
|
||||
|
||||
### File Already Open
|
||||
**Problem**: "OSError: Unable to open file (file is already open)"
|
||||
**Solution**: Close previous file connection with `adata.file.close()`
|
||||
## Additional Resources
|
||||
|
||||
- **Official documentation**: https://anndata.readthedocs.io/
|
||||
- **Scanpy tutorials**: https://scanpy.readthedocs.io/
|
||||
- **Scverse ecosystem**: https://scverse.org/
|
||||
- **GitHub repository**: https://github.com/scverse/anndata
|
||||
|
||||
@@ -1,218 +0,0 @@
|
||||
# AnnData API Reference
|
||||
|
||||
## Core AnnData Class
|
||||
|
||||
The `AnnData` class is the central data structure for storing and manipulating annotated datasets in single-cell genomics and other domains.
|
||||
|
||||
### Core Attributes
|
||||
|
||||
| Attribute | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| **X** | array-like | Primary data matrix (#observations × #variables). Supports NumPy arrays, sparse matrices (CSR/CSC), HDF5 datasets, Zarr arrays, and Dask arrays |
|
||||
| **obs** | DataFrame | One-dimensional annotation of observations (rows). Length equals observation count |
|
||||
| **var** | DataFrame | One-dimensional annotation of variables/features (columns). Length equals variable count |
|
||||
| **uns** | OrderedDict | Unstructured annotation for miscellaneous metadata |
|
||||
| **obsm** | dict-like | Multi-dimensional observation annotations (structured arrays aligned to observation axis) |
|
||||
| **varm** | dict-like | Multi-dimensional variable annotations (structured arrays aligned to variable axis) |
|
||||
| **obsp** | dict-like | Pairwise observation annotations (square matrices representing graphs) |
|
||||
| **varp** | dict-like | Pairwise variable annotations (graphs between features) |
|
||||
| **layers** | dict-like | Additional data matrices matching X's dimensions |
|
||||
| **raw** | AnnData | Stores original versions of X and var before transformations |
|
||||
|
||||
### Dimensional Properties
|
||||
|
||||
- **n_obs**: Number of observations (sample count)
|
||||
- **n_vars**: Number of variables/features
|
||||
- **shape**: Tuple returning (n_obs, n_vars)
|
||||
- **T**: Transposed view of the entire object
|
||||
|
||||
### State Properties
|
||||
|
||||
- **isbacked**: Boolean indicating disk-backed storage status
|
||||
- **is_view**: Boolean identifying whether object is a view of another AnnData
|
||||
- **filename**: Path to backing .h5ad file; setting this enables disk-backed mode
|
||||
|
||||
### Key Methods
|
||||
|
||||
#### Construction and Copying
|
||||
- **`AnnData(X=None, obs=None, var=None, ...)`**: Create new AnnData object
|
||||
- **`copy(filename=None)`**: Create full copy, optionally stored on disk
|
||||
|
||||
#### Subsetting and Views
|
||||
- **`adata[obs_subset, var_subset]`**: Subset observations and variables (returns view by default)
|
||||
- **`.copy()`**: Convert view to independent object
|
||||
|
||||
#### Data Access
|
||||
- **`to_df(layer=None)`**: Generate pandas DataFrame representation
|
||||
- **`obs_vector(k, layer=None)`**: Extract 1D array from X, layers, or annotations
|
||||
- **`var_vector(k, layer=None)`**: Extract 1D array for a variable
|
||||
- **`chunk_X(chunk_size)`**: Iterate over data matrix in chunks
|
||||
- **`chunked_X(chunk_size)`**: Context manager for chunked iteration
|
||||
|
||||
#### Transformation
|
||||
- **`transpose()`**: Return transposed object
|
||||
- **`concatenate(*adatas, ...)`**: Combine multiple AnnData objects along observation axis
|
||||
- **`to_memory(copy=False)`**: Load all backed arrays into RAM
|
||||
|
||||
#### File I/O
|
||||
- **`write_h5ad(filename, compression='gzip')`**: Save as .h5ad HDF5 format
|
||||
- **`write_zarr(store, ...)`**: Export hierarchical Zarr store
|
||||
- **`write_loom(filename, ...)`**: Output .loom format file
|
||||
- **`write_csvs(dirname, ...)`**: Write annotations as separate CSV files
|
||||
|
||||
#### Data Management
|
||||
- **`strings_to_categoricals()`**: Convert string annotations to categorical types
|
||||
- **`rename_categories(key, categories)`**: Update category labels in annotations
|
||||
- **`obs_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate observation names
|
||||
- **`var_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate variable names
|
||||
|
||||
## Module-Level Functions
|
||||
|
||||
### Reading Functions
|
||||
|
||||
#### Native Formats
|
||||
- **`read_h5ad(filename, backed=None, as_sparse=None)`**: Load HDF5-based .h5ad files
|
||||
- **`read_zarr(store)`**: Access hierarchical Zarr array stores
|
||||
|
||||
#### Alternative Formats
|
||||
- **`read_csv(filename, ...)`**: Import from CSV files
|
||||
- **`read_excel(filename, ...)`**: Import from Excel files
|
||||
- **`read_hdf(filename, key)`**: Read from HDF5 files
|
||||
- **`read_loom(filename, ...)`**: Import from .loom files
|
||||
- **`read_mtx(filename, ...)`**: Import from Matrix Market format
|
||||
- **`read_text(filename, ...)`**: Import from text files
|
||||
- **`read_umi_tools(filename, ...)`**: Import from UMI-tools format
|
||||
|
||||
#### Element-Level Access
|
||||
- **`read_elem(elem)`**: Retrieve specific components from storage
|
||||
- **`sparse_dataset(group)`**: Generate backed sparse matrix classes
|
||||
|
||||
### Combining Operations
|
||||
- **`concat(adatas, axis=0, join='inner', merge=None, ...)`**: Merge multiple AnnData objects
|
||||
- **axis**: 0 (observations) or 1 (variables)
|
||||
- **join**: 'inner' (intersection) or 'outer' (union)
|
||||
- **merge**: Strategy for non-concatenation axis ('same', 'unique', 'first', 'only', or None)
|
||||
- **label**: Column name for source tracking
|
||||
- **keys**: Dataset identifiers for source annotation
|
||||
- **index_unique**: Separator for making duplicate indices unique
|
||||
|
||||
### Writing Functions
|
||||
- **`write_h5ad(filename, adata, compression='gzip')`**: Export to HDF5 format
|
||||
- **`write_zarr(store, adata, ...)`**: Save as Zarr hierarchical arrays
|
||||
- **`write_elem(elem, ...)`**: Write individual components
|
||||
|
||||
### Experimental Features
|
||||
- **`AnnCollection`**: Batch processing for large collections
|
||||
- **`AnnLoader`**: PyTorch DataLoader integration
|
||||
- **`concat_on_disk(*adatas, filename, ...)`**: Memory-efficient out-of-core concatenation
|
||||
- **`read_lazy(filename)`**: Lazy loading with deferred computation
|
||||
- **`read_dispatched(filename, ...)`**: Custom I/O with callbacks
|
||||
- **`write_dispatched(filename, ...)`**: Custom writing with callbacks
|
||||
|
||||
### Configuration
|
||||
- **`settings`**: Package-wide configuration object
|
||||
- **`settings.override(**kwargs)`**: Context manager for temporary settings changes
|
||||
|
||||
## Common Usage Patterns
|
||||
|
||||
### Creating AnnData Objects
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
import numpy as np
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
# From dense array
|
||||
counts = np.random.poisson(1, size=(100, 2000))
|
||||
adata = ad.AnnData(counts)
|
||||
|
||||
# From sparse matrix
|
||||
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
|
||||
adata = ad.AnnData(counts)
|
||||
|
||||
# With metadata
|
||||
import pandas as pd
|
||||
obs_meta = pd.DataFrame({'cell_type': ['B', 'T', 'Monocyte'] * 33 + ['B']})
|
||||
var_meta = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]})
|
||||
adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
|
||||
```
|
||||
|
||||
### Subsetting
|
||||
|
||||
```python
|
||||
# By names
|
||||
subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
|
||||
|
||||
# By boolean mask
|
||||
b_cells = adata[adata.obs.cell_type == 'B']
|
||||
|
||||
# By position
|
||||
first_five = adata[:5, :100]
|
||||
|
||||
# Convert view to copy
|
||||
adata_copy = adata[:5].copy()
|
||||
```
|
||||
|
||||
### Adding Annotations
|
||||
|
||||
```python
|
||||
# Cell-level metadata
|
||||
adata.obs['batch'] = pd.Categorical(['batch1', 'batch2'] * 50)
|
||||
|
||||
# Gene-level metadata
|
||||
adata.var['highly_variable'] = np.random.choice([True, False], size=adata.n_vars)
|
||||
|
||||
# Embeddings
|
||||
adata.obsm['X_pca'] = np.random.normal(size=(adata.n_obs, 50))
|
||||
adata.obsm['X_umap'] = np.random.normal(size=(adata.n_obs, 2))
|
||||
|
||||
# Alternative data representations
|
||||
adata.layers['log_transformed'] = np.log1p(adata.X)
|
||||
adata.layers['scaled'] = (adata.X - adata.X.mean(axis=0)) / adata.X.std(axis=0)
|
||||
|
||||
# Unstructured metadata
|
||||
adata.uns['experiment_date'] = '2024-01-15'
|
||||
adata.uns['parameters'] = {'min_genes': 200, 'min_cells': 3}
|
||||
```
|
||||
|
||||
### File I/O
|
||||
|
||||
```python
|
||||
# Write to disk
|
||||
adata.write('my_results.h5ad', compression='gzip')
|
||||
|
||||
# Read into memory
|
||||
adata = ad.read_h5ad('my_results.h5ad')
|
||||
|
||||
# Read in backed mode (memory-efficient)
|
||||
adata = ad.read_h5ad('my_results.h5ad', backed='r')
|
||||
|
||||
# Close file connection
|
||||
adata.file.close()
|
||||
```
|
||||
|
||||
### Concatenation
|
||||
|
||||
```python
|
||||
# Combine multiple datasets
|
||||
adata1 = ad.AnnData(np.random.poisson(1, size=(100, 2000)))
|
||||
adata2 = ad.AnnData(np.random.poisson(1, size=(150, 2000)))
|
||||
adata3 = ad.AnnData(np.random.poisson(1, size=(80, 2000)))
|
||||
|
||||
# Simple concatenation
|
||||
combined = ad.concat([adata1, adata2, adata3], axis=0)
|
||||
|
||||
# With source labels
|
||||
combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
axis=0,
|
||||
label='dataset',
|
||||
keys=['exp1', 'exp2', 'exp3']
|
||||
)
|
||||
|
||||
# Inner join (only shared variables)
|
||||
combined = ad.concat([adata1, adata2, adata3], axis=0, join='inner')
|
||||
|
||||
# Outer join (all variables, pad with zeros)
|
||||
combined = ad.concat([adata1, adata2, adata3], axis=0, join='outer')
|
||||
```
|
||||
525
scientific-packages/anndata/references/best_practices.md
Normal file
525
scientific-packages/anndata/references/best_practices.md
Normal file
@@ -0,0 +1,525 @@
|
||||
# Best Practices
|
||||
|
||||
Guidelines for efficient and effective use of AnnData.
|
||||
|
||||
## Memory Management
|
||||
|
||||
### Use sparse matrices for sparse data
|
||||
```python
|
||||
import numpy as np
|
||||
from scipy.sparse import csr_matrix
|
||||
import anndata as ad
|
||||
|
||||
# Check data sparsity
|
||||
data = np.random.rand(1000, 2000)
|
||||
sparsity = 1 - np.count_nonzero(data) / data.size
|
||||
print(f"Sparsity: {sparsity:.2%}")
|
||||
|
||||
# Convert to sparse if >50% zeros
|
||||
if sparsity > 0.5:
|
||||
adata = ad.AnnData(X=csr_matrix(data))
|
||||
else:
|
||||
adata = ad.AnnData(X=data)
|
||||
|
||||
# Benefits: 10-100x memory reduction for sparse genomics data
|
||||
```
|
||||
|
||||
### Convert strings to categoricals
|
||||
```python
|
||||
# Inefficient: string columns use lots of memory
|
||||
adata.obs['cell_type'] = ['Type_A', 'Type_B', 'Type_C'] * 333 + ['Type_A']
|
||||
|
||||
# Efficient: convert to categorical
|
||||
adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')
|
||||
|
||||
# Convert all string columns
|
||||
adata.strings_to_categoricals()
|
||||
|
||||
# Benefits: 10-50x memory reduction for repeated strings
|
||||
```
|
||||
|
||||
### Use backed mode for large datasets
|
||||
```python
|
||||
# Don't load entire dataset into memory
|
||||
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
|
||||
|
||||
# Work with metadata
|
||||
filtered = adata[adata.obs['quality'] > 0.8]
|
||||
|
||||
# Load only filtered subset
|
||||
adata_subset = filtered.to_memory()
|
||||
|
||||
# Benefits: Work with datasets larger than RAM
|
||||
```
|
||||
|
||||
## Views vs Copies
|
||||
|
||||
### Understanding views
|
||||
```python
|
||||
# Subsetting creates a view by default
|
||||
subset = adata[0:100, :]
|
||||
print(subset.is_view) # True
|
||||
|
||||
# Views don't copy data (memory efficient)
|
||||
# But modifications can affect original
|
||||
|
||||
# Check if object is a view
|
||||
if adata.is_view:
|
||||
adata = adata.copy() # Make independent
|
||||
```
|
||||
|
||||
### When to use views
|
||||
```python
|
||||
# Good: Read-only operations on subsets
|
||||
mean_expr = adata[adata.obs['cell_type'] == 'T cell'].X.mean()
|
||||
|
||||
# Good: Temporary analysis
|
||||
temp_subset = adata[:100, :]
|
||||
result = analyze(temp_subset.X)
|
||||
```
|
||||
|
||||
### When to use copies
|
||||
```python
|
||||
# Create independent copy for modifications
|
||||
adata_filtered = adata[keep_cells, :].copy()
|
||||
|
||||
# Safe to modify without affecting original
|
||||
adata_filtered.obs['new_column'] = values
|
||||
|
||||
# Always copy when:
|
||||
# - Storing subset for later use
|
||||
# - Modifying subset data
|
||||
# - Passing to function that modifies data
|
||||
```
|
||||
|
||||
## Data Storage Best Practices
|
||||
|
||||
### Choose the right format
|
||||
|
||||
**H5AD (HDF5) - Default choice**
|
||||
```python
|
||||
adata.write_h5ad('data.h5ad', compression='gzip')
|
||||
```
|
||||
- Fast random access
|
||||
- Supports backed mode
|
||||
- Good compression
|
||||
- Best for: Most use cases
|
||||
|
||||
**Zarr - Cloud and parallel access**
|
||||
```python
|
||||
adata.write_zarr('data.zarr', chunks=(100, 100))
|
||||
```
|
||||
- Excellent for cloud storage (S3, GCS)
|
||||
- Supports parallel I/O
|
||||
- Good compression
|
||||
- Best for: Large datasets, cloud workflows, parallel processing
|
||||
|
||||
**CSV - Interoperability**
|
||||
```python
|
||||
adata.write_csvs('output_dir/')
|
||||
```
|
||||
- Human readable
|
||||
- Compatible with all tools
|
||||
- Large file sizes, slow
|
||||
- Best for: Sharing with non-Python tools, small datasets
|
||||
|
||||
### Optimize file size
|
||||
```python
|
||||
# Before saving, optimize:
|
||||
|
||||
# 1. Convert to sparse if appropriate
|
||||
from scipy.sparse import csr_matrix, issparse
|
||||
if not issparse(adata.X):
|
||||
density = np.count_nonzero(adata.X) / adata.X.size
|
||||
if density < 0.5:
|
||||
adata.X = csr_matrix(adata.X)
|
||||
|
||||
# 2. Convert strings to categoricals
|
||||
adata.strings_to_categoricals()
|
||||
|
||||
# 3. Use compression
|
||||
adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9)
|
||||
|
||||
# Typical results: 5-20x file size reduction
|
||||
```
|
||||
|
||||
## Backed Mode Strategies
|
||||
|
||||
### Read-only analysis
|
||||
```python
|
||||
# Open in read-only backed mode
|
||||
adata = ad.read_h5ad('data.h5ad', backed='r')
|
||||
|
||||
# Perform filtering without loading data
|
||||
high_quality = adata[adata.obs['quality_score'] > 0.8]
|
||||
|
||||
# Load only filtered data
|
||||
adata_filtered = high_quality.to_memory()
|
||||
```
|
||||
|
||||
### Read-write modifications
|
||||
```python
|
||||
# Open in read-write backed mode
|
||||
adata = ad.read_h5ad('data.h5ad', backed='r+')
|
||||
|
||||
# Modify metadata (written to disk)
|
||||
adata.obs['new_annotation'] = values
|
||||
|
||||
# X remains on disk, modifications saved immediately
|
||||
```
|
||||
|
||||
### Chunked processing
|
||||
```python
|
||||
# Process large dataset in chunks
|
||||
adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')
|
||||
|
||||
results = []
|
||||
chunk_size = 1000
|
||||
|
||||
for i in range(0, adata.n_obs, chunk_size):
|
||||
chunk = adata[i:i+chunk_size, :].to_memory()
|
||||
result = process(chunk)
|
||||
results.append(result)
|
||||
|
||||
final_result = combine(results)
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Subsetting performance
|
||||
```python
|
||||
# Fast: Boolean indexing with arrays
|
||||
mask = np.array(adata.obs['quality'] > 0.5)
|
||||
subset = adata[mask, :]
|
||||
|
||||
# Slow: Boolean indexing with Series (creates view chain)
|
||||
subset = adata[adata.obs['quality'] > 0.5, :]
|
||||
|
||||
# Fastest: Integer indices
|
||||
indices = np.where(adata.obs['quality'] > 0.5)[0]
|
||||
subset = adata[indices, :]
|
||||
```
|
||||
|
||||
### Avoid repeated subsetting
|
||||
```python
|
||||
# Inefficient: Multiple subset operations
|
||||
for cell_type in ['A', 'B', 'C']:
|
||||
subset = adata[adata.obs['cell_type'] == cell_type]
|
||||
process(subset)
|
||||
|
||||
# Efficient: Group and process
|
||||
groups = adata.obs.groupby('cell_type').groups
|
||||
for cell_type, indices in groups.items():
|
||||
subset = adata[indices, :]
|
||||
process(subset)
|
||||
```
|
||||
|
||||
### Use chunked operations for large matrices
|
||||
```python
|
||||
# Process X in chunks
|
||||
for chunk in adata.chunked_X(chunk_size=1000):
|
||||
result = compute(chunk)
|
||||
|
||||
# More memory efficient than loading full X
|
||||
```
|
||||
|
||||
## Working with Raw Data
|
||||
|
||||
### Store raw before filtering
|
||||
```python
|
||||
# Original data with all genes
|
||||
adata = ad.AnnData(X=counts)
|
||||
|
||||
# Store raw before filtering
|
||||
adata.raw = adata.copy()
|
||||
|
||||
# Filter to highly variable genes
|
||||
adata = adata[:, adata.var['highly_variable']]
|
||||
|
||||
# Later: access original data
|
||||
original_expression = adata.raw.X
|
||||
all_genes = adata.raw.var_names
|
||||
```
|
||||
|
||||
### When to use raw
|
||||
```python
|
||||
# Use raw for:
|
||||
# - Differential expression on filtered genes
|
||||
# - Visualization of specific genes not in filtered set
|
||||
# - Accessing original counts after normalization
|
||||
|
||||
# Access raw data
|
||||
if adata.raw is not None:
|
||||
gene_expr = adata.raw[:, 'GENE_NAME'].X
|
||||
else:
|
||||
gene_expr = adata[:, 'GENE_NAME'].X
|
||||
```
|
||||
|
||||
## Metadata Management
|
||||
|
||||
### Naming conventions
|
||||
```python
|
||||
# Consistent naming improves usability
|
||||
|
||||
# Observation metadata (obs):
|
||||
# - cell_id, sample_id
|
||||
# - cell_type, tissue, condition
|
||||
# - n_genes, n_counts, percent_mito
|
||||
# - cluster, leiden, louvain
|
||||
|
||||
# Variable metadata (var):
|
||||
# - gene_id, gene_name
|
||||
# - highly_variable, n_cells
|
||||
# - mean_expression, dispersion
|
||||
|
||||
# Embeddings (obsm):
|
||||
# - X_pca, X_umap, X_tsne
|
||||
# - X_diffmap, X_draw_graph_fr
|
||||
|
||||
# Follow conventions from scanpy/scverse ecosystem
|
||||
```
|
||||
|
||||
### Document metadata
|
||||
```python
|
||||
# Store metadata descriptions in uns
|
||||
adata.uns['metadata_descriptions'] = {
|
||||
'cell_type': 'Cell type annotation from automated clustering',
|
||||
'quality_score': 'QC score from scrublet (0-1, higher is better)',
|
||||
'batch': 'Experimental batch identifier'
|
||||
}
|
||||
|
||||
# Store processing history
|
||||
adata.uns['processing_steps'] = [
|
||||
'Raw counts loaded from 10X',
|
||||
'Filtered: n_genes > 200, n_counts < 50000',
|
||||
'Normalized to 10000 counts per cell',
|
||||
'Log transformed'
|
||||
]
|
||||
```
|
||||
|
||||
## Reproducibility
|
||||
|
||||
### Set random seeds
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Set seed for reproducible results
|
||||
np.random.seed(42)
|
||||
|
||||
# Document in uns
|
||||
adata.uns['random_seed'] = 42
|
||||
```
|
||||
|
||||
### Store parameters
|
||||
```python
|
||||
# Store analysis parameters in uns
|
||||
adata.uns['pca'] = {
|
||||
'n_comps': 50,
|
||||
'svd_solver': 'arpack',
|
||||
'random_state': 42
|
||||
}
|
||||
|
||||
adata.uns['neighbors'] = {
|
||||
'n_neighbors': 15,
|
||||
'n_pcs': 50,
|
||||
'metric': 'euclidean',
|
||||
'method': 'umap'
|
||||
}
|
||||
```
|
||||
|
||||
### Version tracking
|
||||
```python
|
||||
import anndata
|
||||
import scanpy
|
||||
import numpy
|
||||
|
||||
# Store versions
|
||||
adata.uns['versions'] = {
|
||||
'anndata': anndata.__version__,
|
||||
'scanpy': scanpy.__version__,
|
||||
'numpy': numpy.__version__,
|
||||
'python': sys.version
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Check data validity
|
||||
```python
|
||||
# Verify dimensions
|
||||
assert adata.n_obs == len(adata.obs)
|
||||
assert adata.n_vars == len(adata.var)
|
||||
assert adata.X.shape == (adata.n_obs, adata.n_vars)
|
||||
|
||||
# Check for NaN values
|
||||
has_nan = np.isnan(adata.X.data).any() if issparse(adata.X) else np.isnan(adata.X).any()
|
||||
if has_nan:
|
||||
print("Warning: Data contains NaN values")
|
||||
|
||||
# Check for negative values (if counts expected)
|
||||
has_negative = (adata.X.data < 0).any() if issparse(adata.X) else (adata.X < 0).any()
|
||||
if has_negative:
|
||||
print("Warning: Data contains negative values")
|
||||
```
|
||||
|
||||
### Validate metadata
|
||||
```python
|
||||
# Check for missing values
|
||||
missing_obs = adata.obs.isnull().sum()
|
||||
if missing_obs.any():
|
||||
print("Missing values in obs:")
|
||||
print(missing_obs[missing_obs > 0])
|
||||
|
||||
# Verify indices are unique
|
||||
assert adata.obs_names.is_unique, "Observation names not unique"
|
||||
assert adata.var_names.is_unique, "Variable names not unique"
|
||||
|
||||
# Check metadata alignment
|
||||
assert len(adata.obs) == adata.n_obs
|
||||
assert len(adata.var) == adata.n_vars
|
||||
```
|
||||
|
||||
## Integration with Other Tools
|
||||
|
||||
### Scanpy integration
|
||||
```python
|
||||
import scanpy as sc
|
||||
|
||||
# AnnData is native format for scanpy
|
||||
sc.pp.filter_cells(adata, min_genes=200)
|
||||
sc.pp.filter_genes(adata, min_cells=3)
|
||||
sc.pp.normalize_total(adata, target_sum=1e4)
|
||||
sc.pp.log1p(adata)
|
||||
sc.pp.highly_variable_genes(adata)
|
||||
sc.pp.pca(adata)
|
||||
sc.pp.neighbors(adata)
|
||||
sc.tl.umap(adata)
|
||||
```
|
||||
|
||||
### Pandas integration
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Convert to DataFrame
|
||||
df = adata.to_df()
|
||||
|
||||
# Create from DataFrame
|
||||
adata = ad.AnnData(df)
|
||||
|
||||
# Work with metadata as DataFrames
|
||||
adata.obs = adata.obs.merge(external_metadata, left_index=True, right_index=True)
|
||||
```
|
||||
|
||||
### PyTorch integration
|
||||
```python
|
||||
from anndata.experimental import AnnLoader
|
||||
|
||||
# Create PyTorch DataLoader
|
||||
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
|
||||
|
||||
# Iterate in training loop
|
||||
for batch in dataloader:
|
||||
X = batch.X
|
||||
# Train model on batch
|
||||
```
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### Pitfall 1: Modifying views
|
||||
```python
|
||||
# Wrong: Modifying view can affect original
|
||||
subset = adata[:100, :]
|
||||
subset.X = new_data # May modify adata.X!
|
||||
|
||||
# Correct: Copy before modifying
|
||||
subset = adata[:100, :].copy()
|
||||
subset.X = new_data # Independent copy
|
||||
```
|
||||
|
||||
### Pitfall 2: Index misalignment
|
||||
```python
|
||||
# Wrong: Assuming order matches
|
||||
external_data = pd.read_csv('data.csv')
|
||||
adata.obs['new_col'] = external_data['values'] # May misalign!
|
||||
|
||||
# Correct: Align on index
|
||||
adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
|
||||
```
|
||||
|
||||
### Pitfall 3: Mixing sparse and dense
|
||||
```python
|
||||
# Wrong: Converting sparse to dense uses huge memory
|
||||
result = adata.X + 1 # Converts sparse to dense!
|
||||
|
||||
# Correct: Use sparse operations
|
||||
from scipy.sparse import issparse
|
||||
if issparse(adata.X):
|
||||
result = adata.X.copy()
|
||||
result.data += 1
|
||||
```
|
||||
|
||||
### Pitfall 4: Not handling views
|
||||
```python
|
||||
# Wrong: Assuming subset is independent
|
||||
subset = adata[mask, :]
|
||||
del adata # subset may become invalid!
|
||||
|
||||
# Correct: Copy when needed
|
||||
subset = adata[mask, :].copy()
|
||||
del adata # subset remains valid
|
||||
```
|
||||
|
||||
### Pitfall 5: Ignoring memory constraints
|
||||
```python
|
||||
# Wrong: Loading huge dataset into memory
|
||||
adata = ad.read_h5ad('100GB_file.h5ad') # OOM error!
|
||||
|
||||
# Correct: Use backed mode
|
||||
adata = ad.read_h5ad('100GB_file.h5ad', backed='r')
|
||||
subset = adata[adata.obs['keep']].to_memory()
|
||||
```
|
||||
|
||||
## Workflow Example
|
||||
|
||||
Complete best-practices workflow:
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
import numpy as np
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
# 1. Load with backed mode if large
|
||||
adata = ad.read_h5ad('data.h5ad', backed='r')
|
||||
|
||||
# 2. Quick metadata check without loading data
|
||||
print(f"Dataset: {adata.n_obs} cells × {adata.n_vars} genes")
|
||||
|
||||
# 3. Filter based on metadata
|
||||
high_quality = adata[adata.obs['quality_score'] > 0.8]
|
||||
|
||||
# 4. Load filtered subset to memory
|
||||
adata = high_quality.to_memory()
|
||||
|
||||
# 5. Convert to optimal storage types
|
||||
adata.strings_to_categoricals()
|
||||
if not issparse(adata.X):
|
||||
density = np.count_nonzero(adata.X) / adata.X.size
|
||||
if density < 0.5:
|
||||
adata.X = csr_matrix(adata.X)
|
||||
|
||||
# 6. Store raw before filtering genes
|
||||
adata.raw = adata.copy()
|
||||
|
||||
# 7. Filter to highly variable genes
|
||||
adata = adata[:, adata.var['highly_variable']].copy()
|
||||
|
||||
# 8. Document processing
|
||||
adata.uns['processing'] = {
|
||||
'filtered': 'quality_score > 0.8',
|
||||
'n_hvg': adata.n_vars,
|
||||
'date': '2025-11-03'
|
||||
}
|
||||
|
||||
# 9. Save optimized
|
||||
adata.write_h5ad('processed.h5ad', compression='gzip')
|
||||
```
|
||||
396
scientific-packages/anndata/references/concatenation.md
Normal file
396
scientific-packages/anndata/references/concatenation.md
Normal file
@@ -0,0 +1,396 @@
|
||||
# Concatenating AnnData Objects
|
||||
|
||||
Combine multiple AnnData objects along either observations or variables axis.
|
||||
|
||||
## Basic Concatenation
|
||||
|
||||
### Concatenate along observations (stack cells/samples)
|
||||
```python
|
||||
import anndata as ad
|
||||
import numpy as np
|
||||
|
||||
# Create multiple AnnData objects
|
||||
adata1 = ad.AnnData(X=np.random.rand(100, 50))
|
||||
adata2 = ad.AnnData(X=np.random.rand(150, 50))
|
||||
adata3 = ad.AnnData(X=np.random.rand(200, 50))
|
||||
|
||||
# Concatenate along observations (axis=0, default)
|
||||
adata_combined = ad.concat([adata1, adata2, adata3], axis=0)
|
||||
|
||||
print(adata_combined.shape) # (450, 50)
|
||||
```
|
||||
|
||||
### Concatenate along variables (stack genes/features)
|
||||
```python
|
||||
# Create objects with same observations, different variables
|
||||
adata1 = ad.AnnData(X=np.random.rand(100, 50))
|
||||
adata2 = ad.AnnData(X=np.random.rand(100, 30))
|
||||
adata3 = ad.AnnData(X=np.random.rand(100, 70))
|
||||
|
||||
# Concatenate along variables (axis=1)
|
||||
adata_combined = ad.concat([adata1, adata2, adata3], axis=1)
|
||||
|
||||
print(adata_combined.shape) # (100, 150)
|
||||
```
|
||||
|
||||
## Join Types
|
||||
|
||||
### Inner join (intersection)
|
||||
Keep only variables/observations present in all objects.
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Create objects with different variables
|
||||
adata1 = ad.AnnData(
|
||||
X=np.random.rand(100, 50),
|
||||
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(50)])
|
||||
)
|
||||
adata2 = ad.AnnData(
|
||||
X=np.random.rand(150, 60),
|
||||
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(10, 70)])
|
||||
)
|
||||
|
||||
# Inner join: only genes 10-49 are kept (overlap)
|
||||
adata_inner = ad.concat([adata1, adata2], join='inner')
|
||||
print(adata_inner.n_vars) # 40 genes (overlap)
|
||||
```
|
||||
|
||||
### Outer join (union)
|
||||
Keep all variables/observations, filling missing values.
|
||||
|
||||
```python
|
||||
# Outer join: all genes are kept
|
||||
adata_outer = ad.concat([adata1, adata2], join='outer')
|
||||
print(adata_outer.n_vars) # 70 genes (union)
|
||||
|
||||
# Missing values are filled with appropriate defaults:
|
||||
# - 0 for sparse matrices
|
||||
# - NaN for dense matrices
|
||||
```
|
||||
|
||||
### Fill values for outer joins
|
||||
```python
|
||||
# Specify fill value for missing data
|
||||
adata_filled = ad.concat([adata1, adata2], join='outer', fill_value=0)
|
||||
```
|
||||
|
||||
## Tracking Data Sources
|
||||
|
||||
### Add batch labels
|
||||
```python
|
||||
# Label which object each observation came from
|
||||
adata_combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
label='batch', # Column name for labels
|
||||
keys=['batch1', 'batch2', 'batch3'] # Labels for each object
|
||||
)
|
||||
|
||||
print(adata_combined.obs['batch'].value_counts())
|
||||
# batch1 100
|
||||
# batch2 150
|
||||
# batch3 200
|
||||
```
|
||||
|
||||
### Automatic batch labels
|
||||
```python
|
||||
# If keys not provided, uses integer indices
|
||||
adata_combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
label='dataset'
|
||||
)
|
||||
# dataset column contains: 0, 1, 2
|
||||
```
|
||||
|
||||
## Merge Strategies
|
||||
|
||||
Control how metadata from different objects is combined using the `merge` parameter.
|
||||
|
||||
### merge=None (default for observations)
|
||||
Exclude metadata on non-concatenation axis.
|
||||
|
||||
```python
|
||||
# When concatenating observations, var metadata must match
|
||||
adata1.var['gene_type'] = 'protein_coding'
|
||||
adata2.var['gene_type'] = 'protein_coding'
|
||||
|
||||
# var is kept only if identical across all objects
|
||||
adata_combined = ad.concat([adata1, adata2], merge=None)
|
||||
```
|
||||
|
||||
### merge='same'
|
||||
Keep metadata that is identical across all objects.
|
||||
|
||||
```python
|
||||
adata1.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
|
||||
adata2.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
|
||||
adata1.var['type'] = 'protein_coding'
|
||||
adata2.var['type'] = 'lncRNA' # Different
|
||||
|
||||
# 'chromosome' is kept (same), 'type' is excluded (different)
|
||||
adata_combined = ad.concat([adata1, adata2], merge='same')
|
||||
```
|
||||
|
||||
### merge='unique'
|
||||
Keep metadata columns where each key has exactly one value.
|
||||
|
||||
```python
|
||||
adata1.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]
|
||||
adata2.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]
|
||||
|
||||
# gene_id is kept (unique values for each key)
|
||||
adata_combined = ad.concat([adata1, adata2], merge='unique')
|
||||
```
|
||||
|
||||
### merge='first'
|
||||
Take values from the first object containing each key.
|
||||
|
||||
```python
|
||||
adata1.var['description'] = ['Desc1'] * 50
|
||||
adata2.var['description'] = ['Desc2'] * 50
|
||||
|
||||
# Uses descriptions from adata1
|
||||
adata_combined = ad.concat([adata1, adata2], merge='first')
|
||||
```
|
||||
|
||||
### merge='only'
|
||||
Keep metadata that appears in only one object.
|
||||
|
||||
```python
|
||||
adata1.var['adata1_specific'] = [1] * 50
|
||||
adata2.var['adata2_specific'] = [2] * 50
|
||||
|
||||
# Both metadata columns are kept
|
||||
adata_combined = ad.concat([adata1, adata2], merge='only')
|
||||
```
|
||||
|
||||
## Handling Index Conflicts
|
||||
|
||||
### Make indices unique
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Create objects with overlapping observation names
|
||||
adata1 = ad.AnnData(
|
||||
X=np.random.rand(3, 10),
|
||||
obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
|
||||
)
|
||||
adata2 = ad.AnnData(
|
||||
X=np.random.rand(3, 10),
|
||||
obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
|
||||
)
|
||||
|
||||
# Make indices unique by appending batch keys
|
||||
adata_combined = ad.concat(
|
||||
[adata1, adata2],
|
||||
label='batch',
|
||||
keys=['batch1', 'batch2'],
|
||||
index_unique='_' # Separator for making indices unique
|
||||
)
|
||||
|
||||
print(adata_combined.obs_names)
|
||||
# ['cell_1_batch1', 'cell_2_batch1', 'cell_3_batch1',
|
||||
# 'cell_1_batch2', 'cell_2_batch2', 'cell_3_batch2']
|
||||
```
|
||||
|
||||
## Concatenating Layers
|
||||
|
||||
```python
|
||||
# Objects with layers
|
||||
adata1 = ad.AnnData(X=np.random.rand(100, 50))
|
||||
adata1.layers['normalized'] = np.random.rand(100, 50)
|
||||
adata1.layers['scaled'] = np.random.rand(100, 50)
|
||||
|
||||
adata2 = ad.AnnData(X=np.random.rand(150, 50))
|
||||
adata2.layers['normalized'] = np.random.rand(150, 50)
|
||||
adata2.layers['scaled'] = np.random.rand(150, 50)
|
||||
|
||||
# Layers are concatenated automatically if present in all objects
|
||||
adata_combined = ad.concat([adata1, adata2])
|
||||
|
||||
print(adata_combined.layers.keys())
|
||||
# dict_keys(['normalized', 'scaled'])
|
||||
```
|
||||
|
||||
## Concatenating Multi-dimensional Annotations
|
||||
|
||||
### obsm/varm
|
||||
```python
|
||||
# Objects with embeddings
|
||||
adata1.obsm['X_pca'] = np.random.rand(100, 50)
|
||||
adata2.obsm['X_pca'] = np.random.rand(150, 50)
|
||||
|
||||
# obsm is concatenated along observation axis
|
||||
adata_combined = ad.concat([adata1, adata2])
|
||||
print(adata_combined.obsm['X_pca'].shape) # (250, 50)
|
||||
```
|
||||
|
||||
### obsp/varp (pairwise annotations)
|
||||
```python
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
# Pairwise matrices
|
||||
adata1.obsp['connectivities'] = csr_matrix((100, 100))
|
||||
adata2.obsp['connectivities'] = csr_matrix((150, 150))
|
||||
|
||||
# By default, obsp is NOT concatenated (set pairwise=True to include)
|
||||
adata_combined = ad.concat([adata1, adata2])
|
||||
# adata_combined.obsp is empty
|
||||
|
||||
# Include pairwise data (creates block diagonal matrix)
|
||||
adata_combined = ad.concat([adata1, adata2], pairwise=True)
|
||||
print(adata_combined.obsp['connectivities'].shape) # (250, 250)
|
||||
```
|
||||
|
||||
## Concatenating uns (unstructured)
|
||||
|
||||
Unstructured metadata is merged recursively:
|
||||
|
||||
```python
|
||||
adata1.uns['experiment'] = {'date': '2025-01-01', 'batch': 'A'}
|
||||
adata2.uns['experiment'] = {'date': '2025-01-01', 'batch': 'B'}
|
||||
|
||||
# Using merge='unique' for uns
|
||||
adata_combined = ad.concat([adata1, adata2], uns_merge='unique')
|
||||
# 'date' is kept (same value), 'batch' might be excluded (different values)
|
||||
```
|
||||
|
||||
## Lazy Concatenation (AnnCollection)
|
||||
|
||||
For very large datasets, use lazy concatenation that doesn't load all data:
|
||||
|
||||
```python
|
||||
from anndata.experimental import AnnCollection
|
||||
|
||||
# Create collection from file paths (doesn't load data)
|
||||
files = ['data1.h5ad', 'data2.h5ad', 'data3.h5ad']
|
||||
collection = AnnCollection(
|
||||
files,
|
||||
join_obs='outer',
|
||||
join_vars='inner',
|
||||
label='dataset',
|
||||
keys=['dataset1', 'dataset2', 'dataset3']
|
||||
)
|
||||
|
||||
# Access data lazily
|
||||
print(collection.n_obs) # Total observations
|
||||
print(collection.obs.head()) # Metadata loaded, not X
|
||||
|
||||
# Convert to regular AnnData when needed (loads all data)
|
||||
adata = collection.to_adata()
|
||||
```
|
||||
|
||||
### Working with AnnCollection
|
||||
```python
|
||||
# Subset without loading data
|
||||
subset = collection[collection.obs['cell_type'] == 'T cell']
|
||||
|
||||
# Iterate through datasets
|
||||
for adata in collection:
|
||||
print(adata.shape)
|
||||
|
||||
# Access specific dataset
|
||||
first_dataset = collection[0]
|
||||
```
|
||||
|
||||
## Concatenation on Disk
|
||||
|
||||
For datasets too large for memory, concatenate directly on disk:
|
||||
|
||||
```python
|
||||
from anndata.experimental import concat_on_disk
|
||||
|
||||
# Concatenate without loading into memory
|
||||
concat_on_disk(
|
||||
['data1.h5ad', 'data2.h5ad', 'data3.h5ad'],
|
||||
'combined.h5ad',
|
||||
join='outer'
|
||||
)
|
||||
|
||||
# Load result in backed mode
|
||||
adata = ad.read_h5ad('combined.h5ad', backed='r')
|
||||
```
|
||||
|
||||
## Common Concatenation Patterns
|
||||
|
||||
### Combine technical replicates
|
||||
```python
|
||||
# Multiple runs of the same samples
|
||||
replicates = [adata_run1, adata_run2, adata_run3]
|
||||
adata_combined = ad.concat(
|
||||
replicates,
|
||||
label='technical_replicate',
|
||||
keys=['rep1', 'rep2', 'rep3'],
|
||||
join='inner' # Keep only genes measured in all runs
|
||||
)
|
||||
```
|
||||
|
||||
### Combine batches from experiment
|
||||
```python
|
||||
# Different experimental batches
|
||||
batches = [adata_batch1, adata_batch2, adata_batch3]
|
||||
adata_combined = ad.concat(
|
||||
batches,
|
||||
label='batch',
|
||||
keys=['batch1', 'batch2', 'batch3'],
|
||||
join='outer' # Keep all genes
|
||||
)
|
||||
|
||||
# Later: apply batch correction
|
||||
```
|
||||
|
||||
### Merge multi-modal data
|
||||
```python
|
||||
# Different measurement modalities (e.g., RNA + protein)
|
||||
adata_rna = ad.AnnData(X=np.random.rand(100, 2000))
|
||||
adata_protein = ad.AnnData(X=np.random.rand(100, 50))
|
||||
|
||||
# Concatenate along variables to combine modalities
|
||||
adata_multimodal = ad.concat([adata_rna, adata_protein], axis=1)
|
||||
|
||||
# Add labels to distinguish modalities
|
||||
adata_multimodal.var['modality'] = ['RNA'] * 2000 + ['protein'] * 50
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Check compatibility before concatenating**
|
||||
```python
|
||||
# Verify shapes are compatible
|
||||
print([adata.n_vars for adata in [adata1, adata2, adata3]])
|
||||
|
||||
# Check variable names match
|
||||
print([set(adata.var_names) for adata in [adata1, adata2, adata3]])
|
||||
```
|
||||
|
||||
2. **Use appropriate join type**
|
||||
- `inner`: When you need the same features across all samples (most stringent)
|
||||
- `outer`: When you want to preserve all features (most inclusive)
|
||||
|
||||
3. **Track data sources**
|
||||
Always use `label` and `keys` to track which observations came from which dataset.
|
||||
|
||||
4. **Consider memory usage**
|
||||
- For large datasets, use `AnnCollection` or `concat_on_disk`
|
||||
- Consider backed mode for the result
|
||||
|
||||
5. **Handle batch effects**
|
||||
Concatenation combines data but doesn't correct for batch effects. Apply batch correction after concatenation:
|
||||
```python
|
||||
# After concatenation, apply batch correction
|
||||
import scanpy as sc
|
||||
sc.pp.combat(adata_combined, key='batch')
|
||||
```
|
||||
|
||||
6. **Validate results**
|
||||
```python
|
||||
# Check dimensions
|
||||
print(adata_combined.shape)
|
||||
|
||||
# Check batch distribution
|
||||
print(adata_combined.obs['batch'].value_counts())
|
||||
|
||||
# Verify metadata integrity
|
||||
print(adata_combined.var.head())
|
||||
print(adata_combined.obs.head())
|
||||
```
|
||||
@@ -1,478 +0,0 @@
|
||||
# AnnData Concatenation Guide
|
||||
|
||||
## Overview
|
||||
|
||||
The `concat()` function combines multiple AnnData objects through two fundamental operations:
|
||||
1. **Concatenation**: Stacking sub-elements in order
|
||||
2. **Merging**: Combining collections into one result
|
||||
|
||||
## Basic Concatenation
|
||||
|
||||
### Syntax
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
combined = ad.concat(
|
||||
adatas, # List of AnnData objects
|
||||
axis=0, # 0=observations, 1=variables
|
||||
join='inner', # 'inner' or 'outer'
|
||||
merge=None, # Merge strategy for non-concat axis
|
||||
label=None, # Column name for source tracking
|
||||
keys=None, # Dataset identifiers
|
||||
index_unique=None, # Separator for unique indices
|
||||
fill_value=None, # Fill value for missing data
|
||||
pairwise=False # Include pairwise matrices
|
||||
)
|
||||
```
|
||||
|
||||
### Concatenating Observations (Cells)
|
||||
```python
|
||||
# Most common: combining multiple samples/batches
|
||||
adata1 = ad.AnnData(np.random.rand(100, 2000))
|
||||
adata2 = ad.AnnData(np.random.rand(150, 2000))
|
||||
adata3 = ad.AnnData(np.random.rand(80, 2000))
|
||||
|
||||
combined = ad.concat([adata1, adata2, adata3], axis=0)
|
||||
# Result: (330 observations, 2000 variables)
|
||||
```
|
||||
|
||||
### Concatenating Variables (Genes)
|
||||
```python
|
||||
# Less common: combining different feature sets
|
||||
adata1 = ad.AnnData(np.random.rand(100, 1000))
|
||||
adata2 = ad.AnnData(np.random.rand(100, 500))
|
||||
|
||||
combined = ad.concat([adata1, adata2], axis=1)
|
||||
# Result: (100 observations, 1500 variables)
|
||||
```
|
||||
|
||||
## Join Strategies
|
||||
|
||||
### Inner Join (Intersection)
|
||||
|
||||
Keeps only shared features across all objects.
|
||||
|
||||
```python
|
||||
# Datasets with different genes
|
||||
adata1 = ad.AnnData(
|
||||
np.random.rand(100, 2000),
|
||||
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(2000)])
|
||||
)
|
||||
adata2 = ad.AnnData(
|
||||
np.random.rand(150, 1800),
|
||||
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(200, 2000)])
|
||||
)
|
||||
|
||||
# Inner join: only genes present in both
|
||||
combined = ad.concat([adata1, adata2], join='inner')
|
||||
# Result: (250 observations, 1800 variables)
|
||||
# Only Gene_200 through Gene_1999
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- You want to analyze only features measured in all datasets
|
||||
- Missing features would compromise analysis
|
||||
- You need a complete case analysis
|
||||
|
||||
**Trade-offs:**
|
||||
- May lose many features
|
||||
- Ensures no missing data
|
||||
- Smaller result size
|
||||
|
||||
### Outer Join (Union)
|
||||
|
||||
Keeps all features from all objects, padding with fill values (default 0).
|
||||
|
||||
```python
|
||||
# Outer join: all genes from both datasets
|
||||
combined = ad.concat([adata1, adata2], join='outer')
|
||||
# Result: (250 observations, 2000 variables)
|
||||
# Missing values filled with 0
|
||||
|
||||
# Custom fill value
|
||||
combined = ad.concat([adata1, adata2], join='outer', fill_value=np.nan)
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- You want to preserve all features
|
||||
- Sparse data is acceptable
|
||||
- Features are independent
|
||||
|
||||
**Trade-offs:**
|
||||
- Introduces zeros/missing values
|
||||
- Larger result size
|
||||
- May need imputation
|
||||
|
||||
## Merge Strategies
|
||||
|
||||
Merge strategies control how elements on the non-concatenation axis are combined.
|
||||
|
||||
### merge=None (Default)
|
||||
|
||||
Excludes all non-concatenation axis elements.
|
||||
|
||||
```python
|
||||
# Both datasets have var annotations
|
||||
adata1.var['gene_type'] = ['protein_coding'] * 2000
|
||||
adata2.var['gene_type'] = ['protein_coding'] * 1800
|
||||
|
||||
# merge=None: var annotations excluded
|
||||
combined = ad.concat([adata1, adata2], merge=None)
|
||||
assert 'gene_type' not in combined.var.columns
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- Annotations are dataset-specific
|
||||
- You'll add new annotations after merging
|
||||
|
||||
### merge='same'
|
||||
|
||||
Keeps only annotations with identical values across datasets.
|
||||
|
||||
```python
|
||||
# Same annotation values
|
||||
adata1.var['chromosome'] = ['chr1'] * 1000 + ['chr2'] * 1000
|
||||
adata2.var['chromosome'] = ['chr1'] * 900 + ['chr2'] * 900
|
||||
|
||||
# merge='same': keeps chromosome annotation
|
||||
combined = ad.concat([adata1, adata2], merge='same')
|
||||
assert 'chromosome' in combined.var.columns
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- Annotations should be consistent
|
||||
- You want to validate consistency
|
||||
- Shared metadata is important
|
||||
|
||||
**Note:** Comparison occurs after index alignment - only shared indices need to match.
|
||||
|
||||
### merge='unique'
|
||||
|
||||
Includes annotations with a single possible value.
|
||||
|
||||
```python
|
||||
# Unique values per gene
|
||||
adata1.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
|
||||
adata2.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
|
||||
|
||||
# merge='unique': keeps ensembl_id
|
||||
combined = ad.concat([adata1, adata2], merge='unique')
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- Each feature has a unique identifier
|
||||
- Annotations are feature-specific
|
||||
|
||||
### merge='first'
|
||||
|
||||
Takes the first occurrence of each annotation.
|
||||
|
||||
```python
|
||||
# Different annotation versions
|
||||
adata1.var['description'] = ['desc1'] * 2000
|
||||
adata2.var['description'] = ['desc2'] * 2000
|
||||
|
||||
# merge='first': uses adata1's descriptions
|
||||
combined = ad.concat([adata1, adata2], merge='first')
|
||||
# Uses descriptions from adata1
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- One dataset has authoritative annotations
|
||||
- Order matters
|
||||
- You need a simple resolution strategy
|
||||
|
||||
### merge='only'
|
||||
|
||||
Retains annotations appearing in exactly one object.
|
||||
|
||||
```python
|
||||
# Dataset-specific annotations
|
||||
adata1.var['dataset1_specific'] = ['value'] * 2000
|
||||
adata2.var['dataset2_specific'] = ['value'] * 2000
|
||||
|
||||
# merge='only': keeps both (no conflicts)
|
||||
combined = ad.concat([adata1, adata2], merge='only')
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- Datasets have non-overlapping annotations
|
||||
- You want to preserve all unique metadata
|
||||
|
||||
## Source Tracking
|
||||
|
||||
### Using label
|
||||
|
||||
Add a categorical column to track data origin.
|
||||
|
||||
```python
|
||||
combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
label='batch'
|
||||
)
|
||||
|
||||
# Creates obs['batch'] with values 0, 1, 2
|
||||
print(combined.obs['batch'].cat.categories) # ['0', '1', '2']
|
||||
```
|
||||
|
||||
### Using keys
|
||||
|
||||
Provide custom names for source tracking.
|
||||
|
||||
```python
|
||||
combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
label='study',
|
||||
keys=['control', 'treatment_a', 'treatment_b']
|
||||
)
|
||||
|
||||
# Creates obs['study'] with custom names
|
||||
print(combined.obs['study'].unique()) # ['control', 'treatment_a', 'treatment_b']
|
||||
```
|
||||
|
||||
### Making Indices Unique
|
||||
|
||||
Append source identifiers to duplicate observation names.
|
||||
|
||||
```python
|
||||
# Both datasets have cells named "Cell_0", "Cell_1", etc.
|
||||
adata1.obs_names = [f'Cell_{i}' for i in range(100)]
|
||||
adata2.obs_names = [f'Cell_{i}' for i in range(150)]
|
||||
|
||||
# index_unique adds suffix
|
||||
combined = ad.concat(
|
||||
[adata1, adata2],
|
||||
keys=['batch1', 'batch2'],
|
||||
index_unique='-'
|
||||
)
|
||||
|
||||
# Results in: Cell_0-batch1, Cell_0-batch2, etc.
|
||||
print(combined.obs_names[:5])
|
||||
```
|
||||
|
||||
## Handling Different Attributes
|
||||
|
||||
### X Matrix and Layers
|
||||
|
||||
Follows join strategy. Missing values filled according to `fill_value`.
|
||||
|
||||
```python
|
||||
# Both have layers
|
||||
adata1.layers['counts'] = adata1.X.copy()
|
||||
adata2.layers['counts'] = adata2.X.copy()
|
||||
|
||||
# Concatenates both X and layers
|
||||
combined = ad.concat([adata1, adata2])
|
||||
assert 'counts' in combined.layers
|
||||
```
|
||||
|
||||
### obs and var DataFrames
|
||||
|
||||
- **obs**: Concatenated along concatenation axis
|
||||
- **var**: Handled by merge strategy
|
||||
|
||||
```python
|
||||
adata1.obs['cell_type'] = ['B cell'] * 100
|
||||
adata2.obs['cell_type'] = ['T cell'] * 150
|
||||
|
||||
combined = ad.concat([adata1, adata2])
|
||||
# obs['cell_type'] preserved for all cells
|
||||
```
|
||||
|
||||
### obsm and varm
|
||||
|
||||
Multi-dimensional annotations follow same rules as layers.
|
||||
|
||||
```python
|
||||
adata1.obsm['X_pca'] = np.random.rand(100, 50)
|
||||
adata2.obsm['X_pca'] = np.random.rand(150, 50)
|
||||
|
||||
combined = ad.concat([adata1, adata2])
|
||||
# obsm['X_pca'] concatenated: shape (250, 50)
|
||||
```
|
||||
|
||||
### obsp and varp
|
||||
|
||||
Pairwise matrices excluded by default. Enable with `pairwise=True`.
|
||||
|
||||
```python
|
||||
# Distance matrices
|
||||
adata1.obsp['distances'] = np.random.rand(100, 100)
|
||||
adata2.obsp['distances'] = np.random.rand(150, 150)
|
||||
|
||||
# Excluded by default
|
||||
combined = ad.concat([adata1, adata2])
|
||||
assert 'distances' not in combined.obsp
|
||||
|
||||
# Include if needed
|
||||
combined = ad.concat([adata1, adata2], pairwise=True)
|
||||
# Results in padded block diagonal matrix
|
||||
```
|
||||
|
||||
### uns Dictionary
|
||||
|
||||
Merged recursively, applying merge strategy at any nesting depth.
|
||||
|
||||
```python
|
||||
adata1.uns['experiment'] = {'date': '2024-01', 'lab': 'A'}
|
||||
adata2.uns['experiment'] = {'date': '2024-02', 'lab': 'A'}
|
||||
|
||||
# merge='same' keeps 'lab', excludes 'date'
|
||||
combined = ad.concat([adata1, adata2], merge='same')
|
||||
# combined.uns['experiment'] = {'lab': 'A'}
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Batch Integration Pipeline
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Load batches
|
||||
batches = [
|
||||
ad.read_h5ad(f'batch_{i}.h5ad')
|
||||
for i in range(5)
|
||||
]
|
||||
|
||||
# Concatenate with tracking
|
||||
combined = ad.concat(
|
||||
batches,
|
||||
axis=0,
|
||||
join='outer',
|
||||
merge='first',
|
||||
label='batch_id',
|
||||
keys=[f'batch_{i}' for i in range(5)],
|
||||
index_unique='-'
|
||||
)
|
||||
|
||||
# Add batch effects
|
||||
combined.obs['batch_numeric'] = combined.obs['batch_id'].cat.codes
|
||||
```
|
||||
|
||||
### Multi-Study Meta-Analysis
|
||||
|
||||
```python
|
||||
# Different studies with varying gene coverage
|
||||
studies = {
|
||||
'study_a': ad.read_h5ad('study_a.h5ad'),
|
||||
'study_b': ad.read_h5ad('study_b.h5ad'),
|
||||
'study_c': ad.read_h5ad('study_c.h5ad')
|
||||
}
|
||||
|
||||
# Outer join to keep all genes
|
||||
combined = ad.concat(
|
||||
list(studies.values()),
|
||||
axis=0,
|
||||
join='outer',
|
||||
label='study',
|
||||
keys=list(studies.keys()),
|
||||
merge='unique',
|
||||
fill_value=0
|
||||
)
|
||||
|
||||
# Track coverage
|
||||
for study in studies:
|
||||
n_genes = studies[study].n_vars
|
||||
combined.uns[f'{study}_n_genes'] = n_genes
|
||||
```
|
||||
|
||||
### Incremental Concatenation
|
||||
|
||||
```python
|
||||
# For many datasets, concatenate in batches
|
||||
chunk_size = 10
|
||||
all_files = [f'dataset_{i}.h5ad' for i in range(100)]
|
||||
|
||||
# Process in chunks
|
||||
result = None
|
||||
for i in range(0, len(all_files), chunk_size):
|
||||
chunk_files = all_files[i:i+chunk_size]
|
||||
chunk_adatas = [ad.read_h5ad(f) for f in chunk_files]
|
||||
chunk_combined = ad.concat(chunk_adatas)
|
||||
|
||||
if result is None:
|
||||
result = chunk_combined
|
||||
else:
|
||||
result = ad.concat([result, chunk_combined])
|
||||
```
|
||||
|
||||
### Memory-Efficient On-Disk Concatenation
|
||||
|
||||
```python
|
||||
# Experimental feature for large datasets
|
||||
from anndata.experimental import concat_on_disk
|
||||
|
||||
files = ['dataset1.h5ad', 'dataset2.h5ad', 'dataset3.h5ad']
|
||||
concat_on_disk(
|
||||
files,
|
||||
'combined.h5ad',
|
||||
join='outer'
|
||||
)
|
||||
|
||||
# Read result in backed mode
|
||||
combined = ad.read_h5ad('combined.h5ad', backed='r')
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Dimension Mismatch
|
||||
|
||||
```python
|
||||
# Error: shapes don't match
|
||||
adata1 = ad.AnnData(np.random.rand(100, 2000))
|
||||
adata2 = ad.AnnData(np.random.rand(150, 1500))
|
||||
|
||||
# Solution: use outer join
|
||||
combined = ad.concat([adata1, adata2], join='outer')
|
||||
```
|
||||
|
||||
### Issue: Memory Error
|
||||
|
||||
```python
|
||||
# Problem: too many large objects in memory
|
||||
large_adatas = [ad.read_h5ad(f) for f in many_files]
|
||||
|
||||
# Solution: read and concatenate incrementally
|
||||
result = None
|
||||
for file in many_files:
|
||||
adata = ad.read_h5ad(file)
|
||||
if result is None:
|
||||
result = adata
|
||||
else:
|
||||
result = ad.concat([result, adata])
|
||||
del adata # Free memory
|
||||
```
|
||||
|
||||
### Issue: Duplicate Indices
|
||||
|
||||
```python
|
||||
# Problem: same cell names in different batches
|
||||
# Solution: use index_unique
|
||||
combined = ad.concat(
|
||||
[adata1, adata2],
|
||||
keys=['batch1', 'batch2'],
|
||||
index_unique='-'
|
||||
)
|
||||
```
|
||||
|
||||
### Issue: Lost Annotations
|
||||
|
||||
```python
|
||||
# Problem: annotations disappear
|
||||
adata1.var['important'] = values1
|
||||
adata2.var['important'] = values2
|
||||
|
||||
combined = ad.concat([adata1, adata2]) # merge=None by default
|
||||
# Solution: use appropriate merge strategy
|
||||
combined = ad.concat([adata1, adata2], merge='first')
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Pre-align indices**: Ensure consistent naming before concatenation
|
||||
2. **Use sparse matrices**: Convert to sparse before concatenating
|
||||
3. **Batch operations**: Concatenate in groups for many datasets
|
||||
4. **Choose inner join**: When possible, to reduce result size
|
||||
5. **Use categoricals**: Convert string annotations before concatenating
|
||||
6. **Consider on-disk**: For very large datasets, use `concat_on_disk`
|
||||
314
scientific-packages/anndata/references/data_structure.md
Normal file
314
scientific-packages/anndata/references/data_structure.md
Normal file
@@ -0,0 +1,314 @@
|
||||
# AnnData Object Structure
|
||||
|
||||
The AnnData object stores a data matrix with associated annotations, providing a flexible framework for managing experimental data and metadata.
|
||||
|
||||
## Core Components
|
||||
|
||||
### X (Data Matrix)
|
||||
The primary data matrix with shape (n_obs, n_vars) storing experimental measurements.
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
import numpy as np
|
||||
|
||||
# Create with dense array
|
||||
adata = ad.AnnData(X=np.random.rand(100, 2000))
|
||||
|
||||
# Create with sparse matrix (recommended for large, sparse data)
|
||||
from scipy.sparse import csr_matrix
|
||||
sparse_data = csr_matrix(np.random.rand(100, 2000))
|
||||
adata = ad.AnnData(X=sparse_data)
|
||||
```
|
||||
|
||||
Access data:
|
||||
```python
|
||||
# Full matrix (caution with large datasets)
|
||||
full_data = adata.X
|
||||
|
||||
# Single observation
|
||||
obs_data = adata.X[0, :]
|
||||
|
||||
# Single variable across all observations
|
||||
var_data = adata.X[:, 0]
|
||||
```
|
||||
|
||||
### obs (Observation Annotations)
|
||||
DataFrame storing metadata about observations (rows). Each row corresponds to one observation in X.
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Create AnnData with observation metadata
|
||||
obs_df = pd.DataFrame({
|
||||
'cell_type': ['T cell', 'B cell', 'Monocyte'],
|
||||
'treatment': ['control', 'treated', 'control'],
|
||||
'timepoint': [0, 24, 24]
|
||||
}, index=['cell_1', 'cell_2', 'cell_3'])
|
||||
|
||||
adata = ad.AnnData(X=np.random.rand(3, 100), obs=obs_df)
|
||||
|
||||
# Access observation metadata
|
||||
print(adata.obs['cell_type'])
|
||||
print(adata.obs.loc['cell_1'])
|
||||
```
|
||||
|
||||
### var (Variable Annotations)
|
||||
DataFrame storing metadata about variables (columns). Each row corresponds to one variable in X.
|
||||
|
||||
```python
|
||||
# Create AnnData with variable metadata
|
||||
var_df = pd.DataFrame({
|
||||
'gene_name': ['ACTB', 'GAPDH', 'TP53'],
|
||||
'chromosome': ['7', '12', '17'],
|
||||
'highly_variable': [True, False, True]
|
||||
}, index=['ENSG00001', 'ENSG00002', 'ENSG00003'])
|
||||
|
||||
adata = ad.AnnData(X=np.random.rand(100, 3), var=var_df)
|
||||
|
||||
# Access variable metadata
|
||||
print(adata.var['gene_name'])
|
||||
print(adata.var.loc['ENSG00001'])
|
||||
```
|
||||
|
||||
### layers (Alternative Data Representations)
|
||||
Dictionary storing alternative matrices with the same dimensions as X.
|
||||
|
||||
```python
|
||||
# Store raw counts, normalized data, and scaled data
|
||||
adata = ad.AnnData(X=np.random.rand(100, 2000))
|
||||
adata.layers['raw_counts'] = np.random.randint(0, 100, (100, 2000))
|
||||
adata.layers['normalized'] = adata.X / np.sum(adata.X, axis=1, keepdims=True)
|
||||
adata.layers['scaled'] = (adata.X - adata.X.mean()) / adata.X.std()
|
||||
|
||||
# Access layers
|
||||
raw_data = adata.layers['raw_counts']
|
||||
normalized_data = adata.layers['normalized']
|
||||
```
|
||||
|
||||
Common layer uses:
|
||||
- `raw_counts`: Original count data before normalization
|
||||
- `normalized`: Log-normalized or TPM values
|
||||
- `scaled`: Z-scored values for analysis
|
||||
- `imputed`: Data after imputation
|
||||
|
||||
### obsm (Multi-dimensional Observation Annotations)
|
||||
Dictionary storing multi-dimensional arrays aligned to observations.
|
||||
|
||||
```python
|
||||
# Store PCA coordinates and UMAP embeddings
|
||||
adata.obsm['X_pca'] = np.random.rand(100, 50) # 50 principal components
|
||||
adata.obsm['X_umap'] = np.random.rand(100, 2) # 2D UMAP coordinates
|
||||
adata.obsm['X_tsne'] = np.random.rand(100, 2) # 2D t-SNE coordinates
|
||||
|
||||
# Access embeddings
|
||||
pca_coords = adata.obsm['X_pca']
|
||||
umap_coords = adata.obsm['X_umap']
|
||||
```
|
||||
|
||||
Common obsm uses:
|
||||
- `X_pca`: Principal component coordinates
|
||||
- `X_umap`: UMAP embedding coordinates
|
||||
- `X_tsne`: t-SNE embedding coordinates
|
||||
- `X_diffmap`: Diffusion map coordinates
|
||||
- `protein_expression`: Protein abundance measurements (CITE-seq)
|
||||
|
||||
### varm (Multi-dimensional Variable Annotations)
|
||||
Dictionary storing multi-dimensional arrays aligned to variables.
|
||||
|
||||
```python
|
||||
# Store PCA loadings
|
||||
adata.varm['PCs'] = np.random.rand(2000, 50) # Loadings for 50 components
|
||||
adata.varm['gene_modules'] = np.random.rand(2000, 10) # Gene module scores
|
||||
|
||||
# Access loadings
|
||||
pc_loadings = adata.varm['PCs']
|
||||
```
|
||||
|
||||
Common varm uses:
|
||||
- `PCs`: Principal component loadings
|
||||
- `gene_modules`: Gene co-expression module assignments
|
||||
|
||||
### obsp (Pairwise Observation Relationships)
|
||||
Dictionary storing sparse matrices representing relationships between observations.
|
||||
|
||||
```python
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
# Store k-nearest neighbor graph
|
||||
n_obs = 100
|
||||
knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
|
||||
adata.obsp['connectivities'] = knn_graph
|
||||
adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))
|
||||
|
||||
# Access graphs
|
||||
knn_connections = adata.obsp['connectivities']
|
||||
distances = adata.obsp['distances']
|
||||
```
|
||||
|
||||
Common obsp uses:
|
||||
- `connectivities`: Cell-cell neighborhood graph
|
||||
- `distances`: Pairwise distances between cells
|
||||
|
||||
### varp (Pairwise Variable Relationships)
|
||||
Dictionary storing sparse matrices representing relationships between variables.
|
||||
|
||||
```python
|
||||
# Store gene-gene correlation matrix
|
||||
n_vars = 2000
|
||||
gene_corr = csr_matrix(np.random.rand(n_vars, n_vars) > 0.99)
|
||||
adata.varp['correlations'] = gene_corr
|
||||
|
||||
# Access correlations
|
||||
gene_correlations = adata.varp['correlations']
|
||||
```
|
||||
|
||||
### uns (Unstructured Annotations)
|
||||
Dictionary storing arbitrary unstructured metadata.
|
||||
|
||||
```python
|
||||
# Store analysis parameters and results
|
||||
adata.uns['experiment_date'] = '2025-11-03'
|
||||
adata.uns['pca'] = {
|
||||
'variance_ratio': [0.15, 0.10, 0.08],
|
||||
'params': {'n_comps': 50}
|
||||
}
|
||||
adata.uns['neighbors'] = {
|
||||
'params': {'n_neighbors': 15, 'method': 'umap'},
|
||||
'connectivities_key': 'connectivities'
|
||||
}
|
||||
|
||||
# Access unstructured data
|
||||
exp_date = adata.uns['experiment_date']
|
||||
pca_params = adata.uns['pca']['params']
|
||||
```
|
||||
|
||||
Common uns uses:
|
||||
- Analysis parameters and settings
|
||||
- Color palettes for plotting
|
||||
- Cluster information
|
||||
- Tool-specific metadata
|
||||
|
||||
### raw (Original Data Snapshot)
|
||||
Optional attribute preserving the original data matrix and variable annotations before filtering.
|
||||
|
||||
```python
|
||||
# Create AnnData and store raw state
|
||||
adata = ad.AnnData(X=np.random.rand(100, 5000))
|
||||
adata.var['gene_name'] = [f'Gene_{i}' for i in range(5000)]
|
||||
|
||||
# Store raw state before filtering
|
||||
adata.raw = adata.copy()
|
||||
|
||||
# Filter to highly variable genes
|
||||
highly_variable_mask = np.random.rand(5000) > 0.5
|
||||
adata = adata[:, highly_variable_mask]
|
||||
|
||||
# Access original data
|
||||
original_matrix = adata.raw.X
|
||||
original_var = adata.raw.var
|
||||
```
|
||||
|
||||
## Object Properties
|
||||
|
||||
```python
|
||||
# Dimensions
|
||||
n_observations = adata.n_obs
|
||||
n_variables = adata.n_vars
|
||||
shape = adata.shape # (n_obs, n_vars)
|
||||
|
||||
# Index information
|
||||
obs_names = adata.obs_names # Observation identifiers
|
||||
var_names = adata.var_names # Variable identifiers
|
||||
|
||||
# Storage mode
|
||||
is_view = adata.is_view # True if this is a view of another object
|
||||
is_backed = adata.isbacked # True if backed by on-disk storage
|
||||
filename = adata.filename # Path to backing file (if backed)
|
||||
```
|
||||
|
||||
## Creating AnnData Objects
|
||||
|
||||
### From arrays and DataFrames
|
||||
```python
|
||||
import anndata as ad
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
# Minimal creation
|
||||
X = np.random.rand(100, 2000)
|
||||
adata = ad.AnnData(X)
|
||||
|
||||
# With metadata
|
||||
obs = pd.DataFrame({'cell_type': ['A', 'B'] * 50}, index=[f'cell_{i}' for i in range(100)])
|
||||
var = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]}, index=[f'ENSG{i:05d}' for i in range(2000)])
|
||||
adata = ad.AnnData(X=X, obs=obs, var=var)
|
||||
|
||||
# With all components
|
||||
adata = ad.AnnData(
|
||||
X=X,
|
||||
obs=obs,
|
||||
var=var,
|
||||
layers={'raw': np.random.randint(0, 100, (100, 2000))},
|
||||
obsm={'X_pca': np.random.rand(100, 50)},
|
||||
uns={'experiment': 'test'}
|
||||
)
|
||||
```
|
||||
|
||||
### From DataFrame
|
||||
```python
|
||||
# Create from pandas DataFrame (genes as columns, cells as rows)
|
||||
df = pd.DataFrame(
|
||||
np.random.rand(100, 50),
|
||||
columns=[f'Gene_{i}' for i in range(50)],
|
||||
index=[f'Cell_{i}' for i in range(100)]
|
||||
)
|
||||
adata = ad.AnnData(df)
|
||||
```
|
||||
|
||||
## Data Access Patterns
|
||||
|
||||
### Vector extraction
|
||||
```python
|
||||
# Get observation annotation as array
|
||||
cell_types = adata.obs_vector('cell_type')
|
||||
|
||||
# Get variable values across observations
|
||||
gene_expression = adata.obs_vector('ACTB') # If ACTB is in var_names
|
||||
|
||||
# Get variable annotation as array
|
||||
gene_names = adata.var_vector('gene_name')
|
||||
```
|
||||
|
||||
### Subsetting
|
||||
```python
|
||||
# By index
|
||||
subset = adata[0:10, 0:100] # First 10 obs, first 100 vars
|
||||
|
||||
# By name
|
||||
subset = adata[['cell_1', 'cell_2'], ['ACTB', 'GAPDH']]
|
||||
|
||||
# By boolean mask
|
||||
high_count_cells = adata.obs['total_counts'] > 1000
|
||||
subset = adata[high_count_cells, :]
|
||||
|
||||
# By observation metadata
|
||||
t_cells = adata[adata.obs['cell_type'] == 'T cell']
|
||||
```
|
||||
|
||||
## Memory Considerations
|
||||
|
||||
The AnnData structure is designed for memory efficiency:
|
||||
- Sparse matrices reduce memory for sparse data
|
||||
- Views avoid copying data when possible
|
||||
- Backed mode enables working with data larger than RAM
|
||||
- Categorical annotations reduce memory for discrete values
|
||||
|
||||
```python
|
||||
# Convert strings to categoricals (more memory efficient)
|
||||
adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')
|
||||
adata.strings_to_categoricals()
|
||||
|
||||
# Check if object is a view (doesn't own data)
|
||||
if adata.is_view:
|
||||
adata = adata.copy() # Create independent copy
|
||||
```
|
||||
404
scientific-packages/anndata/references/io_operations.md
Normal file
404
scientific-packages/anndata/references/io_operations.md
Normal file
@@ -0,0 +1,404 @@
|
||||
# Input/Output Operations
|
||||
|
||||
AnnData provides comprehensive I/O functionality for reading and writing data in various formats.
|
||||
|
||||
## Native Formats
|
||||
|
||||
### H5AD (HDF5-based)
|
||||
The recommended native format for AnnData objects, providing efficient storage and fast access.
|
||||
|
||||
#### Writing H5AD files
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Write to file
|
||||
adata.write_h5ad('data.h5ad')
|
||||
|
||||
# Write with compression
|
||||
adata.write_h5ad('data.h5ad', compression='gzip')
|
||||
|
||||
# Write with specific compression level (0-9, higher = more compression)
|
||||
adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9)
|
||||
```
|
||||
|
||||
#### Reading H5AD files
|
||||
```python
|
||||
# Read entire file into memory
|
||||
adata = ad.read_h5ad('data.h5ad')
|
||||
|
||||
# Read in backed mode (lazy loading for large files)
|
||||
adata = ad.read_h5ad('data.h5ad', backed='r') # Read-only
|
||||
adata = ad.read_h5ad('data.h5ad', backed='r+') # Read-write
|
||||
|
||||
# Backed mode enables working with datasets larger than RAM
|
||||
# Only accessed data is loaded into memory
|
||||
```
|
||||
|
||||
#### Backed mode operations
|
||||
```python
|
||||
# Open in backed mode
|
||||
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
|
||||
|
||||
# Access metadata without loading X into memory
|
||||
print(adata.obs.head())
|
||||
print(adata.var.head())
|
||||
|
||||
# Subset operations create views
|
||||
subset = adata[:100, :500] # View, no data loaded
|
||||
|
||||
# Load specific data into memory
|
||||
X_subset = subset.X[:] # Now loads this subset
|
||||
|
||||
# Convert entire backed object to memory
|
||||
adata_memory = adata.to_memory()
|
||||
```
|
||||
|
||||
### Zarr
|
||||
Hierarchical array storage format, optimized for cloud storage and parallel I/O.
|
||||
|
||||
#### Writing Zarr
|
||||
```python
|
||||
# Write to Zarr store
|
||||
adata.write_zarr('data.zarr')
|
||||
|
||||
# Write with specific chunks (important for performance)
|
||||
adata.write_zarr('data.zarr', chunks=(100, 100))
|
||||
```
|
||||
|
||||
#### Reading Zarr
|
||||
```python
|
||||
# Read Zarr store
|
||||
adata = ad.read_zarr('data.zarr')
|
||||
```
|
||||
|
||||
#### Remote Zarr access
|
||||
```python
|
||||
import fsspec
|
||||
|
||||
# Access Zarr from S3
|
||||
store = fsspec.get_mapper('s3://bucket-name/data.zarr')
|
||||
adata = ad.read_zarr(store)
|
||||
|
||||
# Access Zarr from URL
|
||||
store = fsspec.get_mapper('https://example.com/data.zarr')
|
||||
adata = ad.read_zarr(store)
|
||||
```
|
||||
|
||||
## Alternative Input Formats
|
||||
|
||||
### CSV/TSV
|
||||
```python
|
||||
# Read CSV (genes as columns, cells as rows)
|
||||
adata = ad.read_csv('data.csv')
|
||||
|
||||
# Read with custom delimiter
|
||||
adata = ad.read_csv('data.tsv', delimiter='\t')
|
||||
|
||||
# Specify that first column is row names
|
||||
adata = ad.read_csv('data.csv', first_column_names=True)
|
||||
```
|
||||
|
||||
### Excel
|
||||
```python
|
||||
# Read Excel file
|
||||
adata = ad.read_excel('data.xlsx')
|
||||
|
||||
# Read specific sheet
|
||||
adata = ad.read_excel('data.xlsx', sheet='Sheet1')
|
||||
```
|
||||
|
||||
### Matrix Market (MTX)
|
||||
Common format for sparse matrices in genomics.
|
||||
|
||||
```python
|
||||
# Read MTX with associated files
|
||||
# Requires: matrix.mtx, genes.tsv, barcodes.tsv
|
||||
adata = ad.read_mtx('matrix.mtx')
|
||||
|
||||
# Read with custom gene and barcode files
|
||||
adata = ad.read_mtx(
|
||||
'matrix.mtx',
|
||||
var_names='genes.tsv',
|
||||
obs_names='barcodes.tsv'
|
||||
)
|
||||
|
||||
# Transpose if needed (MTX often has genes as rows)
|
||||
adata = adata.T
|
||||
```
|
||||
|
||||
### 10X Genomics formats
|
||||
```python
|
||||
# Read 10X h5 format
|
||||
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
|
||||
|
||||
# Read 10X MTX directory
|
||||
adata = ad.read_10x_mtx('filtered_feature_bc_matrix/')
|
||||
|
||||
# Specify genome if multiple present
|
||||
adata = ad.read_10x_h5('data.h5', genome='GRCh38')
|
||||
```
|
||||
|
||||
### Loom
|
||||
```python
|
||||
# Read Loom file
|
||||
adata = ad.read_loom('data.loom')
|
||||
|
||||
# Read with specific observation and variable annotations
|
||||
adata = ad.read_loom(
|
||||
'data.loom',
|
||||
obs_names='CellID',
|
||||
var_names='Gene'
|
||||
)
|
||||
```
|
||||
|
||||
### Text files
|
||||
```python
|
||||
# Read generic text file
|
||||
adata = ad.read_text('data.txt', delimiter='\t')
|
||||
|
||||
# Read with custom parameters
|
||||
adata = ad.read_text(
|
||||
'data.txt',
|
||||
delimiter=',',
|
||||
first_column_names=True,
|
||||
dtype='float32'
|
||||
)
|
||||
```
|
||||
|
||||
### UMI tools
|
||||
```python
|
||||
# Read UMI tools format
|
||||
adata = ad.read_umi_tools('counts.tsv')
|
||||
```
|
||||
|
||||
### HDF5 (generic)
|
||||
```python
|
||||
# Read from HDF5 file (not h5ad format)
|
||||
adata = ad.read_hdf('data.h5', key='dataset')
|
||||
```
|
||||
|
||||
## Alternative Output Formats
|
||||
|
||||
### CSV
|
||||
```python
|
||||
# Write to CSV files (creates multiple files)
|
||||
adata.write_csvs('output_dir/')
|
||||
|
||||
# This creates:
|
||||
# - output_dir/X.csv (expression matrix)
|
||||
# - output_dir/obs.csv (observation annotations)
|
||||
# - output_dir/var.csv (variable annotations)
|
||||
# - output_dir/uns.csv (unstructured annotations, if possible)
|
||||
|
||||
# Skip certain components
|
||||
adata.write_csvs('output_dir/', skip_data=True) # Skip X matrix
|
||||
```
|
||||
|
||||
### Loom
|
||||
```python
|
||||
# Write to Loom format
|
||||
adata.write_loom('output.loom')
|
||||
```
|
||||
|
||||
## Reading Specific Elements
|
||||
|
||||
For fine-grained control, read specific elements from storage:
|
||||
|
||||
```python
|
||||
from anndata import read_elem
|
||||
|
||||
# Read just observation annotations
|
||||
obs = read_elem('data.h5ad/obs')
|
||||
|
||||
# Read specific layer
|
||||
layer = read_elem('data.h5ad/layers/normalized')
|
||||
|
||||
# Read unstructured data element
|
||||
params = read_elem('data.h5ad/uns/pca_params')
|
||||
```
|
||||
|
||||
## Writing Specific Elements
|
||||
|
||||
```python
|
||||
from anndata import write_elem
|
||||
import h5py
|
||||
|
||||
# Write element to existing file
|
||||
with h5py.File('data.h5ad', 'a') as f:
|
||||
write_elem(f, 'new_layer', adata.X.copy())
|
||||
```
|
||||
|
||||
## Lazy Operations
|
||||
|
||||
For very large datasets, use lazy reading to avoid loading entire datasets:
|
||||
|
||||
```python
|
||||
from anndata.experimental import read_elem_lazy
|
||||
|
||||
# Lazy read (returns dask array or similar)
|
||||
X_lazy = read_elem_lazy('large_data.h5ad/X')
|
||||
|
||||
# Compute only when needed
|
||||
subset = X_lazy[:100, :100].compute()
|
||||
```
|
||||
|
||||
## Common I/O Patterns
|
||||
|
||||
### Convert between formats
|
||||
```python
|
||||
# MTX to H5AD
|
||||
adata = ad.read_mtx('matrix.mtx').T
|
||||
adata.write_h5ad('data.h5ad')
|
||||
|
||||
# CSV to H5AD
|
||||
adata = ad.read_csv('data.csv')
|
||||
adata.write_h5ad('data.h5ad')
|
||||
|
||||
# H5AD to Zarr
|
||||
adata = ad.read_h5ad('data.h5ad')
|
||||
adata.write_zarr('data.zarr')
|
||||
```
|
||||
|
||||
### Load metadata without data
|
||||
```python
|
||||
# Backed mode allows inspecting metadata without loading X
|
||||
adata = ad.read_h5ad('large_file.h5ad', backed='r')
|
||||
print(f"Dataset contains {adata.n_obs} observations and {adata.n_vars} variables")
|
||||
print(adata.obs.columns)
|
||||
print(adata.var.columns)
|
||||
# X is not loaded into memory
|
||||
```
|
||||
|
||||
### Append to existing file
|
||||
```python
|
||||
# Open in read-write mode
|
||||
adata = ad.read_h5ad('data.h5ad', backed='r+')
|
||||
|
||||
# Modify metadata
|
||||
adata.obs['new_column'] = values
|
||||
|
||||
# Changes are written to disk
|
||||
```
|
||||
|
||||
### Download from URL
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Read directly from URL (for h5ad files)
|
||||
url = 'https://example.com/data.h5ad'
|
||||
adata = ad.read_h5ad(url, backed='r') # Streaming access
|
||||
|
||||
# For other formats, download first
|
||||
import urllib.request
|
||||
urllib.request.urlretrieve(url, 'local_file.h5ad')
|
||||
adata = ad.read_h5ad('local_file.h5ad')
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
### Reading
|
||||
- Use `backed='r'` for large files you only need to query
|
||||
- Use `backed='r+'` if you need to modify metadata without loading all data
|
||||
- H5AD format is generally fastest for random access
|
||||
- Zarr is better for cloud storage and parallel access
|
||||
- Consider compression for storage, but note it may slow down reading
|
||||
|
||||
### Writing
|
||||
- Use compression for long-term storage: `compression='gzip'` or `compression='lzf'`
|
||||
- LZF compression is faster but compresses less than GZIP
|
||||
- For Zarr, tune chunk sizes based on access patterns:
|
||||
- Larger chunks for sequential reads
|
||||
- Smaller chunks for random access
|
||||
- Convert string columns to categorical before writing (smaller files)
|
||||
|
||||
### Memory management
|
||||
```python
|
||||
# Convert strings to categoricals (reduces file size and memory)
|
||||
adata.strings_to_categoricals()
|
||||
adata.write_h5ad('data.h5ad')
|
||||
|
||||
# Use sparse matrices for sparse data
|
||||
from scipy.sparse import csr_matrix
|
||||
if isinstance(adata.X, np.ndarray):
|
||||
density = np.count_nonzero(adata.X) / adata.X.size
|
||||
if density < 0.5: # If more than 50% zeros
|
||||
adata.X = csr_matrix(adata.X)
|
||||
```
|
||||
|
||||
## Handling Large Datasets
|
||||
|
||||
### Strategy 1: Backed mode
|
||||
```python
|
||||
# Work with dataset larger than RAM
|
||||
adata = ad.read_h5ad('100GB_file.h5ad', backed='r')
|
||||
|
||||
# Filter based on metadata (fast, no data loading)
|
||||
filtered = adata[adata.obs['quality_score'] > 0.8]
|
||||
|
||||
# Load filtered subset into memory
|
||||
adata_memory = filtered.to_memory()
|
||||
```
|
||||
|
||||
### Strategy 2: Chunked processing
|
||||
```python
|
||||
# Process data in chunks
|
||||
adata = ad.read_h5ad('large_file.h5ad', backed='r')
|
||||
|
||||
chunk_size = 1000
|
||||
results = []
|
||||
|
||||
for i in range(0, adata.n_obs, chunk_size):
|
||||
chunk = adata[i:i+chunk_size, :].to_memory()
|
||||
# Process chunk
|
||||
result = process(chunk)
|
||||
results.append(result)
|
||||
```
|
||||
|
||||
### Strategy 3: Use AnnCollection
|
||||
```python
|
||||
from anndata.experimental import AnnCollection
|
||||
|
||||
# Create collection without loading data
|
||||
adatas = [f'dataset_{i}.h5ad' for i in range(10)]
|
||||
collection = AnnCollection(
|
||||
adatas,
|
||||
join_obs='inner',
|
||||
join_vars='inner'
|
||||
)
|
||||
|
||||
# Process collection lazily
|
||||
# Data is loaded only when accessed
|
||||
```
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### Issue: Out of memory when reading
|
||||
**Solution**: Use backed mode or read in chunks
|
||||
```python
|
||||
adata = ad.read_h5ad('file.h5ad', backed='r')
|
||||
```
|
||||
|
||||
### Issue: Slow reading from cloud storage
|
||||
**Solution**: Use Zarr format with appropriate chunking
|
||||
```python
|
||||
adata.write_zarr('data.zarr', chunks=(1000, 1000))
|
||||
```
|
||||
|
||||
### Issue: Large file sizes
|
||||
**Solution**: Use compression and convert to sparse/categorical
|
||||
```python
|
||||
adata.strings_to_categoricals()
|
||||
from scipy.sparse import csr_matrix
|
||||
adata.X = csr_matrix(adata.X)
|
||||
adata.write_h5ad('compressed.h5ad', compression='gzip')
|
||||
```
|
||||
|
||||
### Issue: Cannot modify backed object
|
||||
**Solution**: Either load to memory or open in 'r+' mode
|
||||
```python
|
||||
# Option 1: Load to memory
|
||||
adata = adata.to_memory()
|
||||
|
||||
# Option 2: Open in read-write mode
|
||||
adata = ad.read_h5ad('file.h5ad', backed='r+')
|
||||
```
|
||||
516
scientific-packages/anndata/references/manipulation.md
Normal file
516
scientific-packages/anndata/references/manipulation.md
Normal file
@@ -0,0 +1,516 @@
|
||||
# Data Manipulation
|
||||
|
||||
Operations for transforming, subsetting, and manipulating AnnData objects.
|
||||
|
||||
## Subsetting
|
||||
|
||||
### By indices
|
||||
```python
|
||||
import anndata as ad
|
||||
import numpy as np
|
||||
|
||||
adata = ad.AnnData(X=np.random.rand(1000, 2000))
|
||||
|
||||
# Integer indices
|
||||
subset = adata[0:100, 0:500] # First 100 obs, first 500 vars
|
||||
|
||||
# List of indices
|
||||
obs_indices = [0, 10, 20, 30, 40]
|
||||
var_indices = [0, 1, 2, 3, 4]
|
||||
subset = adata[obs_indices, var_indices]
|
||||
|
||||
# Single observation or variable
|
||||
single_obs = adata[0, :]
|
||||
single_var = adata[:, 0]
|
||||
```
|
||||
|
||||
### By names
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Create with named indices
|
||||
obs_names = [f'cell_{i}' for i in range(1000)]
|
||||
var_names = [f'gene_{i}' for i in range(2000)]
|
||||
adata = ad.AnnData(
|
||||
X=np.random.rand(1000, 2000),
|
||||
obs=pd.DataFrame(index=obs_names),
|
||||
var=pd.DataFrame(index=var_names)
|
||||
)
|
||||
|
||||
# Subset by observation names
|
||||
subset = adata[['cell_0', 'cell_1', 'cell_2'], :]
|
||||
|
||||
# Subset by variable names
|
||||
subset = adata[:, ['gene_0', 'gene_10', 'gene_20']]
|
||||
|
||||
# Both axes
|
||||
subset = adata[['cell_0', 'cell_1'], ['gene_0', 'gene_1']]
|
||||
```
|
||||
|
||||
### By boolean masks
|
||||
```python
|
||||
# Create boolean masks
|
||||
high_count_obs = np.random.rand(1000) > 0.5
|
||||
high_var_genes = np.random.rand(2000) > 0.7
|
||||
|
||||
# Subset using masks
|
||||
subset = adata[high_count_obs, :]
|
||||
subset = adata[:, high_var_genes]
|
||||
subset = adata[high_count_obs, high_var_genes]
|
||||
```
|
||||
|
||||
### By metadata conditions
|
||||
```python
|
||||
# Add metadata
|
||||
adata.obs['cell_type'] = np.random.choice(['A', 'B', 'C'], 1000)
|
||||
adata.obs['quality_score'] = np.random.rand(1000)
|
||||
adata.var['highly_variable'] = np.random.rand(2000) > 0.8
|
||||
|
||||
# Filter by cell type
|
||||
t_cells = adata[adata.obs['cell_type'] == 'A']
|
||||
|
||||
# Filter by multiple conditions
|
||||
high_quality_a_cells = adata[
|
||||
(adata.obs['cell_type'] == 'A') &
|
||||
(adata.obs['quality_score'] > 0.7)
|
||||
]
|
||||
|
||||
# Filter by variable metadata
|
||||
hv_genes = adata[:, adata.var['highly_variable']]
|
||||
|
||||
# Complex conditions
|
||||
filtered = adata[
|
||||
(adata.obs['quality_score'] > 0.5) &
|
||||
(adata.obs['cell_type'].isin(['A', 'B'])),
|
||||
adata.var['highly_variable']
|
||||
]
|
||||
```
|
||||
|
||||
## Transposition
|
||||
|
||||
```python
|
||||
# Transpose AnnData object (swap observations and variables)
|
||||
adata_T = adata.T
|
||||
|
||||
# Shape changes
|
||||
print(adata.shape) # (1000, 2000)
|
||||
print(adata_T.shape) # (2000, 1000)
|
||||
|
||||
# obs and var are swapped
|
||||
print(adata.obs.head()) # Observation metadata
|
||||
print(adata_T.var.head()) # Same data, now as variable metadata
|
||||
|
||||
# Useful when data is in opposite orientation
|
||||
# Common with some file formats where genes are rows
|
||||
```
|
||||
|
||||
## Copying
|
||||
|
||||
### Full copy
|
||||
```python
|
||||
# Create independent copy
|
||||
adata_copy = adata.copy()
|
||||
|
||||
# Modifications to copy don't affect original
|
||||
adata_copy.obs['new_column'] = 1
|
||||
print('new_column' in adata.obs.columns) # False
|
||||
```
|
||||
|
||||
### Shallow copy
|
||||
```python
|
||||
# View (doesn't copy data, modifications affect original)
|
||||
adata_view = adata[0:100, :]
|
||||
|
||||
# Check if object is a view
|
||||
print(adata_view.is_view) # True
|
||||
|
||||
# Convert view to independent copy
|
||||
adata_independent = adata_view.copy()
|
||||
print(adata_independent.is_view) # False
|
||||
```
|
||||
|
||||
## Renaming
|
||||
|
||||
### Rename observations and variables
|
||||
```python
|
||||
# Rename all observations
|
||||
adata.obs_names = [f'new_cell_{i}' for i in range(adata.n_obs)]
|
||||
|
||||
# Rename all variables
|
||||
adata.var_names = [f'new_gene_{i}' for i in range(adata.n_vars)]
|
||||
|
||||
# Make names unique (add suffix to duplicates)
|
||||
adata.obs_names_make_unique()
|
||||
adata.var_names_make_unique()
|
||||
```
|
||||
|
||||
### Rename categories
|
||||
```python
|
||||
# Create categorical column
|
||||
adata.obs['cell_type'] = pd.Categorical(['A', 'B', 'C'] * 333 + ['A'])
|
||||
|
||||
# Rename categories
|
||||
adata.rename_categories('cell_type', ['Type_A', 'Type_B', 'Type_C'])
|
||||
|
||||
# Or using dictionary
|
||||
adata.rename_categories('cell_type', {
|
||||
'Type_A': 'T_cell',
|
||||
'Type_B': 'B_cell',
|
||||
'Type_C': 'Monocyte'
|
||||
})
|
||||
```
|
||||
|
||||
## Type Conversions
|
||||
|
||||
### Strings to categoricals
|
||||
```python
|
||||
# Convert string columns to categorical (more memory efficient)
|
||||
adata.obs['cell_type'] = ['TypeA', 'TypeB'] * 500
|
||||
adata.obs['tissue'] = ['brain', 'liver'] * 500
|
||||
|
||||
# Convert all string columns to categorical
|
||||
adata.strings_to_categoricals()
|
||||
|
||||
print(adata.obs['cell_type'].dtype) # category
|
||||
print(adata.obs['tissue'].dtype) # category
|
||||
```
|
||||
|
||||
### Sparse to dense and vice versa
|
||||
```python
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
# Dense to sparse
|
||||
if not isinstance(adata.X, csr_matrix):
|
||||
adata.X = csr_matrix(adata.X)
|
||||
|
||||
# Sparse to dense
|
||||
if isinstance(adata.X, csr_matrix):
|
||||
adata.X = adata.X.toarray()
|
||||
|
||||
# Convert layer
|
||||
adata.layers['normalized'] = csr_matrix(adata.layers['normalized'])
|
||||
```
|
||||
|
||||
## Chunked Operations
|
||||
|
||||
Process large datasets in chunks:
|
||||
|
||||
```python
|
||||
# Iterate through data in chunks
|
||||
chunk_size = 100
|
||||
for chunk in adata.chunked_X(chunk_size):
|
||||
# Process chunk
|
||||
result = process_chunk(chunk)
|
||||
```
|
||||
|
||||
## Extracting Vectors
|
||||
|
||||
### Get observation vectors
|
||||
```python
|
||||
# Get observation metadata as array
|
||||
cell_types = adata.obs_vector('cell_type')
|
||||
|
||||
# Get gene expression across observations
|
||||
actb_expression = adata.obs_vector('ACTB') # If ACTB in var_names
|
||||
```
|
||||
|
||||
### Get variable vectors
|
||||
```python
|
||||
# Get variable metadata as array
|
||||
gene_names = adata.var_vector('gene_name')
|
||||
```
|
||||
|
||||
## Adding/Modifying Data
|
||||
|
||||
### Add observations
|
||||
```python
|
||||
# Create new observations
|
||||
new_obs = ad.AnnData(X=np.random.rand(100, adata.n_vars))
|
||||
new_obs.var_names = adata.var_names
|
||||
|
||||
# Concatenate with existing
|
||||
adata_extended = ad.concat([adata, new_obs], axis=0)
|
||||
```
|
||||
|
||||
### Add variables
|
||||
```python
|
||||
# Create new variables
|
||||
new_vars = ad.AnnData(X=np.random.rand(adata.n_obs, 100))
|
||||
new_vars.obs_names = adata.obs_names
|
||||
|
||||
# Concatenate with existing
|
||||
adata_extended = ad.concat([adata, new_vars], axis=1)
|
||||
```
|
||||
|
||||
### Add metadata columns
|
||||
```python
|
||||
# Add observation annotation
|
||||
adata.obs['new_score'] = np.random.rand(adata.n_obs)
|
||||
|
||||
# Add variable annotation
|
||||
adata.var['new_label'] = ['label'] * adata.n_vars
|
||||
|
||||
# Add from external data
|
||||
external_data = pd.read_csv('metadata.csv', index_col=0)
|
||||
adata.obs['external_info'] = external_data.loc[adata.obs_names, 'column']
|
||||
```
|
||||
|
||||
### Add layers
|
||||
```python
|
||||
# Add new layer
|
||||
adata.layers['raw_counts'] = np.random.randint(0, 100, adata.shape)
|
||||
adata.layers['log_transformed'] = np.log1p(adata.X)
|
||||
|
||||
# Replace layer
|
||||
adata.layers['normalized'] = new_normalized_data
|
||||
```
|
||||
|
||||
### Add embeddings
|
||||
```python
|
||||
# Add PCA
|
||||
adata.obsm['X_pca'] = np.random.rand(adata.n_obs, 50)
|
||||
|
||||
# Add UMAP
|
||||
adata.obsm['X_umap'] = np.random.rand(adata.n_obs, 2)
|
||||
|
||||
# Add multiple embeddings
|
||||
adata.obsm['X_tsne'] = np.random.rand(adata.n_obs, 2)
|
||||
adata.obsm['X_diffmap'] = np.random.rand(adata.n_obs, 10)
|
||||
```
|
||||
|
||||
### Add pairwise relationships
|
||||
```python
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
# Add nearest neighbor graph
|
||||
n_obs = adata.n_obs
|
||||
knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
|
||||
adata.obsp['connectivities'] = knn_graph
|
||||
|
||||
# Add distance matrix
|
||||
adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))
|
||||
```
|
||||
|
||||
### Add unstructured data
|
||||
```python
|
||||
# Add analysis parameters
|
||||
adata.uns['pca'] = {
|
||||
'variance': [0.2, 0.15, 0.1],
|
||||
'variance_ratio': [0.4, 0.3, 0.2],
|
||||
'params': {'n_comps': 50}
|
||||
}
|
||||
|
||||
# Add color schemes
|
||||
adata.uns['cell_type_colors'] = ['#FF0000', '#00FF00', '#0000FF']
|
||||
```
|
||||
|
||||
## Removing Data
|
||||
|
||||
### Remove observations or variables
|
||||
```python
|
||||
# Keep only specific observations
|
||||
keep_obs = adata.obs['quality_score'] > 0.5
|
||||
adata = adata[keep_obs, :]
|
||||
|
||||
# Remove specific variables
|
||||
remove_vars = adata.var['low_count']
|
||||
adata = adata[:, ~remove_vars]
|
||||
```
|
||||
|
||||
### Remove metadata columns
|
||||
```python
|
||||
# Remove observation column
|
||||
adata.obs.drop('unwanted_column', axis=1, inplace=True)
|
||||
|
||||
# Remove variable column
|
||||
adata.var.drop('unwanted_column', axis=1, inplace=True)
|
||||
```
|
||||
|
||||
### Remove layers
|
||||
```python
|
||||
# Remove specific layer
|
||||
del adata.layers['unwanted_layer']
|
||||
|
||||
# Remove all layers
|
||||
adata.layers = {}
|
||||
```
|
||||
|
||||
### Remove embeddings
|
||||
```python
|
||||
# Remove specific embedding
|
||||
del adata.obsm['X_tsne']
|
||||
|
||||
# Remove all embeddings
|
||||
adata.obsm = {}
|
||||
```
|
||||
|
||||
### Remove unstructured data
|
||||
```python
|
||||
# Remove specific key
|
||||
del adata.uns['unwanted_key']
|
||||
|
||||
# Remove all unstructured data
|
||||
adata.uns = {}
|
||||
```
|
||||
|
||||
## Reordering
|
||||
|
||||
### Sort observations
|
||||
```python
|
||||
# Sort by observation metadata
|
||||
adata = adata[adata.obs.sort_values('quality_score').index, :]
|
||||
|
||||
# Sort by observation names
|
||||
adata = adata[sorted(adata.obs_names), :]
|
||||
```
|
||||
|
||||
### Sort variables
|
||||
```python
|
||||
# Sort by variable metadata
|
||||
adata = adata[:, adata.var.sort_values('gene_name').index]
|
||||
|
||||
# Sort by variable names
|
||||
adata = adata[:, sorted(adata.var_names)]
|
||||
```
|
||||
|
||||
### Reorder to match external list
|
||||
```python
|
||||
# Reorder observations to match external list
|
||||
desired_order = ['cell_10', 'cell_5', 'cell_20', ...]
|
||||
adata = adata[desired_order, :]
|
||||
|
||||
# Reorder variables
|
||||
desired_genes = ['TP53', 'ACTB', 'GAPDH', ...]
|
||||
adata = adata[:, desired_genes]
|
||||
```
|
||||
|
||||
## Data Transformations
|
||||
|
||||
### Normalize
|
||||
```python
|
||||
# Total count normalization (CPM/TPM-like)
|
||||
total_counts = adata.X.sum(axis=1)
|
||||
adata.layers['normalized'] = adata.X / total_counts[:, np.newaxis] * 1e6
|
||||
|
||||
# Log transformation
|
||||
adata.layers['log1p'] = np.log1p(adata.X)
|
||||
|
||||
# Z-score normalization
|
||||
mean = adata.X.mean(axis=0)
|
||||
std = adata.X.std(axis=0)
|
||||
adata.layers['scaled'] = (adata.X - mean) / std
|
||||
```
|
||||
|
||||
### Filter
|
||||
```python
|
||||
# Filter cells by total counts
|
||||
total_counts = np.array(adata.X.sum(axis=1)).flatten()
|
||||
adata.obs['total_counts'] = total_counts
|
||||
adata = adata[adata.obs['total_counts'] > 1000, :]
|
||||
|
||||
# Filter genes by detection rate
|
||||
detection_rate = (adata.X > 0).sum(axis=0) / adata.n_obs
|
||||
adata.var['detection_rate'] = np.array(detection_rate).flatten()
|
||||
adata = adata[:, adata.var['detection_rate'] > 0.01]
|
||||
```
|
||||
|
||||
## Working with Views
|
||||
|
||||
Views are lightweight references to subsets of data that don't copy the underlying matrix:
|
||||
|
||||
```python
|
||||
# Create view
|
||||
view = adata[0:100, 0:500]
|
||||
print(view.is_view) # True
|
||||
|
||||
# Views allow read access
|
||||
data = view.X
|
||||
|
||||
# Modifying view data affects original
|
||||
# (Be careful!)
|
||||
|
||||
# Convert view to independent copy
|
||||
independent = view.copy()
|
||||
|
||||
# Force AnnData to be a copy, not a view
|
||||
adata = adata.copy()
|
||||
```
|
||||
|
||||
## Merging Metadata
|
||||
|
||||
```python
|
||||
# Merge external metadata
|
||||
external_metadata = pd.read_csv('additional_metadata.csv', index_col=0)
|
||||
|
||||
# Join metadata (inner join on index)
|
||||
adata.obs = adata.obs.join(external_metadata)
|
||||
|
||||
# Left join (keep all adata observations)
|
||||
adata.obs = adata.obs.merge(
|
||||
external_metadata,
|
||||
left_index=True,
|
||||
right_index=True,
|
||||
how='left'
|
||||
)
|
||||
```
|
||||
|
||||
## Common Manipulation Patterns
|
||||
|
||||
### Quality control filtering
|
||||
```python
|
||||
# Calculate QC metrics
|
||||
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
|
||||
adata.obs['total_counts'] = adata.X.sum(axis=1)
|
||||
adata.var['n_cells'] = (adata.X > 0).sum(axis=0)
|
||||
|
||||
# Filter low-quality cells
|
||||
adata = adata[adata.obs['n_genes'] > 200, :]
|
||||
adata = adata[adata.obs['total_counts'] < 50000, :]
|
||||
|
||||
# Filter rarely detected genes
|
||||
adata = adata[:, adata.var['n_cells'] >= 3]
|
||||
```
|
||||
|
||||
### Select highly variable genes
|
||||
```python
|
||||
# Mark highly variable genes
|
||||
gene_variance = np.var(adata.X, axis=0)
|
||||
adata.var['variance'] = np.array(gene_variance).flatten()
|
||||
adata.var['highly_variable'] = adata.var['variance'] > np.percentile(gene_variance, 90)
|
||||
|
||||
# Subset to highly variable genes
|
||||
adata_hvg = adata[:, adata.var['highly_variable']].copy()
|
||||
```
|
||||
|
||||
### Downsample
|
||||
```python
|
||||
# Random sampling of observations
|
||||
np.random.seed(42)
|
||||
n_sample = 500
|
||||
sample_indices = np.random.choice(adata.n_obs, n_sample, replace=False)
|
||||
adata_downsampled = adata[sample_indices, :].copy()
|
||||
|
||||
# Stratified sampling by cell type
|
||||
from sklearn.model_selection import train_test_split
|
||||
train_idx, test_idx = train_test_split(
|
||||
range(adata.n_obs),
|
||||
test_size=0.2,
|
||||
stratify=adata.obs['cell_type']
|
||||
)
|
||||
adata_train = adata[train_idx, :].copy()
|
||||
adata_test = adata[test_idx, :].copy()
|
||||
```
|
||||
|
||||
### Split train/test
|
||||
```python
|
||||
# Random train/test split
|
||||
np.random.seed(42)
|
||||
n_obs = adata.n_obs
|
||||
train_size = int(0.8 * n_obs)
|
||||
indices = np.random.permutation(n_obs)
|
||||
train_indices = indices[:train_size]
|
||||
test_indices = indices[train_size:]
|
||||
|
||||
adata_train = adata[train_indices, :].copy()
|
||||
adata_test = adata[test_indices, :].copy()
|
||||
```
|
||||
@@ -1,438 +0,0 @@
|
||||
# AnnData Workflows and Best Practices
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### 1. Single-Cell RNA-seq Analysis Workflow
|
||||
|
||||
#### Loading Data
|
||||
```python
|
||||
import anndata as ad
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
# Load from 10X format
|
||||
adata = ad.read_mtx('matrix.mtx')
|
||||
adata.var_names = pd.read_csv('genes.tsv', sep='\t', header=None)[0]
|
||||
adata.obs_names = pd.read_csv('barcodes.tsv', sep='\t', header=None)[0]
|
||||
|
||||
# Or load from pre-processed h5ad
|
||||
adata = ad.read_h5ad('preprocessed_data.h5ad')
|
||||
```
|
||||
|
||||
#### Quality Control
|
||||
```python
|
||||
# Calculate QC metrics
|
||||
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
|
||||
adata.obs['total_counts'] = adata.X.sum(axis=1)
|
||||
|
||||
# Filter cells
|
||||
adata = adata[adata.obs.n_genes > 200]
|
||||
adata = adata[adata.obs.total_counts < 10000]
|
||||
|
||||
# Filter genes
|
||||
min_cells = 3
|
||||
adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
|
||||
```
|
||||
|
||||
#### Normalization and Preprocessing
|
||||
```python
|
||||
# Store raw counts
|
||||
adata.layers['counts'] = adata.X.copy()
|
||||
|
||||
# Normalize
|
||||
adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
|
||||
|
||||
# Log transform
|
||||
adata.layers['log1p'] = np.log1p(adata.X)
|
||||
adata.X = adata.layers['log1p']
|
||||
|
||||
# Identify highly variable genes
|
||||
gene_variance = adata.X.var(axis=0)
|
||||
adata.var['highly_variable'] = gene_variance > np.percentile(gene_variance, 90)
|
||||
```
|
||||
|
||||
#### Dimensionality Reduction
|
||||
```python
|
||||
# PCA
|
||||
from sklearn.decomposition import PCA
|
||||
pca = PCA(n_components=50)
|
||||
adata.obsm['X_pca'] = pca.fit_transform(adata.X)
|
||||
|
||||
# Store PCA variance
|
||||
adata.uns['pca'] = {'variance_ratio': pca.explained_variance_ratio_}
|
||||
|
||||
# UMAP
|
||||
from umap import UMAP
|
||||
umap = UMAP(n_components=2)
|
||||
adata.obsm['X_umap'] = umap.fit_transform(adata.obsm['X_pca'])
|
||||
```
|
||||
|
||||
#### Clustering
|
||||
```python
|
||||
# Store cluster assignments
|
||||
adata.obs['clusters'] = pd.Categorical(['cluster_0', 'cluster_1', ...])
|
||||
|
||||
# Store cluster centroids
|
||||
centroids = np.array([...])
|
||||
adata.varm['cluster_centroids'] = centroids
|
||||
```
|
||||
|
||||
#### Save Results
|
||||
```python
|
||||
# Save complete analysis
|
||||
adata.write('analyzed_data.h5ad', compression='gzip')
|
||||
```
|
||||
|
||||
### 2. Batch Integration Workflow
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Load multiple batches
|
||||
batch1 = ad.read_h5ad('batch1.h5ad')
|
||||
batch2 = ad.read_h5ad('batch2.h5ad')
|
||||
batch3 = ad.read_h5ad('batch3.h5ad')
|
||||
|
||||
# Concatenate with batch labels
|
||||
adata = ad.concat(
|
||||
[batch1, batch2, batch3],
|
||||
axis=0,
|
||||
label='batch',
|
||||
keys=['batch1', 'batch2', 'batch3'],
|
||||
index_unique='-'
|
||||
)
|
||||
|
||||
# Batch effect correction would go here
|
||||
# (using external tools like Harmony, Scanorama, etc.)
|
||||
|
||||
# Store corrected embeddings
|
||||
adata.obsm['X_pca_corrected'] = corrected_pca
|
||||
adata.obsm['X_umap_corrected'] = corrected_umap
|
||||
```
|
||||
|
||||
### 3. Memory-Efficient Large Dataset Workflow
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Read in backed mode
|
||||
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
|
||||
|
||||
# Check backing status
|
||||
print(f"Is backed: {adata.isbacked}")
|
||||
print(f"File: {adata.filename}")
|
||||
|
||||
# Work with chunks
|
||||
for chunk in adata.chunk_X(chunk_size=1000):
|
||||
# Process chunk
|
||||
result = process_chunk(chunk)
|
||||
|
||||
# Close file when done
|
||||
adata.file.close()
|
||||
```
|
||||
|
||||
### 4. Multi-Dataset Comparison Workflow
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Load datasets
|
||||
datasets = {
|
||||
'study1': ad.read_h5ad('study1.h5ad'),
|
||||
'study2': ad.read_h5ad('study2.h5ad'),
|
||||
'study3': ad.read_h5ad('study3.h5ad')
|
||||
}
|
||||
|
||||
# Outer join to keep all genes
|
||||
combined = ad.concat(
|
||||
list(datasets.values()),
|
||||
axis=0,
|
||||
join='outer',
|
||||
label='study',
|
||||
keys=list(datasets.keys()),
|
||||
merge='first'
|
||||
)
|
||||
|
||||
# Handle missing data
|
||||
combined.X[np.isnan(combined.X)] = 0
|
||||
|
||||
# Add dataset-specific metadata
|
||||
combined.uns['datasets'] = {
|
||||
'study1': {'date': '2023-01', 'n_samples': datasets['study1'].n_obs},
|
||||
'study2': {'date': '2023-06', 'n_samples': datasets['study2'].n_obs},
|
||||
'study3': {'date': '2024-01', 'n_samples': datasets['study3'].n_obs}
|
||||
}
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Memory Management
|
||||
|
||||
#### Use Sparse Matrices
|
||||
```python
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
# Convert to sparse if data is sparse
|
||||
if density < 0.3: # Less than 30% non-zero
|
||||
adata.X = csr_matrix(adata.X)
|
||||
```
|
||||
|
||||
#### Use Backed Mode for Large Files
|
||||
```python
|
||||
# Read with backing
|
||||
adata = ad.read_h5ad('large_file.h5ad', backed='r')
|
||||
|
||||
# Only load what you need
|
||||
subset = adata[:1000, :500].copy() # Now in memory
|
||||
```
|
||||
|
||||
#### Convert Strings to Categoricals
|
||||
```python
|
||||
# Efficient storage for repeated strings
|
||||
adata.strings_to_categoricals()
|
||||
|
||||
# Or manually
|
||||
adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
|
||||
```
|
||||
|
||||
### Data Organization
|
||||
|
||||
#### Use Layers for Different Representations
|
||||
```python
|
||||
# Store multiple versions of the data
|
||||
adata.layers['counts'] = raw_counts
|
||||
adata.layers['normalized'] = normalized_data
|
||||
adata.layers['log1p'] = log_transformed_data
|
||||
adata.layers['scaled'] = scaled_data
|
||||
```
|
||||
|
||||
#### Use obsm/varm for Multi-Dimensional Annotations
|
||||
```python
|
||||
# Embeddings
|
||||
adata.obsm['X_pca'] = pca_coordinates
|
||||
adata.obsm['X_umap'] = umap_coordinates
|
||||
adata.obsm['X_tsne'] = tsne_coordinates
|
||||
|
||||
# Gene loadings
|
||||
adata.varm['PCs'] = principal_components
|
||||
```
|
||||
|
||||
#### Use uns for Analysis Metadata
|
||||
```python
|
||||
# Store parameters
|
||||
adata.uns['preprocessing'] = {
|
||||
'normalization': 'TPM',
|
||||
'min_genes': 200,
|
||||
'min_cells': 3,
|
||||
'date': '2024-01-15'
|
||||
}
|
||||
|
||||
# Store analysis results
|
||||
adata.uns['differential_expression'] = {
|
||||
'method': 't-test',
|
||||
'p_value_threshold': 0.05
|
||||
}
|
||||
```
|
||||
|
||||
### Subsetting and Views
|
||||
|
||||
#### Understand View vs Copy
|
||||
```python
|
||||
# Subsetting returns a view
|
||||
subset = adata[adata.obs.cell_type == 'B cell'] # View
|
||||
print(subset.is_view) # True
|
||||
|
||||
# Views are memory efficient but modifications affect original
|
||||
subset.obs['new_column'] = value # Modifies original adata
|
||||
|
||||
# Create independent copy when needed
|
||||
subset_copy = adata[adata.obs.cell_type == 'B cell'].copy()
|
||||
```
|
||||
|
||||
#### Chain Operations Efficiently
|
||||
```python
|
||||
# Bad - creates multiple intermediate views
|
||||
temp1 = adata[adata.obs.batch == 'batch1']
|
||||
temp2 = temp1[temp1.obs.n_genes > 200]
|
||||
result = temp2[:, temp2.var.highly_variable].copy()
|
||||
|
||||
# Good - chain operations
|
||||
result = adata[
|
||||
(adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
|
||||
adata.var.highly_variable
|
||||
].copy()
|
||||
```
|
||||
|
||||
### File I/O
|
||||
|
||||
#### Use Compression
|
||||
```python
|
||||
# Save with compression
|
||||
adata.write('data.h5ad', compression='gzip')
|
||||
```
|
||||
|
||||
#### Choose the Right Format
|
||||
```python
|
||||
# H5AD for general use (good compression, fast)
|
||||
adata.write_h5ad('data.h5ad')
|
||||
|
||||
# Zarr for cloud storage and parallel access
|
||||
adata.write_zarr('data.zarr')
|
||||
|
||||
# Loom for compatibility with other tools
|
||||
adata.write_loom('data.loom')
|
||||
```
|
||||
|
||||
#### Close File Connections
|
||||
```python
|
||||
# Use context manager pattern
|
||||
adata = ad.read_h5ad('file.h5ad', backed='r')
|
||||
try:
|
||||
# Work with data
|
||||
process(adata)
|
||||
finally:
|
||||
adata.file.close()
|
||||
```
|
||||
|
||||
### Concatenation
|
||||
|
||||
#### Choose Appropriate Join Strategy
|
||||
```python
|
||||
# Inner join - only common features (safe, may lose data)
|
||||
combined = ad.concat([adata1, adata2], join='inner')
|
||||
|
||||
# Outer join - all features (keeps all data, may introduce zeros)
|
||||
combined = ad.concat([adata1, adata2], join='outer')
|
||||
```
|
||||
|
||||
#### Track Data Sources
|
||||
```python
|
||||
# Add source labels
|
||||
combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
label='dataset',
|
||||
keys=['exp1', 'exp2', 'exp3']
|
||||
)
|
||||
|
||||
# Make indices unique
|
||||
combined = ad.concat(
|
||||
[adata1, adata2, adata3],
|
||||
index_unique='-'
|
||||
)
|
||||
```
|
||||
|
||||
#### Handle Variable-Specific Metadata
|
||||
```python
|
||||
# Use merge strategy for var annotations
|
||||
combined = ad.concat(
|
||||
[adata1, adata2],
|
||||
merge='same', # Keep only identical annotations
|
||||
join='outer'
|
||||
)
|
||||
```
|
||||
|
||||
### Naming Conventions
|
||||
|
||||
#### Use Consistent Naming
|
||||
```python
|
||||
# Embeddings: X_<method>
|
||||
adata.obsm['X_pca']
|
||||
adata.obsm['X_umap']
|
||||
adata.obsm['X_tsne']
|
||||
|
||||
# Layers: descriptive names
|
||||
adata.layers['counts']
|
||||
adata.layers['log1p']
|
||||
adata.layers['scaled']
|
||||
|
||||
# Observations: snake_case
|
||||
adata.obs['cell_type']
|
||||
adata.obs['n_genes']
|
||||
adata.obs['total_counts']
|
||||
```
|
||||
|
||||
#### Make Indices Unique
|
||||
```python
|
||||
# Ensure unique names
|
||||
adata.obs_names_make_unique()
|
||||
adata.var_names_make_unique()
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
#### Validate Data Structure
|
||||
```python
|
||||
# Check dimensions
|
||||
assert adata.n_obs > 0, "No observations in data"
|
||||
assert adata.n_vars > 0, "No variables in data"
|
||||
|
||||
# Check for NaN values
|
||||
if np.isnan(adata.X).any():
|
||||
print("Warning: NaN values detected")
|
||||
|
||||
# Check for negative values in count data
|
||||
if (adata.X < 0).any():
|
||||
print("Warning: Negative values in count data")
|
||||
```
|
||||
|
||||
#### Handle Missing Data
|
||||
```python
|
||||
# Check for missing annotations
|
||||
if adata.obs['cell_type'].isna().any():
|
||||
print("Warning: Missing cell type annotations")
|
||||
# Fill or remove
|
||||
adata = adata[~adata.obs['cell_type'].isna()]
|
||||
```
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Forgetting to Copy Views
|
||||
```python
|
||||
# BAD - modifies original
|
||||
subset = adata[adata.obs.condition == 'treated']
|
||||
subset.X = transformed_data # Changes original adata!
|
||||
|
||||
# GOOD
|
||||
subset = adata[adata.obs.condition == 'treated'].copy()
|
||||
subset.X = transformed_data # Only changes subset
|
||||
```
|
||||
|
||||
### 2. Mixing Backed and In-Memory Operations
|
||||
```python
|
||||
# BAD - trying to modify backed data
|
||||
adata = ad.read_h5ad('file.h5ad', backed='r')
|
||||
adata.X[0, 0] = 100 # Error: can't modify backed data
|
||||
|
||||
# GOOD - load to memory first
|
||||
adata = ad.read_h5ad('file.h5ad', backed='r')
|
||||
adata = adata.to_memory()
|
||||
adata.X[0, 0] = 100 # Works
|
||||
```
|
||||
|
||||
### 3. Not Using Categoricals for Metadata
|
||||
```python
|
||||
# BAD - stores as strings (memory inefficient)
|
||||
adata.obs['cell_type'] = ['B cell', 'T cell', ...] * 1000
|
||||
|
||||
# GOOD - use categorical
|
||||
adata.obs['cell_type'] = pd.Categorical(['B cell', 'T cell', ...] * 1000)
|
||||
```
|
||||
|
||||
### 4. Incorrect Concatenation Axis
|
||||
```python
|
||||
# Concatenating observations (cells)
|
||||
combined = ad.concat([adata1, adata2], axis=0) # Correct
|
||||
|
||||
# Concatenating variables (genes) - rare
|
||||
combined = ad.concat([adata1, adata2], axis=1) # Less common
|
||||
```
|
||||
|
||||
### 5. Not Preserving Raw Data
|
||||
```python
|
||||
# BAD - loses original data
|
||||
adata.X = normalized_data
|
||||
|
||||
# GOOD - preserve original
|
||||
adata.layers['counts'] = adata.X.copy()
|
||||
adata.X = normalized_data
|
||||
```
|
||||
Reference in New Issue
Block a user